Recently, I wanted to read HDF5 files in Rust and WebAssembly. In particular, I was visualizing some point cloud files from the company where I work, Zivid, in my (still very experimental) personal side project, Visula.

I quickly found out that the hdf5 crate does not support compiling to the wasm32 target. Inspired by the pyfive and jsfive libraries, I therefore decided to try and write a pure Rust library for HDF5 files instead. I figured I might even learn some more details of the HDF5 specification, which I probably should admit I have also criticised before.

That is how Oxifive came to be. It is far from feature complete, but supports a small part of the specification that covers what I currently need myself.

At first, I wanted to implement the same API as the hdf5 crate, allowing for code like this:

let file = hdf5::File::open("my_file.h5").unwrap();
let group = file.group("group").unwrap();
let group2 = file.group("group").unwrap();
let dataset = group.dataset("data").unwrap();
let array = dataset.read_2d::<f32>().unwrap();
println!("Data {:#?}", array);

First of all, the above API assumes that there is a file system, while I would like the API to work on a general byte reader on the web. That is easily solved by accepting an io::Read instead of a str when opening the file.

However, the above API also requires holding on to, and keeping open, a reference to the actual file reader in every object returned by the API, such as for the groups and dataset in the above code.

Sharing such a file handle is very doable in Rust. One option is to stuff it into a thread-safe mutex-guarded container like Arc<Mutex<Box<dyn Read>>>. However, this adds a bit of unnecessary complexity and overhead if there is only one thread involved, like in my WebAssembly use case. It also makes it harder for the user to know when the file is actually closed, as any stray object still in scope will keep the handle open.

Another option is to ensure objects from the same file share a reader with a common lifetime, but that will often result in errors about multiple mutable borrows. Take the following mock-up of the API as an example of my attempt at this, inspired by the zip crate. It will not compile, because there will be multiple mutable borrows of the group at the same time:

struct File<R> {
    reader: R,
}

impl<R: Read> File<R> {
    pub fn new(reader: R) -> Self {
        File {
            reader,
        }
    }
    pub fn group<'a>(&'a mut self, name: &str) -> Group<'a> {
        Group {
            reader: &mut self.reader,
        }
    }
}

struct Group<'a> {
    reader: &'a mut dyn Read,
}

impl<'a> Group<'a> {
    pub fn group(&'a mut self, name: &str) -> Group<'a> {
        Group {
            reader: self.reader,
        }
    }

    pub fn dataset(&'a mut self, name: &str) -> Dataset<'a> {
        Dataset {
            reader: self.reader,
        }
    }
}

struct Dataset<'a> {
    reader: &'a mut dyn Read,
}

fn main() -> Result<(), Error> {
    let input = std::fs::File::open("my_file.h5")?;
    let mut file = File::new(input)?;
    let mut group = file.group("group")?;
    let mut group2 = group.group("group2")?;
    let mut group3 = group.group("group3")?; // this additional borrow is not allowed
    let mut dataset = group.dataset("data")?; // nor this
}

There are probably ways to improve the above code to get around this, but I could not figure any out without getting myself into even more complicated lifetime issues.

And by the way, I think the above lifetime sharing works well in the zip crate because it does not have the same concept of nested object hierarchies like we have in the hdf5 crate (and HDF5 files for that matter). The problematic borrowing occurs sooner if you have multiple child objects referring to the same parent object, all sharing the same reader.

Instead, I figured that I can leave an opinionated high-level API for later, and instead let the user handle the lifetime of the main reader explicitly. This can be done by turning the File into a FileReader that needs to be passed into each call:

let reader = Box::new(std::fs::File::open("my_file.h5"));
let mut file = oxifive::FileReader::new(input)?;
let group = file.group("group")?;
let group2 = group.group(&mut file, "group2")?;
let data = group.dataset(&mut file, "data")?;
let array = data.read::<f32, Ix3>(&mut file)?;

An alternative twist to this API is to add more member functions on the FileReader itself. We can then pass in the groups and datasets to the FileReader:

let reader = Box::new(std::fs::File::open("my_file.h5"));
let mut file = oxifive::FileReader::new(input)?;
let group = file.group("group")?;
let group2 = file.group_from_parent(group, "group2")?;
let data = file.dataset_from_parent(group, "data")?;
let array = file.read::<f32, Ix3>(data)?;

I am a bit torn about which API I prefer personally, but think I will stick with the first one for the time being.

There is one big risk of both these APIs, though: It is possible for the user to pass the wrong file into these calls. This would lead the FileReader to try to access bytes from the wrong HDF5 file when parsing an object. However, this can be caught at runtime by adding a unique identifier to the file and all its objects, and verify that it is the same in all calls.

And why the name Oxifive?

Well, the Rust API Guidelines discourages names with “rust” or “rs” in them. So I did as many others before me and found something related to physical rust and chose “oxi” as short for “oxidized”. The “five” is to tag along with the “pyfive” and “jsfive” libraries that also implement native HDF5 readers.