[Question] is hdf5 the best format for data storage? #98

Howuhh · 2023-06-23T16:14:07Z

Question

While hdf5 and h5py is the most popular approach for multi-dimensional arrays storage, is has some major limitations. For example, the inability to read data in multiple processes / threads simultaneously, which can be important for the implementation of efficient data loading.

There is an alternative - Zarr, which is very similar, but a bit more capable. I think a discussion on this would be useful to the community.

elliottower · 2023-06-23T19:50:42Z

Haven't tested it myself but it looks like hdf5 and h5py should be able to support multi-process reads (https://docs.h5py.org/en/latest/swmr.html?#multiprocess-concurrent-write-and-read) although multithreading seems to not be possible as far as I can see.

Haven't heard of zarr before but did some googling vs hdf5 and saw a few issues saying it was slower than hdf5 and this paper which seems to have that conclusion as well (but they don't mention multithreading besides once in the beginning so I'm guessing it's not testing that extensively, and that does seem to be one of the main advantages of zarr) https://arxiv.org/pdf/2207.09503.pdf As you say though Zarr does have advantages in concurrency and chunking which sounds useful (https://pythonspeed.com/articles/mmap-vs-zarr-hdf5/) Here's something comparing it with parquet https://sites.google.com/view/raybellwaves/blog/zarr-vs-parquet

We talked in the lambda meeting today a bit, the plan is to reach out to different people and see what they would prefer, to avoid arbitrarily changing it and changing to a new thing in the future.

It seems like apache arrow may be a good choice: it's used by huggingface datasets, as well as ray data and as of recently pandas. You can save them to disk as either Arrow/Feather files (uncompressed afaik but fast to read) or parquet files (compressed and more intended for long term storage). It looks like huggingface saves them directly as arrow files, so that seems like a reasonable thing to do here too imo. Converting from parquet and arrow is supposed to be very easy, and I think parquet would be flexible enough to support complex nested data like minari has but maybe not (more info https://arrow.apache.org/blog/2022/10/17/arrow-parquet-encoding-part-3/)

John mentioned something today about an issue with storing tuples in huggingface datasets though so it sounds like there may be some unexpected issues to work out. I posted in the dev channel but it feels like it's important to figure out these sorts of things early on as the longer it's delayed the more painful it will be to switch formats.

jamartinh · 2023-07-03T18:25:38Z

I have done some research during some months to identify formats and in advantages.

At the end, I have found hdf5 and h5py the best option.

It is a well established standard or even more standard that other kind of format files.
It allows to save easily numpy arrays directly.
It allows "single writer multiple readers" directly
It admits compression
It , if carefully done, only will put in RAM memory the data that you just are reading, such as an episode etc. This allows to open the files read the stats and filter, i.e., only load episodes of interest, without consuming all the ram, allowing for big files. This also makes faster the data access.
I use hdf5 for multiprocess, each instance open the same file but just load the data it needs so RAM is safe for big files and multi-process.
My tests indicate that opening an hdf5 file for just opening one episode and reading it is the faster option that any other file type.

elliottower · 2023-07-03T20:33:59Z

I have done some research during some months to identify formats and in advantages.

At the end, I have found hdf5 and h5py the best option.

It is a well established standard or even more standard that other kind of format files. It allows to save easily numpy arrays directly. It allows "single writer multiple readers" directly It admits compression It , if carefully done, only will put in RAM memory the data that you just are reading, such as an episode etc. This allows to open the files read the stats and filter, i.e., only load episodes of interest, without consuming all the ram, allowing for big files. This also makes faster the data access. I use hdf5 for multiprocess, each instance open the same file but just load the data it needs so RAM is safe for big files and multi-process. My tests indicate that opening an hdf5 file for just opening one episode and reading it is the faster option that any other file type.

Thanks for the feedback, it seems to be the most widely used in the field of offline RL as well so in terms of compatibility and standardizing things it’s probably the best choice. There’s definitely an argument to be made for Zarr but imo the best option is to support alternative file formats like that as an option, but to still maintain compatibility with HDF5 as the standard.

eugeneteoh · 2023-07-07T15:43:01Z

safetensors could be a good option.

Also I would have each transition as a separate file. File size will be huge when observation space is huge (e.g. images).

younik mentioned this issue Sep 24, 2023

Refactor DataCollectorV0 and HDF5 dependencies isolation #133

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] is hdf5 the best format for data storage? #98

[Question] is hdf5 the best format for data storage? #98

Howuhh commented Jun 23, 2023 •

edited

Loading

elliottower commented Jun 23, 2023 •

edited

Loading

jamartinh commented Jul 3, 2023

elliottower commented Jul 3, 2023

eugeneteoh commented Jul 7, 2023

[Question] is hdf5 the best format for data storage? #98

[Question] is hdf5 the best format for data storage? #98

Comments

Howuhh commented Jun 23, 2023 • edited Loading

Question

elliottower commented Jun 23, 2023 • edited Loading

jamartinh commented Jul 3, 2023

elliottower commented Jul 3, 2023

eugeneteoh commented Jul 7, 2023

Howuhh commented Jun 23, 2023 •

edited

Loading

elliottower commented Jun 23, 2023 •

edited

Loading