Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Order of attrs fields not stable in interchange format #184

Open
JGuetschow opened this issue Dec 4, 2023 · 2 comments
Open

Order of attrs fields not stable in interchange format #184

JGuetschow opened this issue Dec 4, 2023 · 2 comments
Labels
bug Something isn't working

Comments

@JGuetschow
Copy link
Contributor

Describe the bug

The order of the attr fields in the interchange format is not always the same leading to differences in the same data saved by different users. In the yaml file this is directly visible, but as we have seen differences in checksums of binary files there might be a similar problem there.

Failing Test

No built in test is know to be failing. We've noticed this when re-reading a dataset version in the Andrew cement data repository
See this pr
@crdanielbusch can you clone primap2 and run make test to see if anything fails for you?

Expected behavior

Dataset metadata (and actual data) should always be ordered in the same way such that when saving with DataLad only actual data differences are detected as new and not reordering of metadata or data.

System (please complete the following information):
Original data read on Linux mint, python 3.10.12, pandas 1.2.1, primap2 0.9.7, xarray 2023.10.1
Conflicting read on Mac OS: @crdanielbusch can you add your package versions here?

@JGuetschow JGuetschow added the bug Something isn't working label Dec 4, 2023
@mikapfl
Copy link
Member

mikapfl commented Jun 13, 2024

In general, this might be hard to achieve for binary data due to timestamps. But we can of course add this as a requirement and then start testing for it and specify it in the data format descriptions that the data needs to be ordered in a specific way and have a stable bitstream. This might e.g. limit our options for compression.

@mikapfl
Copy link
Member

mikapfl commented Jun 13, 2024

For example, zipfiles always have a timestamp, and other compression algorithms change their bitstream with newer versions of compression libraries (usually for higher compression ratios). We'd need to do some research into actually stable binary data formats if we need to commit to bitstream stability for longer periods (years).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants