Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace checksums files by Dataset infos json #95

Merged
merged 12 commits into from
May 14, 2020

Conversation

lhoestq
Copy link
Member

@lhoestq lhoestq commented May 13, 2020

Better verifications when loading a dataset

I replaced the urls_checksums directory that used to contain checksums.txt and cached_sizes.txt, by a single file dataset_infos.json. It's just a dict config_name -> DatasetInfo.

It simplifies and improves how verifications of checksums and splits sizes are done, as they're all stored in DatasetInfo (one per config). Also, having already access to DatasetInfo enables to check disk space before running download_and_prepare for a given config.

The dataset infos json file is user readable, you can take a look at the squad one that I generated in this PR.

Renaming

According to these changes, I did some renaming:
save_checksums -> save_infos
ignore_checksums -> ignore_verifications

for example, when you are creating a dataset you have to run
nlp-cli test path/to/my/dataset --save_infos --all_configs
instead of
nlp-cli test path/to/my/dataset --save_checksums --all_configs

And now, the fun part

We'll have to rerun the nlp-cli test ... --save_infos --all_configs for all the datasets


feedback appreciated !

Copy link
Member

@thomwolf thomwolf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, really clean!
I like the logic (not a huge fan of using _asdict_inner but it makes sense).
I think it's a nice improvement!

How should we update the files in the repo? Run a big job on a server or on somebody's computer who has most of the datasets already downloaded?

Copy link
Contributor

@jplu jplu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect! Way much better than the simple checksum file ^^

@patrickvonplaten
Copy link
Contributor

Great! LGTM :-)

@patrickvonplaten
Copy link
Contributor

Ok, really clean!
I like the logic (not a huge fan of using _asdict_inner but it makes sense).
I think it's a nice improvement!

How should we update the files in the repo? Run a big job on a server or on somebody's computer who has most of the datasets already downloaded?

Maybe we can split the updates among us...IMO most datasets run very quickly.
I think I've downloaded 50 datasets and 80% are loaded in <5min, 15% in <1h and then wmt which is still downloading (since 12h).
I deleted my cache because the wmt downloads require quite a lot of space, so I only have parts of the wmt datasets on my computer.

@mariamabarham I guess you have downloaded most of the datasets no?

@lhoestq lhoestq merged commit a576579 into master May 14, 2020
@lhoestq lhoestq deleted the replace-checksums-dset-info branch May 14, 2020 08:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants