Replace checksums files by Dataset infos json #95

lhoestq · 2020-05-13T19:36:16Z

Better verifications when loading a dataset

I replaced the urls_checksums directory that used to contain checksums.txt and cached_sizes.txt, by a single file dataset_infos.json. It's just a dict config_name -> DatasetInfo.

It simplifies and improves how verifications of checksums and splits sizes are done, as they're all stored in DatasetInfo (one per config). Also, having already access to DatasetInfo enables to check disk space before running download_and_prepare for a given config.

The dataset infos json file is user readable, you can take a look at the squad one that I generated in this PR.

Renaming

According to these changes, I did some renaming:
save_checksums -> save_infos
ignore_checksums -> ignore_verifications

for example, when you are creating a dataset you have to run
nlp-cli test path/to/my/dataset --save_infos --all_configs
instead of
nlp-cli test path/to/my/dataset --save_checksums --all_configs

And now, the fun part

We'll have to rerun the nlp-cli test ... --save_infos --all_configs for all the datasets

feedback appreciated !

thomwolf

Ok, really clean!
I like the logic (not a huge fan of using _asdict_inner but it makes sense).
I think it's a nice improvement!

How should we update the files in the repo? Run a big job on a server or on somebody's computer who has most of the datasets already downloaded?

jplu

Perfect! Way much better than the simple checksum file ^^

patrickvonplaten · 2020-05-13T23:15:28Z

Great! LGTM :-)

patrickvonplaten · 2020-05-13T23:20:38Z

Ok, really clean!
I like the logic (not a huge fan of using _asdict_inner but it makes sense).
I think it's a nice improvement!

How should we update the files in the repo? Run a big job on a server or on somebody's computer who has most of the datasets already downloaded?

Maybe we can split the updates among us...IMO most datasets run very quickly.
I think I've downloaded 50 datasets and 80% are loaded in <5min, 15% in <1h and then wmt which is still downloading (since 12h).
I deleted my cache because the wmt downloads require quite a lot of space, so I only have parts of the wmt datasets on my computer.

@mariamabarham I guess you have downloaded most of the datasets no?

lhoestq added 6 commits May 13, 2020 21:07

replace checksums/sizes tests by dset info verification

9624d9e

Merge branch 'master' of https://github.com/huggingface/nlp

cd3bcce

make style

cd8c118

quality

ff4cda4

remove old arguments

bc929e8

fix tests

c2f3dc6

lhoestq requested review from thomwolf, jplu, patrickvonplaten and mariamabarham May 13, 2020 19:36

thomwolf approved these changes May 13, 2020

View reviewed changes

jplu approved these changes May 13, 2020

View reviewed changes

patrickvonplaten approved these changes May 13, 2020

View reviewed changes

lhoestq added 6 commits May 14, 2020 10:27

Merge branch 'master' of https://github.com/huggingface/nlp

08d9136

Merge branch 'master' into replace-checksums-dset-info

53dc8b9

remove _asdict_inner

f6f7470

better messages

0abe408

quality

ee9d978

update convert

e0817eb

lhoestq merged commit a576579 into master May 14, 2020

lhoestq deleted the replace-checksums-dset-info branch May 14, 2020 08:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace checksums files by Dataset infos json #95

Replace checksums files by Dataset infos json #95

lhoestq commented May 13, 2020

thomwolf left a comment

jplu left a comment

patrickvonplaten commented May 13, 2020

patrickvonplaten commented May 13, 2020

Replace checksums files by Dataset infos json #95

Replace checksums files by Dataset infos json #95

Conversation

lhoestq commented May 13, 2020

Better verifications when loading a dataset

Renaming

And now, the fun part

thomwolf left a comment

Choose a reason for hiding this comment

jplu left a comment

Choose a reason for hiding this comment

patrickvonplaten commented May 13, 2020

patrickvonplaten commented May 13, 2020