-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace checksums files by Dataset infos json #95
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, really clean!
I like the logic (not a huge fan of using _asdict_inner
but it makes sense).
I think it's a nice improvement!
How should we update the files in the repo? Run a big job on a server or on somebody's computer who has most of the datasets already downloaded?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perfect! Way much better than the simple checksum file ^^
Great! LGTM :-) |
Maybe we can split the updates among us...IMO most datasets run very quickly. @mariamabarham I guess you have downloaded most of the datasets no? |
Better verifications when loading a dataset
I replaced the
urls_checksums
directory that used to containchecksums.txt
andcached_sizes.txt
, by a single filedataset_infos.json
. It's just a dictconfig_name
->DatasetInfo
.It simplifies and improves how verifications of checksums and splits sizes are done, as they're all stored in
DatasetInfo
(one per config). Also, having already access toDatasetInfo
enables to check disk space before runningdownload_and_prepare
for a given config.The dataset infos json file is user readable, you can take a look at the squad one that I generated in this PR.
Renaming
According to these changes, I did some renaming:
save_checksums
->save_infos
ignore_checksums
->ignore_verifications
for example, when you are creating a dataset you have to run
nlp-cli test path/to/my/dataset --save_infos --all_configs
instead of
nlp-cli test path/to/my/dataset --save_checksums --all_configs
And now, the fun part
We'll have to rerun the
nlp-cli test ... --save_infos --all_configs
for all the datasetsfeedback appreciated !