Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Tests] Local => aws #123

Merged
merged 3 commits into from
May 15, 2020
Merged

[Tests] Local => aws #123

merged 3 commits into from
May 15, 2020

Conversation

patrickvonplaten
Copy link
Contributor

@patrickvonplaten patrickvonplaten commented May 15, 2020

Change default Test from local => aws

As a default we set aws=True, Local=False, slow=False

1. RUN_AWS=1 (default)

This runs 4 tests per dataset script.

a) Does the dataset script have a valid etag / Can it be reached on AWS?
b) Can we load its builder_class?
c) Can we load all dataset configs?
d) Most importantly: Can we load the dataset?

Important - we currently only test the first config of each dataset to reduce test time. Total test time is around 1min20s.

2. RUN_LOCAL=1 RUN_AWS=0

This should be done when debugging dataset scripts of the ./datasets folder

This only runs 1 test per dataset test, which is equivalent to aws d) - Can we load the dataset from the local datasets directory?

3. RUN_SLOW=1

We should set up to run these tests maybe 1 time per week ? @thomwolf

The slow tests include two more important tests.

e) Can we load the dataset with all possible configs? This test will probably fail at the moment because a lot of dummy data is missing. We should add the dummy data step by step to be sure that all configs work.

f) Test that the actual dataset can be loaded. This will take quite some time to run, but is important to make sure that the "real" data can be loaded. It will also test whether the dataset script has the correct checksums file which is currently not tested with aws=True. @lhoestq - is there an easy way to check cheaply whether the dataset_info.json is correct for each dataset script?

@patrickvonplaten patrickvonplaten changed the title [Tests] default to aws tests [Tests] Local => AWS May 15, 2020
@patrickvonplaten patrickvonplaten changed the title [Tests] Local => AWS [Tests] Local => aws May 15, 2020
@lhoestq
Copy link
Member

lhoestq commented May 15, 2020

For each dataset, If there exist a dataset_info.json, then the command nlp-cli test path/to/my/dataset --al_configs is successful only if the dataset_infos.json is correct. The infos are correct if the size and checksums of the downloaded file are correct, and if the number of examples in each split are correct.

Note: the test command is supposed to test the script, that's why it runs the script even if the cached files already exist. Let me know if it's good to you.

@patrickvonplaten
Copy link
Contributor Author

For each dataset, If there exist a dataset_info.json, then the command nlp-cli test path/to/my/dataset --al_configs is successful only if the dataset_infos.json is correct. The infos are correct if the size and checksums of the downloaded file are correct, and if the number of examples in each split are correct.

Note: the test command is supposed to test the script, that's why it runs the script even if the cached files already exist. Let me know if it's good to you.

Does it have to download the whole data to check if the checksums are correct? I guess so no?

@patrickvonplaten patrickvonplaten merged commit 7d97b83 into master May 15, 2020
@patrickvonplaten patrickvonplaten deleted the release_aws_tests branch May 15, 2020 10:03
@lhoestq
Copy link
Member

lhoestq commented May 15, 2020

For each dataset, If there exist a dataset_info.json, then the command nlp-cli test path/to/my/dataset --al_configs is successful only if the dataset_infos.json is correct. The infos are correct if the size and checksums of the downloaded file are correct, and if the number of examples in each split are correct.
Note: the test command is supposed to test the script, that's why it runs the script even if the cached files already exist. Let me know if it's good to you.

Does it have to download the whole data to check if the checksums are correct? I guess so no?

Yes it has to download them all (unless they were already downloaded in which case it just uses the cached downloaded files).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants