Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding new splits to a dataset script with existing old splits info in metadata's dataset_info fails #5315

Open
polinaeterna opened this issue Nov 30, 2022 · 3 comments · May be fixed by #5327
Assignees
Labels
bug Something isn't working

Comments

@polinaeterna
Copy link
Contributor

Describe the bug

If you first create a custom dataset with a specific set of splits, generate metadata with datasets-cli test ... --save_info, then change your script to include more splits, it fails.

That's what happened in https://huggingface.co/datasets/mrdbourke/food_vision_199_classes/discussions/2#6385fd1269634850f8ddff48.

Steps to reproduce the bug

  1. create a dataset with a custom split that returns, for example, only "train" split in _splits_generators'. specifically, if really want to reproduce, copy `https://huggingface.co/datasets/mrdbourke/food_vision_199_classes/blob/main/food_vision_199_classes.py
  2. run datasets-cli test dataset_script.py --save_info --all_configs - this would generate metadata yaml in README.md that would contain info about splits, for example, like this:
  splits:
  - name: train
    num_bytes: 2973286
    num_examples: 19747
  1. make changes to your script so that it returns another set of splits, for example, "train" and "test" (uncomment these lines)
  2. run load_dataset and get the following error:
Traceback (most recent call last):
  File "/home/daniel/code/pytorch/env/bin/datasets-cli", line 8, in <module>
    sys.exit(main())
  File "/home/daniel/code/pytorch/env/lib/python3.8/site-packages/datasets/commands/datasets_cli.py", line 39, in main
    service.run()
  File "/home/daniel/code/pytorch/env/lib/python3.8/site-packages/datasets/commands/test.py", line 141, in run
    builder.download_and_prepare(
  File "/home/daniel/code/pytorch/env/lib/python3.8/site-packages/datasets/builder.py", line 822, in download_and_prepare
    self._download_and_prepare(
  File "/home/daniel/code/pytorch/env/lib/python3.8/site-packages/datasets/builder.py", line 1555, in _download_and_prepare
    super()._download_and_prepare(
  File "/home/daniel/code/pytorch/env/lib/python3.8/site-packages/datasets/builder.py", line 913, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "/home/daniel/code/pytorch/env/lib/python3.8/site-packages/datasets/builder.py", line 1356, in _prepare_split
    split_info = self.info.splits[split_generator.name]
  File "/home/daniel/code/pytorch/env/lib/python3.8/site-packages/datasets/splits.py", line 525, in __getitem__
    instructions = make_file_instructions(
  File "/home/daniel/code/pytorch/env/lib/python3.8/site-packages/datasets/arrow_reader.py", line 111, in make_file_instructions
    name2filenames = {
  File "/home/daniel/code/pytorch/env/lib/python3.8/site-packages/datasets/arrow_reader.py", line 112, in <dictcomp>
    info.name: filenames_for_dataset_split(
  File "/home/daniel/code/pytorch/env/lib/python3.8/site-packages/datasets/naming.py", line 78, in filenames_for_dataset_split
    prefix = filename_prefix_for_split(dataset_name, split)
  File "/home/daniel/code/pytorch/env/lib/python3.8/site-packages/datasets/naming.py", line 57, in filename_prefix_for_split
    if os.path.basename(name) != name:
  File "/home/daniel/code/pytorch/env/lib/python3.8/posixpath.py", line 143, in basename
    p = os.fspath(p)
TypeError: expected str, bytes or os.PathLike object, not NoneType
  1. bonus: try to regenerate metadata in README.md with datasets-cli as in step 2 and get the same error.

This is because dataset.info.splits contains only "train" split so when we are doing self.info.splits[split_generator.name] it tries to infer smth like info.splits['train[50%]'] and that's not the case and it fails.

Expected behavior

to be discussed?

This can be solved by removing splits information from metadata file first. But I wonder if there is a better way.

Environment info

  • Datasets version: 2.7.1
  • Python version: 3.8.13
@polinaeterna polinaeterna self-assigned this Nov 30, 2022
@polinaeterna polinaeterna changed the title Adding new splits in a dataset script with existing old splits info in metadata's dataset_info fails Adding new splits to a dataset script with existing old splits info in metadata's dataset_info fails Nov 30, 2022
@polinaeterna polinaeterna added the bug Something isn't working label Nov 30, 2022
@albertvillanova
Copy link
Member

albertvillanova commented Nov 30, 2022

EDIT:
I think in this case, the metadata files (either README or JSON) should not be read (i.e. self.info.splits should be None).

One idea:

  • I think ideally we should set this behavior when we pass --save_info to the CLI test
  • However, currently, the builder is unaware of this: save_info arg is not passed to it

@polinaeterna
Copy link
Contributor Author

polinaeterna commented Dec 1, 2022

I think in this case

@albertvillanova You mean in cases when the script was changed?

I suggest that we:

  • add a check on the slice (like 'split_name[n%]) kind of format here: https://github.com/huggingface/datasets/blob/main/src/datasets/splits.py#L523 to catch things like this.
  • Error here happens before splits verification, but in _prepare_split, and _prepare_split doesn't perform any verification and don't know about it. so we can pass this parameter and take splits from split_generator, not from split.info in case when verify_infos is False
  • we can check if split names from split_generators and self.info.splits are the same before preparing splits (if verify_info=True) so that we don't spend time on generating unwanted data.
  • provide some user-friendly warnings about ignore_verifications parameter so that users know that if something is not matching they can ignore it

I started it here: https://github.com/huggingface/datasets/pull/5327/files

What do you think @albertvillanova ?

@albertvillanova
Copy link
Member

albertvillanova commented Dec 2, 2022

I edited my previous comment:

  • First I proposed setting self.info.splits to None when ignore_verifications=True
    • I thought it was the easiest implementation because ignore_verifications is passed to DatasetBuilder.download_and_prepare
    • However, afterwards, I realized this might not be a good idea for this use case:
      • A user wants to optimize the loading of the dataset, and passes ignore_verifications=False to avoid all the verifications
        • In this case, we want self.info.splits to be read from metadata file
  • Then, I thought that it might be better to set self.info.splits to None when we pass --save_info to the CLI test: if we are going to save the info to the metadata file, it makes no sense to read the info from the metadata file
    • This implementation is not so easy because the Builder knows nothing about --save_info

I agree with you there are 2 things to be addressed here:

  • One is what I have just commented: self.info.splits should be None in this case
  • The other, a validation should be implemented when calling make_file_instructions and/or SplitDict.__getitem__, so that when passing "training" to it, we get a more descriptive error other than TypeError: expected str, bytes or os.PathLike object, not NoneType

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
2 participants