Adding new splits to a dataset script with existing old splits info in metadata's `dataset_info` fails #5315

polinaeterna · 2022-11-30T18:02:15Z

Describe the bug

If you first create a custom dataset with a specific set of splits, generate metadata with datasets-cli test ... --save_info, then change your script to include more splits, it fails.

That's what happened in https://huggingface.co/datasets/mrdbourke/food_vision_199_classes/discussions/2#6385fd1269634850f8ddff48.

Steps to reproduce the bug

create a dataset with a custom split that returns, for example, only "train" split in _splits_generators'. specifically, if really want to reproduce, copy `https://huggingface.co/datasets/mrdbourke/food_vision_199_classes/blob/main/food_vision_199_classes.py
run datasets-cli test dataset_script.py --save_info --all_configs - this would generate metadata yaml in README.md that would contain info about splits, for example, like this:

  splits:
  - name: train
    num_bytes: 2973286
    num_examples: 19747

make changes to your script so that it returns another set of splits, for example, "train" and "test" (uncomment these lines)
run load_dataset and get the following error:

Traceback (most recent call last):
  File "/home/daniel/code/pytorch/env/bin/datasets-cli", line 8, in <module>
    sys.exit(main())
  File "/home/daniel/code/pytorch/env/lib/python3.8/site-packages/datasets/commands/datasets_cli.py", line 39, in main
    service.run()
  File "/home/daniel/code/pytorch/env/lib/python3.8/site-packages/datasets/commands/test.py", line 141, in run
    builder.download_and_prepare(
  File "/home/daniel/code/pytorch/env/lib/python3.8/site-packages/datasets/builder.py", line 822, in download_and_prepare
    self._download_and_prepare(
  File "/home/daniel/code/pytorch/env/lib/python3.8/site-packages/datasets/builder.py", line 1555, in _download_and_prepare
    super()._download_and_prepare(
  File "/home/daniel/code/pytorch/env/lib/python3.8/site-packages/datasets/builder.py", line 913, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "/home/daniel/code/pytorch/env/lib/python3.8/site-packages/datasets/builder.py", line 1356, in _prepare_split
    split_info = self.info.splits[split_generator.name]
  File "/home/daniel/code/pytorch/env/lib/python3.8/site-packages/datasets/splits.py", line 525, in __getitem__
    instructions = make_file_instructions(
  File "/home/daniel/code/pytorch/env/lib/python3.8/site-packages/datasets/arrow_reader.py", line 111, in make_file_instructions
    name2filenames = {
  File "/home/daniel/code/pytorch/env/lib/python3.8/site-packages/datasets/arrow_reader.py", line 112, in <dictcomp>
    info.name: filenames_for_dataset_split(
  File "/home/daniel/code/pytorch/env/lib/python3.8/site-packages/datasets/naming.py", line 78, in filenames_for_dataset_split
    prefix = filename_prefix_for_split(dataset_name, split)
  File "/home/daniel/code/pytorch/env/lib/python3.8/site-packages/datasets/naming.py", line 57, in filename_prefix_for_split
    if os.path.basename(name) != name:
  File "/home/daniel/code/pytorch/env/lib/python3.8/posixpath.py", line 143, in basename
    p = os.fspath(p)
TypeError: expected str, bytes or os.PathLike object, not NoneType

bonus: try to regenerate metadata in README.md with datasets-cli as in step 2 and get the same error.

This is because dataset.info.splits contains only "train" split so when we are doing self.info.splits[split_generator.name] it tries to infer smth like info.splits['train[50%]'] and that's not the case and it fails.

Expected behavior

to be discussed?

This can be solved by removing splits information from metadata file first. But I wonder if there is a better way.

Environment info

Datasets version: 2.7.1
Python version: 3.8.13

The text was updated successfully, but these errors were encountered:

albertvillanova · 2022-11-30T18:34:31Z

EDIT:
I think in this case, the metadata files (either README or JSON) should not be read (i.e. self.info.splits should be None).

One idea:

I think ideally we should set this behavior when we pass --save_info to the CLI test
However, currently, the builder is unaware of this: save_info arg is not passed to it

polinaeterna · 2022-12-01T17:07:21Z

I think in this case

@albertvillanova You mean in cases when the script was changed?

I suggest that we:

add a check on the slice (like 'split_name[n%]) kind of format here: https://github.com/huggingface/datasets/blob/main/src/datasets/splits.py#L523 to catch things like this.
Error here happens before splits verification, but in _prepare_split, and _prepare_split doesn't perform any verification and don't know about it. so we can pass this parameter and take splits from split_generator, not from split.info in case when verify_infos is False
we can check if split names from split_generators and self.info.splits are the same before preparing splits (if verify_info=True) so that we don't spend time on generating unwanted data.
provide some user-friendly warnings about ignore_verifications parameter so that users know that if something is not matching they can ignore it

I started it here: https://github.com/huggingface/datasets/pull/5327/files

What do you think @albertvillanova ?

albertvillanova · 2022-12-02T06:54:14Z

I edited my previous comment:

First I proposed setting self.info.splits to None when ignore_verifications=True
- I thought it was the easiest implementation because ignore_verifications is passed to DatasetBuilder.download_and_prepare
- However, afterwards, I realized this might not be a good idea for this use case:
  - A user wants to optimize the loading of the dataset, and passes ignore_verifications=False to avoid all the verifications
    - In this case, we want self.info.splits to be read from metadata file
Then, I thought that it might be better to set self.info.splits to None when we pass --save_info to the CLI test: if we are going to save the info to the metadata file, it makes no sense to read the info from the metadata file
- This implementation is not so easy because the Builder knows nothing about --save_info

I agree with you there are 2 things to be addressed here:

One is what I have just commented: self.info.splits should be None in this case
The other, a validation should be implemented when calling make_file_instructions and/or SplitDict.__getitem__, so that when passing "training" to it, we get a more descriptive error other than TypeError: expected str, bytes or os.PathLike object, not NoneType

polinaeterna self-assigned this Nov 30, 2022

polinaeterna changed the title ~~Adding new splits in a dataset script with existing old splits info in metadata's dataset_info fails~~ Adding new splits to a dataset script with existing old splits info in metadata's dataset_info fails Nov 30, 2022

polinaeterna added the bug Something isn't working label Nov 30, 2022

polinaeterna linked a pull request Dec 1, 2022 that will close this issue

Avoid unwanted behaviour when splits from script and metadata are not matching because of outdated metadata #5327

Draft

polinaeterna mentioned this issue Jan 23, 2023

Skip dataset verifications by default #5303

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding new splits to a dataset script with existing old splits info in metadata's `dataset_info` fails #5315

Adding new splits to a dataset script with existing old splits info in metadata's `dataset_info` fails #5315

polinaeterna commented Nov 30, 2022

albertvillanova commented Nov 30, 2022 •

edited

Loading

polinaeterna commented Dec 1, 2022 •

edited

Loading

albertvillanova commented Dec 2, 2022 •

edited

Loading

Adding new splits to a dataset script with existing old splits info in metadata's dataset_info fails #5315

Adding new splits to a dataset script with existing old splits info in metadata's dataset_info fails #5315

Comments

polinaeterna commented Nov 30, 2022

Describe the bug

Steps to reproduce the bug

Expected behavior

Environment info

albertvillanova commented Nov 30, 2022 • edited Loading

polinaeterna commented Dec 1, 2022 • edited Loading

albertvillanova commented Dec 2, 2022 • edited Loading

Adding new splits to a dataset script with existing old splits info in metadata's `dataset_info` fails #5315

Adding new splits to a dataset script with existing old splits info in metadata's `dataset_info` fails #5315

albertvillanova commented Nov 30, 2022 •

edited

Loading

polinaeterna commented Dec 1, 2022 •

edited

Loading

albertvillanova commented Dec 2, 2022 •

edited

Loading