Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optional per-dataset default config name #889

Merged
merged 12 commits into from
Nov 30, 2020

Conversation

joeddav
Copy link
Contributor

@joeddav joeddav commented Nov 25, 2020

This PR adds a DEFAULT_CONFIG_NAME class attribute to DatasetBuilder. This allows a dataset to have a specified default config name when a dataset has more than one config but the user does not specify it. For example, after defining DEFAULT_CONFIG_NAME = "combined" in PolyglotNER, a user can now do the following:

ds = load_dataset("polyglot_ner")

which is equivalent to,

ds = load_dataset("polyglot_ner", "combined")

In effect (for this particular dataset configuration), this means that if the user doesn't specify a language, they are given the combined dataset including all languages.

Since it doesn't always make sense to have a default config, this feature is opt-in. If DEFAULT_CONFIG_NAME is not defined and a user does not pass a config for a dataset with multiple configs available, a ValueError is raised like usual.

Let me know what you think about this approach @lhoestq @thomwolf and I'll add some documentation and define a default for some of our existing datasets.

@lhoestq
Copy link
Member

lhoestq commented Nov 26, 2020

I like the idea ! And the approach is right imo

Note that by changing this we will have to add a way for users to get the config lists of a dataset. In the current user workflow, the user could see the list of the config when the missing config error is raised but now it won't be the case because of the default config.

@lhoestq
Copy link
Member

lhoestq commented Nov 26, 2020

Maybe let's add a test in the test_builder.py test script ?

src/datasets/inspect.py Outdated Show resolved Hide resolved
@joeddav joeddav changed the title [WIP] optional per-dataset default config name Optional per-dataset default config name Nov 28, 2020
@joeddav
Copy link
Contributor Author

joeddav commented Nov 28, 2020

@lhoestq Okay great, I added a test as well as two new inspect functions: get_dataset_config_names and get_dataset_infos (the latter is something I've been wanting anyway). As a quick hack, you can also just pass a random config name (e.g. an empty string) to load_dataset to get the config names in the error msg as before. Also added a couple paragraphs to the adding new datasets doc.

I'll send a separate PR incorporating this in existing datasets so we can get this merged before our sprint on Monday.

Any ideas on the failing tests? I'm having trouble making sense of it. Edit: nvm, it was master.

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot ! LGTM :)

@@ -573,3 +584,17 @@ def test_custom_writer_batch_size(self):
dataset3 = dummy_builder3.as_dataset("train")
self.assertEqual(len(dataset3._data[0].chunks), 10)
del dataset1, dataset2, dataset3

def test_config_names(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for adding this test !

Comment on lines +100 to +127
def get_dataset_infos(path: str):
"""Get the meta information about a dataset, returned as a dict mapping config name to DatasetInfoDict.

Args:
path (``str``): path to the dataset processing script with the dataset builder. Can be either:
- a local path to processing script or the directory containing the script (if the script has the same name as the directory),
e.g. ``'./dataset/squad'`` or ``'./dataset/squad/squad.py'``
- a dataset identifier on HuggingFace AWS bucket (list all available datasets and ids with ``datasets.list_datasets()``)
e.g. ``'squad'``, ``'glue'`` or ``'openai/webtext'``
"""
module_path, _ = prepare_module(path)
builder_cls = import_main_class(module_path, dataset=True)
return builder_cls.get_all_exported_dataset_infos()


def get_dataset_config_names(path: str):
"""Get the list of available config names for a particular dataset.

Args:
path (``str``): path to the dataset processing script with the dataset builder. Can be either:
- a local path to processing script or the directory containing the script (if the script has the same name as the directory),
e.g. ``'./dataset/squad'`` or ``'./dataset/squad/squad.py'``
- a dataset identifier on HuggingFace AWS bucket (list all available datasets and ids with ``datasets.list_datasets()``)
e.g. ``'squad'``, ``'glue'`` or ``'openai/webtext'``
"""
module_path, _ = prepare_module(path)
builder_cls = import_main_class(module_path, dataset=True)
return list(builder_cls.builder_configs.keys())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

love it

logger.info("No config specified, defaulting to first: %s/%s", self.name, builder_config.name)
if self.DEFAULT_CONFIG_NAME is not None:
builder_config = self.builder_configs.get(self.DEFAULT_CONFIG_NAME)
logger.info("No config specified, defaulting to: %s/%s", self.name, builder_config.name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we said offline, this should be warning indeed.

The other one on the other hand No config specified... should stay at info level since there is no ambiguity in which config to load.

docs/source/add_dataset.rst Outdated Show resolved Hide resolved
Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
@joeddav joeddav merged commit 1d8bc19 into huggingface:master Nov 30, 2020
@joeddav joeddav deleted the default-config branch November 30, 2020 17:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants