Optional per-dataset default config name #889

joeddav · 2020-11-25T21:02:30Z

This PR adds a DEFAULT_CONFIG_NAME class attribute to DatasetBuilder. This allows a dataset to have a specified default config name when a dataset has more than one config but the user does not specify it. For example, after defining DEFAULT_CONFIG_NAME = "combined" in PolyglotNER, a user can now do the following:

ds = load_dataset("polyglot_ner")

which is equivalent to,

ds = load_dataset("polyglot_ner", "combined")

In effect (for this particular dataset configuration), this means that if the user doesn't specify a language, they are given the combined dataset including all languages.

Since it doesn't always make sense to have a default config, this feature is opt-in. If DEFAULT_CONFIG_NAME is not defined and a user does not pass a config for a dataset with multiple configs available, a ValueError is raised like usual.

Let me know what you think about this approach @lhoestq @thomwolf and I'll add some documentation and define a default for some of our existing datasets.

lhoestq · 2020-11-26T10:05:37Z

I like the idea ! And the approach is right imo

Note that by changing this we will have to add a way for users to get the config lists of a dataset. In the current user workflow, the user could see the list of the config when the missing config error is raised but now it won't be the case because of the default config.

lhoestq · 2020-11-26T17:30:04Z

Maybe let's add a test in the test_builder.py test script ?

src/datasets/inspect.py

joeddav · 2020-11-28T19:38:39Z

@lhoestq Okay great, I added a test as well as two new inspect functions: get_dataset_config_names and get_dataset_infos (the latter is something I've been wanting anyway). As a quick hack, you can also just pass a random config name (e.g. an empty string) to load_dataset to get the config names in the error msg as before. Also added a couple paragraphs to the adding new datasets doc.

I'll send a separate PR incorporating this in existing datasets so we can get this merged before our sprint on Monday.

Any ideas on the failing tests? I'm having trouble making sense of it. Edit: nvm, it was master.

lhoestq

Thanks a lot ! LGTM :)

lhoestq · 2020-11-30T14:01:54Z

tests/test_builder.py

@@ -573,3 +584,17 @@ def test_custom_writer_batch_size(self):
            dataset3 = dummy_builder3.as_dataset("train")
            self.assertEqual(len(dataset3._data[0].chunks), 10)
            del dataset1, dataset2, dataset3
+
+    def test_config_names(self):


thanks for adding this test !

lhoestq · 2020-11-30T14:02:13Z

src/datasets/inspect.py

+def get_dataset_infos(path: str):
+    """Get the meta information about a dataset, returned as a dict mapping config name to DatasetInfoDict.
+
+    Args:
+        path (``str``): path to the dataset processing script with the dataset builder. Can be either:
+            - a local path to processing script or the directory containing the script (if the script has the same name as the directory),
+                e.g. ``'./dataset/squad'`` or ``'./dataset/squad/squad.py'``
+            - a dataset identifier on HuggingFace AWS bucket (list all available datasets and ids with ``datasets.list_datasets()``)
+                e.g. ``'squad'``, ``'glue'`` or ``'openai/webtext'``
+    """
+    module_path, _ = prepare_module(path)
+    builder_cls = import_main_class(module_path, dataset=True)
+    return builder_cls.get_all_exported_dataset_infos()
+
+
+def get_dataset_config_names(path: str):
+    """Get the list of available config names for a particular dataset.
+
+    Args:
+        path (``str``): path to the dataset processing script with the dataset builder. Can be either:
+            - a local path to processing script or the directory containing the script (if the script has the same name as the directory),
+                e.g. ``'./dataset/squad'`` or ``'./dataset/squad/squad.py'``
+            - a dataset identifier on HuggingFace AWS bucket (list all available datasets and ids with ``datasets.list_datasets()``)
+                e.g. ``'squad'``, ``'glue'`` or ``'openai/webtext'``
+    """
+    module_path, _ = prepare_module(path)
+    builder_cls = import_main_class(module_path, dataset=True)
+    return list(builder_cls.builder_configs.keys())


lhoestq · 2020-11-30T14:03:28Z

src/datasets/builder.py

-            logger.info("No config specified, defaulting to first: %s/%s", self.name, builder_config.name)
+            if self.DEFAULT_CONFIG_NAME is not None:
+                builder_config = self.builder_configs.get(self.DEFAULT_CONFIG_NAME)
+                logger.info("No config specified, defaulting to: %s/%s", self.name, builder_config.name)


As we said offline, this should be warning indeed.

The other one on the other hand No config specified... should stay at info level since there is no ambiguity in which config to load.

docs/source/add_dataset.rst

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

joeddav added 3 commits November 25, 2020 15:36

add optional default config name

2204837

add default to polyglot ner

6d1700f

style

becb9e3

add functions to get dataset meta info

439f261

lhoestq reviewed Nov 27, 2020

View reviewed changes

src/datasets/inspect.py Outdated Show resolved Hide resolved

joeddav added 4 commits November 28, 2020 13:25

use builder_configs

e4c418b

add config names test

30c50df

style

1d2230f

add docs

2e9a673

joeddav changed the title ~~[WIP] optional per-dataset default config name~~ Optional per-dataset default config name Nov 28, 2020

joeddav added 3 commits November 28, 2020 14:44

polyglot ner combined const -> private

f041442

Merge branch 'master' into default-config

9c2a18d

default config warning

8f2e498

lhoestq approved these changes Nov 30, 2020

View reviewed changes

doc typo

0428e8e

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

joeddav merged commit 1d8bc19 into huggingface:master Nov 30, 2020

joeddav deleted the default-config branch November 30, 2020 17:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optional per-dataset default config name #889

Optional per-dataset default config name #889

joeddav commented Nov 25, 2020 •

edited

Loading

lhoestq commented Nov 26, 2020

lhoestq commented Nov 26, 2020

joeddav commented Nov 28, 2020 •

edited

Loading

lhoestq left a comment

lhoestq Nov 30, 2020

lhoestq Nov 30, 2020

lhoestq Nov 30, 2020

Optional per-dataset default config name #889

Optional per-dataset default config name #889

Conversation

joeddav commented Nov 25, 2020 • edited Loading

lhoestq commented Nov 26, 2020

lhoestq commented Nov 26, 2020

joeddav commented Nov 28, 2020 • edited Loading

lhoestq left a comment

Choose a reason for hiding this comment

lhoestq Nov 30, 2020

Choose a reason for hiding this comment

lhoestq Nov 30, 2020

Choose a reason for hiding this comment

lhoestq Nov 30, 2020

Choose a reason for hiding this comment

joeddav commented Nov 25, 2020 •

edited

Loading

joeddav commented Nov 28, 2020 •

edited

Loading