-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optional per-dataset default config name #889
Conversation
I like the idea ! And the approach is right imo Note that by changing this we will have to add a way for users to get the config lists of a dataset. In the current user workflow, the user could see the list of the config when the missing config error is raised but now it won't be the case because of the default config. |
Maybe let's add a test in the test_builder.py test script ? |
@lhoestq Okay great, I added a test as well as two new inspect functions: I'll send a separate PR incorporating this in existing datasets so we can get this merged before our sprint on Monday. Any ideas on the failing tests? I'm having trouble making sense of it. Edit: nvm, it was master. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot ! LGTM :)
@@ -573,3 +584,17 @@ def test_custom_writer_batch_size(self): | |||
dataset3 = dummy_builder3.as_dataset("train") | |||
self.assertEqual(len(dataset3._data[0].chunks), 10) | |||
del dataset1, dataset2, dataset3 | |||
|
|||
def test_config_names(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for adding this test !
def get_dataset_infos(path: str): | ||
"""Get the meta information about a dataset, returned as a dict mapping config name to DatasetInfoDict. | ||
|
||
Args: | ||
path (``str``): path to the dataset processing script with the dataset builder. Can be either: | ||
- a local path to processing script or the directory containing the script (if the script has the same name as the directory), | ||
e.g. ``'./dataset/squad'`` or ``'./dataset/squad/squad.py'`` | ||
- a dataset identifier on HuggingFace AWS bucket (list all available datasets and ids with ``datasets.list_datasets()``) | ||
e.g. ``'squad'``, ``'glue'`` or ``'openai/webtext'`` | ||
""" | ||
module_path, _ = prepare_module(path) | ||
builder_cls = import_main_class(module_path, dataset=True) | ||
return builder_cls.get_all_exported_dataset_infos() | ||
|
||
|
||
def get_dataset_config_names(path: str): | ||
"""Get the list of available config names for a particular dataset. | ||
|
||
Args: | ||
path (``str``): path to the dataset processing script with the dataset builder. Can be either: | ||
- a local path to processing script or the directory containing the script (if the script has the same name as the directory), | ||
e.g. ``'./dataset/squad'`` or ``'./dataset/squad/squad.py'`` | ||
- a dataset identifier on HuggingFace AWS bucket (list all available datasets and ids with ``datasets.list_datasets()``) | ||
e.g. ``'squad'``, ``'glue'`` or ``'openai/webtext'`` | ||
""" | ||
module_path, _ = prepare_module(path) | ||
builder_cls = import_main_class(module_path, dataset=True) | ||
return list(builder_cls.builder_configs.keys()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
love it
src/datasets/builder.py
Outdated
logger.info("No config specified, defaulting to first: %s/%s", self.name, builder_config.name) | ||
if self.DEFAULT_CONFIG_NAME is not None: | ||
builder_config = self.builder_configs.get(self.DEFAULT_CONFIG_NAME) | ||
logger.info("No config specified, defaulting to: %s/%s", self.name, builder_config.name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we said offline, this should be warning indeed.
The other one on the other hand No config specified...
should stay at info level since there is no ambiguity in which config to load.
This PR adds a
DEFAULT_CONFIG_NAME
class attribute toDatasetBuilder
. This allows a dataset to have a specified default config name when a dataset has more than one config but the user does not specify it. For example, after definingDEFAULT_CONFIG_NAME = "combined"
in PolyglotNER, a user can now do the following:which is equivalent to,
In effect (for this particular dataset configuration), this means that if the user doesn't specify a language, they are given the combined dataset including all languages.
Since it doesn't always make sense to have a default config, this feature is opt-in. If
DEFAULT_CONFIG_NAME
is not defined and a user does not pass a config for a dataset with multiple configs available, a ValueError is raised like usual.Let me know what you think about this approach @lhoestq @thomwolf and I'll add some documentation and define a default for some of our existing datasets.