-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Metadata validation #2107
Metadata validation #2107
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really cool thanks :)
I left a few comments.
Also I was wondering this is really needed to have utils.metadata
as a submodule of datasets
? This is only used by the CI so I'm not sure we should have this in the actual datasets
package.
I'm unclear on the suggestion, would you rather have a root-level |
Ok that makes sense if we want to have functions that parse the metadata for users |
Hi @gchhablani, yes I think at the moment the best solution is for you to write in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good :)
Can you just add docstrings before we merge ?
Can you add tests as well ?
src/datasets/utils/metadata.py
Outdated
known_multilingualities, known_multilingualities_url = load_json_resource("multilingualities.json") | ||
|
||
|
||
def dict_from_readme(f: Path) -> Optional[Dict[str, List[str]]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def dict_from_readme(f: Path) -> Optional[Dict[str, List[str]]]: | |
def dict_from_readme(path: Path) -> Optional[Dict[str, List[str]]]: |
Use explicit argument names.
Can you also add a docstring ?
This reverts commit ab82a6c.
cc @abhi1thakur |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @theo-m @SBrandeis !
pydantic
metadata schema with dedicated validators against our taxonomyfor reference with the current validation we have
365378 datasets with invalid metadata! full error report here.