-
Notifications
You must be signed in to change notification settings - Fork 3.1k
Metadata validation #2107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Metadata validation #2107
Changes from 9 commits
Commits
Show all changes
32 commits
Select commit
Hold shift + click to select a range
fadc0a0
basic validation
7a4b594
ci script and test change
c3c97ea
color is better
2fe5787
check all option
0f68ce4
validate size cats & multiling, point to reference file urls on error
2d264e8
add validation to ci and rename files
fc46ec3
spurrious change to trigger CI
58763d2
add qa reqs
115d252
disallow empty lists
9ae048e
better error msg: show all invalid values rather than first one
299e907
some code shuffling & better error msg for langcodes
b4a0665
add pyyaml to qa reqs
7eeb647
fix package file loading
3a94086
include json resources
e4409a9
reflect changes to size cats from https://github.com/huggingface/data…
9450b5f
trying another format for package_data
58709bf
ci works! fixing the readme like a good citizen 🤗
702a8a1
escape validation everywhere it's allowed in the tagging app
d3eec3c
code review: more json files, conditional import
59d7dde
Merge remote-tracking branch 'origin/master' into theo/config-validator
84de013
pointers to integrate readme metadata in class (wip)
7fbd51d
no pydantic
0aefcae
Merge remote-tracking branch 'origin/master' into theo/config-validator
ab82a6c
fix docs?
a4953db
Revert "fix docs?"
4cfd2e8
Merge remote-tracking branch 'origin/master' into theo/config-validator
e63d325
remove pointers to add readme to loader
2f2e197
Merge branch 'master' into theo/config-validator
SBrandeis 3102ccf
Get rid of langcodes, some refactor
SBrandeis a9846fd
Update languages.json
SBrandeis 551ae96
Refactor, add tests
SBrandeis 8afb25a
I said, tests!!
SBrandeis File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,61 @@ | ||
| #!/usr/bin/env python | ||
|
|
||
| """ This script will run in CI and make sure all new changes to datasets readme files have valid metadata yaml headers. | ||
|
|
||
| """ | ||
|
|
||
| from pathlib import Path | ||
| from subprocess import check_output | ||
| from typing import List | ||
|
|
||
| from pydantic import ValidationError | ||
|
|
||
| from datasets.utils.metadata import DatasetMetadata | ||
|
|
||
|
|
||
| def get_changed_files(repo_path: Path) -> List[Path]: | ||
| diff_output = check_output(["git", "diff", "--name-only", "HEAD..origin/master"], cwd=repo_path) | ||
| changed_files = [Path(repo_path, f) for f in diff_output.decode().splitlines()] | ||
| return changed_files | ||
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| import logging | ||
| from argparse import ArgumentParser | ||
|
|
||
| logging.basicConfig(level=logging.DEBUG) | ||
|
|
||
| ap = ArgumentParser() | ||
| ap.add_argument("--repo_path", type=Path, default=Path.cwd()) | ||
| ap.add_argument("--check_all", action="store_true") | ||
| args = ap.parse_args() | ||
|
|
||
| repo_path: Path = args.repo_path | ||
| if args.check_all: | ||
| readmes = [dd / "README.md" for dd in (repo_path / "datasets").iterdir()] | ||
| else: | ||
| changed_files = get_changed_files(repo_path) | ||
| readmes = [ | ||
| f | ||
| for f in changed_files | ||
| if f.exists() and f.name.lower() == "readme.md" and f.parent.parent.name == "datasets" | ||
| ] | ||
|
|
||
| failed: List[Path] = [] | ||
| for readme in sorted(readmes): | ||
| try: | ||
| DatasetMetadata.from_readme(readme) | ||
| logging.debug(f"✅️ Validated '{readme.relative_to(repo_path)}'") | ||
| except ValidationError as e: | ||
| failed.append(readme) | ||
| logging.warning(f"❌ Failed to validate '{readme.relative_to(repo_path)}':\n{e}") | ||
| except Exception as e: | ||
| failed.append(readme) | ||
| logging.warning(f"⁉️ Something unexpected happened on '{readme.relative_to(repo_path)}':\n{e}") | ||
|
|
||
| if len(failed) > 0: | ||
| logging.info(f"❌ Failed on {len(failed)} files.") | ||
| exit(1) | ||
| else: | ||
| logging.info("All is well, keep up the good work 🤗!") | ||
| exit(0) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,155 @@ | ||
| import json | ||
| import logging | ||
| from pathlib import Path | ||
| from typing import Any, Callable, Dict, List, Optional, Tuple | ||
|
|
||
| import langcodes as lc | ||
| import yaml | ||
| from pydantic import BaseModel, conlist, validator | ||
|
|
||
|
|
||
| BASE_REF_URL = "https://github.com/huggingface/datasets/tree/master/src/datasets/utils" | ||
| this_url = f"{BASE_REF_URL}/{__file__}" | ||
| logger = logging.getLogger(__name__) | ||
|
|
||
|
|
||
| def dict_from_readme(f: Path) -> Optional[Dict[str, List[str]]]: | ||
| with f.open() as fi: | ||
| content = [line.strip() for line in fi] | ||
|
|
||
| if content[0] == "---" and "---" in content[1:]: | ||
| yamlblock = "\n".join(content[1 : content[1:].index("---") + 1]) | ||
| metada_dict = yaml.safe_load(yamlblock) or dict() | ||
| return metada_dict | ||
|
|
||
|
|
||
| def load_json_resource(resource: str) -> Tuple[Dict, str]: | ||
| utils_dir = Path(__file__).parent | ||
| with open(utils_dir / "resources" / resource) as fi: | ||
| return json.load(fi), f"{BASE_REF_URL}/resources/{resource}" | ||
|
|
||
|
|
||
| known_licenses, known_licenses_url = load_json_resource("licenses.json") | ||
| known_task_ids, known_task_ids_url = load_json_resource("tasks.json") | ||
| known_creators, known_creators_url = load_json_resource("creators.json") | ||
| known_size_categories = ["unknown", "n<1K", "1K<n<10K", "10K<n<100K", "100K<n<1M", "n>1M"] | ||
| known_multilingualities = { | ||
| "monolingual": "contains a single language", | ||
| "multilingual": "contains multiple languages", | ||
| "translation": "contains translated or aligned text", | ||
| "other": "other type of language distribution", | ||
| } | ||
|
|
||
|
|
||
| def tagset_validator(values: List[str], reference_values: List[str], name: str, url: str) -> List[str]: | ||
| for v in values: | ||
| if v not in reference_values: | ||
| raise ValueError(f"'{v}' is not a registered tag for '{name}', reference at {url}") | ||
| return values | ||
|
|
||
|
|
||
| def splitter(values: List[Any], predicate_fn: Callable[[Any], bool]) -> Tuple[List[Any], List[Any]]: | ||
| trues, falses = list(), list() | ||
| for v in values: | ||
| if predicate_fn(v): | ||
| trues.append(v) | ||
| else: | ||
| falses.append(v) | ||
| return trues, falses | ||
|
|
||
|
|
||
| class DatasetMetadata(BaseModel): | ||
| annotations_creators: conlist(str, min_items=1) | ||
| language_creators: conlist(str, min_items=1) | ||
| languages: conlist(str, min_items=1) | ||
| licenses: conlist(str, min_items=1) | ||
| multilinguality: conlist(str, min_items=1) | ||
| size_categories: conlist(str, min_items=1) | ||
| source_datasets: conlist(str, min_items=1) | ||
| task_categories: conlist(str, min_items=1) | ||
| task_ids: conlist(str, min_items=1) | ||
|
|
||
| @classmethod | ||
| def from_readme(cls, f: Path) -> "DatasetMetadata": | ||
| metadata_dict = dict_from_readme(f) | ||
| if metadata_dict is not None: | ||
| return cls(**metadata_dict) | ||
| else: | ||
| raise ValueError(f"did not find a yaml block in '{f}'") | ||
|
|
||
| @classmethod | ||
| def from_yaml_string(cls, string: str) -> "DatasetMetadata": | ||
| metada_dict = yaml.safe_load(string) or dict() | ||
| return cls(**metada_dict) | ||
|
|
||
| @classmethod | ||
| def empty(cls) -> "DatasetMetadata": | ||
| return cls( | ||
| annotations_creators=list(), | ||
| language_creators=list(), | ||
| languages=list(), | ||
| licenses=list(), | ||
| multilinguality=list(), | ||
| size_categories=list(), | ||
| source_datasets=list(), | ||
| task_categories=list(), | ||
| task_ids=list(), | ||
| ) | ||
|
|
||
| @validator("annotations_creators") | ||
| def annotations_creators_must_be_in_known_set(cls, annotations_creators: List[str]) -> List[str]: | ||
| return tagset_validator(annotations_creators, known_creators["annotations"], "annotations", known_creators_url) | ||
|
|
||
| @validator("language_creators") | ||
| def language_creators_must_be_in_known_set(cls, language_creators: List[str]) -> List[str]: | ||
| return tagset_validator(language_creators, known_creators["language"], "annotations", known_creators_url) | ||
|
|
||
| @validator("languages") | ||
| def language_code_must_be_recognized(cls, languages: List[str]): | ||
| for code in languages: | ||
| try: | ||
| lc.get(code) | ||
| except lc.tag_parser.LanguageTagError: | ||
| raise ValueError( | ||
| f"'{code}' is not recognised as a valid language code (BCP47 norm), you can refer to https://github.com/LuminosoInsight/langcodes" | ||
| ) | ||
| return languages | ||
|
|
||
| @validator("licenses") | ||
| def licenses_must_be_in_known_set(cls, licenses: List[str]): | ||
| return tagset_validator(licenses, list(known_licenses.keys()), "licenses", known_licenses_url) | ||
|
|
||
| @validator("task_categories") | ||
| def task_category_must_be_in_known_set(cls, task_categories: List[str]): | ||
| # TODO: we're currently ignoring all values starting with 'other' as our task taxonomy is bound to change | ||
| # in the near future and we don't want to waste energy in tagging against a moving taxonomy. | ||
| known_set = list(known_task_ids.keys()) | ||
| others, to_validate = splitter(task_categories, lambda e: e.startswith("other")) | ||
| return [*tagset_validator(to_validate, known_set, "tasks_ids", known_task_ids_url), *others] | ||
|
|
||
| @validator("task_ids") | ||
| def task_id_must_be_in_known_set(cls, task_ids: List[str]): | ||
| # TODO: we're currently ignoring all values starting with 'other' as our task taxonomy is bound to change | ||
| # in the near future and we don't want to waste energy in tagging against a moving taxonomy. | ||
| known_set = [tid for _cat, d in known_task_ids.items() for tid in d["options"]] | ||
| others, to_validate = splitter(task_ids, lambda e: e.startswith("other")) | ||
| return [*tagset_validator(to_validate, known_set, "tasks_ids", known_task_ids_url), *others] | ||
|
|
||
| @validator("multilinguality") | ||
| def multilinguality_must_be_in_known_set(cls, multilinguality: List[str]): | ||
| return tagset_validator(multilinguality, list(known_multilingualities.keys()), "multilinguality", this_url) | ||
|
|
||
| @validator("size_categories") | ||
| def size_categories_must_be_in_known_set(cls, size_cats: List[str]): | ||
| return tagset_validator(size_cats, known_size_categories, "size_categories", this_url) | ||
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| from argparse import ArgumentParser | ||
|
|
||
| ap = ArgumentParser(usage="Validate the yaml metadata block of a README.md file.") | ||
| ap.add_argument("readme_filepath") | ||
| args = ap.parse_args() | ||
|
|
||
| readme_filepath = Path(args.readme_filepath) | ||
| DatasetMetadata.from_readme(readme_filepath) | ||
theo-m marked this conversation as resolved.
Show resolved
Hide resolved
|
||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,17 @@ | ||
| { | ||
lhoestq marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| "language": [ | ||
| "found", | ||
| "crowdsourced", | ||
| "expert-generated", | ||
| "machine-generated", | ||
| "other" | ||
| ], | ||
| "annotations": [ | ||
| "found", | ||
| "crowdsourced", | ||
| "expert-generated", | ||
| "machine-generated", | ||
| "no-annotation", | ||
| "other" | ||
| ] | ||
| } | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use explicit argument names.
Can you also add a docstring ?