-
Notifications
You must be signed in to change notification settings - Fork 3.1k
Metadata validation #2107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Metadata validation #2107
Changes from 18 commits
Commits
Show all changes
32 commits
Select commit
Hold shift + click to select a range
fadc0a0
basic validation
7a4b594
ci script and test change
c3c97ea
color is better
2fe5787
check all option
0f68ce4
validate size cats & multiling, point to reference file urls on error
2d264e8
add validation to ci and rename files
fc46ec3
spurrious change to trigger CI
58763d2
add qa reqs
115d252
disallow empty lists
9ae048e
better error msg: show all invalid values rather than first one
299e907
some code shuffling & better error msg for langcodes
b4a0665
add pyyaml to qa reqs
7eeb647
fix package file loading
3a94086
include json resources
e4409a9
reflect changes to size cats from https://github.com/huggingface/data…
9450b5f
trying another format for package_data
58709bf
ci works! fixing the readme like a good citizen 🤗
702a8a1
escape validation everywhere it's allowed in the tagging app
d3eec3c
code review: more json files, conditional import
59d7dde
Merge remote-tracking branch 'origin/master' into theo/config-validator
84de013
pointers to integrate readme metadata in class (wip)
7fbd51d
no pydantic
0aefcae
Merge remote-tracking branch 'origin/master' into theo/config-validator
ab82a6c
fix docs?
a4953db
Revert "fix docs?"
4cfd2e8
Merge remote-tracking branch 'origin/master' into theo/config-validator
e63d325
remove pointers to add readme to loader
2f2e197
Merge branch 'master' into theo/config-validator
SBrandeis 3102ccf
Get rid of langcodes, some refactor
SBrandeis a9846fd
Update languages.json
SBrandeis 551ae96
Refactor, add tests
SBrandeis 8afb25a
I said, tests!!
SBrandeis File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,61 @@ | ||
| #!/usr/bin/env python | ||
|
|
||
| """ This script will run in CI and make sure all new changes to datasets readme files have valid metadata yaml headers. | ||
|
|
||
| """ | ||
|
|
||
| from pathlib import Path | ||
| from subprocess import check_output | ||
| from typing import List | ||
|
|
||
| from pydantic import ValidationError | ||
|
|
||
| from datasets.utils.metadata import DatasetMetadata | ||
|
|
||
|
|
||
| def get_changed_files(repo_path: Path) -> List[Path]: | ||
| diff_output = check_output(["git", "diff", "--name-only", "HEAD..origin/master"], cwd=repo_path) | ||
| changed_files = [Path(repo_path, f) for f in diff_output.decode().splitlines()] | ||
| return changed_files | ||
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| import logging | ||
| from argparse import ArgumentParser | ||
|
|
||
| logging.basicConfig(level=logging.DEBUG) | ||
|
|
||
| ap = ArgumentParser() | ||
| ap.add_argument("--repo_path", type=Path, default=Path.cwd()) | ||
| ap.add_argument("--check_all", action="store_true") | ||
| args = ap.parse_args() | ||
|
|
||
| repo_path: Path = args.repo_path | ||
| if args.check_all: | ||
| readmes = [dd / "README.md" for dd in (repo_path / "datasets").iterdir()] | ||
| else: | ||
| changed_files = get_changed_files(repo_path) | ||
| readmes = [ | ||
| f | ||
| for f in changed_files | ||
| if f.exists() and f.name.lower() == "readme.md" and f.parent.parent.name == "datasets" | ||
| ] | ||
|
|
||
| failed: List[Path] = [] | ||
| for readme in sorted(readmes): | ||
| try: | ||
| DatasetMetadata.from_readme(readme) | ||
| logging.debug(f"✅️ Validated '{readme.relative_to(repo_path)}'") | ||
| except ValidationError as e: | ||
| failed.append(readme) | ||
| logging.warning(f"❌ Failed to validate '{readme.relative_to(repo_path)}':\n{e}") | ||
| except Exception as e: | ||
| failed.append(readme) | ||
| logging.warning(f"⁉️ Something unexpected happened on '{readme.relative_to(repo_path)}':\n{e}") | ||
|
|
||
| if len(failed) > 0: | ||
| logging.info(f"❌ Failed on {len(failed)} files.") | ||
| exit(1) | ||
| else: | ||
| logging.info("All is well, keep up the good work 🤗!") | ||
| exit(0) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -0,0 +1,188 @@ | ||||||
| import json | ||||||
| import logging | ||||||
| from pathlib import Path | ||||||
| from typing import Any, Callable, Dict, List, Optional, Tuple | ||||||
|
|
||||||
|
|
||||||
| # loading package files: https://stackoverflow.com/a/20885799 | ||||||
| try: | ||||||
| import importlib.resources as pkg_resources | ||||||
| except ImportError: | ||||||
| # Try backported to PY<37 `importlib_resources`. | ||||||
| import importlib_resources as pkg_resources | ||||||
|
|
||||||
| import langcodes as lc | ||||||
| import yaml | ||||||
| from pydantic import BaseModel, conlist, validator | ||||||
|
|
||||||
| from . import resources | ||||||
|
|
||||||
|
|
||||||
| BASE_REF_URL = "https://github.com/huggingface/datasets/tree/master/src/datasets/utils" | ||||||
| this_url = f"{BASE_REF_URL}/{__file__}" | ||||||
| logger = logging.getLogger(__name__) | ||||||
|
|
||||||
|
|
||||||
| def load_json_resource(resource: str) -> Tuple[Dict, str]: | ||||||
| content = pkg_resources.read_text(resources, resource) | ||||||
| return json.loads(content), f"{BASE_REF_URL}/resources/{resource}" | ||||||
|
|
||||||
|
|
||||||
| known_licenses, known_licenses_url = load_json_resource("licenses.json") | ||||||
| known_task_ids, known_task_ids_url = load_json_resource("tasks.json") | ||||||
| known_creators, known_creators_url = load_json_resource("creators.json") | ||||||
| known_size_categories = [ | ||||||
| "unknown", | ||||||
| "n<1K", | ||||||
| "1K<n<10K", | ||||||
| "10K<n<100K", | ||||||
| "100K<n<1M", | ||||||
| "1M<n<10M", | ||||||
| "10M<n<100M", | ||||||
| "100M<n<1B", | ||||||
| "1B<n<10B", | ||||||
| "10B<n<100B", | ||||||
| "100B<n<1T", | ||||||
| "n>1T", | ||||||
| ] | ||||||
| known_multilingualities = { | ||||||
| "monolingual": "contains a single language", | ||||||
| "multilingual": "contains multiple languages", | ||||||
| "translation": "contains translated or aligned text", | ||||||
| "other": "other type of language distribution", | ||||||
| } | ||||||
theo-m marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
|
|
||||||
|
|
||||||
| def dict_from_readme(f: Path) -> Optional[Dict[str, List[str]]]: | ||||||
|
||||||
| def dict_from_readme(f: Path) -> Optional[Dict[str, List[str]]]: | |
| def dict_from_readme(path: Path) -> Optional[Dict[str, List[str]]]: |
Use explicit argument names.
Can you also add a docstring ?
lhoestq marked this conversation as resolved.
Show resolved
Hide resolved
lhoestq marked this conversation as resolved.
Show resolved
Hide resolved
theo-m marked this conversation as resolved.
Show resolved
Hide resolved
theo-m marked this conversation as resolved.
Show resolved
Hide resolved
Empty file.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,17 @@ | ||
| { | ||
lhoestq marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| "language": [ | ||
| "found", | ||
| "crowdsourced", | ||
| "expert-generated", | ||
| "machine-generated", | ||
| "other" | ||
| ], | ||
| "annotations": [ | ||
| "found", | ||
| "crowdsourced", | ||
| "expert-generated", | ||
| "machine-generated", | ||
| "no-annotation", | ||
| "other" | ||
| ] | ||
| } | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.