-
Notifications
You must be signed in to change notification settings - Fork 3k
Add Validation For README #2121
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
38 commits
Select commit
Hold shift + click to select a range
993acd6
Add Initial README parser
gchhablani c2f0699
Merge remote-tracking branch 'upstream/master' into add-readme-parser
gchhablani 014c49d
Add basic validation checks
gchhablani 2040602
Minor fix
gchhablani 79c2ad0
Merge remote-tracking branch 'upstream/master' into add-readme-parser
gchhablani 7a1654b
Changes from review
gchhablani c6d2345
Merge branch 'master' of https://github.com/huggingface/datasets into…
gchhablani 99d2222
Make main into a function in readme_parser
gchhablani 51c08ae
Merge remote-tracking branch 'upstream/master' into add-readme-parser
gchhablani 4e54669
Merge remote-tracking branch 'upstream/master' into add-readme-parser
gchhablani 1d788a9
Move README validator to scripts
gchhablani 2d13f70
Arrange README validation files
gchhablani a1c1f67
Merge remote-tracking branch 'upstream/master' into add-readme-parser
gchhablani ee31e15
Update readme validator class
gchhablani ae60ce5
Add from_string tests
gchhablani 362e464
Merge remote-tracking branch 'upstream/master' into add-readme-parser
gchhablani 057d0d9
Add PyTest tests
gchhablani cd4b69e
Merge remote-tracking branch 'upstream/master' into add-readme-parser
gchhablani 35e08d8
Add tests for from_readme
gchhablani a3de91a
Add ReadMe validator script
gchhablani 8dd3feb
Fix style
gchhablani 87b0668
Remove print statement
gchhablani 1d49a4d
Add validator to CircleCI
gchhablani d9f0ac3
Fix style
gchhablani 414fc2e
Add YAML files to setup resources
gchhablani 0c3425a
Make validator executable
gchhablani 933fdf7
Add no subsections test
gchhablani cd895a1
Add incorrect YAML test
gchhablani a3bdb1f
Fix style
gchhablani 6e85d4a
Fix tests
gchhablani 10386e7
Fix tests
gchhablani b4ca9ca
Fix style
gchhablani a69c019
Fix escape character issue
gchhablani f12a105
Merge remote-tracking branch 'upstream/master' into add-readme-parser
gchhablani d45ec9b
Add three-level heading validation limit
gchhablani 309a69e
Merge remote-tracking branch 'upstream/master' into add-readme-parser
gchhablani cdcffe0
Add either text or subsection option
gchhablani ffdfcb6
Fix style
gchhablani File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,57 @@ | ||
| #!/usr/bin/env python | ||
|
|
||
| """ This script will run in CI and make sure all new changes to datasets readme files have valid readme content.""" | ||
|
|
||
| from pathlib import Path | ||
| from subprocess import check_output | ||
| from typing import List | ||
|
|
||
| from datasets.utils.readme import ReadMe | ||
|
|
||
|
|
||
| def get_changed_files(repo_path: Path) -> List[Path]: | ||
| diff_output = check_output(["git", "diff", "--name-only", "HEAD..origin/master"], cwd=repo_path) | ||
| changed_files = [Path(repo_path, f) for f in diff_output.decode().splitlines()] | ||
| return changed_files | ||
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| import logging | ||
| from argparse import ArgumentParser | ||
|
|
||
| logging.basicConfig(level=logging.DEBUG) | ||
|
|
||
| ap = ArgumentParser() | ||
| ap.add_argument("--repo_path", type=Path, default=Path.cwd()) | ||
| ap.add_argument("--check_all", action="store_true") | ||
| args = ap.parse_args() | ||
|
|
||
| repo_path: Path = args.repo_path | ||
| if args.check_all: | ||
| readmes = [dd / "README.md" for dd in (repo_path / "datasets").iterdir()] | ||
| else: | ||
| changed_files = get_changed_files(repo_path) | ||
| readmes = [ | ||
| f | ||
| for f in changed_files | ||
| if f.exists() and f.name.lower() == "readme.md" and f.parent.parent.name == "datasets" | ||
| ] | ||
|
|
||
| failed: List[Path] = [] | ||
| for readme in sorted(readmes): | ||
| try: | ||
| ReadMe.from_readme(readme) | ||
| logging.debug(f"✅️ Validated '{readme.relative_to(repo_path)}'") | ||
| except ValueError as e: | ||
| failed.append(readme) | ||
| logging.warning(f"❌ Validation failed for '{readme.relative_to(repo_path)}':\n{e}") | ||
| except Exception as e: | ||
| failed.append(readme) | ||
| logging.warning(f"⁉️ Something unexpected happened on '{readme.relative_to(repo_path)}':\n{e}") | ||
|
|
||
| if len(failed) > 0: | ||
| logging.info(f"❌ Failed on {len(failed)} files.") | ||
| exit(1) | ||
| else: | ||
| logging.info("All is well, keep up the good work 🤗!") | ||
| exit(0) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,279 @@ | ||
| import logging | ||
| from dataclasses import dataclass | ||
| from pathlib import Path | ||
| from typing import Any, List, Tuple | ||
|
|
||
| import yaml | ||
|
|
||
|
|
||
| # loading package files: https://stackoverflow.com/a/20885799 | ||
| try: | ||
| import importlib.resources as pkg_resources | ||
| except ImportError: | ||
| # Try backported to PY<37 `importlib_resources`. | ||
| import importlib_resources as pkg_resources | ||
|
|
||
| from . import resources | ||
|
|
||
|
|
||
| BASE_REF_URL = "https://github.com/huggingface/datasets/tree/master/src/datasets/utils" | ||
| this_url = f"{BASE_REF_URL}/{__file__}" | ||
| logger = logging.getLogger(__name__) | ||
|
|
||
|
|
||
| def load_yaml_resource(resource: str) -> Tuple[Any, str]: | ||
| content = pkg_resources.read_text(resources, resource) | ||
| return yaml.safe_load(content), f"{BASE_REF_URL}/resources/{resource}" | ||
|
|
||
|
|
||
| readme_structure, known_readme_structure_url = load_yaml_resource("readme_structure.yaml") | ||
|
|
||
| FILLER_TEXT = [ | ||
| "[Needs More Information]", | ||
| "[More Information Needed]", | ||
| "(https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)", | ||
| ] | ||
|
|
||
| # Dictionary representation of section/readme, error_list, warning_list | ||
| ReadmeValidatorOutput = Tuple[dict, List[str], List[str]] | ||
|
|
||
|
|
||
| @dataclass | ||
| class Section: | ||
| name: str | ||
| level: str | ||
| lines: List[str] = None | ||
|
|
||
| def __post_init__(self): | ||
| self.text = "" | ||
| self.is_empty_text = True | ||
| self.content = {} | ||
| self.parsing_error_list = [] | ||
| self.parsing_warning_list = [] | ||
| if self.lines is not None: | ||
| self.parse() | ||
|
|
||
| def parse(self): | ||
| current_sub_level = "" | ||
| current_lines = [] | ||
| code_start = False | ||
| for line in self.lines: | ||
| if line.strip(" \n") == "": | ||
| continue | ||
| elif line.strip(" \n")[:3] == "```": | ||
| code_start = not code_start | ||
| elif line.split()[0] == self.level + "#" and not code_start: | ||
| if current_sub_level != "": | ||
| self.content[current_sub_level] = Section(current_sub_level, self.level + "#", current_lines) | ||
| current_lines = [] | ||
| else: | ||
| if current_lines != []: | ||
| self.text += "".join(current_lines).strip() | ||
| if self.text != "" and self.text not in FILLER_TEXT: | ||
| self.is_empty_text = False | ||
| current_lines = [] | ||
|
|
||
| current_sub_level = " ".join(line.split()[1:]).strip(" \n") | ||
| else: | ||
| current_lines.append(line) | ||
| else: | ||
| if current_sub_level != "": | ||
| if current_sub_level in self.content: | ||
| self.parsing_error_list.append( | ||
| f"Multiple sections with the same heading `{current_sub_level}` have been found. Please keep only one of these sections." | ||
| ) | ||
| self.content[current_sub_level] = Section(current_sub_level, self.level + "#", current_lines) | ||
| else: | ||
| if current_lines != []: | ||
| self.text += "".join(current_lines).strip() | ||
| if self.text != "" and self.text not in FILLER_TEXT: | ||
| self.is_empty_text = False | ||
|
|
||
| def validate(self, structure: dict) -> ReadmeValidatorOutput: | ||
| """Validates a Section class object recursively using the structure provided as a dictionary. | ||
|
|
||
| Args: | ||
| structute (:obj: `dict`): The dictionary representing expected structure. | ||
|
|
||
| Returns: | ||
| :obj: `ReadmeValidatorOutput`: The dictionary representation of the section, and the errors. | ||
| """ | ||
| # Header text validation | ||
| error_list = [] | ||
| warning_list = [] | ||
| if structure["allow_empty"] is False: | ||
| # If content is expected | ||
| if self.is_empty_text and self.content == {}: | ||
| # If no content is found, mention it in the error_list | ||
| error_list.append(f"Expected some content in section `{self.name}` but it is empty.") | ||
|
|
||
| if structure["allow_empty_text"] is False: | ||
| # If some text is expected | ||
| if self.is_empty_text: | ||
| # If no text is found, mention it in the error_list | ||
| error_list.append( | ||
| f"Expected some text in section `{self.name}` but it is empty (text in subsections are ignored)." | ||
| ) | ||
| # Subsections Validation | ||
| if structure["subsections"] is not None: | ||
| # If subsections are expected | ||
| if self.content == {}: | ||
| # If no subsections are present | ||
| values = [subsection["name"] for subsection in structure["subsections"]] | ||
| # Mention the expected values in the error_list | ||
| error_list.append( | ||
| f"Section `{self.name}` expected the following subsections: {', '.join(['`'+x+'`' for x in values])}. Found 'None'." | ||
| ) | ||
| else: | ||
| # If some subsections are present | ||
| structure_names = [subsection["name"] for subsection in structure["subsections"]] | ||
| for idx, name in enumerate(structure_names): | ||
| if name not in self.content: | ||
| # If the expected subsection is not present | ||
| error_list.append(f"Section `{self.name}` is missing subsection: `{name}`.") | ||
| else: | ||
| # If the subsection is present, validate subsection, return the result | ||
| # and concat the errors from subsection to section error_list | ||
|
|
||
| # Skip sublevel validation if current level is `###` | ||
| if self.level == "###": | ||
| continue | ||
| else: | ||
| _, subsec_error_list, subsec_warning_list = self.content[name].validate( | ||
| structure["subsections"][idx] | ||
| ) | ||
| error_list += subsec_error_list | ||
| warning_list += subsec_warning_list | ||
|
|
||
| for name in self.content: | ||
| if name not in structure_names: | ||
| # If an extra subsection is present | ||
| warning_list.append( | ||
| f"`{self.name}` has an extra subsection: `{name}`. Skipping further validation checks for this subsection as expected structure is unknown." | ||
| ) | ||
| error_list = self.parsing_error_list + error_list | ||
| warning_list = self.parsing_warning_list + warning_list | ||
| if error_list: | ||
| # If there are errors, do not return the dictionary as it is invalid | ||
| return {}, error_list, warning_list | ||
| else: | ||
| return self.to_dict(), error_list, warning_list | ||
|
|
||
| def to_dict(self) -> dict: | ||
| """Returns the dictionary representation of a section.""" | ||
| return { | ||
| "name": self.name, | ||
| "text": self.text, | ||
| "is_empty_text": self.is_empty_text, | ||
| "subsections": [value.to_dict() for value in self.content.values()], | ||
| } | ||
|
|
||
|
|
||
| class ReadMe(Section): # Level 0 | ||
| def __init__(self, name: str, lines: List[str], structure: dict = None): | ||
| super().__init__(name=name, level="") # Not using lines here as we need to use a child class parse | ||
| self.structure = structure | ||
| self.yaml_tags_line_count = -2 | ||
| self.tag_count = 0 | ||
| self.lines = lines | ||
| if self.lines is not None: | ||
| self.parse() | ||
|
|
||
| # Validation | ||
| if self.structure is None: | ||
| content, error_list, warning_list = self.validate(readme_structure) | ||
| else: | ||
| content, error_list, warning_list = self.validate(self.structure) | ||
|
|
||
| error_list = self.parsing_error_list + error_list | ||
| warning_list = self.parsing_warning_list + warning_list | ||
| if error_list != [] or warning_list != []: | ||
| errors = "\n".join(list(map(lambda x: "-\t" + x, error_list + warning_list))) | ||
| error_string = f"The following issues were found for the README at `{self.name}`:\n" + errors | ||
| raise ValueError(error_string) | ||
|
|
||
| @classmethod | ||
| def from_readme(cls, path: Path, structure: dict = None): | ||
| with open(path) as f: | ||
| lines = f.readlines() | ||
| return cls(path, lines, structure) | ||
|
|
||
| @classmethod | ||
| def from_string(cls, string: str, structure: dict = None, root_name: str = "root"): | ||
| lines = string.split("\n") | ||
| return cls(root_name, lines, structure) | ||
|
|
||
| def parse(self): | ||
| # Skip Tags | ||
| line_count = 0 | ||
|
|
||
| for line in self.lines: | ||
| self.yaml_tags_line_count += 1 | ||
| if line.strip(" \n") == "---": | ||
| self.tag_count += 1 | ||
| if self.tag_count == 2: | ||
| break | ||
| line_count += 1 | ||
| if self.tag_count == 2: | ||
| self.lines = self.lines[line_count + 1 :] # Get the last + 1 th item. | ||
| else: | ||
| self.lines = self.lines[self.tag_count :] | ||
| super().parse() | ||
|
|
||
| def __str__(self): | ||
| """Returns the string of dictionary representation of the ReadMe.""" | ||
| return str(self.to_dict()) | ||
|
|
||
| def validate(self, readme_structure): | ||
| error_list = [] | ||
| warning_list = [] | ||
| if self.yaml_tags_line_count == 0: | ||
| warning_list.append("Empty YAML markers are present in the README.") | ||
| elif self.tag_count == 0: | ||
| warning_list.append("No YAML markers are present in the README.") | ||
| elif self.tag_count == 1: | ||
| warning_list.append("Only the start of YAML tags present in the README.") | ||
| # Check how many first level sections are present. | ||
| num_first_level_keys = len(self.content.keys()) | ||
| if num_first_level_keys > 1: | ||
| # If more than one, add to the error list, continue | ||
| error_list.append( | ||
| f"The README has several first-level headings: {', '.join(['`'+x+'`' for x in list(self.content.keys())])}. Only one heading is expected. Skipping further validation for this README." | ||
| ) | ||
| elif num_first_level_keys < 1: | ||
| # If less than one, append error. | ||
| error_list.append( | ||
| f"The README has no first-level headings. One heading is expected. Skipping further validation for this README." | ||
| ) | ||
|
|
||
| else: | ||
| # If one exactly | ||
| start_key = list(self.content.keys())[0] # Get the key | ||
| if start_key.startswith("Dataset Card for"): # Check correct start | ||
|
|
||
| # If the starting is correct, validate all the sections | ||
| _, sec_error_list, sec_warning_list = self.content[start_key].validate( | ||
| readme_structure["subsections"][0] | ||
| ) | ||
| error_list += sec_error_list | ||
| warning_list += sec_warning_list | ||
| else: | ||
| # If not found, append error | ||
| error_list.append( | ||
| f"No first-level heading starting with `Dataset Card for` found in README. Skipping further validation for this README." | ||
| ) | ||
| if error_list: | ||
| # If there are errors, do not return the dictionary as it is invalid | ||
| return {}, error_list, warning_list | ||
| else: | ||
| return self.to_dict(), error_list, warning_list | ||
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| from argparse import ArgumentParser | ||
|
|
||
| ap = ArgumentParser(usage="Validate the content (excluding YAML tags) of a README.md file.") | ||
| ap.add_argument("readme_filepath") | ||
| args = ap.parse_args() | ||
| readme_filepath = Path(args.readme_filepath) | ||
| readme = ReadMe.from_readme(readme_filepath) | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the future we may have subclasses of this to have more finegrained validation per section
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this class can be extended and we can keep a section-to-class mapping in the future. For now, this should be fine, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it's fine for now