-
Notifications
You must be signed in to change notification settings - Fork 3k
Add Validation For README #2121
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Good start! Here are some proposed next steps:
|
|
I have added basic validation checking in the class. It works based on a YAML string. The YAML string determines the expected structure and which text is to be checked. The Please let me know your thoughts. I haven't added a variable that keeps a track of whether the text is empty or not but it can be done easliy if required. |
|
This looks like a good start ! Do you think you can have a way to collect all the validation fails of a readme and then raise an error showing all the failures instead of using print ? Then we can create a |
|
Hi @lhoestq I have added changes accordingly. I prepared a list which stores all the errors and raises them at the end. I'm not sure if there is a better way. |
lhoestq
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice ! Now I'm curious to see the results if we run this on all the dataset cards ^^'
src/datasets/utils/readme_parser.py
Outdated
| error_list = [] | ||
| if structure["allow_empty"] == False: | ||
| if section.is_empty: | ||
| print(section.text) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| print(section.text) |
… add-readme-parser
|
Please find the output for the existing READMEs here: http://p.ip.fi/2vYU Thanks, |
lhoestq
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very cool thanks !
Feel free to add a few docstrings and type hints. I also left a few comments:
| ] | ||
|
|
||
|
|
||
| class Section: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the future we may have subclasses of this to have more finegrained validation per section
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this class can be extended and we can keep a section-to-class mapping in the future. For now, this should be fine, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it's fine for now
src/datasets/utils/readme.py
Outdated
| with open(resource) as f: | ||
| content = yaml.safe_load(f) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You may need to use pkg_resources here to load the yaml data
See an example here:
datasets/src/datasets/utils/metadata.py
Lines 25 to 27 in 8e903b5
| def load_json_resource(resource: str) -> Tuple[Any, str]: | |
| content = pkg_resources.read_text(resources, resource) | |
| return json.loads(content), f"{BASE_REF_URL}/resources/{resource}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I'll use pkg_resources, but can you please explain why it is needed?
src/datasets/utils/readme.py
Outdated
| return error_list | ||
|
|
||
|
|
||
| def validate_readme(file_path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you write a few tests for this function ? that would be appreciated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I will add the tests.
|
Hi @lhoestq I have added some basic tests, also have restructured There is one print statement currently, I'm not sure how to remove it. Basically, I want to warn but not stop further validation. I can't append to a list because the ---
---
# Dataset Card for FashionMNIST
## Dataset Description
## Dataset DescriptionIn this case, I check for validation only in the latest entry. I can also raise an error (ideal case scenario), but still, it is in the In tests, I'm using a dummy YAML string for structure, we can also make it into a file but I feel that is not a hard requirement. Let me know your thoughts. I will add tests for However, I would love to be able to check the exact message in the test when an error is raised. I checked a couple of methods but couldn't get it working. Let me know if you're aware of a way to do that. |
lhoestq
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks !
| ] | ||
|
|
||
|
|
||
| class Section: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it's fine for now
src/datasets/utils/readme.py
Outdated
| print( | ||
| f"Multiple sections with the same heading '{current_sub_level}' have been found. Using the latest one found." | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe you could also have self.parsing_error_list and self.parsing_warning_list ?
This way in validate you could get the errors and warnings with section.parsing_error_list and section.parsing_warning_list
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I also add self.validate_error_list and self.validate_warning_list?
Currently I am raising both warnings and errors together. Should I handle them separately?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you want.
The advantage of having the parsing error and warnings in the attributes is that you can access them from the validate methods
tests/test_readme_util.py
Outdated
| class TestReadMeUtils(unittest.TestCase): | ||
| def test_from_string(self): | ||
| ReadMe.from_string(README_CORRECT, EXPECTED_STRUCTURE) | ||
| with self.assertRaises(ValueError): | ||
| ReadMe.from_string(README_EMPTY_YAML, EXPECTED_STRUCTURE) | ||
| with self.assertRaises(ValueError): | ||
| ReadMe.from_string(README_INCORRECT_YAML, EXPECTED_STRUCTURE) | ||
| with self.assertRaises(ValueError): | ||
| ReadMe.from_string(README_NO_YAML, EXPECTED_STRUCTURE) | ||
| with self.assertRaises(ValueError): | ||
| ReadMe.from_string(README_MISSING_TEXT, EXPECTED_STRUCTURE) | ||
| with self.assertRaises(ValueError): | ||
| ReadMe.from_string(README_MISSING_SUBSECTION, EXPECTED_STRUCTURE) | ||
| with self.assertRaises(ValueError): | ||
| ReadMe.from_string(README_MISSING_FIRST_LEVEL, EXPECTED_STRUCTURE) | ||
| with self.assertRaises(ValueError): | ||
| ReadMe.from_string(README_MULTIPLE_WRONG_FIRST_LEVEL, EXPECTED_STRUCTURE) | ||
| with self.assertRaises(ValueError): | ||
| ReadMe.from_string(README_WRONG_FIRST_LEVEL, EXPECTED_STRUCTURE) | ||
| with self.assertRaises(ValueError): | ||
| ReadMe.from_string(README_EMPTY, EXPECTED_STRUCTURE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here you could use pytest to check for the error messages.
You can find some documentation here:
https://docs.pytest.org/en/stable/assert.html#assertions-about-expected-exceptions
Note that pytest doesn't use the unittest.TestCase class. Instead you have to define a test function.
For example
def test_from_string(self):
ReadMe.from_string(README_CORRECT, EXPECTED_STRUCTURE)
with pytest.raises(ValueError) as excinfo:
ReadMe.from_string(README_EMPTY_YAML, EXPECTED_STRUCTURE)
assert "empty" in excinfoDoes that sound good for you ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also you can use @pytest.mark.parametrize(...) to run your test functions on all the dummy yaml you defined if it sounds more convenient for you
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I thought I was restricted to unittest. Cool, I'll write pytest test cases and also check the error messages. I assume that is better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be ideal, thanks !
tests/test_readme_util.py
Outdated
| expected_error = expected_error.format(path=path).encode("unicode_escape").decode("ascii") | ||
| with pytest.raises(ValueError, match=expected_error): | ||
| ReadMe.from_readme(path, example_yaml_structure) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
match is supposed to be a regex, however you are passing a path that may be a windows path.
Instead of espacing the backslashes from windows, you can just escape the full string so that it will consider it a a simple litteral.
| expected_error = expected_error.format(path=path).encode("unicode_escape").decode("ascii") | |
| with pytest.raises(ValueError, match=expected_error): | |
| ReadMe.from_readme(path, example_yaml_structure) | |
| expected_error = expected_error.format(path=path) | |
| with pytest.raises(ValueError, match=re.escape(expected_error)): | |
| ReadMe.from_readme(path, example_yaml_structure) |
src/datasets/utils/readme.py
Outdated
| if self.is_empty: | ||
| # If no header text is found, mention it in the error_list | ||
| error_list.append(f"Expected some header text for section `{self.name}`.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe have a more explicit message like "Expected some text in section {self.name} but it is empty (text in subsections are ignored)."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks ! You really did an amazing job on this one :)
As discussed offline, the next step is to integrate this to the pytest suite, and allow running the validation of all readmes with a RUN_SLOW=1 parameter (i.e. mark the full test with the slow decorator).
Hi @lhoestq, @yjernite
This is a simple Readme parser. All classes specific to different sections can inherit
Sectionclass, and we can define more attributes in each.Let me know if this is going in the right direction :)
Currently the output looks like this, for
to_dict()onFashionMNISTREADME.md:{ "name": "./datasets/fashion_mnist/README.md", "attributes": "", "subsections": [ { "name": "Dataset Card for FashionMNIST", "attributes": "", "subsections": [ { "name": "Table of Contents", "attributes": "- [Dataset Description](#dataset-description)\n - [Dataset Summary](#dataset-summary)\n - [Supported Tasks](#supported-tasks-and-leaderboards)\n - [Languages](#languages)\n- [Dataset Structure](#dataset-structure)\n - [Data Instances](#data-instances)\n - [Data Fields](#data-instances)\n - [Data Splits](#data-instances)\n- [Dataset Creation](#dataset-creation)\n - [Curation Rationale](#curation-rationale)\n - [Source Data](#source-data)\n - [Annotations](#annotations)\n - [Personal and Sensitive Information](#personal-and-sensitive-information)\n- [Considerations for Using the Data](#considerations-for-using-the-data)\n - [Social Impact of Dataset](#social-impact-of-dataset)\n - [Discussion of Biases](#discussion-of-biases)\n - [Other Known Limitations](#other-known-limitations)\n- [Additional Information](#additional-information)\n - [Dataset Curators](#dataset-curators)\n - [Licensing Information](#licensing-information)\n - [Citation Information](#citation-information)\n - [Contributions](#contributions)", "subsections": [] }, { "name": "Dataset Description", "attributes": "- **Homepage:** [GitHub](https://github.com/zalandoresearch/fashion-mnist)\n- **Repository:** [GitHub](https://github.com/zalandoresearch/fashion-mnist)\n- **Paper:** [arXiv](https://arxiv.org/pdf/1708.07747.pdf)\n- **Leaderboard:**\n- **Point of Contact:**", "subsections": [ { "name": "Dataset Summary", "attributes": "Fashion-MNIST is a dataset of Zalando's article images\u2014consisting of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label from 10 classes. We intend Fashion-MNIST to serve as a direct drop-in replacement for the original MNIST dataset for benchmarking machine learning algorithms. It shares the same image size and structure of training and testing splits.", "subsections": [] }, { "name": "Supported Tasks and Leaderboards", "attributes": "[More Information Needed]", "subsections": [] }, { "name": "Languages", "attributes": "[More Information Needed]", "subsections": [] } ] }, { "name": "Dataset Structure", "attributes": "", "subsections": [ { "name": "Data Instances", "attributes": "A data point comprises an image and its label.", "subsections": [] }, { "name": "Data Fields", "attributes": "- `image`: a 2d array of integers representing the 28x28 image.\n- `label`: an integer between 0 and 9 representing the classes with the following mapping:\n | Label | Description |\n | --- | --- |\n | 0 | T-shirt/top |\n | 1 | Trouser |\n | 2 | Pullover |\n | 3 | Dress |\n | 4 | Coat |\n | 5 | Sandal |\n | 6 | Shirt |\n | 7 | Sneaker |\n | 8 | Bag |\n | 9 | Ankle boot |", "subsections": [] }, { "name": "Data Splits", "attributes": "The data is split into training and test set. The training set contains 60,000 images and the test set 10,000 images.", "subsections": [] } ] }, { "name": "Dataset Creation", "attributes": "", "subsections": [ { "name": "Curation Rationale", "attributes": "**From the arXiv paper:**\nThe original MNIST dataset contains a lot of handwritten digits. Members of the AI/ML/Data Science community love this dataset and use it as a benchmark to validate their algorithms. In fact, MNIST is often the first dataset researchers try. \"If it doesn't work on MNIST, it won't work at all\", they said. \"Well, if it does work on MNIST, it may still fail on others.\"\nHere are some good reasons:\n- MNIST is too easy. Convolutional nets can achieve 99.7% on MNIST. Classic machine learning algorithms can also achieve 97% easily. Check out our side-by-side benchmark for Fashion-MNIST vs. MNIST, and read \"Most pairs of MNIST digits can be distinguished pretty well by just one pixel.\"\n- MNIST is overused. In this April 2017 Twitter thread, Google Brain research scientist and deep learning expert Ian Goodfellow calls for people to move away from MNIST.\n- MNIST can not represent modern CV tasks, as noted in this April 2017 Twitter thread, deep learning expert/Keras author Fran\u00e7ois Chollet.", "subsections": [] }, { "name": "Source Data", "attributes": "", "subsections": [ { "name": "Initial Data Collection and Normalization", "attributes": "**From the arXiv paper:**\nFashion-MNIST is based on the assortment on Zalando\u2019s website. Every fashion product on Zalando has a set of pictures shot by professional photographers, demonstrating different aspects of the product, i.e. front and back looks, details, looks with model and in an outfit. The original picture has a light-gray background (hexadecimal color: #fdfdfd) and stored in 762 \u00d7 1000 JPEG format. For efficiently serving different frontend components, the original picture is resampled with multiple resolutions, e.g. large, medium, small, thumbnail and tiny.\nWe use the front look thumbnail images of 70,000 unique products to build Fashion-MNIST. Those products come from different gender groups: men, women, kids and neutral. In particular, whitecolor products are not included in the dataset as they have low contrast to the background. The thumbnails (51 \u00d7 73) are then fed into the following conversion pipeline:\n1. Converting the input to a PNG image.\n2. Trimming any edges that are close to the color of the corner pixels. The \u201ccloseness\u201d is defined by the distance within 5% of the maximum possible intensity in RGB space.\n3. Resizing the longest edge of the image to 28 by subsampling the pixels, i.e. some rows and columns are skipped over.\n4. Sharpening pixels using a Gaussian operator of the radius and standard deviation of 1.0, with increasing effect near outlines.\n5. Extending the shortest edge to 28 and put the image to the center of the canvas.\n6. Negating the intensities of the image.\n7. Converting the image to 8-bit grayscale pixels.", "subsections": [] }, { "name": "Who are the source image producers?", "attributes": "**From the arXiv paper:**\nEvery fashion product on Zalando has a set of pictures shot by professional photographers, demonstrating different aspects of the product, i.e. front and back looks, details, looks with model and in an outfit.", "subsections": [] } ] }, { "name": "Annotations", "attributes": "", "subsections": [ { "name": "Annotation process", "attributes": "**From the arXiv paper:**\nFor the class labels, they use the silhouette code of the product. The silhouette code is manually labeled by the in-house fashion experts and reviewed by a separate team at Zalando. Each product Zalando is the Europe\u2019s largest online fashion platform. Each product contains only one silhouette code.", "subsections": [] }, { "name": "Who are the annotators?", "attributes": "**From the arXiv paper:**\nThe silhouette code is manually labeled by the in-house fashion experts and reviewed by a separate team at Zalando.", "subsections": [] } ] }, { "name": "Personal and Sensitive Information", "attributes": "[More Information Needed]", "subsections": [] } ] }, { "name": "Considerations for Using the Data", "attributes": "", "subsections": [ { "name": "Social Impact of Dataset", "attributes": "[More Information Needed]", "subsections": [] }, { "name": "Discussion of Biases", "attributes": "[More Information Needed]", "subsections": [] }, { "name": "Other Known Limitations", "attributes": "[More Information Needed]", "subsections": [] } ] }, { "name": "Additional Information", "attributes": "", "subsections": [ { "name": "Dataset Curators", "attributes": "Han Xiao and Kashif Rasul and Roland Vollgraf", "subsections": [] }, { "name": "Licensing Information", "attributes": "MIT Licence", "subsections": [] }, { "name": "Citation Information", "attributes": "@article{DBLP:journals/corr/abs-1708-07747,\n author = {Han Xiao and\n Kashif Rasul and\n Roland Vollgraf},\n title = {Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning\n Algorithms},\n journal = {CoRR},\n volume = {abs/1708.07747},\n year = {2017},\n url = {http://arxiv.org/abs/1708.07747},\n archivePrefix = {arXiv},\n eprint = {1708.07747},\n timestamp = {Mon, 13 Aug 2018 16:47:27 +0200},\n biburl = {https://dblp.org/rec/bib/journals/corr/abs-1708-07747},\n bibsource = {dblp computer science bibliography, https://dblp.org}\n}", "subsections": [] }, { "name": "Contributions", "attributes": "Thanks to [@gchhablani](https://github.com/gchablani) for adding this dataset.", "subsections": [] } ] } ] } ] }Thanks,
Gunjan