Open
Conversation
This change improves the parser of recipes at rezeptwelt.de: - detect ingredient groups - support HTML layout for newer recipes, especially for instruction parsing - add prep time - add equipment entries
jayaddison
reviewed
Oct 17, 2024
|
|
||
| def site_name(self): | ||
| raise StaticValueException(return_value="Rezeptwelt") | ||
| return "Thermomix Rezeptwelt" |
Contributor
There was a problem hiding this comment.
Suggested change
| return "Thermomix Rezeptwelt" | |
| raise StaticValueException(return_value="Thermomix Rezeptwelt") |
I admit this is a slightly unusual pattern that we use; it is used so that the interface of the library can indicate whether values were retrieved from the source HTML or whether they are static/constant values returned by the code.
jayaddison
reviewed
Oct 17, 2024
Comment on lines
+106
to
+111
| def prep_time(self): | ||
| tag = self.soup.find(itemprop="performTime", content=nonempty) | ||
| return get_minutes(tag['content']) if tag else None | ||
|
|
||
| def equipment(self): | ||
| return [tag['content'] for tag in self.soup.find_all("meta", itemprop="tool", content=nonempty)] |
Contributor
There was a problem hiding this comment.
BeautifulSoup (bs4 / self.soup) allows non-empty content filtering by passing a boolean True value, so I think we can simplify these methods slightly:
Suggested change
| def prep_time(self): | |
| tag = self.soup.find(itemprop="performTime", content=nonempty) | |
| return get_minutes(tag['content']) if tag else None | |
| def equipment(self): | |
| return [tag['content'] for tag in self.soup.find_all("meta", itemprop="tool", content=nonempty)] | |
| def prep_time(self): | |
| tag = self.soup.find(itemprop="performTime", content=True) | |
| return get_minutes(tag['content']) if tag else None | |
| def equipment(self): | |
| return [tag['content'] for tag in self.soup.find_all("meta", itemprop="tool", content=True)] |
jayaddison
reviewed
Oct 17, 2024
Comment on lines
+31
to
+35
| tag = self.soup.find("div", itemprop="author") | ||
| if tag: | ||
| return normalize_string(tag.get_text()) | ||
| tag = self.soup.find("span", {"id": "viewRecipeAuthor"}) | ||
| return normalize_string(tag.get_text()) |
Contributor
There was a problem hiding this comment.
Some observations here:
- The retrieval from an
itemprop="author"attribute is essentiallyschema.orgmetadata retrieval; we have an existing helper method to implement that, so let's re-use them here. - The information contained in the
viewRecipeAuthorelement seems more-specific than the schema metadata, which is sometimes generic. So let's preferviewRecipeAuthorwhen mentioned.
What this leads me to when adapting the code locally is:
Suggested change
| tag = self.soup.find("div", itemprop="author") | |
| if tag: | |
| return normalize_string(tag.get_text()) | |
| tag = self.soup.find("span", {"id": "viewRecipeAuthor"}) | |
| return normalize_string(tag.get_text()) | |
| name_from_schema = self.schema.author() | |
| name_from_hyperlink = None | |
| tag = self.soup.find("span", {"id": "viewRecipeAuthor"}) | |
| if tag: | |
| name_from_hyperlink = tag.get_text() | |
| return normalize_string(name_from_hyperlink or name_from_schema) |
Note: the word von in some of the test data seems redundant, so we can remove that (these changes affect that).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This change improves the parser of recipes at rezeptwelt.de: