-
Notifications
You must be signed in to change notification settings - Fork 513
feat: Add html_to_text helper function
#792
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Add tests.
|
The That is, for HTML of this format we will get rather strange results <!DOCTYPE html>
<html>
<head>
<title>List with different elements</title>
</head>
<body>
<ul>
<li>This is a <a href="https://example.com">link</a> element</li>
<li>This is a text with <b>bold</b> word</li>
<li>This is an element<br>with line break</li>
</ul>
</body>
</html> |
Yes, I thought about it. But am not so sure about requirements for our function. I can try to follow more closely the JS implementation. |
I'm not quite sure either But just as a thought, since we always have You could try something like this: from lxml import html
text = html.fromstring(html_text).text_content().strip()and regex for clear spaces |
TODO: Fix last differences and add more tests according to JS implementation.
src/crawlee/_utils/html_to_text.py
Outdated
| # Compress white spaces outside of pre block | ||
| compr = re.sub(r'\s+', ' ', page_element.get_text()) | ||
| # If text is empty or ends with a whitespace, don't add the leading whitespace or new line | ||
| if (compr.startswith((' ', '\n'))) and re.search(r'(^|\s)$', text): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
JS version has only compr.startswith(' '), but I can't understand how it passed the test. In Python to pass exactly same test I had to do: compr.startswith((' ', '\n')
|
Tagging @Mantisus for review - I believe you have the most field experience with beautifulsoup. |
vdusek
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm just suprised that html_to_text is in private _utils module, not exposed publicly, and not being used anywhere. Is that intended 🙂 ?
src/crawlee/_utils/html_to_text.py
Outdated
| } | ||
|
|
||
|
|
||
| def html_to_text(source: str | BeautifulSoup) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the "entrypoint"? If it is, we need to expose it to the user in a better way. crawlee._utils is private.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we want to integrate it to some Crawlers? similar to e.g. enqueue links
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, was not really sure where or how is this going to be used :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I expose it in BScrawler. It is so BS-dependent that you anyway have to install crawlee[beautifulsoup] to be able to use it.
After checking Parsel or Playwright I realized, that I am not sure I would be able to implement this tree-based processing with those two packages alone.
src/crawlee/_utils/html_to_text.py
Outdated
| else: | ||
| # Block elements must be surrounded by newlines(unless beginning of text) | ||
| is_block_tag = page_element.name.lower() in BLOCK_TAGS | ||
| if is_block_tag and not re.search(r'(^|\n)$', text): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to use regex with pre-compile re.compile? I think this would be useful if the function will be used frequently during the crawler process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Expose this function in BS crawler.
vdusek
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why isn't this simply another helper method in BeautifulSoupCrawler or BeautifulSoupCrawlingContext, like the many others we already have? IMO this introduces a somewhat different "end-user interface" compared to what we've had so far.
I am not very sure where to put it. It is kind of general utility, that can be used by any Crawler, but due to it's implementation it has to be used only when BS is installed. I re-exported it from that BS-related init as that already contains all the BS based functionality. |
vdusek
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding exposing this on the context level, I was thinking about something like this:
class BeautifulSoupCrawlingContext(ParsedHttpCrawlingContext[BeautifulSoup]):
...
def html_to_text(self) -> str:
"""Converts HTML content to plain text."""
return html_to_text(self.parsed_content)Doesn't that make sense to you? @janbuchar @Pijukatel
If |
|
btw. I tried to convert this funcionality to Parsel using LLMs and it seems fine: from __future__ import annotations
import re
from parsel import Selector
SKIP_TAGS = {'script', 'style', 'canvas', 'svg', 'noscript', 'title'}
BLOCK_TAGS = {
'p',
'h1',
'h2',
'h3',
'h4',
'h5',
'h6',
'ol',
'ul',
'li',
'pre',
'address',
'blockquote',
'dl',
'div',
'fieldset',
'form',
'table',
'tr',
'select',
'option',
}
_ANY_CONSECUTIVE_WHITE_SPACES = re.compile(r'\s+')
_EMPTY_OR_ENDS_WITH_NEW_LINE = re.compile(r'(^|\n)$')
def html_to_text(source: str | Selector) -> str:
"""Converts markup string to newline-separated plain text without tags using Parsel."""
selector = Selector(text=source) if isinstance(source, str) else source
text = ''
def _extract_text(elements: list[Selector]) -> None:
"""Custom parser for Parsel elements to simulate the behavior of the original function."""
nonlocal text
for element in elements:
tag = element.root.tag if hasattr(element.root, 'tag') else None
if tag in SKIP_TAGS:
continue
if tag == 'br':
text += '\n'
elif tag == 'td':
_extract_text(element.xpath('./node()'))
text += '\t'
else:
is_block_tag = tag in BLOCK_TAGS if tag else False
if is_block_tag and not re.search(_EMPTY_OR_ENDS_WITH_NEW_LINE, text):
text += '\n'
if tag:
_extract_text(element.xpath('./node()'))
else:
compr = re.sub(_ANY_CONSECUTIVE_WHITE_SPACES, ' ', element.root.strip())
text += compr
if is_block_tag and not text.endswith('\n'):
text += '\n'
# Start processing the root elements
_extract_text(selector.xpath('/*'))
return text.strip()Then using it from the Crawler: from __future__ import annotations
import asyncio
from crawlee.parsel_crawler import ParselCrawler, ParselCrawlingContext, html_to_text
async def main() -> None:
crawler = ParselCrawler()
# Define the default request handler, which will be called for every request.
@crawler.router.default_handler
async def request_handler(context: ParselCrawlingContext) -> None:
context.log.info(f'Processing {context.request.url} ...')
# Extract data from the page.
data = {
'url': context.request.url,
'title': context.selector.css('title').get(),
'text': html_to_text(context.selector),
}
# Push the extracted data to the default dataset.
await context.push_data(data)
await crawler.run(['https://crawlee.dev/'])
if __name__ == '__main__':
asyncio.run(main())Output seems okay: parsel_html_to_text_output.txt |
I'd prefer to just keep one version of the function, but we can make it Parsel-based for all I care |
Add both implementations to respective contexts. Set same tests for both.
vdusek
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! More comments only regarding the docs.
src/crawlee/beautifulsoup_crawler/_beautifulsoup_crawling_context.py
Outdated
Show resolved
Hide resolved
Co-authored-by: Vlada Dusek <[email protected]>
vdusek
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
html_to_text helper function
Co-authored-by: Jan Buchar <[email protected]>
Description
Add simple html_to_text helper function. This generates new line separated plain text without tags.
Function is done to perform the same way as existing implementation in Javascript.
Add tests.
Issues