feat: Add `html_to_text` helper function #792

Pijukatel · 2024-12-09T09:25:59Z

Description

Add simple html_to_text helper function. This generates new line separated plain text without tags.
Function is done to perform the same way as existing implementation in Javascript.
Add tests.

Issues

Closes: A helper function for simple HTML tag removal #659

Add tests.

Mantisus · 2024-12-09T14:16:46Z

The get_text in BeautifulSoup will split the text with each tag encountered.

That is, for HTML of this format we will get rather strange results

<!DOCTYPE html>
<html>
<head>
   <title>List with different elements</title>
</head>
<body>
   <ul>
       <li>This is a <a href="https://example.com">link</a> element</li>
       <li>This is a text with <b>bold</b> word</li>
       <li>This is an element<br>with line break</li>
   </ul>
</body>
</html>

Pijukatel · 2024-12-09T14:58:46Z

The get_text in BeautifulSoup will split the text with each tag encountered.

That is, for HTML of this format we will get rather strange results
<!DOCTYPE html>
<html>
<head>
   <title>List with different elements</title>
</head>
<body>
   <ul>
       <li>This is a <a href="https://example.com">link</a> element</li>
       <li>This is a text with <b>bold</b> word</li>
       <li>This is an element<br>with line break</li>
   </ul>
</body>
</html>

Yes, I thought about it. But am not so sure about requirements for our function. I can try to follow more closely the JS implementation.

Mantisus · 2024-12-09T15:42:22Z

Yes, I thought about it. But am not so sure about requirements for our function. I can try to follow more closely the JS implementation.

I'm not quite sure either

But just as a thought, since we always have lxml in our dependencies along with beautifulsoup.

You could try something like this:

from lxml import html

text = html.fromstring(html_text).text_content().strip()

and regex for clear spaces

TODO: Fix last differences and add more tests according to JS implementation.

…-removal

Pijukatel · 2024-12-11T10:15:15Z

src/crawlee/_utils/html_to_text.py

+                        # Compress white spaces outside of pre block
+                        compr = re.sub(r'\s+', ' ', page_element.get_text())
+                    # If text is empty or ends with a whitespace, don't add the leading whitespace or new line
+                    if (compr.startswith((' ', '\n'))) and re.search(r'(^|\s)$', text):


JS version has only compr.startswith(' '), but I can't understand how it passed the test. In Python to pass exactly same test I had to do: compr.startswith((' ', '\n')

janbuchar · 2024-12-11T10:51:03Z

Tagging @Mantisus for review - I believe you have the most field experience with beautifulsoup.

vdusek

I'm just suprised that html_to_text is in private _utils module, not exposed publicly, and not being used anywhere. Is that intended 🙂 ?

janbuchar · 2024-12-11T10:53:12Z

src/crawlee/_utils/html_to_text.py

+}
+
+
+def html_to_text(source: str | BeautifulSoup) -> str:


Is this the "entrypoint"? If it is, we need to expose it to the user in a better way. crawlee._utils is private.

Don't we want to integrate it to some Crawlers? similar to e.g. enqueue links

Yes, was not really sure where or how is this going to be used :-)

I expose it in BScrawler. It is so BS-dependent that you anyway have to install crawlee[beautifulsoup] to be able to use it.
After checking Parsel or Playwright I realized, that I am not sure I would be able to implement this tree-based processing with those two packages alone.

src/crawlee/_utils/html_to_text.py

Mantisus · 2024-12-11T19:51:28Z

src/crawlee/_utils/html_to_text.py

+                else:
+                    # Block elements must be surrounded by newlines(unless beginning of text)
+                    is_block_tag = page_element.name.lower() in BLOCK_TAGS
+                    if is_block_tag and not re.search(r'(^|\n)$', text):


Is it possible to use regex with pre-compile re.compile? I think this would be useful if the function will be used frequently during the crawler process.

Expose this function in BS crawler.

vdusek

Why isn't this simply another helper method in BeautifulSoupCrawler or BeautifulSoupCrawlingContext, like the many others we already have? IMO this introduces a somewhat different "end-user interface" compared to what we've had so far.

Pijukatel · 2024-12-13T15:27:18Z

Why isn't this simply another helper method in BeautifulSoupCrawler or BeautifulSoupCrawlingContext, like the many others we already have? IMO this introduces a somewhat different "end-user interface" compared to what we've had so far.

I am not very sure where to put it. It is kind of general utility, that can be used by any Crawler, but due to it's implementation it has to be used only when BS is installed. I re-exported it from that BS-related init as that already contains all the BS based functionality.

vdusek

Regarding exposing this on the context level, I was thinking about something like this:

class BeautifulSoupCrawlingContext(ParsedHttpCrawlingContext[BeautifulSoup]):
    ...

    def html_to_text(self) -> str:
        """Converts HTML content to plain text."""
        return html_to_text(self.parsed_content)

Doesn't that make sense to you? @janbuchar @Pijukatel

src/crawlee/_utils/html_to_text.py

janbuchar · 2024-12-16T12:06:12Z

Regarding exposing this on the context level, I was thinking about something like this:
class BeautifulSoupCrawlingContext(ParsedHttpCrawlingContext[BeautifulSoup]):
    ...

    def html_to_text(self) -> str:
        """Converts HTML content to plain text."""
        return html_to_text(self.parsed_content)
Doesn't that make sense to you? @janbuchar @Pijukatel

If html_to_text is also available as a public utility, sure, why not. I just want to make sure that you can use this from ParselCrawler, for instance. Even if you need to install beautifulsoup first.

vdusek · 2024-12-16T12:23:23Z

btw. I tried to convert this funcionality to Parsel using LLMs and it seems fine:

from __future__ import annotations

import re

from parsel import Selector

SKIP_TAGS = {'script', 'style', 'canvas', 'svg', 'noscript', 'title'}
BLOCK_TAGS = {
    'p',
    'h1',
    'h2',
    'h3',
    'h4',
    'h5',
    'h6',
    'ol',
    'ul',
    'li',
    'pre',
    'address',
    'blockquote',
    'dl',
    'div',
    'fieldset',
    'form',
    'table',
    'tr',
    'select',
    'option',
}

_ANY_CONSECUTIVE_WHITE_SPACES = re.compile(r'\s+')
_EMPTY_OR_ENDS_WITH_NEW_LINE = re.compile(r'(^|\n)$')


def html_to_text(source: str | Selector) -> str:
    """Converts markup string to newline-separated plain text without tags using Parsel."""
    selector = Selector(text=source) if isinstance(source, str) else source
    text = ''

    def _extract_text(elements: list[Selector]) -> None:
        """Custom parser for Parsel elements to simulate the behavior of the original function."""
        nonlocal text
        for element in elements:
            tag = element.root.tag if hasattr(element.root, 'tag') else None

            if tag in SKIP_TAGS:
                continue
            if tag == 'br':
                text += '\n'
            elif tag == 'td':
                _extract_text(element.xpath('./node()'))
                text += '\t'
            else:
                is_block_tag = tag in BLOCK_TAGS if tag else False

                if is_block_tag and not re.search(_EMPTY_OR_ENDS_WITH_NEW_LINE, text):
                    text += '\n'

                if tag:
                    _extract_text(element.xpath('./node()'))
                else:
                    compr = re.sub(_ANY_CONSECUTIVE_WHITE_SPACES, ' ', element.root.strip())
                    text += compr

                if is_block_tag and not text.endswith('\n'):
                    text += '\n'

    # Start processing the root elements
    _extract_text(selector.xpath('/*'))

    return text.strip()

Then using it from the Crawler:

from __future__ import annotations

import asyncio

from crawlee.parsel_crawler import ParselCrawler, ParselCrawlingContext, html_to_text


async def main() -> None:
    crawler = ParselCrawler()

    # Define the default request handler, which will be called for every request.
    @crawler.router.default_handler
    async def request_handler(context: ParselCrawlingContext) -> None:
        context.log.info(f'Processing {context.request.url} ...')

        # Extract data from the page.
        data = {
            'url': context.request.url,
            'title': context.selector.css('title').get(),
            'text': html_to_text(context.selector),
        }

        # Push the extracted data to the default dataset.
        await context.push_data(data)

    await crawler.run(['https://crawlee.dev/'])


if __name__ == '__main__':
    asyncio.run(main())

Output seems okay: parsel_html_to_text_output.txt

janbuchar · 2024-12-16T12:37:23Z

btw. I tried to convert this funcionality to Parsel using LLMs and it seems fine:

I'd prefer to just keep one version of the function, but we can make it Parsel-based for all I care

Add both implementations to respective contexts. Set same tests for both.

vdusek

Nice! More comments only regarding the docs.

src/crawlee/beautifulsoup_crawler/_beautifulsoup_crawling_context.py

src/crawlee/parsel_crawler/_parsel_crawling_context.py

src/crawlee/beautifulsoup_crawler/_utils.py

Co-authored-by: Vlada Dusek <[email protected]>

vdusek

LGTM

src/crawlee/_utils/html_to_text.py

Co-authored-by: Jan Buchar <[email protected]>

Add simple html_to_text helper function.

c803951

Add tests.

Pijukatel added enhancement New feature or request. t-tooling Issues with this label are in the ownership of the tooling team. labels Dec 9, 2024

github-actions bot assigned Pijukatel Dec 9, 2024

github-actions bot added this to the 104th sprint - Tooling team milestone Dec 9, 2024

github-actions bot added the tested Temporary label used only programatically for some analytics. label Dec 9, 2024

Pijukatel changed the title ~~Add simple html_to_text helper function.~~ Add simple html_to_text helper function Dec 9, 2024

Pijukatel changed the title ~~Add simple html_to_text helper function~~ feat: Add simple html_to_text helper function Dec 9, 2024

Pijukatel added 6 commits December 9, 2024 17:09

WIP

c571caa

WIP follow JS implementation

f26f8b6

Almost same as JS implementation.

db0cc65

TODO: Fix last differences and add more tests according to JS implementation.

Same behavior as JS implementation.

26d5aee

Merge remote-tracking branch 'origin/master' into helper-function-tag…

b45e2f5

…-removal

Reformat import in test_base_crawler.py

27cfedd

Pijukatel marked this pull request as ready for review December 11, 2024 10:13

Pijukatel commented Dec 11, 2024

View reviewed changes

Pijukatel requested a review from janbuchar December 11, 2024 10:16

janbuchar requested a review from Mantisus December 11, 2024 10:50

vdusek reviewed Dec 11, 2024

View reviewed changes

janbuchar reviewed Dec 11, 2024

View reviewed changes

Mantisus reviewed Dec 12, 2024

View reviewed changes

Pre-compile used re patterns.

bab689c

Expose this function in BS crawler.

Mantisus approved these changes Dec 13, 2024

View reviewed changes

Pijukatel requested review from janbuchar and vdusek December 13, 2024 12:50

vdusek requested changes Dec 13, 2024

View reviewed changes

vdusek reviewed Dec 16, 2024

View reviewed changes

src/crawlee/_utils/html_to_text.py Show resolved Hide resolved

Pijukatel added 2 commits December 17, 2024 09:33

Add Parsel version of html_to_text.

6f32156

Add both implementations to respective contexts. Set same tests for both.

Add public function as well.

964c2cf

Pijukatel requested a review from vdusek December 17, 2024 09:05

vdusek reviewed Dec 17, 2024

View reviewed changes

src/crawlee/beautifulsoup_crawler/_beautifulsoup_crawling_context.py Outdated Show resolved Hide resolved

src/crawlee/parsel_crawler/_parsel_crawling_context.py Outdated Show resolved Hide resolved

src/crawlee/beautifulsoup_crawler/_utils.py Outdated Show resolved Hide resolved

Pijukatel and others added 3 commits December 17, 2024 10:26

Apply suggestions from code review

0cb68af

Co-authored-by: Vlada Dusek <[email protected]>

Add docs decorator

210749e

Do not expose in crawlee.utils - review comments

50968bd

vdusek approved these changes Dec 18, 2024

View reviewed changes

vdusek changed the title ~~feat: Add simple html_to_text helper function~~ feat: Add html_to_text helper function Dec 18, 2024

vdusek changed the title ~~feat: Add html_to_text helper function~~ feat: Add html_to_text helper function Dec 18, 2024

janbuchar reviewed Dec 18, 2024

View reviewed changes

src/crawlee/_utils/html_to_text.py Outdated Show resolved Hide resolved

Update src/crawlee/_utils/html_to_text.py

9f15a91

Co-authored-by: Jan Buchar <[email protected]>

Pijukatel merged commit 2b9d970 into master Dec 18, 2024
23 checks passed

Pijukatel deleted the helper-function-tag-removal branch December 18, 2024 15:43

feat: Add html_to_text helper function #792

feat: Add html_to_text helper function #792

Uh oh!

Conversation

Pijukatel commented Dec 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Issues

Uh oh!

Mantisus commented Dec 9, 2024

Uh oh!

Pijukatel commented Dec 9, 2024

Uh oh!

Mantisus commented Dec 9, 2024

Uh oh!

Pijukatel Dec 11, 2024

Choose a reason for hiding this comment

Uh oh!

janbuchar commented Dec 11, 2024

Uh oh!

vdusek left a comment

Choose a reason for hiding this comment

Uh oh!

janbuchar Dec 11, 2024

Choose a reason for hiding this comment

Uh oh!

vdusek Dec 11, 2024

Choose a reason for hiding this comment

Uh oh!

Pijukatel Dec 11, 2024

Choose a reason for hiding this comment

Uh oh!

Pijukatel Dec 13, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Mantisus Dec 11, 2024

Choose a reason for hiding this comment

Uh oh!

Pijukatel Dec 13, 2024

Choose a reason for hiding this comment

Uh oh!

vdusek left a comment

Choose a reason for hiding this comment

Uh oh!

Pijukatel commented Dec 13, 2024

Uh oh!

vdusek left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

janbuchar commented Dec 16, 2024

Uh oh!

vdusek commented Dec 16, 2024

Uh oh!

janbuchar commented Dec 16, 2024

Uh oh!

vdusek left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vdusek left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

feat: Add `html_to_text` helper function #792

feat: Add `html_to_text` helper function #792

Pijukatel commented Dec 9, 2024 •

edited

Loading