Skip to content

Conversation

@Pijukatel
Copy link
Collaborator

@Pijukatel Pijukatel commented Dec 9, 2024

Description

Add simple html_to_text helper function. This generates new line separated plain text without tags.
Function is done to perform the same way as existing implementation in Javascript.
Add tests.

Issues

@Pijukatel Pijukatel added enhancement New feature or request. t-tooling Issues with this label are in the ownership of the tooling team. labels Dec 9, 2024
@github-actions github-actions bot added this to the 104th sprint - Tooling team milestone Dec 9, 2024
@github-actions github-actions bot added the tested Temporary label used only programatically for some analytics. label Dec 9, 2024
@Pijukatel Pijukatel changed the title Add simple html_to_text helper function. Add simple html_to_text helper function Dec 9, 2024
@Pijukatel Pijukatel changed the title Add simple html_to_text helper function feat: Add simple html_to_text helper function Dec 9, 2024
@Mantisus
Copy link
Collaborator

Mantisus commented Dec 9, 2024

The get_text in BeautifulSoup will split the text with each tag encountered.

That is, for HTML of this format we will get rather strange results

<!DOCTYPE html>
<html>
<head>
   <title>List with different elements</title>
</head>
<body>
   <ul>
       <li>This is a <a href="https://example.com">link</a> element</li>
       <li>This is a text with <b>bold</b> word</li>
       <li>This is an element<br>with line break</li>
   </ul>
</body>
</html>

@Pijukatel
Copy link
Collaborator Author

The get_text in BeautifulSoup will split the text with each tag encountered.

That is, for HTML of this format we will get rather strange results

<!DOCTYPE html>
<html>
<head>
   <title>List with different elements</title>
</head>
<body>
   <ul>
       <li>This is a <a href="https://example.com">link</a> element</li>
       <li>This is a text with <b>bold</b> word</li>
       <li>This is an element<br>with line break</li>
   </ul>
</body>
</html>

Yes, I thought about it. But am not so sure about requirements for our function. I can try to follow more closely the JS implementation.

@Mantisus
Copy link
Collaborator

Mantisus commented Dec 9, 2024

Yes, I thought about it. But am not so sure about requirements for our function. I can try to follow more closely the JS implementation.

I'm not quite sure either

But just as a thought, since we always have lxml in our dependencies along with beautifulsoup.

You could try something like this:

from lxml import html

text = html.fromstring(html_text).text_content().strip()

and regex for clear spaces

@Pijukatel Pijukatel marked this pull request as ready for review December 11, 2024 10:13
# Compress white spaces outside of pre block
compr = re.sub(r'\s+', ' ', page_element.get_text())
# If text is empty or ends with a whitespace, don't add the leading whitespace or new line
if (compr.startswith((' ', '\n'))) and re.search(r'(^|\s)$', text):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JS version has only compr.startswith(' '), but I can't understand how it passed the test. In Python to pass exactly same test I had to do: compr.startswith((' ', '\n')

@Pijukatel Pijukatel requested a review from janbuchar December 11, 2024 10:16
@janbuchar janbuchar requested a review from Mantisus December 11, 2024 10:50
@janbuchar
Copy link
Collaborator

Tagging @Mantisus for review - I believe you have the most field experience with beautifulsoup.

Copy link
Collaborator

@vdusek vdusek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just suprised that html_to_text is in private _utils module, not exposed publicly, and not being used anywhere. Is that intended 🙂 ?

}


def html_to_text(source: str | BeautifulSoup) -> str:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the "entrypoint"? If it is, we need to expose it to the user in a better way. crawlee._utils is private.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we want to integrate it to some Crawlers? similar to e.g. enqueue links

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, was not really sure where or how is this going to be used :-)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expose it in BScrawler. It is so BS-dependent that you anyway have to install crawlee[beautifulsoup] to be able to use it.
After checking Parsel or Playwright I realized, that I am not sure I would be able to implement this tree-based processing with those two packages alone.

else:
# Block elements must be surrounded by newlines(unless beginning of text)
is_block_tag = page_element.name.lower() in BLOCK_TAGS
if is_block_tag and not re.search(r'(^|\n)$', text):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to use regex with pre-compile re.compile? I think this would be useful if the function will be used frequently during the crawler process.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Expose this function in BS crawler.
Copy link
Collaborator

@vdusek vdusek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why isn't this simply another helper method in BeautifulSoupCrawler or BeautifulSoupCrawlingContext, like the many others we already have? IMO this introduces a somewhat different "end-user interface" compared to what we've had so far.

@Pijukatel
Copy link
Collaborator Author

Why isn't this simply another helper method in BeautifulSoupCrawler or BeautifulSoupCrawlingContext, like the many others we already have? IMO this introduces a somewhat different "end-user interface" compared to what we've had so far.

I am not very sure where to put it. It is kind of general utility, that can be used by any Crawler, but due to it's implementation it has to be used only when BS is installed. I re-exported it from that BS-related init as that already contains all the BS based functionality.

Copy link
Collaborator

@vdusek vdusek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding exposing this on the context level, I was thinking about something like this:

class BeautifulSoupCrawlingContext(ParsedHttpCrawlingContext[BeautifulSoup]):
    ...

    def html_to_text(self) -> str:
        """Converts HTML content to plain text."""
        return html_to_text(self.parsed_content)

Doesn't that make sense to you? @janbuchar @Pijukatel

@janbuchar
Copy link
Collaborator

Regarding exposing this on the context level, I was thinking about something like this:

class BeautifulSoupCrawlingContext(ParsedHttpCrawlingContext[BeautifulSoup]):
    ...

    def html_to_text(self) -> str:
        """Converts HTML content to plain text."""
        return html_to_text(self.parsed_content)

Doesn't that make sense to you? @janbuchar @Pijukatel

If html_to_text is also available as a public utility, sure, why not. I just want to make sure that you can use this from ParselCrawler, for instance. Even if you need to install beautifulsoup first.

@vdusek
Copy link
Collaborator

vdusek commented Dec 16, 2024

btw. I tried to convert this funcionality to Parsel using LLMs and it seems fine:

from __future__ import annotations

import re

from parsel import Selector

SKIP_TAGS = {'script', 'style', 'canvas', 'svg', 'noscript', 'title'}
BLOCK_TAGS = {
    'p',
    'h1',
    'h2',
    'h3',
    'h4',
    'h5',
    'h6',
    'ol',
    'ul',
    'li',
    'pre',
    'address',
    'blockquote',
    'dl',
    'div',
    'fieldset',
    'form',
    'table',
    'tr',
    'select',
    'option',
}

_ANY_CONSECUTIVE_WHITE_SPACES = re.compile(r'\s+')
_EMPTY_OR_ENDS_WITH_NEW_LINE = re.compile(r'(^|\n)$')


def html_to_text(source: str | Selector) -> str:
    """Converts markup string to newline-separated plain text without tags using Parsel."""
    selector = Selector(text=source) if isinstance(source, str) else source
    text = ''

    def _extract_text(elements: list[Selector]) -> None:
        """Custom parser for Parsel elements to simulate the behavior of the original function."""
        nonlocal text
        for element in elements:
            tag = element.root.tag if hasattr(element.root, 'tag') else None

            if tag in SKIP_TAGS:
                continue
            if tag == 'br':
                text += '\n'
            elif tag == 'td':
                _extract_text(element.xpath('./node()'))
                text += '\t'
            else:
                is_block_tag = tag in BLOCK_TAGS if tag else False

                if is_block_tag and not re.search(_EMPTY_OR_ENDS_WITH_NEW_LINE, text):
                    text += '\n'

                if tag:
                    _extract_text(element.xpath('./node()'))
                else:
                    compr = re.sub(_ANY_CONSECUTIVE_WHITE_SPACES, ' ', element.root.strip())
                    text += compr

                if is_block_tag and not text.endswith('\n'):
                    text += '\n'

    # Start processing the root elements
    _extract_text(selector.xpath('/*'))

    return text.strip()

Then using it from the Crawler:

from __future__ import annotations

import asyncio

from crawlee.parsel_crawler import ParselCrawler, ParselCrawlingContext, html_to_text


async def main() -> None:
    crawler = ParselCrawler()

    # Define the default request handler, which will be called for every request.
    @crawler.router.default_handler
    async def request_handler(context: ParselCrawlingContext) -> None:
        context.log.info(f'Processing {context.request.url} ...')

        # Extract data from the page.
        data = {
            'url': context.request.url,
            'title': context.selector.css('title').get(),
            'text': html_to_text(context.selector),
        }

        # Push the extracted data to the default dataset.
        await context.push_data(data)

    await crawler.run(['https://crawlee.dev/'])


if __name__ == '__main__':
    asyncio.run(main())

Output seems okay: parsel_html_to_text_output.txt

@janbuchar
Copy link
Collaborator

btw. I tried to convert this funcionality to Parsel using LLMs and it seems fine:

I'd prefer to just keep one version of the function, but we can make it Parsel-based for all I care

Add both implementations to respective contexts.
Set same tests for both.
@Pijukatel Pijukatel requested a review from vdusek December 17, 2024 09:05
Copy link
Collaborator

@vdusek vdusek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! More comments only regarding the docs.

Copy link
Collaborator

@vdusek vdusek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@vdusek vdusek changed the title feat: Add simple html_to_text helper function feat: Add html_to_text helper function Dec 18, 2024
@vdusek vdusek changed the title feat: Add html_to_text helper function feat: Add html_to_text helper function Dec 18, 2024
@Pijukatel Pijukatel merged commit 2b9d970 into master Dec 18, 2024
23 checks passed
@Pijukatel Pijukatel deleted the helper-function-tag-removal branch December 18, 2024 15:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request. t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

A helper function for simple HTML tag removal

5 participants