Add plain text retrieval #41

malberts · 2025-06-01T15:21:31Z

Closes #39

This uses a library to do the HTML stripping.

It still runs into the context window limit: #30

The wikitext there is actually present in the HTML: https://en.wikipedia.org/w/rest.php/v1/page/Earth/with_html. It did strip the HTML, although it looks like something went slightly wrong here:

JeroenDeDauw · 2025-06-01T21:59:22Z

src/tools/get-page.ts

+				type: 'text',
+				text: `Text:\n${ stripHtml( result.html ).result }`
+			} );
+		}


Looks like this should be its own function

results.push( newFunction() )

Parts of this will get rewritten in #38 anyway.

JeroenDeDauw · 2025-06-01T22:00:30Z

Looks reasonable. I do wonder if info gets lost and if that is important. For instance, knowing that something is a list item in a bulleted list

malberts · 2025-06-02T00:22:03Z

I'm also not sure. At least right now we have source, HTML and plain text, so if your response looks wrong you can nudge it to use HTML.

I'm also slightly wondering about the utility of plaintext vs HTML when we still run into the context limit issue. The chunking workaround in #38 needs to be applied anyway. Although I guess having a smaller string is still better than a longer string.

The string-strip-html library supports not stripping some tags (and probably some other config I did not look at, including full control via callback), so we can tweak that. But then I wonder about the semantics of other tags, e.g. tables and emphasis. And if we then end up with partial HTML, do we still need plain text and full HTML? We might have to tweak some of the descriptions we provide in the tool definition, to default a generic "give me page content" to whichever is the most common.

alistair3149 · 2025-06-02T16:14:12Z

Do we need to strip the HTML at all knowing that the LLM can understand HTML?
If we are just looking for a generic give me page content tool, we can use a Markdown parser instead in the LLM will mostly process the Markdown syntax.

malberts · 2025-06-02T19:24:09Z

Stripping HTML was partially an attempt to reduce the size.

I don't know if there is a real requirement to render the page content in the LLM. It seems like more effort to convert HTML to markdown. And if the whole HTML fit in the context window, then I suspect you could just ask the LLM to render the HTML as markdown, or just render it in whatever way it is able to.

alistair3149 · 2025-06-02T19:51:50Z

The implementation looks good to me.

I am not sure if the plain text output is helpful since there are often templates that go into the HTML and wikitext, which can mess up the output. Some of the templates or other HTML elements can contain information important to the page.

Alternatively, it might be more useful for implement some kind of page summary extraction. Which is similar to the Page Content Service Summary endpoint on WMF wikis or the TextExtracts API

malberts · 2025-06-03T11:27:08Z

Getting a deterministic page summary might be useful (as opposed to asking the LLM to get the full text and summarize it). However, my understanding of TextExtracts is that it returns a snippet of the content, so it's not really a summarized version of the content if that snippet is not written as a summary. This might still be useful wrapped in a tool like get-page-text-extracts if you just need a part of the page.

But by itself, getting a summary/extract is not an alternative for when the full text is needed. The main question for this PR is whether there is a technical advantage in explicitly providing non-HTML content.

JeroenDeDauw · 2025-06-03T22:00:05Z

Given that it is not clear if we need the plain text, we could keep the PR open for potential future continuation.

JeroenDeDauw reviewed Jun 1, 2025

View reviewed changes

Add plain text retrieval

32ce7c1

malberts force-pushed the plaintext branch from 708a410 to 32ce7c1 Compare June 2, 2025 19:40

malberts marked this pull request as draft June 21, 2025 12:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add plain text retrieval #41

Add plain text retrieval #41

Uh oh!

malberts commented Jun 1, 2025 •

edited

Loading

Uh oh!

JeroenDeDauw Jun 1, 2025

Uh oh!

malberts Jun 2, 2025

Uh oh!

JeroenDeDauw commented Jun 1, 2025 •

edited

Loading

Uh oh!

malberts commented Jun 2, 2025

Uh oh!

alistair3149 commented Jun 2, 2025

Uh oh!

malberts commented Jun 2, 2025

Uh oh!

alistair3149 commented Jun 2, 2025 •

edited

Loading

Uh oh!

malberts commented Jun 3, 2025

Uh oh!

JeroenDeDauw commented Jun 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Add plain text retrieval #41

Are you sure you want to change the base?

Add plain text retrieval #41

Uh oh!

Conversation

malberts commented Jun 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JeroenDeDauw Jun 1, 2025

Choose a reason for hiding this comment

Uh oh!

malberts Jun 2, 2025

Choose a reason for hiding this comment

Uh oh!

JeroenDeDauw commented Jun 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

malberts commented Jun 2, 2025

Uh oh!

alistair3149 commented Jun 2, 2025

Uh oh!

malberts commented Jun 2, 2025

Uh oh!

alistair3149 commented Jun 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

malberts commented Jun 3, 2025

Uh oh!

JeroenDeDauw commented Jun 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

malberts commented Jun 1, 2025 •

edited

Loading

JeroenDeDauw commented Jun 1, 2025 •

edited

Loading

alistair3149 commented Jun 2, 2025 •

edited

Loading