-
Notifications
You must be signed in to change notification settings - Fork 9
Add plain text retrieval #41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
| type: 'text', | ||
| text: `Text:\n${ stripHtml( result.html ).result }` | ||
| } ); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this should be its own function
results.push( newFunction() )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Parts of this will get rewritten in #38 anyway.
|
Looks reasonable. I do wonder if info gets lost and if that is important. For instance, knowing that something is a list item in a bulleted list |
|
I'm also not sure. At least right now we have source, HTML and plain text, so if your response looks wrong you can nudge it to use HTML. I'm also slightly wondering about the utility of plaintext vs HTML when we still run into the context limit issue. The chunking workaround in #38 needs to be applied anyway. Although I guess having a smaller string is still better than a longer string. The |
|
Do we need to strip the HTML at all knowing that the LLM can understand HTML? |
|
Stripping HTML was partially an attempt to reduce the size. I don't know if there is a real requirement to render the page content in the LLM. It seems like more effort to convert HTML to markdown. And if the whole HTML fit in the context window, then I suspect you could just ask the LLM to render the HTML as markdown, or just render it in whatever way it is able to. |
|
The implementation looks good to me. I am not sure if the plain text output is helpful since there are often templates that go into the HTML and wikitext, which can mess up the output. Some of the templates or other HTML elements can contain information important to the page. Alternatively, it might be more useful for implement some kind of page summary extraction. Which is similar to the Page Content Service Summary endpoint on WMF wikis or the TextExtracts API |
|
Getting a deterministic page summary might be useful (as opposed to asking the LLM to get the full text and summarize it). However, my understanding of TextExtracts is that it returns a snippet of the content, so it's not really a summarized version of the content if that snippet is not written as a summary. This might still be useful wrapped in a tool like But by itself, getting a summary/extract is not an alternative for when the full text is needed. The main question for this PR is whether there is a technical advantage in explicitly providing non-HTML content. |
|
Given that it is not clear if we need the plain text, we could keep the PR open for potential future continuation. |
Closes #39
This uses a library to do the HTML stripping.
It still runs into the context window limit: #30
The wikitext there is actually present in the HTML: https://en.wikipedia.org/w/rest.php/v1/page/Earth/with_html. It did strip the HTML, although it looks like something went slightly wrong here:
