Skip to content

Towards better documentation search  #6307

@squidfunk

Description

@squidfunk

Background

As you may have read in one of my recent comments, we're currently revising our search implementation. The current search is based on Lunr.js, which is also the search engine that MkDocs has been using the time Material for MkDocs started in 2016. In the beginning, we felt that this was a good fit, as Lunr.js allows searching in the browser without the need for an external service. This makes deploying documentation much simpler, since search is and should always be a central component to each and every good documentation site.

In the past years, we've invested hundreds of hours into making search better. With the help of our awesome sponsors, we were able to ship rich search previews, support for more sophisticated tokenizers, support for Chinese, as well as better highlighting. Additionally, we made search almost twice as fast. However, in order to progress, and solve the many open issues that are related to search, we decided to throw out Lunr.js. There are several reasons for that, the most important of which that it is unmaintained since 2020. Additionally, Lunr.js only allows ranking with BM25, which is a good basis, but almost all issues that are related to weird rankings are caused by the fact that BM25 is not ideal for stable typeahead search. It was meant for full-word retrieval and is almost impossible to tame for the many different use cases that we've seen in the wild. Again, we've invested a lot of time to improve the situation, but we've reached an end where this doesn't make sense anymore.

This is the reason why we're currently releasing so few new features, because we're putting our entire energy in finishing the new search implementation. We're already almost en-par with Lunr.js' functionality, but now have an entirely modular architecture, which will allow us to swap out everything. Yes, I mean everything: the ranking algorithm, wildcard matching, the inverted index implementation, yada, yada, yada. Solving the documentation search problem is a personal affair for me. I really hate that there's not yet a solution that works reliably, can run anywhere, and is modular so it can be easily customized.

This is what we're building.

As you may already suspect, this is a pretty big project, which is why it is taking so long. We feel, it is the perfect moment to venture into this problem, because we gathered a lot of use cases that we can now balance and optimize for. However, please understand that this takes time, so I kindly ask you to be a little more patient. Development on this project is after all 99% done by me, @squidfunk, and we're rewriting something that millions of users are using each and every day. That needs care.

Where we're currently at

First of all: search will be a separate, new project! This means you will be able to use the same engine in your other projects as well. Additionally, here's a non-exhaustive list of things we're planning to ship in the first version:

  • Modular engines: Search should not only allow to search for text in an inverted index, but also support new use cases like nearest neighbors on vector embeddings. We designed the new search so that multiple engines can be configured for the same set of documents, e.g. store text and title in an inverted index, and store embeddings in a vector store – all from the same document. They should then be searched and ranked together. Additionally, document fields can be tokenized differently, and the tokenizing algorithm can be based on a regular expression, or a function, allowing for maximum flexibility.

  • Powerful plugin system: Plugins are first-class-citizens! The new search is completely modularized. For example, the inverted index itself does not compute scores – it's implemented as a plugin. This means, alternative ranking plugins can be implemented. The plugin architecture is dead simple, but insanely powerful. From my current knowledge, I know of nothing that could not be implemented as a plugin.

  • Document metadata – authors should be able to configure which parts of document metadata should be included with the documents, so that documents can be indexed with custom metadata. Currently, only text, title, location and tags are included. The new search should allow to configure which fields are indexed how, i.e., how they should render in search results, if they should render at all (think keywords or aliases), etc. This would also allow to slice the search into different sections, e.g. for the blog, API reference, etc., by allowing the author to render those as tabs in the search bar.

  • Better accuracy – the current implementation uses Lunr.js, which uses OR to combine terms. This is not ideal for document search, as users reported repeatedly that they expect to narrow the number of search results with more terms entered. The new search will make it easy to switch to AND as a default combinator.

  • Detect misspellings – if a typo is entered, e.g. instlal, the engine should detect the typo and correct it to install. Many engines support this, so we should find a way to do the same.

  • Offline-first – it goes without saying that one of the highest priorities is that search will keep working offline. The new search implementation will, of course, still not need a server.

  • Span queries – searches like "single page application" should be ranked higher when those words appear together. This removes the need for exact search within quotes, which is something many non-technical users don't even know is possible in search engines like Google or Bing. The goal is that entering a few words should be enough, no special syntax should be needed.

  • Compound words – the current search allows to index words like PascalCase as Pascal and Case by using clever lookaheads, but it also means that searching for the entire term in lowercase pascalcase will not return any results. This should be fixed in a way that both can be found.

  • Document hierarchy – the search index should be organized hierarchically, so that the explicit navigation structure and implicit table of contents hierarchy yield more context to search results, helping to disambiguate repetitive documentation.

Here's a list of ideas, partially based on open change requests, which we will implement after the first version is out and reached a stable state. We believe all of those features will be great additions:

  • Stemming and segmentation – of course, search should be multi-lingual and support language-specific stemming and text segmentation for iconographic languages like Japanese and Chinese. We should check whether we can use browser-based APIs for text segmentation, or if not available, maybe fall back to a polyfill. Alternatively (or ideally?), segmentation could be done during build time, so that the payload shipped to the user is even smaller. Additionally, stopwords should be allowed to be provided by the author. Here's an interesting stemmer implementation.

  • Compact summaries the current search indexes the HTML and divides it into blocks on the top-level. If a long list is contained, and a single word matches in that list, the entire list is rendered as part of the search results. This is not ideal, since the user has to scroll through a lot of irrelevant content. The new search should provide an intelligent summarization algorithm, possibly with a configurable way to detect endings of sentences and paragraphs.

  • Federated search it should be possible to federate the search with other sites that are built with the same engine, so that a single MkDocs site and a federated search can be built from multiple MkDocs projects. This could also be applied to different versions. The author must be able to influence the rendering of federated results.

  • Caching – since we're re-architecting the entire search implementation, we can leverage caching, so that the index can be completely persisted and restored from memory without the need to rebuild it every time.

  • Deep linking – the entire search must be serializable to a URL query string, so that the query the user entered, as well as all filters that were selected can be directly linked to.

  • Adaptive rendering – The search result list should be much smaller than it currently is, only including text when the search has only a few results, adapting to what the user expects. When the user only enters a few characters, a lot of documents will be returned. The more characters are entered, the less results will be returned, and at some threshold, the document text should be shown. This threshold should be configurable and tuneable.

  • Fuzzy finder – as opposed to the common tokenization and ranking with BM25, search should support to index datums like file paths, class names, attribute names, etc. with a fuzzy finder approach, similar to what IDEs like VS Code do when you're using auto complete.

  • Allow to use search as a component in Markdown – allow the user to embed search bars at arbitrary locations, possibly re-configured.

  • Rich results – not only code blocks should be renderable, but also Mermaid diagrams and code annotations.

  • Recommend term removal – if a search matches no results, recommend to the user which term can be removed.

  • Search history – we could allow to preserve the search history, which means that users have an easy way to go back to previous search results without having them to re-enter again. Entries in the search history could be cleared out by the user one-by-one.

  • Synonyms – authors should be allowed to provide synonyms for specific words. We need to think of a good way to signal to the user that a synonym was found, or we just replace the word with the synonym in search results.

  • Index non-Markdown sources – It should be possible to index other contents alongside Markdown, including HTML, PDFs, etc., possibly with the help of plugins.

  • Arbitrary sections – authors should be allowed to add custom sections like Admonitions or tabs (or whatever) to the search, in order to provide an even more flexible structure.

  • Search separator testing – we should provide a method for authors to easily test the search separator on their site.


This list is far from complete. We have so many more ideas, which we'll share when the time has come. We'll keep this issue updated, so feel free to subscribe or check back from time to time. We hope to push our the first candidate before the end of this year! Thank you for your patience and for your trust in Material for MkDocs.

Metadata

Metadata

Assignees

No one assigned

    Labels

    announcementIssue announces news or new features

    Projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions