Towards better documentation search 

## Background

As you may have read in one of my recent comments, we're currently revising our search implementation. The current search is based on [Lunr.js](https://github.com/olivernn/lunr.js), which is also the search engine that MkDocs has been using the time Material for MkDocs started in 2016. In the beginning, we felt that this was a good fit, as Lunr.js allows searching in the browser without the need for an external service. This makes deploying documentation much simpler, since search is and should always be a central component to each and every good documentation site.

In the past years, we've invested hundreds of hours into making search better. With the help of our awesome sponsors, we were able to ship [rich search previews](https://squidfunk.github.io/mkdocs-material/blog/2021/09/13/search-better-faster-smaller/#rich-search-previews), [support for more sophisticated tokenizers](https://squidfunk.github.io/mkdocs-material/blog/2021/09/13/search-better-faster-smaller/#tokenizer-lookahead), [support for Chinese](https://squidfunk.github.io/mkdocs-material/blog/2022/05/05/chinese-search-support/), as well as [better highlighting](https://squidfunk.github.io/mkdocs-material/blog/2021/09/13/search-better-faster-smaller/#accurate-highlighting). Additionally, we made search almost [twice as fast](https://squidfunk.github.io/mkdocs-material/blog/2021/09/13/search-better-faster-smaller/#benchmarks). However, in order to progress, and solve the many open issues that are related to search, we decided to throw out Lunr.js. There are several reasons for that, the most important of which that it is unmaintained since 2020. Additionally, Lunr.js only allows ranking with [BM25](https://en.wikipedia.org/wiki/Okapi_BM25), which is a good basis, but almost all issues that are related to weird rankings are caused by the fact that BM25 is not ideal for stable typeahead search. It was meant for full-word retrieval and is almost impossible to tame for the many different use cases that we've seen in the wild. Again, we've invested a lot of time to improve the situation, but we've reached an end where this doesn't make sense anymore.

This is the reason why we're currently releasing so few new features, because we're putting our entire energy in finishing the new search implementation. We're already almost en-par with Lunr.js' functionality, but now have an __entirely modular architecture__, which will allow us to swap out everything. Yes, I mean __everything__: the ranking algorithm, wildcard matching, the inverted index implementation, yada, yada, yada. Solving the documentation search problem is a personal affair for me. I really hate that there's not yet a solution that works reliably, can run anywhere, and is modular so it can be easily customized.

This is what we're building.

As you may already suspect, this is a pretty big project, which is why it is taking so long. We feel, it is the perfect moment to venture into this problem, because we gathered a lot of use cases that we can now balance and optimize for. However, please understand that this takes time, so I kindly ask you to be a little more patient. Development on this project is after all 99% done by me, @squidfunk, and we're rewriting something that millions of users are using each and every day. That needs care.

## Where we're currently at

First of all: __search will be a separate, new project__! This means you will be able to use the same engine in your other projects as well. Additionally, here's a non-exhaustive list of things we're planning to ship in the first version:

- [x] __Modular engines__: Search should not only allow to search for text in an inverted index, but also support new use cases like nearest neighbors on vector embeddings. We designed the new search so that multiple engines can be configured for the same set of documents, e.g. store `text` and `title` in an inverted index, and store `embeddings` in a vector store – all from the same document. They should then be searched and ranked together. Additionally, document fields can be tokenized differently, and the tokenizing algorithm can be based on a regular expression, or a function, allowing for maximum flexibility.

    - #5936

- [x] __Powerful plugin system__: Plugins are first-class-citizens! The new search is completely modularized. For example, the inverted index itself does not compute scores – it's implemented as a plugin. This means, alternative ranking plugins can be implemented. The plugin architecture is dead simple, but insanely powerful. From my current knowledge, I know of nothing that could not be implemented as a plugin.

    - #4980

- [x] __Document metadata__  – authors should be able to configure which parts of document metadata should be included with the documents, so that documents can be indexed with custom metadata. Currently, only `text`, `title`, `location` and `tags` are included. The new search should allow to configure which fields are indexed how, i.e., how they should render in search results, if they should render at all (think keywords or aliases), etc. This would also allow to slice the search into different sections, e.g. for the blog, API reference, etc., by allowing the author to render those as tabs in the search bar.

    - #3174
    - #4983
    - #4965
- [x] __Better accuracy__ – the current implementation uses Lunr.js, which uses `OR` to combine terms. This is not ideal for document search, as users reported repeatedly that they expect to narrow the number of search results with more terms entered. The new search will make it easy to switch to `AND` as a default combinator. 
- [x] __Detect misspellings__ – if a typo is entered, e.g. `instlal`, the engine should detect the typo and correct it to `install`. Many engines support this, so we should find a way to do the same.
- [x] __Offline-first__ – it goes without saying that one of the highest priorities is that search will keep working offline. The new search implementation will, of course, still not need a server.


- [x] __Span queries__ – searches like "single page application" should be ranked higher when those words appear together. This removes the need for exact search within quotes, which is something many non-technical users don't even know is possible in search engines like Google or Bing. The goal is that entering a few words should be enough, no special syntax should be needed.
 
- [x] __Compound words__ – the current search allows to index words like `PascalCase` as `Pascal` and `Case` by using clever lookaheads, but it also means that searching for the entire term in lowercase `pascalcase` will not return any results. This should be fixed in a way that both can be found.

    - #6632
- [x] __Document hierarchy__ – the search index should be organized hierarchically, so that the explicit navigation structure and implicit table of contents hierarchy yield more context to search results, helping to disambiguate repetitive documentation. 

    - #3787

Here's a list of ideas, partially based on open change requests, which we will implement after the first version is out and reached a stable state. We believe all of those features will be great additions:



- [ ] __Stemming and segmentation__ – of course, search should be multi-lingual and support language-specific stemming and text segmentation for iconographic languages like Japanese and Chinese. We should check whether we can use browser-based APIs for text segmentation, or if not available, maybe fall back to a polyfill. Alternatively (or ideally?), segmentation could be done during build time, so that the payload shipped to the user is even smaller. Additionally, [stopwords](https://github.com/6/stopwords-json) should be allowed to be provided by the author. [Here's an interesting stemmer implementation](https://github.com/axa-group/nlp.js/tree/master/packages/lang-ar/src).

- [ ] __Compact summaries__ the current search indexes the HTML and divides it into blocks on the top-level. If a long list is contained, and a single word matches in that list, the entire list is rendered as part of the search results. This is not ideal, since the user has to scroll through a lot of irrelevant content. The new search should provide an intelligent summarization algorithm, possibly with a configurable way to detect endings of sentences and paragraphs.

    - #4278

- [ ] __Federated search__ it should be possible to federate the search with other sites that are built with the same engine, so that a single MkDocs site and a federated search can be built from multiple MkDocs projects. This could also be applied to different versions. The author must be able to influence the rendering of federated results.

    - #5230

- [ ] __Caching__ – since we're re-architecting the entire search implementation, we can leverage caching, so that the index can be completely persisted and restored from memory without the need to rebuild it every time.

    - #5391

- [ ] __Deep linking__ – the entire search must be serializable to a URL query string, so that the query the user entered, as well as all filters that were selected can be directly linked to.

- [ ] __Adaptive rendering__ – The search result list should be much smaller than it currently is, only including text when the search has only a few results, adapting to what the user expects. When the user only enters a few characters, a lot of documents will be returned. The more characters are entered, the less results will be returned, and at some threshold, the document text should be shown. This threshold should be configurable and tuneable.



- [ ] __Fuzzy finder__ – as opposed to the common tokenization and ranking with BM25, search should support to index datums like file paths, class names, attribute names, etc. with a fuzzy finder approach, similar to what IDEs like VS Code do when you're using auto complete.

    - #4466 

- [ ] __Allow to use search as a component in Markdown__ – allow the user to embed search bars at arbitrary locations, possibly re-configured.

    - #6858

- [ ] __Rich results__ – not only code blocks should be renderable, but also Mermaid diagrams and code annotations.

- [ ] __Recommend term removal__ – if a search matches no results, recommend to the user which term can be removed. 

- [ ] __Search history__ – we could allow to preserve the search history, which means that users have an easy way to go back to previous search results without having them to re-enter again. Entries in the search history could be cleared out by the user one-by-one.

- [ ] __Synonyms__ – authors should be allowed to provide synonyms for specific words. We need to think of a good way to signal to the user that a synonym was found, or we just replace the word with the synonym in search results.

- [ ] __Index non-Markdown sources__ – It should be possible to index other contents alongside Markdown, including HTML, PDFs, etc., possibly with the help of plugins.

- [ ] __Arbitrary sections__ – authors should be allowed to add custom sections like Admonitions or tabs (or whatever) to the search, in order to provide an even more flexible structure.

- [ ] __Search separator testing__ – we should provide a method for authors to easily test the search separator on their site.

---

This list is far from complete. We have so many more ideas, which we'll share when the time has come. We'll keep this issue updated, so feel free to subscribe or check back from time to time. We hope to push our the first candidate before the end of this year! Thank you for your patience and for your trust in Material for MkDocs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Towards better documentation search #6307

Background

Where we're currently at

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Towards better documentation search #6307

Description

Background

Where we're currently at

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions