Skip to content

NicolasPllr1/kwsearch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

76 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Keyword Search Engine

A keyword search engine written in Zig and compiled to WASM.

Table of Contents

Boolean Search

Implicitly, keywords and _AND_ed together to form boolean search queries.1

An AND query means that a document matches if and only if it contains all the search keywords.

Currently, search results are not ranked. However, the index already stores frequency information and position data. I included this to support TF-IDF style / BM25 ranking in the future.

Markdown files

Documents are assumed to be Markdown files.

Instead of indexing files as a whole, they are parsed into an AST at the heading level. Each node in the tree essentially consists of a title (the heading text) and some content (the text until the next heading).

For instance given this document:

# Hello world

File over app.

## Foo

## Bar

Chocolate bar
  • Searching for hello file matches the level-1 node (with heading "Hello world" and content "File over app").
  • Searching for chocolate matches one level-2 node (with heading Bar and content Chocolate bar).

I find this way more useful than just matching entire documents, which could be very long.

Pipeline

The system consists of a CLI indexer and a WASM search API.

1. Build and Index

Run the build script, with the directory containing the Mardown files you want to index:

./build.sh <data-dir>

It will:

  1. Build a search index for the documents in <data-dir> (search-index.bin).
  2. Generate a metadata file 2 (docs-mapping.json).
  3. Compile the search logic to WASM 3.

Then, you can search through the indexed documents from your javascript code running in the browser using the exposed search API running in WASM. You will need, alongside the WASM binary, the index and the document mappings JSON.

2. Use in the browser

To use the engine in a browser, load the WASM binary and the generated index. You can then query the index via the exposed API.

The provided SearchEngine wrapper handles the WASM memory management and querying.

const searchEngine = new SearchEngine();

// Initialization from the build assets
await searchEngine.initialize(
  './search.wasm', // search logic and exposed API
  './search-index.bin', // serialized index data structure
  './docs-mapping.json' // serialized metadata
);

const results = searchEngine.search("zig wasm performance");

// Results (IDs) are enriched with metadata (titles and links):
// [{ docId: 10, title: "Optimizing Zig", link: "/posts/opt.html" }, ...]

Why?

I wanted to learn more about "old-school" search engines and their internals. And I also needed one for my personal website!

The choice of Zig was a happy coincidence. I attended TigerBeetle world tour in Paris (Dec. 2025), where we were encouraged to present a project. I felt like I had no choice but to do this one in Zig for the occasion ;)!

You can find the slides from my presentation in this repo. They provide a high-level overview of the architecture and how an inverted index is built and used to power the search engine.

Good resources

Footnotes

  1. Support for OR and other operators could come in the future.

  2. Also a document mapping JSON. This is could be removed but helps in debuging the index.

  3. See wasm-api.zig for the API.

About

Keyword search engine written in Zig and compiled to WASM.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors