A modular Node.js + Express API that implements semantic caching for efficient LLM workflows.
This project provides:
- A prompt module to handle requests to LLM providers (e.g., OpenAI, DeepSeek, Gemini)
- A semantic cache module to reduce token costs and latency by caching LLM responses based on semantic similarity rather than exact string match
Base use case: before calling an expensive LLM API, check if a semantically similar question was already answered. If so, instantly return the cached response from Redis.
.src/
│ server.ts
│
├───modules
│ ├───prompt
│ │ ├───controllers
│ │ │ prompt.controller.ts
│ │ ├───dtos
│ │ │ prompt.dto.ts
│ │ ├───interfaces
│ │ │ ├───abstract
│ │ │ │ base-prompt-service.ts
│ │ │ ├───prompt-response.interface.ts
│ │ │ ├───prompt.service.interface.ts
│ │ │ └───request-body.interface.ts
│ │ ├───routes
│ │ │ prompt.routes.ts
│ │ └───services
│ │ deepseek-prompt.service.ts
│ │
│ └───semantic_cache
│ ├───config
│ │ cache.config.ts
│ ├───interfaces
│ │ cache.config.interface.ts
│ │ document.interface.ts
│ │ embedding.repository.interface.ts
│ │ ft-raw-search-result.interface.ts
│ │ vector-search-result.interface.ts
│ ├───providers
│ │ gemini.provider.ts
│ │ openai.provider.ts
│ ├───services
│ │ ├───core
│ │ │ embedding.service.ts
│ │ │ semantic-cache.service.ts
│ │ ├───search
│ │ │ cached-answer.service.ts
│ │ │ parse-vector-search.service.ts
│ │ │ vector-search.service.ts
│ │ └───storage
│ │ document-storage.service.ts
│ │ index-management.service.ts
│ └───utils
│ buffer-to-float32.ts
│ cosine-similarity.ts
│ float32-to-buffer.ts
│
└───shared
├───exceptions
├───middlewares
├───services
└───utils
Instead of caching by exact string match, semantic caching uses vector embeddings to find semantically similar queries. When a user sends a question:
- We convert each user question into an embedding, a sort of mathematical representation of meaning (basically a very big float vector);
- We store the embedding + original question + LLM response in a Redis vector database;
- When a new query comes in, we perform a vector search in Redis to retrieve the 5 most semantically similar questions
- We then calculate how close these questions are to the current question, by using a cosine similarity algorithm, which returns a number from 0 to 1.
- If the similarity exceeds our threshold (0.82), we return the cached response, skipping the LLM call entirely
This way we reduce token usage and improve user experience by providing answers waaaay faster!
Provides a consistent entry point to send queries to different LLM APIs. It handles:
- Input validation with
class-validator - Request formatting
- Response normalization
- TypeScript
- Node.js + Express
- Redis with RediSearch
- OpenAI / Gemini APIs
- class-validator, class-transformer
- Axios
git clone https://github.com/garzuze/ts_semantic_cache.git
cd ts_semantic_cachepnpm installdocker compose up -dServer will run on http://localhost:3000.
Note: You will need to set your environment variables with the API keys from the providers of your choice.
.env.exampleprovides an example of how to set your .env configuration. The system currently supports OpenAi and Gemini embeddings API + DeepSeek and Gemini as prompt providers.
- Create OpenAI prompt service
- Create Claude prompt service
- Add factory pattern