Skip to content

brokenashish/cfstore

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CFStore - Semantic File Storage

A modern file storage and semantic search system built with Cloudflare Workers, R2, and Durable Objects. This application provides file upload capabilities with AI-powered semantic search through WebSocket connections.

Features

  • File Upload to R2: Upload files directly to Cloudflare R2 object storage
  • Semantic Search: AI-powered semantic search using Cloudflare AI embeddings
  • Durable Objects: Persistent SQLite database for metadata and user information
  • WebSocket Support: Real-time search results through WebSocket connections
  • React UI: Beautiful, responsive React interface for file management
  • User Tracking: SQLite database storing user information and search history

Architecture

Components

  1. Cloudflare Worker (src/worker.ts)

    • Handles HTTP requests and routing
    • Manages file uploads to R2
    • Serves the React application
    • Proxies WebSocket connections to Durable Objects
  2. Durable Object (src/durable-object.ts)

    • Manages SQLite database with three tables:
      • files: File metadata and embeddings
      • users: User information and activity
      • search_history: Search queries and results
    • Handles WebSocket connections for real-time search
    • Performs semantic search using cosine similarity
    • Generates embeddings using Cloudflare AI
  3. React UI (embedded in worker.ts)

    • File upload interface with drag-and-drop
    • Semantic search with real-time results
    • File browser with download capabilities
    • WebSocket status indicator

API Endpoints

REST API

  • GET / - Serves the React application
  • POST /api/upload - Upload a file to R2
    • FormData: file, userId
  • GET /api/files - List all files in R2
  • GET /api/download/:filename - Download a file from R2
  • POST /api/search - Perform semantic search
    • Body: { query: string, userId?: string }

WebSocket API

Connect to /api/ws for real-time search:

// Connect
ws.send(JSON.stringify({ type: 'register', userId: 'user123' }))

// Search
ws.send(JSON.stringify({ type: 'search', query: 'your query', userId: 'user123' }))

// Response
{ type: 'search_results', results: [...] }

Durable Object Endpoints

  • POST /index - Index a file with embeddings
  • POST /search - Perform semantic search
  • GET /user?userId=xxx - Get user information
  • POST /user - Update user information
  • GET /stats - Get usage statistics

Database Schema

Files Table

CREATE TABLE files (
  id INTEGER PRIMARY KEY AUTOINCREMENT,
  filename TEXT NOT NULL,
  text TEXT,
  userId TEXT NOT NULL,
  fileType TEXT,
  fileSize INTEGER,
  embedding TEXT,
  createdAt TEXT DEFAULT CURRENT_TIMESTAMP
)

Users Table

CREATE TABLE users (
  id INTEGER PRIMARY KEY AUTOINCREMENT,
  userId TEXT UNIQUE NOT NULL,
  username TEXT,
  email TEXT,
  createdAt TEXT DEFAULT CURRENT_TIMESTAMP,
  lastActive TEXT DEFAULT CURRENT_TIMESTAMP
)

Search History Table

CREATE TABLE search_history (
  id INTEGER PRIMARY KEY AUTOINCREMENT,
  userId TEXT NOT NULL,
  query TEXT NOT NULL,
  resultsCount INTEGER,
  searchedAt TEXT DEFAULT CURRENT_TIMESTAMP
)

Setup

Prerequisites

  • Node.js 18+
  • Cloudflare account
  • Wrangler CLI installed globally

Installation

  1. Clone the repository:
git clone <repository-url>
cd cfstore
  1. Install dependencies:
npm install
  1. Configure your Cloudflare account in wrangler.toml:

    • Update the R2 bucket name if needed
    • Ensure AI binding is available in your account
  2. Create the R2 bucket:

wrangler r2 bucket create cfstore-files
  1. Deploy the worker:
npm run deploy

Development

Run the development server:

npm run dev

This will start a local development server at http://localhost:8787

Configuration

wrangler.toml

name = "cfstore"
main = "src/worker.ts"
compatibility_date = "2024-11-01"
node_compat = true

[durable_objects]
bindings = [
  { name = "SEMANTIC_STORE", class_name = "SemanticStore" }
]

[[migrations]]
tag = "v1"
new_classes = ["SemanticStore"]

[[r2_buckets]]
binding = "FILE_BUCKET"
bucket_name = "cfstore-files"

[ai]
binding = "AI"

How It Works

File Upload Flow

  1. User uploads a file through the React UI
  2. Worker receives the file and uploads it to R2
  3. Worker extracts text content from the file
  4. Worker sends metadata to Durable Object
  5. Durable Object generates embeddings using Cloudflare AI
  6. Durable Object stores metadata and embeddings in SQLite

Semantic Search Flow

  1. User enters a search query
  2. Query is sent via WebSocket or REST API
  3. Durable Object generates embedding for the query
  4. Performs cosine similarity comparison with stored embeddings
  5. Returns top 10 most similar files
  6. Results are sent back to the client
  7. Search is logged in search_history table

Embedding Generation

Uses Cloudflare AI's @cf/baai/bge-base-en-v1.5 model to generate embeddings:

  • Text is limited to 5000 characters for embedding
  • Embeddings are stored as JSON strings in SQLite
  • Cosine similarity is used to compare embeddings

Technologies Used

  • Cloudflare Workers: Serverless edge computing
  • Cloudflare R2: S3-compatible object storage
  • Durable Objects: Stateful serverless objects with SQLite
  • Cloudflare AI: AI inference at the edge
  • React: UI framework (loaded via CDN)
  • TypeScript: Type-safe development

Features in Detail

Semantic Search

The semantic search uses vector embeddings to find similar documents based on meaning rather than exact text matches. This allows for:

  • Finding documents with similar concepts
  • Language understanding beyond keyword matching
  • Ranking results by semantic similarity

SQLite Database

The Durable Object uses SQLite for persistent storage:

  • ACID compliant transactions
  • Efficient indexing for fast queries
  • Relational data modeling
  • Full SQL query support

WebSocket Support

Real-time bidirectional communication:

  • Instant search results
  • Connection status updates
  • Automatic reconnection on disconnect
  • Efficient for repeated queries

Limitations

  • Text extraction is basic (only for text files)
  • Embedding generation limited to 5000 characters
  • Single Durable Object instance (scalable with sharding)
  • R2 bucket name must be unique globally

Future Enhancements

  • Support for PDF and document parsing
  • Batch file upload
  • Advanced filtering and sorting
  • User authentication
  • File sharing and permissions
  • Analytics dashboard
  • Full-text search fallback
  • Multiple R2 bucket support

License

MIT

Support

For issues and questions, please open an issue on GitHub.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published