A sophisticated AI-powered research assistant that autonomously performs multi-step web research and synthesizes findings into comprehensive reports using Google's Vertex AI and Gemini models.
Thunk replicates advanced research capabilities similar to Google's Gemini Deep Research, providing:
- Autonomous Research Planning - Breaks down complex queries into structured research steps
- Multiple Search Providers - Support for web search (SerpAPI) and academic papers (arXiv)
- Intelligent Content Filtering - AI-powered quality filtering and relevance evaluation
- Content Analysis - Fetches and processes web pages, PDFs, and academic papers
- Vertex AI RAG Integration - Stores and manages research findings using Google's RAG engine
- Interactive CLI - User-friendly command-line interface with clarification support
- Report Synthesis - Generates comprehensive reports with proper citations
graph TD
A[User Query] --> B[Clarification Questions]
B --> C[Research Planning]
C --> D[Generate Search Queries]
D --> E[Parallel Web Search]
E --> F[Source Evaluation & Filtering]
F --> G[Content Fetching]
G --> H[Content Summarization]
H --> I[Document Storage]
I --> J{Research Complete?}
J -->|No| K[Focused Research]
K --> D
J -->|Yes| L[Vertex AI RAG Synthesis]
L --> M[Final Report]
N[Existing Corpus] --> L
subgraph "Storage Layer"
O[Vertex AI RAG Corpus]
P[Local Backup]
end
I --> O
I --> P
N --> O
The research flow follows these phases:
- Query Analysis - Analyzes complexity and generates clarifying questions
- Research Planning - Creates structured multi-step research plan
- Parallel Execution - Concurrent search queries with rate limiting
- Quality Filtering - AI-powered source relevance evaluation
- Content Processing - Fetch, summarize, and store documents
- Completeness Assessment - Determines if additional focused research is needed
- Synthesis - Uses Vertex AI RAG for enhanced report generation
- Python 3.9+
- uv - Fast Python package manager (install uv)
- Google Cloud Project with Vertex AI API enabled
- API Keys for SerpAPI and Google Cloud authentication
# Clone the repository
git clone [email protected]:wbeard/thunk.git
cd thunk
# Install dependencies with uv
uv sync
# Set up Google Cloud authentication
gcloud auth application-default loginCreate a .env file with your API keys:
# Required for web search (not needed for arXiv-only research)
SERPAPI_KEY=your_serpapi_key_here
# Always required
GOOGLE_CLOUD_PROJECT=your_gcp_project_id
# Optional (with defaults)
VERTEX_AI_LOCATION=us-central1
RAG_CORPUS_NAME=research_corpus_6
RAG_MODEL_NAME=gemini-2.5-flash-preview-05-20
MODEL_NAME=gemini-2.5-flash-preview-05-20SerpAPI Key (for web search):
- Sign up at SerpAPI
- Get your API key from the dashboard
- Add to
.envfile asSERPAPI_KEY - Note: Not required if using
--search-provider arxivexclusively
Google Cloud Setup:
- Create a Google Cloud Project
- Enable Vertex AI API in the console
- Set up authentication:
gcloud auth application-default login - Set project ID in
.envasGOOGLE_CLOUD_PROJECT
# Check configuration
uv run thunk --check-configThe CLI provides multiple ways to interact with the research agent:
# Simple web research query
uv run thunk "Latest developments in quantum computing 2024"
# Academic paper research using arXiv
uv run thunk "quantum computing" --search-provider arxiv
# With specific corpus and output options
uv run thunk "AI safety research trends" --corpus my_research --output report.md --full
# Quiet mode (minimal output)
uv run thunk "research query" --quiet# Start interactive session with web search
uv run thunk --interactive
# Start interactive session with arXiv search
uv run thunk --interactive --search-provider arxiv
# Available commands in interactive mode:
# <query> - Run research with clarification
# regenerate <query> - Regenerate from existing corpus
# regenerate-no-rag <query> - Regenerate without Vertex AI RAG
# corpus-info - Show corpus information
# quit - Exit interactive mode# Regenerate report from existing corpus (uses Vertex AI RAG)
uv run thunk --regenerate "quantum computing trends"
# Regenerate without using Vertex AI RAG
uv run thunk --regenerate "AI developments" --no-rag
# Regenerate with specific corpus
uv run thunk --regenerate "research topic" --corpus specific_corpus# Check configuration and API keys
uv run thunk --check-config
# Enable debug mode with verbose output
uv run thunk "research query" --debug
# Use specific RAG corpus
uv run thunk "query" --corpus my_research_corpus| Option | Short | Description |
|---|---|---|
--interactive |
-i |
Run in interactive mode |
--query |
-q |
Research query (alternative to positional) |
--regenerate |
-r |
Regenerate from existing corpus |
--search-provider |
-s |
Search provider: web (default) or arxiv |
--output |
-o |
Output file for report |
--corpus |
-c |
Name of Vertex AI RAG corpus to use |
--no-save |
Don't save report to file | |
--full |
Display full report (don't truncate) | |
--quiet |
-k |
Minimal output mode |
--debug |
-d |
Enable debug mode with verbose output |
--no-rag |
Don't use Vertex AI RAG for regeneration | |
--check-config |
Check configuration and exit |
# First-time setup
uv run thunk --check-config
# Interactive research session with web search
uv run thunk --interactive
# Academic research session with arXiv
uv run thunk --interactive --search-provider arxiv
# Quick web research with custom corpus
uv run thunk "AI trends 2024" --corpus ai_research --debug
# Academic paper research
uv run thunk "machine learning quantum computing" --search-provider arxiv
# Regenerate previous research
uv run thunk --regenerate "previous query" --corpus ai_researchThe following environment variables can be configured:
| Variable | Required | Default | Description |
|---|---|---|---|
SERPAPI_KEY |
Only for web search | - | SerpAPI key for web search |
GOOGLE_CLOUD_PROJECT |
β Yes | - | Google Cloud project ID |
VERTEX_AI_LOCATION |
No | us-central1 |
Vertex AI region |
RAG_CORPUS_NAME |
No | research_corpus_6 |
Name for RAG corpus |
RAG_MODEL_NAME |
No | gemini-2.5-flash-preview-05-20 |
Model for RAG synthesis |
MODEL_NAME |
No | gemini-2.5-flash-preview-05-20 |
Primary Gemini model |
The ResearchConfig class automatically loads and validates configuration:
from src.thunk.types import ResearchConfig
# Automatically loads from environment variables
config = ResearchConfig(corpus_display_name="my_corpus", search_provider="web")
# For arXiv-only research (no SERPAPI_KEY needed)
config_arxiv = ResearchConfig(corpus_display_name="my_corpus", search_provider="arxiv")
# Validates required configuration
try:
config._validate_config()
print("Configuration is valid")
except ValueError as e:
print(f"Configuration error: {e}")Common configuration issues:
- Missing API Keys:
- For web search: Ensure
SERPAPI_KEYandGOOGLE_CLOUD_PROJECTare set - For arXiv search: Only
GOOGLE_CLOUD_PROJECTis required
- For web search: Ensure
- Authentication: Run
gcloud auth application-default login - Corpus Name: Provide via
--corpusargument orRAG_CORPUS_NAMEenvironment variable
# Debug configuration issues
uv run thunk --check-config
# Check configuration for arXiv search (no SerpAPI needed)
uv run thunk --check-config --search-provider arxiv
# Example error messages:
# "Missing required configuration: SERPAPI_KEY" (for web search)
# "Missing required configuration: GOOGLE_CLOUD_PROJECT"
# "Missing required configuration: --corpus argument or RAG_CORPUS_NAME environment variable"Thunk supports multiple search providers through a pluggable architecture:
- Provider: SerpAPI integration
- Content: Web pages, news articles, blog posts
- Requirements:
SERPAPI_KEYenvironment variable - Usage:
--search-provider web(default)
- Provider: arXiv.org academic repository
- Content: Research papers, preprints, academic publications
- Requirements: No additional API keys needed
- Usage:
--search-provider arxiv
# General web research
uv run thunk "latest AI developments 2024" --search-provider web
# Academic research
uv run thunk "transformer architectures" --search-provider arxiv
# Mixed research (use different providers in separate sessions)
uv run thunk "quantum computing applications" --search-provider web --corpus quantum_web
uv run thunk "quantum computing theory" --search-provider arxiv --corpus quantum_papersThe system is designed to be extensible:
- Add new search providers - Implement the
SearchProviderinterface - Add new content fetchers - Support different file types
- Implement custom RAG engines - Beyond Vertex AI
- Create specialized event subscribers - For different use cases
- Extend the CLI - With additional features
[Add your license information here]
Thunk provides a powerful foundation for automated research tasks while maintaining flexibility for customization and extension.