Lumiwealth · grzesir · Nov 18, 2025 · Nov 19, 2025 · Nov 21, 2025 · Nov 21, 2025
@@ -41,9 +41,18 @@ jobs:
 
     env:
       AIOHTTP_NO_EXTENSIONS: 1
+      # CRITICAL: Set this to "none" so tests use their explicit data sources.
+      # Tests that want ThetaData must explicitly request it.
+      # Without this, the default is "ThetaData" which overrides ALL backtests.
+      BACKTESTING_DATA_SOURCE: none
       POLYGON_API_KEY: ${{ secrets.POLYGON_API_KEY }}
       THETADATA_USERNAME: ${{ secrets.THETADATA_USERNAME }}
       THETADATA_PASSWORD: ${{ secrets.THETADATA_PASSWORD }}
+      # NOTE (2025-11-28): Data Downloader is a production proxy for ThetaData that allows
+      # shared access without requiring a local ThetaTerminal JAR. When these are set,
+      # ThetaData tests will use the remote downloader instead of spawning a local process.
+      DATADOWNLOADER_BASE_URL: ${{ secrets.DATADOWNLOADER_BASE_URL }}
+      DATADOWNLOADER_API_KEY: ${{ secrets.DATADOWNLOADER_API_KEY }}
       ALPACA_TEST_API_KEY: ${{secrets.ALPACA_TEST_API_KEY}} # Required for alpaca unit tests
       ALPACA_TEST_API_SECRET: ${{secrets.ALPACA_TEST_API_SECRET}} # Required for alpaca unit tests
       TRADIER_TEST_ACCESS_TOKEN: ${{secrets.TRADIER_TEST_ACCESS_TOKEN}} # Required for tradier unit tests

@@ -0,0 +1,43 @@
+# LumiBot Agent Instructions (Theta / Downloader Focus)
+
+These rules are mandatory whenever you work on ThetaData integrations.
+
+1. **Never launch ThetaTerminal locally.** Production has the only licensed session. Starting the jar (even briefly or via Docker) instantly terminates the prod connection and halts all customers.
+2. **Use the shared downloader endpoint.** All tests/backtests must set `DATADOWNLOADER_BASE_URL=http://44.192.43.146:8080` (or whatever prod IP `/version` reports) and `DATADOWNLOADER_API_KEY=<secret>`. Do not short-cut by hitting Theta
+   directly.
+3. **Respect the queue/backoff contract.** LumiBot no longer enforces a 30 s client timeout; instead it listens for the downloader’s `{"error":"queue_full"}` responses and retries with exponential backoff. If you add new downloader
+   integrations, reuse that helper so we never DDoS the server.
+4. **Long commands = safe-timeout.** Wrap backtests/pytest/stress jobs with `/Users/robertgrzesik/bin/safe-timeout <duration> …` to ensure we never spawn orphaned processes.
+5. **Artifacts.** When demonstrating fixes, capture `Strategy\ Library/logs/*.log`, tear sheets, and downloader stress JSONs so the accuracy/dividend/resilience story stays reproducible.
+
+Failure to follow these rules will break everyone's workflows—double-check env vars before running anything.
+
+---
+
+## Test Philosophy (CRITICAL FOR ALL PROJECTS)
+
+### Test Age = Test Authority
+
+When tests fail, how you fix them depends on **how old the test is**:
+
+| Test Age | Authority Level | How to Fix |
+|----------|----------------|------------|
+| **>1 year old** | LEGACY - High authority | **Fix the CODE**, not the test. These tests have proven themselves over time. |
+| **6-12 months** | ESTABLISHED - Medium authority | Investigate carefully. Likely fix the code, but could be test issue. |
+| **<6 months** | NEW - Lower authority | Test may need adjustment. Still verify code isn't broken. |
+| **<1 month** | EXPERIMENTAL | Test is still being refined. Adjust as needed. |
+
+### Check Test Age Before Fixing
+
+```bash
+git log --format="%ai" --follow -- tests/path/to/test.py | tail -1
+```
+
+### Conflict Resolution
+
+When old tests and new tests conflict:
+1. **Old test wins by default** - it has proven track record
+2. If the new test represents genuinely new functionality, **ask the user for judgment**
+3. Document any judgment calls in the test file with comments
+
+This philosophy applies to ALL projects, not just LumiBot.
@@ -0,0 +1,291 @@
+# CLAUDE.md - AI Assistant Instructions for LumiBot
+
+## Quick Start
+
+**First, read these files:**
+1. `BACKTESTING_ARCHITECTURE.md` - Understand the backtesting data flow
+2. `AGENTS.md` - Critical rules for ThetaData (DO NOT SKIP)
+
+## Project Overview
+
+LumiBot is a trading and backtesting framework supporting multiple data sources (Yahoo, ThetaData, Polygon) and brokers (Alpaca, Interactive Brokers, Tradier, etc.).
+
+## Key Locations
+
+| What | Where |
+|------|-------|
+| LumiBot library | `/Users/robertgrzesik/Documents/Development/lumivest_bot_server/strategies/lumibot/` |
+| Strategy Library | `/Users/robertgrzesik/Documents/Development/Strategy Library/` |
+| Demo strategies | `/Users/robertgrzesik/Documents/Development/Strategy Library/Demos/` |
+| Environment config | `Demos/.env` for strategies, `lumibot/.env` for library |
+| Backtest logs | `/Users/robertgrzesik/Documents/Development/Strategy Library/logs/` |
+
+## Critical Rules
+
+### ThetaData Rules (MUST FOLLOW)
+
+1. **NEVER run ThetaTerminal locally** - It will kill production connections
+2. **Only use Data Downloader** at `http://44.192.43.146:8080`
+3. **Always compare ThetaData vs Yahoo** - Yahoo is the gold standard for split-adjusted prices
+4. See `AGENTS.md` for complete rules
+
+### Data Source Selection
+
+The `BACKTESTING_DATA_SOURCE` env var **OVERRIDES** explicit code:
+```bash
+# In .env file
+BACKTESTING_DATA_SOURCE=thetadata  # Uses ThetaData regardless of code
+BACKTESTING_DATA_SOURCE=yahoo      # Uses Yahoo regardless of code
+BACKTESTING_DATA_SOURCE=none       # Uses whatever class the code specifies
+```
+
+### Cache Management
+
+If seeing wrong/stale data:
+1. Bump `LUMIBOT_CACHE_S3_VERSION` (e.g., v5 → v6)
+2. Clear local cache: `rm -rf ~/Library/Caches/lumibot/`
+
+## Common Tasks
+
+### Run a Backtest
+
+```bash
+cd "/Users/robertgrzesik/Documents/Development/Strategy Library/Demos"
+python3 "TQQQ 200-Day MA.py"
+```
+
+### Compare Yahoo vs ThetaData
+
+1. Edit `Demos/.env`:
+   - Set `BACKTESTING_DATA_SOURCE=yahoo`
+2. Run backtest, note results
+3. Edit `Demos/.env`:
+   - Set `BACKTESTING_DATA_SOURCE=thetadata`
+4. Run backtest, compare results
+5. Results should match within ~1-2%
+
+### Check Backtest Results
+
+```bash
+ls -la "/Users/robertgrzesik/Documents/Development/Strategy Library/logs/" | grep TQQQ | tail -10
+```
+
+Look at `*_tearsheet.csv` for CAGR and metrics.
+
+## Known Issues & Fixes
+
+### ✅ ThetaData Split Adjustment (FIXED - Nov 28, 2025)
+
+**Status:** FIXED - Split handling now correct
+
+**Root cause:** The `_apply_corporate_actions_to_frame()` function was being called 26+ times per backtest without any idempotency check, causing split adjustments to be applied multiple times.
+
+**Fix applied:**
+1. Added idempotency check at start of `_apply_corporate_actions_to_frame()` - checks for `_split_adjusted` column marker
+2. Added marker at end of function after successful adjustment
+3. Cache version bumped to v7
+
+### ✅ ThetaData Dividend Split Adjustment (FIXED - Nov 28, 2025)
+
+**Status:** FIXED - 17/21 dividends now match Yahoo within 5%
+
+**Root causes found:**
+1. `_update_cash_with_dividends()` was called 3 times per day without idempotency
+2. ThetaData dividend amounts were UNADJUSTED for splits
+3. ThetaData returned duplicate dividends for same ex_date (e.g., 2019-03-20 appeared 4x)
+4. ThetaData returned special distributions with `less_amount > 0` (e.g., 2015-07-02)
+
+**Fixes applied:**
+1. Added `_dividends_applied_tracker` in `_strategy.py` to prevent multiple applications
+2. Added split adjustment to `get_yesterday_dividends()` in `thetadata_backtesting_pandas.py`
+3. Added deduplication by ex_date in `_normalize_dividend_events()`
+4. Added filter for `less_amount > 0` to exclude special distributions
+
+**Verified split adjustment:**
+- ThetaData cumulative factor for 2014 dividends: 48x (2×3×2×2×2)
+- After adjustment: $0.01182 raw → $0.000246 adjusted ≈ Yahoo's $0.000250 ✓
+
+**Current results:** ~47% CAGR with ThetaData vs ~42% with Yahoo (gap due to phantom dividends)
+
+### ⚠️ ThetaData Phantom Dividend (KNOWN ISSUE - Reported to ThetaData)
+
+**Status:** KNOWN DATA QUALITY ISSUE - Reported to ThetaData support team
+
+| Date | ThetaData | Yahoo | Status |
+|------|-----------|-------|--------|
+| 2014-09-18 | $0.41 raw | None | ⚠️ PHANTOM - main cause of CAGR gap |
+| 2015-07-02 | $1.22 raw | None | ✅ FILTERED (less_amount=22.93) |
+| 2020-12-23 | $0.000283 | None | ⚠️ PHANTOM |
+| 2021-12-23 | $0.000119 | None | ⚠️ PHANTOM |
+
+**Root cause:** ThetaData phantom dividends are DATA ERRORS in the SIP feed, not Return of Capital (ROC) distributions. Confirmed via Perplexity research - these amounts don't appear in any other financial database (Yahoo, Bloomberg, SEC filings).
+
+**Workaround options:**
+1. Use `BACKTESTING_DATA_SOURCE=yahoo` for dividend-sensitive strategies
+2. Wait for ThetaData to fix the data quality issue
+3. Accept ~5% CAGR gap as known ThetaData limitation
+
+**Key files:**
+- `lumibot/tools/thetadata_helper.py` - `_apply_corporate_actions_to_frame()`, `_normalize_dividend_events()`
+- `lumibot/backtesting/thetadata_backtesting_pandas.py` - `get_yesterday_dividends()`
+- `lumibot/strategies/_strategy.py` - `_update_cash_with_dividends()`
+
+### ✅ ThetaData Zero-Price Data Filtering (FIXED - Nov 28, 2025)
+
+**Status:** FIXED - Zero-price rows now filtered automatically
+
+**Root cause:** ThetaData sometimes returns rows with all-zero OHLC values (e.g., Saturday 2019-06-08 for MELI), which caused `ZeroDivisionError` when strategies tried to calculate positions.
+
+**Fix applied:**
+1. Added zero-price filtering when loading from cache (`thetadata_helper.py` lines ~2501-2513)
+2. Added zero-price filtering when receiving new data from ThetaData (`thetadata_helper.py` lines ~2817-2829)
+3. Cache is self-healing - bad data is filtered on load
+
+**Unit tests added:**
+- `TestZeroPriceFiltering` class with 6 tests covering all edge cases
+- Tests verify: zero-row removal, valid-zero-volume preservation, weekend-zero handling, partial-zeros, empty DF, all-zero DF
+
+### Cache Version Mismatch
+
+Always ensure `.env` files have matching cache versions:
+- `lumibot/.env`
+- `Demos/.env`
+
+## Test Philosophy (CRITICAL - READ THIS)
+
+### Test Age = Test Authority
+
+When tests fail, how you fix them depends on **how old the test is**:
+
+| Test Age | Authority Level | How to Fix |
+|----------|----------------|------------|
+| **>1 year old** | LEGACY - High authority | **Fix the CODE**, not the test. These tests have proven themselves over time. |
+| **6-12 months** | ESTABLISHED - Medium authority | Investigate carefully. Likely fix the code, but could be test issue. |
+| **<6 months** | NEW - Lower authority | Test may need adjustment. Still verify code isn't broken. |
+| **<1 month** | EXPERIMENTAL | Test is still being refined. Adjust as needed. |
+
+### Check Test Age Before Fixing
+
+```bash
+# Check when a test file was first created
+git log --format="%ai" --follow -- tests/path/to/test.py | tail -1
+
+# Check when a specific test function was added
+git log -p --all -S 'def test_function_name' -- tests/
+```
+
+### Conflict Resolution
+
+When old tests and new tests conflict:
+1. **Old test wins by default** - it has proven track record
+2. If the new test represents genuinely new functionality, ask the user
+3. Document any judgment calls in the test file with comments
+
+### Adding Comments to Tests
+
+For tests over 1 year old, add a comment when modifying:
+```python
+# LEGACY TEST (created Aug 2023) - This test has proven correct behavior
+# DO NOT modify expected values without understanding the full impact
+# If this test fails, FIX THE CODE, not this test
+def test_important_behavior():
+    ...
+```
+
+## Testing Checklist for Data Source Changes
+
+1. Run TQQQ 200-Day MA with Yahoo (2013-2025) → expect ~30-45% CAGR
+2. Run same strategy with ThetaData → should match Yahoo within ~5%
+3. Check for anomalous daily returns (>50% gain/loss indicates split issue)
+4. Compare specific prices around split dates (esp. Jan 13, 2022 2:1 forward split)
+
+## Architecture Quick Reference
+
+```
+Strategy.backtest()
+    │
+    ▼
+Data Source Selection (env var overrides code)
+    │
+    ├── Yahoo: yfinance → split-adjusted prices
+    ├── ThetaData: Data Downloader → split-adjusted prices
+    └── Polygon: Polygon API → handles splits in cache validation
+    │
+    ▼
+BacktestingBroker (simulates trades)
+    │
+    ▼
+Results (tearsheet, trades, logs)
+```
+
+See `BACKTESTING_ARCHITECTURE.md` for detailed data flow diagrams.
+
+## CRITICAL: MCP Servers (MUST USE)
+
+**ATTENTION CLAUDE:** These MCP servers are **ESSENTIAL** for working effectively on LumiBot. They have proven to make a **massive difference** in debugging complex data issues, validating financial data accuracy, and navigating this large codebase. **USE THEM PROACTIVELY AND FREQUENTLY.**
+
+**IF THESE MCP SERVERS ARE NOT AVAILABLE:** Inform the user immediately and strongly recommend they install them. The debugging experience without these tools is significantly degraded. Installation instructions can be found in the Claude Code documentation.
+
+### Why These Tools Matter for LumiBot
+
+LumiBot deals with:
+- Complex financial data from multiple sources (Yahoo, ThetaData, Polygon)
+- Split adjustments, dividend calculations, and corporate actions
+- Large codebase with intricate data flow paths
+- Data accuracy requirements where small errors compound over time
+
+Without MCP tools, debugging these issues is slow and error-prone. With them, you can validate data against external sources, systematically break down problems, and navigate the codebase efficiently.
+
+---
+
+### Sequential Thinking (`mcp__sequential-thinking__sequentialthinking`) - **USE THIS FIRST**
+
+**STRONGLY RECOMMENDED** for ANY complex debugging task. This tool has been critical for:
+- Breaking down backtesting discrepancies into systematic steps
+- Analyzing why ThetaData vs Yahoo results differ
+- Planning fixes that don't introduce regressions
+- Debugging split/dividend calculation issues
+
+**USE IT:** Before diving into complex code changes, use sequential thinking to plan your approach.
+
+### Perplexity (`mcp__perplexity__*`) - **ESSENTIAL FOR DATA VALIDATION**
+
+**CRITICAL** for validating financial data. LumiBot data issues often stem from incorrect source data. Perplexity lets you:
+- `perplexity_search` - Verify stock split dates and ratios
+- `perplexity_research` - Deep dive into dividend history discrepancies
+- `perplexity_ask` - Quick validation of corporate action data
+
+**REAL EXAMPLE:** We discovered ThetaData "phantom dividends" by using Perplexity to cross-reference against Yahoo, Bloomberg, and SEC filings. This identified data quality issues that would have been impossible to find otherwise.
+
+**USE IT:** Whenever you see unexpected financial data, validate it with Perplexity before assuming the code is wrong.
+
+### Memory (`mcp__memory__*`) - **TRACK YOUR FINDINGS**
+
+**HIGHLY RECOMMENDED** for maintaining context across debugging sessions:
+- Store known data source discrepancies
+- Record phantom dividend dates and amounts
+- Track cache version changes and their reasons
+- Remember which fixes were applied and why
+
+**USE IT:** When you discover an issue, store it in memory so future sessions don't have to rediscover it.
+
+### X-Ray (`mcp__xray__*`) - **NAVIGATE THE CODEBASE**
+
+**PARTIALLY USEFUL** for understanding the LumiBot codebase:
+- `explore_repo` - **WORKS** - Map the directory structure
+- `what_breaks` - **WORKS** - Find all usages of a function (text search)
+- `find_symbol` - **REQUIRES ast-grep** - If this fails, install: `brew install ast-grep`
+
+**USE IT:** Before modifying any function, use `what_breaks` to understand the impact. If `find_symbol` fails, use Grep tool instead.
+
+### Context7 (`mcp__context7__*`) - **GET CURRENT DOCS**
+
+**USEFUL** for library documentation:
+- `resolve-library-id` - Find library IDs
+- `get-library-docs` - Get current pandas, yfinance, polygon docs
+
+### Chrome DevTools (`mcp__chrome-devtools__*`)
+
+**USEFUL** for debugging Data Downloader issues:
+- Test API endpoints directly
+- Inspect network responses from ThetaData
@@ -5,7 +5,16 @@
 
 Lumibot is a backtesting and trading library for stocks, options, crypto, futures and more. It is made so that the same code you use for backtesting can be used for live trading, making it easy to transition from backtesting to live trading. Lumibot is a highly flexible library that allows you to create your own strategies and indicators, and backtest them on historical data. It is also highly optimized for speed, so you can backtest your strategies quickly and efficiently.
 
-**IMPORTANT: This library requires data for backtesting. The recommended data source is [Polygon.io](https://polygon.io/?utm_source=affiliate&utm_campaign=lumi10) (a free tier is available too). Please click the link to give us credit for the sale, it helps support this project. You can use the coupon code 'LUMI10' for 10% off.**
+**IMPORTANT: This library requires data for backtesting. Our recommended data source is [ThetaData](https://www.thetadata.net/) because they provide the deepest historical coverage we’ve found and directly support BotSpot. Use the promo code `BotSpot10` at checkout for 10% off the first order (the code also tells ThetaData you were referred by us).**
+
+> **Contributor note:** Read `AGENTS.md` before running anything Theta-related. That file spells out the hard rules—never launch ThetaTerminal or the shared downloader locally, always point LumiBot at the AWS-hosted downloader, and wrap all long
+> commands with `/Users/robertgrzesik/bin/safe-timeout`. Breaking these rules kills the only licensed Theta session.
+
+## Architecture Documentation
+
+- `BACKTESTING_ARCHITECTURE.md` - Detailed documentation of the backtesting data flow (Yahoo, ThetaData, Polygon data sources, caching, and data flow diagrams)
+- `CLAUDE.md` - AI assistant instructions for working with the codebase
+- `AGENTS.md` - Critical rules for ThetaData and production safety
 
 ## Documentation - 👇 Start Here 👇