-
-
Notifications
You must be signed in to change notification settings - Fork 1.3k
fix(db): prevent FK constraint failures on worker restart #889
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
fix(db): prevent FK constraint failures on worker restart #889
Conversation
…otmack#846) When the worker restarts, it discards the stale memory_session_id and the SDK generates a new one. However, observations were being stored BEFORE this new ID was registered in the sdk_sessions parent table, causing FK constraint violations and infinite retry loops. Fix: - Add ensureMemorySessionIdRegistered() to SessionStore.ts that updates the parent table if the memory_session_id differs from what's stored - Call this method in ResponseProcessor.ts BEFORE storeObservations() - Add tests covering the worker restart scenario This ensures the FK target always exists before child records reference it. Co-authored-by: C. Copus <[email protected]>
Greptile OverviewGreptile SummaryFixes an FK ordering bug on worker restart by ensuring Confidence Score: 4/5
Important Files Changed
Sequence DiagramsequenceDiagram
autonumber
participant Worker as Worker/Agent
participant RP as ResponseProcessor
participant SS as SessionStore
participant DB as SQLite (sdk_sessions/observations)
Worker->>RP: processAgentResponse(text, session)
RP->>SS: ensureMemorySessionIdRegistered(sessionDbId, memorySessionId)
SS->>DB: SELECT memory_session_id FROM sdk_sessions WHERE id=?
alt session missing
SS-->>RP: throw Error("Session <id> not found")
else id differs or NULL
SS->>DB: UPDATE sdk_sessions SET memory_session_id=? WHERE id=?
SS-->>RP: OK
else already matches
SS-->>RP: no-op
end
RP->>SS: storeObservations(memorySessionId, project, observations, summary,...)
SS->>DB: INSERT observations (FK -> sdk_sessions.memory_session_id)
SS-->>RP: { observationIds, summaryId }
RP-->>Worker: done (then async chroma/broadcast)
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
4 files reviewed, 2 comments
| /** | ||
| * Tests for FK constraint fix (Issue #846) | ||
| * |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[P1] Unused Database import in new test file
tests/fk-constraint-fix.test.ts imports Database from bun:sqlite but never uses it. This will trip linters/formatters and adds noise; remove the unused import to keep the test focused.
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Prompt To Fix With AI
This is a comment left during a code review.
Path: tests/fk-constraint-fix.test.ts
Line: 1:3
Comment:
[P1] Unused `Database` import in new test file
`tests/fk-constraint-fix.test.ts` imports `Database` from `bun:sqlite` but never uses it. This will trip linters/formatters and adds noise; remove the unused import to keep the test focused.
<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>
How can I resolve this? If you propose a fix, please make it concise.
|
|
||
| import { describe, it, expect, beforeEach, afterEach } from 'bun:test'; | ||
| import { SessionStore } from '../src/services/sqlite/SessionStore.js'; | ||
| import { Database } from 'bun:sqlite'; | ||
|
|
||
| describe('FK Constraint Fix (Issue #846)', () => { | ||
| let store: SessionStore; | ||
| let testDbPath: string; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[P2] Temp DB path can collide across fast tests
Using Date.now() for testDbPath can collide when tests start in the same millisecond (especially under parallelism), which can cause flaky failures or cross-test contamination. Consider using a random suffix (e.g. crypto.randomUUID() in Bun) or mkdtemp/tmpdir to guarantee uniqueness.
Prompt To Fix With AI
This is a comment left during a code review.
Path: tests/fk-constraint-fix.test.ts
Line: 12:19
Comment:
[P2] Temp DB path can collide across fast tests
Using `Date.now()` for `testDbPath` can collide when tests start in the same millisecond (especially under parallelism), which can cause flaky failures or cross-test contamination. Consider using a random suffix (e.g. `crypto.randomUUID()` in Bun) or `mkdtemp`/`tmpdir` to guarantee uniqueness.
How can I resolve this? If you propose a fix, please make it concise.
- Remove unused Database import - Use crypto.randomUUID() instead of Date.now() for test DB paths to prevent collisions in parallel test execution
3e8db0b to
92d2476
Compare
Two issues fixed: 1. AbortController remained aborted after generator finished - new generators got insta-killed (awkward) 2. SDK can return different session_id on resume - now synced to DB to prevent FK constraint failures Closes thedotmack#846 Co-authored-by: C. Copus <[email protected]>
FK constraint fix added a getSessionById call in ResponseProcessor. Tests didn't know about it. Tests failed. Tests learned their lesson. - Add getSessionById mock to gemini_agent.test.ts - Add getSessionById mock to response-processor.test.ts (8 locations) - Update error message test to match new dual-source check Co-authored-by: C. Copus <[email protected]>
When SDK generator fails with unrecoverable error (e.g., "Claude not found"), the .finally() block was like "oh no, pending work! restart!" and the error would immediately recur. Rinse, repeat, 10M+ log lines, 1GB+ disk eaten. Now we track unrecoverable patterns and skip restart: - Claude executable not found - CLAUDE_CODE_PATH issues - ENOENT / spawn failures Your disk space is safe now. Co-authored-by: C. Copus <[email protected]>
Race conditions were spawning agent armies: 1. Crash recovery vs HTTP requests (1s window of chaos) 2. Concurrent HTTP requests spawning multiple generators 3. queueDepth telemetry lying (always showed 0) The fix: - spawnInProgress Map blocks concurrent generator spawns - crashRecoveryScheduled Set prevents duplicate recovery attempts - queueDepth now reads from database (the truth) - 6 tests to keep zombies dead Co-authored-by: C. Copus <[email protected]>
92d2476 to
742f614
Compare
- Add ensureMemorySessionIdRegistered mock to GeminiAgent tests - Fix logger.formatTool() parallel test pollution via inline impl - Update ResponseProcessor error message expectation All 796 tests now pass. Co-Authored-By: Claude Opus 4.5 <[email protected]>
Summary
Fixes #846 - FK constraint failures after worker restart causing infinite retry loops and high CPU usage.
Problem
When the worker restarts:
memory_session_id = NULL(or stale value)memory_session_idduring executionstoreObservation()tries to INSERT with this new IDsdk_sessionsrow doesn't have this ID yetRoot Cause
The
updateMemorySessionId()function exists but was never called between session creation and observation storage. There was a timing gap where observations referenced amemory_session_idthat didn't exist in the parent table.Solution
Added
ensureMemorySessionIdRegistered()that:memory_session_idmatches what's insdk_sessionsResponseProcessor.tsright beforestoreObservations()This ensures the FK target always exists before child records reference it.
Changes
src/services/sqlite/SessionStore.ts- AddensureMemorySessionIdRegistered()methodsrc/services/worker/agents/ResponseProcessor.ts- Call guard before storagetests/fk-constraint-fix.test.ts- 4 tests covering worker restart scenariostests/worker/agents/response-processor.test.ts- Update mocks for new methodTest Plan
bun test tests/fk-constraint-fix.test.ts- 4/4 passingRelated