Skip to content

Conversation

@Et9797
Copy link

@Et9797 Et9797 commented Feb 2, 2026

Summary

Fixes #846 - FK constraint failures after worker restart causing infinite retry loops and high CPU usage.

Problem

When the worker restarts:

  1. Session exists with memory_session_id = NULL (or stale value)
  2. SDK generates a new memory_session_id during execution
  3. storeObservation() tries to INSERT with this new ID
  4. FK CONSTRAINT FAILS - parent sdk_sessions row doesn't have this ID yet
  5. Infinite retry loop burns CPU

Root Cause

The updateMemorySessionId() function exists but was never called between session creation and observation storage. There was a timing gap where observations referenced a memory_session_id that didn't exist in the parent table.

Solution

Added ensureMemorySessionIdRegistered() that:

  1. Checks if the provided memory_session_id matches what's in sdk_sessions
  2. If not, UPDATEs the parent table FIRST before any child INSERT
  3. Called in ResponseProcessor.ts right before storeObservations()

This ensures the FK target always exists before child records reference it.

Changes

  • src/services/sqlite/SessionStore.ts - Add ensureMemorySessionIdRegistered() method
  • src/services/worker/agents/ResponseProcessor.ts - Call guard before storage
  • tests/fk-constraint-fix.test.ts - 4 tests covering worker restart scenarios
  • tests/worker/agents/response-processor.test.ts - Update mocks for new method

Test Plan

  • bun test tests/fk-constraint-fix.test.ts - 4/4 passing
  • Full test suite - no regressions from our changes
  • Manual test: kill worker, restart, create observation → should succeed

Related

…otmack#846)

When the worker restarts, it discards the stale memory_session_id and the SDK
generates a new one. However, observations were being stored BEFORE this new
ID was registered in the sdk_sessions parent table, causing FK constraint
violations and infinite retry loops.

Fix:
- Add ensureMemorySessionIdRegistered() to SessionStore.ts that updates the
  parent table if the memory_session_id differs from what's stored
- Call this method in ResponseProcessor.ts BEFORE storeObservations()
- Add tests covering the worker restart scenario

This ensures the FK target always exists before child records reference it.

Co-authored-by: C. Copus <[email protected]>
@greptile-apps
Copy link

greptile-apps bot commented Feb 2, 2026

Greptile Overview

Greptile Summary

Fixes an FK ordering bug on worker restart by ensuring sdk_sessions.memory_session_id is updated/registered before inserting FK-constrained child rows (observations/summaries). The change adds a small guard method on SessionStore and calls it from ResponseProcessor immediately before storeObservations(), plus adds a new test file and updates ResponseProcessor test mocks to include the new method.

Confidence Score: 4/5

  • This PR is generally safe to merge and addresses a real FK ordering bug, with only minor test hygiene issues noted.
  • The functional change is small and localized (a pre-insert guard updating the parent FK target) and is covered by targeted tests. Remaining concerns are non-functional (unused import and potential flakiness in temp DB naming) rather than correctness of the fix.
  • tests/fk-constraint-fix.test.ts

Important Files Changed

Filename Overview
src/services/sqlite/SessionStore.ts Adds ensureMemorySessionIdRegistered() to update sdk_sessions.memory_session_id before FK-constrained child inserts; change is small and consistent with existing DB access patterns.
src/services/worker/agents/ResponseProcessor.ts Calls ensureMemorySessionIdRegistered() before storeObservations() to prevent FK failures after worker restart; no other logic changes.
tests/fk-constraint-fix.test.ts Adds focused tests covering worker-restart FK scenario, but includes an unused import and a potentially flaky temp DB naming strategy.
tests/worker/agents/response-processor.test.ts Updates SessionStore mocks to include ensureMemorySessionIdRegistered() so ResponseProcessor tests keep passing; otherwise unchanged.

Sequence Diagram

sequenceDiagram
  autonumber
  participant Worker as Worker/Agent
  participant RP as ResponseProcessor
  participant SS as SessionStore
  participant DB as SQLite (sdk_sessions/observations)

  Worker->>RP: processAgentResponse(text, session)
  RP->>SS: ensureMemorySessionIdRegistered(sessionDbId, memorySessionId)
  SS->>DB: SELECT memory_session_id FROM sdk_sessions WHERE id=?
  alt session missing
    SS-->>RP: throw Error("Session <id> not found")
  else id differs or NULL
    SS->>DB: UPDATE sdk_sessions SET memory_session_id=? WHERE id=?
    SS-->>RP: OK
  else already matches
    SS-->>RP: no-op
  end
  RP->>SS: storeObservations(memorySessionId, project, observations, summary,...)
  SS->>DB: INSERT observations (FK -> sdk_sessions.memory_session_id)
  SS-->>RP: { observationIds, summaryId }
  RP-->>Worker: done (then async chroma/broadcast)

Loading

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

Comment on lines +1 to +3
/**
* Tests for FK constraint fix (Issue #846)
*
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P1] Unused Database import in new test file

tests/fk-constraint-fix.test.ts imports Database from bun:sqlite but never uses it. This will trip linters/formatters and adds noise; remove the unused import to keep the test focused.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix With AI
This is a comment left during a code review.
Path: tests/fk-constraint-fix.test.ts
Line: 1:3

Comment:
[P1] Unused `Database` import in new test file

`tests/fk-constraint-fix.test.ts` imports `Database` from `bun:sqlite` but never uses it. This will trip linters/formatters and adds noise; remove the unused import to keep the test focused.

<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines 12 to 19

import { describe, it, expect, beforeEach, afterEach } from 'bun:test';
import { SessionStore } from '../src/services/sqlite/SessionStore.js';
import { Database } from 'bun:sqlite';

describe('FK Constraint Fix (Issue #846)', () => {
let store: SessionStore;
let testDbPath: string;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P2] Temp DB path can collide across fast tests

Using Date.now() for testDbPath can collide when tests start in the same millisecond (especially under parallelism), which can cause flaky failures or cross-test contamination. Consider using a random suffix (e.g. crypto.randomUUID() in Bun) or mkdtemp/tmpdir to guarantee uniqueness.

Prompt To Fix With AI
This is a comment left during a code review.
Path: tests/fk-constraint-fix.test.ts
Line: 12:19

Comment:
[P2] Temp DB path can collide across fast tests

Using `Date.now()` for `testDbPath` can collide when tests start in the same millisecond (especially under parallelism), which can cause flaky failures or cross-test contamination. Consider using a random suffix (e.g. `crypto.randomUUID()` in Bun) or `mkdtemp`/`tmpdir` to guarantee uniqueness.

How can I resolve this? If you propose a fix, please make it concise.

- Remove unused Database import
- Use crypto.randomUUID() instead of Date.now() for test DB paths
  to prevent collisions in parallel test execution
@Et9797 Et9797 force-pushed the fix/fk-constraint-memory-session-id branch 2 times, most recently from 3e8db0b to 92d2476 Compare February 3, 2026 15:19
Et and others added 4 commits February 3, 2026 16:20
Two issues fixed:
1. AbortController remained aborted after generator finished - new generators
   got insta-killed (awkward)
2. SDK can return different session_id on resume - now synced to DB to
   prevent FK constraint failures

Closes thedotmack#846

Co-authored-by: C. Copus <[email protected]>
FK constraint fix added a getSessionById call in ResponseProcessor.
Tests didn't know about it. Tests failed. Tests learned their lesson.

- Add getSessionById mock to gemini_agent.test.ts
- Add getSessionById mock to response-processor.test.ts (8 locations)
- Update error message test to match new dual-source check

Co-authored-by: C. Copus <[email protected]>
When SDK generator fails with unrecoverable error (e.g., "Claude not found"),
the .finally() block was like "oh no, pending work! restart!" and the error
would immediately recur. Rinse, repeat, 10M+ log lines, 1GB+ disk eaten.

Now we track unrecoverable patterns and skip restart:
- Claude executable not found
- CLAUDE_CODE_PATH issues
- ENOENT / spawn failures

Your disk space is safe now.

Co-authored-by: C. Copus <[email protected]>
Race conditions were spawning agent armies:
1. Crash recovery vs HTTP requests (1s window of chaos)
2. Concurrent HTTP requests spawning multiple generators
3. queueDepth telemetry lying (always showed 0)

The fix:
- spawnInProgress Map blocks concurrent generator spawns
- crashRecoveryScheduled Set prevents duplicate recovery attempts
- queueDepth now reads from database (the truth)
- 6 tests to keep zombies dead

Co-authored-by: C. Copus <[email protected]>
@Et9797 Et9797 force-pushed the fix/fk-constraint-memory-session-id branch from 92d2476 to 742f614 Compare February 3, 2026 15:21
- Add ensureMemorySessionIdRegistered mock to GeminiAgent tests
- Fix logger.formatTool() parallel test pollution via inline impl
- Update ResponseProcessor error message expectation

All 796 tests now pass.

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug Report: Generator aborted due to stale memory_session_id after DB restore or worker restart

1 participant