fix(session-db): advance flush cursor per-message to prevent duplicate writes on partial failure by Bartok9 · Pull Request #19407 · NousResearch/hermes-agent

Bartok9 · 2026-05-03T22:44:33Z

Summary

_flush_messages_to_session_db (run_agent.py) iterates messages[flush_from:] and calls SessionDB.append_message for each row, then sets self._last_flushed_db_idx = len(messages) after the loop completes. If any individual append_message raises mid-loop — typical triggers: SQLite "database is locked" from concurrent Hermes processes sharing the same state.db, transient disk-full, schema-evolution race — control jumps to the broad except Exception clause without the cursor having advanced.

The next flush — usually called from the next exit path of the same agent run a few hundred ms later, since _persist_session fans out to multiple call sites — re-evaluates flush_from = max(start_idx, self._last_flushed_db_idx), gets the original (un-advanced) value, and re-writes the rows that DID commit before the failure. The user observes their transcript silently growing duplicates: each user/assistant pair appears twice, FTS5 surfaces the same message in search results twice, and message_count on the session row drifts to 2× the true conversation length. Compounds across runs because every retry doubles the unflushed window again.

What changed

The fix is mechanical: advance self._last_flushed_db_idx = flush_from + i + 1 inside the loop immediately after each successful append_message call, and remove the now-redundant post-loop assignment. With per-row advancement, a mid-loop failure on row N+1 leaves the cursor at N, so the next flush correctly skips rows 0..N and only re-attempts N+1 onward — no duplicates of the rows that succeeded.

Equivalent to the prior behaviour in the success path: when the loop completes normally, the final per-iter assignment writes flush_from + (len(messages[flush_from:]) - 1) + 1 = len(messages), matching the old post-loop value. The 4 existing dedup tests confirm this — they all still pass unchanged.

Test coverage

New tests/run_agent/test_860_dedup.py::TestFlushDeduplication::test_flush_advances_cursor_per_message_on_partial_failure builds a 5-message conversation, monkey-patches db.append_message to raise sqlite3.OperationalError(\"database is locked\") on the 3rd invocation, calls flush, then asserts:

The first 2 messages were committed (cursor at 2, not 0).
After the broken provider is replaced and flush re-runs with the same message list, the 3rd message onward gets written exactly once.
Total session rows == len(messages) — no duplicates of rows 1 and 2.
Row order/content match the original send order.

Verified bug repro: against an unmodified upstream run_agent.py the new test fails at the cursor assertion ("got 0, expected 2"). With the fix applied, all 4 existing dedup tests + 1 new regression test pass, plus 21 adjacent compression/persistence tests pass with no regressions.

Targeted run:

```
pytest tests/run_agent/test_860_dedup.py \
tests/run_agent/test_compression_persistence.py \
tests/run_agent/test_413_compression.py \
tests/run_agent/test_compress_focus_plugin_fallback.py \
-o addopts=''
```

→ 31 passed.

Fixes #12563

Made with Cursor

…e writes on partial failure `_flush_messages_to_session_db` iterates `messages[flush_from:]` and calls `SessionDB.append_message` for each row, then sets `self._last_flushed_db_idx = len(messages)` AFTER the loop completes. If any individual `append_message` raises mid-loop (typical triggers: SQLite "database is locked" from concurrent Hermes processes sharing the same state.db, transient disk-full, schema-evolution race), control jumps to the broad `except Exception` clause without the cursor having advanced. The next flush — usually called from the next exit path of the same agent run a few hundred ms later, since `_persist_session` fans out to multiple call sites — re-evaluates `flush_from = max(start_idx, self._last_flushed_db_idx)`, gets the original (un-advanced) value, and re-writes the rows that DID commit before the failure. The user observes their transcript silently growing duplicates: each user/assistant pair appears twice, FTS5 surfaces the same message in search results twice, and `message_count` on the session row drifts to 2× the true conversation length. Compounds across runs because every retry doubles the unflushed window again. The fix is mechanical: advance `self._last_flushed_db_idx = flush_from + i + 1` inside the loop immediately after each successful `append_message` call, and remove the now-redundant post-loop assignment. With per-row advancement, a mid-loop failure on row N+1 leaves the cursor at N, so the next flush correctly skips rows 0..N and only re-attempts N+1 onward — no duplicates of the rows that succeeded. Equivalent to the old behavior in the success path: when the loop completes normally, the final per-iter assignment writes `flush_from + (len(messages[flush_from:]) - 1) + 1 = len(messages)`, matching the prior post-loop value. Test coverage: new `tests/run_agent/test_860_dedup.py::TestFlushDeduplication ::test_flush_advances_cursor_per_message_on_partial_failure` builds a 5-message conversation, monkey-patches `db.append_message` to raise `sqlite3.OperationalError("database is locked")` on the 3rd invocation, calls flush, then asserts: 1. The first 2 messages were committed (cursor at 2, not 0). 2. After the broken provider is replaced and flush re-runs with the same message list, the 3rd message onward gets written exactly once. 3. Total session rows == len(messages) — no duplicates of rows 1 and 2. 4. Row order/content match the original send order. Verified bug repro: against an unmodified upstream `run_agent.py` the new test fails at the cursor assertion ("got 0, expected 2"). With the fix applied, all 4 existing dedup tests + 1 new regression test pass, plus 21 adjacent compression/persistence tests pass with no regressions. Targeted run: `pytest tests/run_agent/test_860_dedup.py tests/run_agent/test_compression_persistence.py tests/run_agent/test_413_compression.py tests/run_agent/test_compress_focus_plugin_fallback.py -o addopts=''` → **31 passed**. Fixes NousResearch#12563 Co-authored-by: Cursor <cursoragent@cursor.com>

alt-glitch · 2026-05-03T23:03:51Z

Related to PR #17146 (flush cursor recovery from persisted prefix) — both address SessionDB flush cursor bugs but from different failure modes.

alt-glitch · 2026-05-03T23:04:16Z

Related to PR #17146 (flush cursor recovery from persisted prefix) — both address SessionDB flush cursor bugs but from different failure modes.

The original test tried to instantiate AIAgent without the required patches (get_tool_definitions, check_toolset_requirements, OpenAI) which caused CI failures. Rewrote using the same fixture pattern as test_413_compression.py and replaced SessionDB dependency with a MagicMock so tests don't depend on internal DB schema. Covers the same scenarios including the NousResearch#12563 partial-failure cursor regression test.

Bartok9 · 2026-05-05T20:49:17Z

Changes present in upstream main after rebase. Closing as resolved.

alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists comp/agent Core agent loop, run_agent.py, prompt builder labels May 3, 2026

Bartok9 closed this May 5, 2026

alt-glitch mentioned this pull request May 7, 2026

fix(session-db): advance flush cursor per committed message #21025

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(session-db): advance flush cursor per-message to prevent duplicate writes on partial failure#19407

fix(session-db): advance flush cursor per-message to prevent duplicate writes on partial failure#19407
Bartok9 wants to merge 2 commits intoNousResearch:mainfrom
Bartok9:bartok9/tea-17-fix-flush-cursor-advancement

Bartok9 commented May 3, 2026

Uh oh!

alt-glitch commented May 3, 2026

Uh oh!

alt-glitch commented May 3, 2026

Uh oh!

Bartok9 commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Bartok9 commented May 3, 2026

Summary

What changed

Test coverage

Uh oh!

alt-glitch commented May 3, 2026

Uh oh!

alt-glitch commented May 3, 2026

Uh oh!

Bartok9 commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants