Skip to content

fix(scheduler): treat Error state sequences as finished in PagedAttention#2111

Draft
glaziermag wants to merge 1 commit into
EricLBuehler:masterfrom
glaziermag:fix/scheduler-error-state-hang
Draft

fix(scheduler): treat Error state sequences as finished in PagedAttention#2111
glaziermag wants to merge 1 commit into
EricLBuehler:masterfrom
glaziermag:fix/scheduler-error-state-hang

Conversation

@glaziermag
Copy link
Copy Markdown
Contributor

@glaziermag glaziermag commented Apr 16, 2026

Problem

Sequences that enter SequenceState::Error were not matched by is_finished_paged_attn(). In PagedAttention, that means an errored sequence can remain in scheduler state instead of being treated as terminal for cleanup.

SequenceState::Error is set on real inference error paths, including handle_seq_error_stateaware_ok! and handle_pipeline_forward_error!. Scheduler cleanup calls free_finished_sequence_groups(), which depends on is_finished_paged_attn().

Fix

Add SequenceState::Error to the terminal states returned by is_finished_paged_attn(), alongside FinishedAborted, FinishedIgnored, and Done.

Files changed

  • mistralrs-core/src/sequence.rs

Validation

Current branch head after Agent 5 follow-up: 593b6f298989769801cabaee4102c4069be3ce7a.

A100 hardware/software used by Agent 5:

  • GCP a2-highgpu-1g, 1x NVIDIA A100-SXM4-40GB, 40960 MiB
  • Driver 580.126.09; nvidia-smi CUDA 13.0; CUDA toolkit 12.9.41
  • Rust 1.95.0; Cargo 1.95.0

Before/after targeted check:

cargo test -q -p mistralrs-core error_state_is_finished_for_paged_attention_cleanup --lib
  • base 2d4ba4f16f61e5e18be085d0dd137bc95cba038a plus injected regression: failed; SequenceState::Error must be terminal for PagedAttention cleanup.
  • previous PR head 0715e0a2ddb90099d353547dc9840f03897f23f2 plus the same regression: passed.

Committed tests on current head:

cargo test -q -p mistralrs-core error_state_is_finished_for_paged_attention --lib
cargo test -q -p mistralrs-core paged_attention_finished_state_predicate_preserves_existing_states --lib

A100 result: both passed.

The committed coverage verifies that Error is terminal for PagedAttention cleanup and that existing Done / FinishedAborted / FinishedIgnored behavior remains terminal while normal running states remain nonterminal.

Merge / Issue-Linking Note

The branch commit with stale auto-close wording has been rewritten. Final squash/merge wording should use Refs #2058 or Related to #2058, without a closing keyword before #2058, so #2058 is not closed automatically. Safe wording: “Fixes the scheduler-state half: SequenceState::Error is terminal for PagedAttention cleanup; does not prove the full Windows RTX 3080 hang is fixed.”

Relationship to #2058 / #2076

Classification: TARGETED.

This is the scheduler-state half of #2076. It is related to #2058, but it is not a full runtime reproduction of #2058. Agent 5 did not recover a live A100 inference repro where a sequence enters SequenceState::Error, blocks KV cleanup, and then prevents new scheduling. The original #2058 report used Windows 11, RTX 3080 12GB, CUDA 12.9 / driver 591.86, and google/gemma-4-E2B-it.

Safe merge claim: this PR makes SequenceState::Error terminal for PagedAttention cleanup. It should not claim to fully resolve the #2058 report without a runtime before/after on the original hang conditions.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 16, 2026

Code Metrics Report
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 Language              Files        Lines         Code     Comments       Blanks
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 C Header                  5          305          210           52           43
 CSS                       3          281          252            5           24
 CUDA                     59        17661        13824         1637         2200
 Dockerfile                1           38           21            8            9
 HTML                      2           27           27            0            0
 JavaScript                3          392          387            2            3
 Jinja2                    7          694          656            5           33
 JSON                     25         9346         9343            0            3
 Makefile                  1            6            5            0            1
 MDX                       1          147            0          132           15
 Metal Shading Lan|       31        11647         9007         1064         1576
 PowerShell                1          300          227           30           43
 Python                  129         9969         8194          456         1319
 Shell                     2          489          331           96           62
 Plain Text                3         3723            0         2413         1310
 TOML                     27         1309         1145           36          128
 TypeScript               11         1607         1371           66          170
 YAML                      3           25           23            2            0
─────────────────────────────────────────────────────────────────────────────────
 Jupyter Notebooks         3          122           83           23           16
 |- Markdown               1           60           30           22            8
 |- Python                 1          122          113            1            8
 (Total)                              304          226           46           32
─────────────────────────────────────────────────────────────────────────────────
 Markdown                119         8232            0         5591         2641
 |- BASH                  52          491          432           34           25
 |- Dockerfile             2            5            5            0            0
 |- JSON                  16          582          582            0            0
 |- PowerShell             3            5            5            0            0
 |- Python                22          687          604            5           78
 |- Rust                  13          415          362            1           52
 |- TOML                   9          107           83            3           21
 |- YAML                   1            9            9            0            0
 (Total)                            10533         2082         5634         2817
─────────────────────────────────────────────────────────────────────────────────
 Rust                    571       245656       216375         6437        22844
 |- Markdown             379         9235          452         7653         1130
 (Total)                           254891       216827        14090        23974
─────────────────────────────────────────────────────────────────────────────────
 Svelte                   18         1831         1696           50           85
 |- CSS                    1            4            4            0            0
 |- JavaScript            18          876          727           24          125
 (Total)                             2711         2427           74          210
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 Total                  1025       326405       266585        25848        33972
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

@glaziermag glaziermag force-pushed the fix/scheduler-error-state-hang branch 2 times, most recently from 212955c to 8041506 Compare April 18, 2026 02:30
@glaziermag glaziermag marked this pull request as draft May 5, 2026 19:16
Copy link
Copy Markdown
Contributor Author

Agent 6 follow-up on existing A100 validation: this remains valid as a targeted scheduler fix. Classification: TARGETED; feasibility: FEASIBLE_NOW. The A100 work validates the SequenceState::Error terminal-state cleanup invariant, but does not prove the full original runtime issue. Safe wording should stay limited to Error-state sequences being treated as finished in PagedAttention. Recommendation: keep open/draft for review.

@glaziermag glaziermag force-pushed the fix/scheduler-error-state-hang branch 3 times, most recently from 4abee42 to 391e829 Compare May 20, 2026 02:25
@glaziermag glaziermag force-pushed the fix/scheduler-error-state-hang branch from 391e829 to bf8e9d2 Compare May 20, 2026 21:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant