Catch and retry when race condition occurs during incident creation #882

fabianvf · 2025-09-30T12:25:28Z

Summary by CodeRabbit

Bug Fixes
- Improved reliability of incident creation under high concurrency with guarded retries and clearer failure reporting for rare duplicate-key races.
Tests
- Added stress tests for multiple simultaneous users and duplicate-violation scenarios.
- Introduced tests covering sequential, concurrent, and batch creation of incidents sharing the same violation.
- Clarified test output messaging for readability.

Related to #883

Signed-off-by: Fabian von Feilitzsch <[email protected]>

coderabbitai · 2025-09-30T12:25:35Z

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The title accurately and concisely summarizes the core change, which is introducing a catch-and-retry mechanism for race conditions during incident creation, without extraneous details or generic phrasing.

✨ Finishing touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🧪 Early access (Sonnet 4.5): enabled

We are currently testing the Sonnet 4.5 model, which is expected to improve code review quality. However, this model may lead to increased noise levels in the review comments. Please disable the early access features if the noise level causes any inconvenience.

Note:

Public repositories are always opted into early access features.
You can enable or disable early access features from the CodeRabbit UI or by updating the CodeRabbit configuration file.

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

kai_mcp_solution_server/src/kai_mcp_solution_server/server.py (1)
319-370: Well-implemented retry mechanism for race condition handling.

The two-attempt retry logic correctly handles the race condition where multiple concurrent requests try to create the same violation. The implementation:

Queries for the existing violation first

Creates the violation if missing

Catches the duplicate key IntegrityError on the first attempt

Retries after a brief delay

A few considerations:

The 0.01s sleep (line 365) might be insufficient under very high contention. Consider increasing to 0.05-0.1s or using a small random jitter to reduce thundering herd effects.

The error message matching on "kai_violations_pkey" (line 361) is database-specific. This works for PostgreSQL but might need adjustment for other databases. Consider documenting this assumption or checking the specific exception type if the ORM provides it.

Static analysis flags line 370's RuntimeError message as too long. Consider extracting to a constant or shortening it.

Optional refinements:
-                await asyncio.sleep(0.01)
+                await asyncio.sleep(0.05)  # Brief delay to reduce contention
-    raise RuntimeError("Failed to create incident after retries")
+    raise RuntimeError("Incident creation failed after retries")
kai_mcp_solution_server/tests/test_duplicate_violations.py (1)
106-208: Well-designed concurrent stress test.

This test effectively exercises the race condition by launching multiple concurrent clients. The dual success criteria (demonstrating the bug vs. confirming the fix) make the test valuable throughout the development lifecycle.

One minor concern: the broad exception catching on line 140 is flagged by static analysis. While acceptable for a test that's explicitly checking for various error conditions, consider being more specific if possible.

If you want to address the static analysis hint while still catching all errors:
             except Exception as e:
+                # Catch all exceptions to report them without failing the test
                 return file_num, e
The comment documents the intentional broad exception handling.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b53f24f and a838859.

📒 Files selected for processing (3)

kai_mcp_solution_server/Makefile (1 hunks)
kai_mcp_solution_server/src/kai_mcp_solution_server/server.py (1 hunks)
kai_mcp_solution_server/tests/test_duplicate_violations.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (2)

kai_mcp_solution_server/src/kai_mcp_solution_server/server.py (2)

kai_mcp_solution_server/src/kai_mcp_solution_server/db/dao.py (2)

DBViolation (202-234)

DBIncident (245-295)

kai_mcp_solution_server/src/kai_mcp_solution_server/constants.py (1)

log (9-10)

kai_mcp_solution_server/tests/test_duplicate_violations.py (2)

kai_mcp_solution_server/ts-mcp-client/src/types/sdk.d.ts (1)

CallToolResult (22-24)

kai/jsonrpc/util.py (1)

model_dump (96-97)

🪛 Ruff (0.13.1)

kai_mcp_solution_server/src/kai_mcp_solution_server/server.py

370-370: Avoid specifying long messages outside the exception class

(TRY003)

kai_mcp_solution_server/tests/test_duplicate_violations.py

140-140: Do not catch blind exception: Exception

(BLE001)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)

GitHub Check: Run e2e test (ubuntu-24.04, bash, ChatOpenAI, kai-test-generation)
GitHub Check: Run e2e test (windows-latest, cmd, ChatOpenAI, kai-test-generation)
GitHub Check: Run e2e test (ubuntu-22.04-arm, bash, ChatOpenAI, kai-test-generation)
GitHub Check: Run e2e test (macos-13, bash, ChatOpenAI, kai-test-generation)
GitHub Check: Run e2e test (macos-latest, bash, ChatOpenAI, kai-test-generation)
GitHub Check: Stress Test with PostgreSQL
GitHub Check: Check Types on PR Runner (ubuntu-latest, 3.12, x64)

🔇 Additional comments (4)

kai_mcp_solution_server/Makefile (1)

125-136: LGTM! Enhanced stress testing coverage.

The updated test-stress target now includes both multiple users and duplicate violations stress tests, providing comprehensive coverage for the race condition handling introduced in this PR.

kai_mcp_solution_server/tests/test_duplicate_violations.py (3)

24-35: LGTM! Proper test setup.

The test setup correctly enforces the required MCP_SERVER_URL environment variable and applies SSL bypass for testing against external servers.

42-104: LGTM! Comprehensive sequential test.

This test correctly verifies that creating incidents with the same violation sequentially works as expected, with each incident receiving a unique ID while reusing the existing violation.

210-270: LGTM! Batch creation test validates sequential processing.

This test correctly verifies that the batch creation endpoint (create_multiple_incidents) handles multiple incidents with the same violation, which works because they're processed sequentially within the batch.

Signed-off-by: Fabian von Feilitzsch <[email protected]>

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

kai_mcp_solution_server/src/kai_mcp_solution_server/server.py (1)
376-376: Optional: Consider using a custom exception class.

The defensive RuntimeError is appropriate as a safeguard that should never execute. However, the static analysis tool suggests extracting long messages to exception classes for better maintainability.

Consider creating a custom exception:
class IncidentCreationError(Exception):
    """Raised when incident creation fails after all retry attempts."""
    pass
Then use it:
-    raise RuntimeError("Failed to create incident after retries")
+    raise IncidentCreationError("Failed to create incident after retries")
Based on static analysis hints.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a838859 and ab88586.

📒 Files selected for processing (1)

kai_mcp_solution_server/src/kai_mcp_solution_server/server.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

kai_mcp_solution_server/src/kai_mcp_solution_server/server.py (2)

kai_mcp_solution_server/src/kai_mcp_solution_server/db/dao.py (2)

DBViolation (202-234)

DBIncident (245-295)

kai_mcp_solution_server/src/kai_mcp_solution_server/constants.py (1)

log (9-10)

🪛 Ruff (0.13.1)

kai_mcp_solution_server/src/kai_mcp_solution_server/server.py

376-376: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)

GitHub Check: Run e2e test (macos-13, bash, ChatOpenAI, kai-test-generation)
GitHub Check: Run e2e test (windows-latest, cmd, ChatOpenAI, kai-test-generation)
GitHub Check: Run e2e test (macos-latest, bash, ChatOpenAI, kai-test-generation)
GitHub Check: Run e2e test (ubuntu-24.04, bash, ChatOpenAI, kai-test-generation)
GitHub Check: Run e2e test (ubuntu-22.04-arm, bash, ChatOpenAI, kai-test-generation)
GitHub Check: Stress Test with PostgreSQL

🔇 Additional comments (3)

kai_mcp_solution_server/src/kai_mcp_solution_server/server.py (3)

319-321: LGTM: Retry parameters are well-chosen.

The max attempts (3) and base delay (50ms) with exponential backoff provide a reasonable retry window (~350ms total) for handling transient race conditions during concurrent violation creation.

323-360: LGTM: Transaction scoping is correct for idempotent retries.

Each retry attempt correctly opens a fresh transaction, performs the full violation lookup/creation and incident creation atomically, and ensures the incident ID is flushed before returning. This properly handles the race condition where concurrent requests attempt to create the same violation.

362-374: Verify duplicate-violation retry works on both PostgreSQL and SQLite.

The retry logic in server.py matches only on "kai_violations_pkey", which catches Postgres primary-key errors but misses SQLite’s “UNIQUE constraint failed” messages. Update the handler to detect SQLite unique-constraint failures (e.g. inspect SQLAlchemy’s exception attributes or error codes) and add an integration test using the SQLite DSN that triggers a duplicate-violation IntegrityError to confirm the backoff/retry path.

JonahSussman

LGTM

…882) Bug Fixes Improved reliability of incident creation under high concurrency with guarded retries and clearer failure reporting for rare duplicate-key races. Tests Added stress tests for multiple simultaneous users and duplicate-violation scenarios. Introduced tests covering sequential, concurrent, and batch creation of incidents sharing the same violation. Clarified test output messaging for readability. Related to #883 Signed-off-by: Fabian von Feilitzsch <[email protected]> Signed-off-by: Cherry Picker <[email protected]>

Catch and retry when race condition occurs during incident creation

a838859

Signed-off-by: Fabian von Feilitzsch <[email protected]>

fabianvf requested a review from JonahSussman September 30, 2025 12:25

coderabbitai bot reviewed Sep 30, 2025

View reviewed changes

Increase retries + add exponential backoff

ab88586

Signed-off-by: Fabian von Feilitzsch <[email protected]>

coderabbitai bot reviewed Sep 30, 2025

View reviewed changes

fabianvf added the cherry-pick/release-0.8 This PR should be cherry-picked to release-0.8 branch label Sep 30, 2025

JonahSussman approved these changes Sep 30, 2025

View reviewed changes

fabianvf merged commit 859ca14 into konveyor:main Sep 30, 2025
15 checks passed

fabianvf deleted the duplicate-violation-handling branch September 30, 2025 14:39

fabianvf mentioned this pull request Oct 9, 2025

[bug] Concurrent create_incident requests can cause race condition when creating violation in DB #883

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Catch and retry when race condition occurs during incident creation #882

Catch and retry when race condition occurs during incident creation #882

Uh oh!

fabianvf commented Sep 30, 2025 •

edited

Loading

Uh oh!

coderabbitai bot commented Sep 30, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

JonahSussman left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Catch and retry when race condition occurs during incident creation #882

Catch and retry when race condition occurs during incident creation #882

Uh oh!

Conversation

fabianvf commented Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

JonahSussman left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fabianvf commented Sep 30, 2025 •

edited

Loading

coderabbitai bot commented Sep 30, 2025 •

edited

Loading