-
Notifications
You must be signed in to change notification settings - Fork 53
Catch and retry when race condition occurs during incident creation #882
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Catch and retry when race condition occurs during incident creation #882
Conversation
Signed-off-by: Fabian von Feilitzsch <[email protected]>
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🧪 Early access (Sonnet 4.5): enabledWe are currently testing the Sonnet 4.5 model, which is expected to improve code review quality. However, this model may lead to increased noise levels in the review comments. Please disable the early access features if the noise level causes any inconvenience. Note:
Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
kai_mcp_solution_server/src/kai_mcp_solution_server/server.py (1)
319-370: Well-implemented retry mechanism for race condition handling.The two-attempt retry logic correctly handles the race condition where multiple concurrent requests try to create the same violation. The implementation:
- Queries for the existing violation first
- Creates the violation if missing
- Catches the duplicate key IntegrityError on the first attempt
- Retries after a brief delay
A few considerations:
The 0.01s sleep (line 365) might be insufficient under very high contention. Consider increasing to 0.05-0.1s or using a small random jitter to reduce thundering herd effects.
The error message matching on "kai_violations_pkey" (line 361) is database-specific. This works for PostgreSQL but might need adjustment for other databases. Consider documenting this assumption or checking the specific exception type if the ORM provides it.
Static analysis flags line 370's RuntimeError message as too long. Consider extracting to a constant or shortening it.
Optional refinements:
- await asyncio.sleep(0.01) + await asyncio.sleep(0.05) # Brief delay to reduce contention- raise RuntimeError("Failed to create incident after retries") + raise RuntimeError("Incident creation failed after retries")kai_mcp_solution_server/tests/test_duplicate_violations.py (1)
106-208: Well-designed concurrent stress test.This test effectively exercises the race condition by launching multiple concurrent clients. The dual success criteria (demonstrating the bug vs. confirming the fix) make the test valuable throughout the development lifecycle.
One minor concern: the broad exception catching on line 140 is flagged by static analysis. While acceptable for a test that's explicitly checking for various error conditions, consider being more specific if possible.
If you want to address the static analysis hint while still catching all errors:
except Exception as e: + # Catch all exceptions to report them without failing the test return file_num, eThe comment documents the intentional broad exception handling.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
kai_mcp_solution_server/Makefile(1 hunks)kai_mcp_solution_server/src/kai_mcp_solution_server/server.py(1 hunks)kai_mcp_solution_server/tests/test_duplicate_violations.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
kai_mcp_solution_server/src/kai_mcp_solution_server/server.py (2)
kai_mcp_solution_server/src/kai_mcp_solution_server/db/dao.py (2)
DBViolation(202-234)DBIncident(245-295)kai_mcp_solution_server/src/kai_mcp_solution_server/constants.py (1)
log(9-10)
kai_mcp_solution_server/tests/test_duplicate_violations.py (2)
kai_mcp_solution_server/ts-mcp-client/src/types/sdk.d.ts (1)
CallToolResult(22-24)kai/jsonrpc/util.py (1)
model_dump(96-97)
🪛 Ruff (0.13.1)
kai_mcp_solution_server/src/kai_mcp_solution_server/server.py
370-370: Avoid specifying long messages outside the exception class
(TRY003)
kai_mcp_solution_server/tests/test_duplicate_violations.py
140-140: Do not catch blind exception: Exception
(BLE001)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)
- GitHub Check: Run e2e test (ubuntu-24.04, bash, ChatOpenAI, kai-test-generation)
- GitHub Check: Run e2e test (windows-latest, cmd, ChatOpenAI, kai-test-generation)
- GitHub Check: Run e2e test (ubuntu-22.04-arm, bash, ChatOpenAI, kai-test-generation)
- GitHub Check: Run e2e test (macos-13, bash, ChatOpenAI, kai-test-generation)
- GitHub Check: Run e2e test (macos-latest, bash, ChatOpenAI, kai-test-generation)
- GitHub Check: Stress Test with PostgreSQL
- GitHub Check: Check Types on PR Runner (ubuntu-latest, 3.12, x64)
🔇 Additional comments (4)
kai_mcp_solution_server/Makefile (1)
125-136: LGTM! Enhanced stress testing coverage.The updated test-stress target now includes both multiple users and duplicate violations stress tests, providing comprehensive coverage for the race condition handling introduced in this PR.
kai_mcp_solution_server/tests/test_duplicate_violations.py (3)
24-35: LGTM! Proper test setup.The test setup correctly enforces the required MCP_SERVER_URL environment variable and applies SSL bypass for testing against external servers.
42-104: LGTM! Comprehensive sequential test.This test correctly verifies that creating incidents with the same violation sequentially works as expected, with each incident receiving a unique ID while reusing the existing violation.
210-270: LGTM! Batch creation test validates sequential processing.This test correctly verifies that the batch creation endpoint (
create_multiple_incidents) handles multiple incidents with the same violation, which works because they're processed sequentially within the batch.
Signed-off-by: Fabian von Feilitzsch <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
kai_mcp_solution_server/src/kai_mcp_solution_server/server.py (1)
376-376: Optional: Consider using a custom exception class.The defensive
RuntimeErroris appropriate as a safeguard that should never execute. However, the static analysis tool suggests extracting long messages to exception classes for better maintainability.Consider creating a custom exception:
class IncidentCreationError(Exception): """Raised when incident creation fails after all retry attempts.""" passThen use it:
- raise RuntimeError("Failed to create incident after retries") + raise IncidentCreationError("Failed to create incident after retries")Based on static analysis hints.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
kai_mcp_solution_server/src/kai_mcp_solution_server/server.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
kai_mcp_solution_server/src/kai_mcp_solution_server/server.py (2)
kai_mcp_solution_server/src/kai_mcp_solution_server/db/dao.py (2)
DBViolation(202-234)DBIncident(245-295)kai_mcp_solution_server/src/kai_mcp_solution_server/constants.py (1)
log(9-10)
🪛 Ruff (0.13.1)
kai_mcp_solution_server/src/kai_mcp_solution_server/server.py
376-376: Avoid specifying long messages outside the exception class
(TRY003)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
- GitHub Check: Run e2e test (macos-13, bash, ChatOpenAI, kai-test-generation)
- GitHub Check: Run e2e test (windows-latest, cmd, ChatOpenAI, kai-test-generation)
- GitHub Check: Run e2e test (macos-latest, bash, ChatOpenAI, kai-test-generation)
- GitHub Check: Run e2e test (ubuntu-24.04, bash, ChatOpenAI, kai-test-generation)
- GitHub Check: Run e2e test (ubuntu-22.04-arm, bash, ChatOpenAI, kai-test-generation)
- GitHub Check: Stress Test with PostgreSQL
🔇 Additional comments (3)
kai_mcp_solution_server/src/kai_mcp_solution_server/server.py (3)
319-321: LGTM: Retry parameters are well-chosen.The max attempts (3) and base delay (50ms) with exponential backoff provide a reasonable retry window (~350ms total) for handling transient race conditions during concurrent violation creation.
323-360: LGTM: Transaction scoping is correct for idempotent retries.Each retry attempt correctly opens a fresh transaction, performs the full violation lookup/creation and incident creation atomically, and ensures the incident ID is flushed before returning. This properly handles the race condition where concurrent requests attempt to create the same violation.
362-374: Verify duplicate-violation retry works on both PostgreSQL and SQLite.The retry logic in server.py matches only on
"kai_violations_pkey", which catches Postgres primary-key errors but misses SQLite’s “UNIQUE constraint failed” messages. Update the handler to detect SQLite unique-constraint failures (e.g. inspect SQLAlchemy’s exception attributes or error codes) and add an integration test using the SQLite DSN that triggers a duplicate-violation IntegrityError to confirm the backoff/retry path.
JonahSussman
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
…882) Bug Fixes Improved reliability of incident creation under high concurrency with guarded retries and clearer failure reporting for rare duplicate-key races. Tests Added stress tests for multiple simultaneous users and duplicate-violation scenarios. Introduced tests covering sequential, concurrent, and batch creation of incidents sharing the same violation. Clarified test output messaging for readability. Related to #883 Signed-off-by: Fabian von Feilitzsch <[email protected]> Signed-off-by: Cherry Picker <[email protected]>
…882) Bug Fixes Improved reliability of incident creation under high concurrency with guarded retries and clearer failure reporting for rare duplicate-key races. Tests Added stress tests for multiple simultaneous users and duplicate-violation scenarios. Introduced tests covering sequential, concurrent, and batch creation of incidents sharing the same violation. Clarified test output messaging for readability. Related to #883 Signed-off-by: Fabian von Feilitzsch <[email protected]> Signed-off-by: Cherry Picker <[email protected]>
Summary by CodeRabbit
Bug Fixes
Tests
Related to #883