-
Notifications
You must be signed in to change notification settings - Fork 194
Fix roster fallback bug causing PID mapping corruption after pmm-agen… #4893
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: v3
Are you sure you want to change the base?
Conversation
|
Hi @MuraliMon, Thanks for your contribution! Would you consider signing the License Agreement above? |
|
i have agreed the CLA , but its not reflected here |
Git says "murali.a" contributed this code, not "MuraliMon". Please kindly sign the CLA while being logged in with the former account. |
46644eb to
5b5019a
Compare
…t crash This commit addresses a production incident where pmm-agent crashes resulted in incorrect PID mappings for RDS exporters. The core issue was that the roster fallback mechanism relied on stale database records instead of forcing a fresh state synchronization. Problem: When pmm-agent restarts after crashing, it receives a new process ID, but the database still contains the old PID. The fallback logic attempted to update the obsolete PID, which failed silently, causing metrics to stop flowing. Solution: 1. roster.go - Removed database fallback mechanism; now returns an error when roster entries are missing 2. handler.go - Added error handling to trigger state refresh when roster lookups fail 3. roster_test.go - Updated tests to expect errors for missing entries Expected outcome: Upon restart, the system triggers a fresh state synchronization, rebuilding the roster with correct PIDs and automatically recovering from crashes.
5b5019a to
d70bfaf
Compare
|
@ademidoff CLA is signed |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## v3 #4893 +/- ##
==========================================
+ Coverage 45.90% 45.97% +0.07%
==========================================
Files 366 362 -4
Lines 38438 40710 +2272
==========================================
+ Hits 17646 18718 +1072
- Misses 19093 20302 +1209
+ Partials 1699 1690 -9
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
PMM-14267
This change fixes the root cause of RDS exporter PID mapping corruption that occurs when pmm-agent crashes and restarts. The bug was in the roster fallback mechanism that used stale database records instead of triggering a fresh state synchronization.
Problem:
Root Cause:
The fallback logic in roster.go:87-98 assumed database always has correct current state. This is FALSE after crashes - database PIDs are stale until next successful SetState.
Solution:
Benefits:
Changes:
Testing:
To verify fix, simulate pmm-agent crash:
Refs: Production incident on 2026-01-06 where pmm-agent crash caused 203 RDS agent records to have incorrect PID mappings, resulting in missing node metrics
PMM-0
Link to the Feature Build: SUBMODULES-0
If this PR adds, removes or alters one or more API endpoints, please review and add or update the relevant API documentation as well:
If this PR is related to some other PRs in this or other repositories, please provide links to those PRs: