Skip to content

Conversation

@MuraliMon
Copy link

@MuraliMon MuraliMon commented Jan 7, 2026

PMM-14267

This change fixes the root cause of RDS exporter PID mapping corruption that occurs when pmm-agent crashes and restarts. The bug was in the roster fallback mechanism that used stale database records instead of triggering a fresh state synchronization.

Problem:

  • pmm-agent maintains roster.m (in-memory map) tracking RDS exporter groups
  • When pmm-agent crashes (exit status 2), roster.m is lost
  • pmm-agent restarts with NEW PID (e.g., 38561)
  • Database still contains OLD PID (e.g., 223) from before crash
  • roster.get() fallback queries database with STALE PIDs
  • SetState tries to update OLD PID → fails silently
  • NEW PID never gets tracked → metrics stop flowing
  • Affects all 7 AWS accounts (multi-exporter architecture)

Root Cause:
The fallback logic in roster.go:87-98 assumed database always has correct current state. This is FALSE after crashes - database PIDs are stale until next successful SetState.

Solution:

  1. Remove database fallback from roster.get()
  2. Return error when roster entry not found (indicates stale state)
  3. Handle error in handler.go by triggering RequestStateUpdate()
  4. Fresh SetState rebuilds roster.m with correct NEW PIDs
  5. Normal operation resumes with accurate PID mappings

Benefits:

  • Fixes PID mapping corruption for ALL grouped agents (RDS exporters)
  • Self-healing: automatically recovers from pmm-agent crashes
  • Maintains multi-exporter architecture (7 processes for isolation)
  • No data loss: brief interruption during state refresh only
  • Prevents silent failures: errors are logged and handled

Changes:

  • managed/services/agents/roster.go: Remove DB fallback, return error for missing entries
  • managed/services/agents/handler.go: Catch roster errors, trigger state refresh
  • managed/services/agents/roster_test.go: Update tests to expect errors for missing entries

Testing:
To verify fix, simulate pmm-agent crash:

  1. Kill pmm-agent process (supervisord will restart it)
  2. Monitor logs for "Roster entry missing, triggering state refresh"
  3. Verify SetState is called with new PID
  4. Confirm all 7 RDS exporters get correct PIDs in database

Refs: Production incident on 2026-01-06 where pmm-agent crash caused 203 RDS agent records to have incorrect PID mappings, resulting in missing node metrics

PMM-0

Link to the Feature Build: SUBMODULES-0

If this PR adds, removes or alters one or more API endpoints, please review and add or update the relevant API documentation as well:

  • API Docs updated

If this PR is related to some other PRs in this or other repositories, please provide links to those PRs:

  • Links to related pull requests (optional).

@MuraliMon MuraliMon requested a review from a team as a code owner January 7, 2026 12:12
@MuraliMon MuraliMon requested review from JiriCtvrtka and idoqo and removed request for a team January 7, 2026 12:12
@it-percona-cla
Copy link
Contributor

it-percona-cla commented Jan 7, 2026

CLA assistant check
All committers have signed the CLA.

@ademidoff
Copy link
Member

Hi @MuraliMon,

Thanks for your contribution!

Would you consider signing the License Agreement above?

@MuraliMon
Copy link
Author

i have agreed the CLA , but its not reflected here

@ademidoff
Copy link
Member

i have agreed the CLA , but its not reflected here

Git says "murali.a" contributed this code, not "MuraliMon". Please kindly sign the CLA while being logged in with the former account.

@MuraliMon MuraliMon force-pushed the fix-roster-fallback-bug branch from 46644eb to 5b5019a Compare February 9, 2026 06:11
…t crash

This commit addresses a production incident where pmm-agent crashes resulted
in incorrect PID mappings for RDS exporters. The core issue was that the
roster fallback mechanism relied on stale database records instead of forcing
a fresh state synchronization.

Problem:
When pmm-agent restarts after crashing, it receives a new process ID, but the
database still contains the old PID. The fallback logic attempted to update
the obsolete PID, which failed silently, causing metrics to stop flowing.

Solution:
1. roster.go - Removed database fallback mechanism; now returns an error when
   roster entries are missing
2. handler.go - Added error handling to trigger state refresh when roster
   lookups fail
3. roster_test.go - Updated tests to expect errors for missing entries

Expected outcome:
Upon restart, the system triggers a fresh state synchronization, rebuilding
the roster with correct PIDs and automatically recovering from crashes.
@MuraliMon MuraliMon force-pushed the fix-roster-fallback-bug branch from 5b5019a to d70bfaf Compare February 9, 2026 07:32
@MuraliMon
Copy link
Author

@ademidoff CLA is signed

@codecov
Copy link

codecov bot commented Feb 9, 2026

Codecov Report

❌ Patch coverage is 33.33333% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 45.97%. Comparing base (ae3203a) to head (5302a92).
⚠️ Report is 2 commits behind head on v3.

Files with missing lines Patch % Lines
managed/services/agents/handler.go 0.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##               v3    #4893      +/-   ##
==========================================
+ Coverage   45.90%   45.97%   +0.07%     
==========================================
  Files         366      362       -4     
  Lines       38438    40710    +2272     
==========================================
+ Hits        17646    18718    +1072     
- Misses      19093    20302    +1209     
+ Partials     1699     1690       -9     
Flag Coverage Δ
managed 46.58% <33.33%> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants