Skip to content

[coordinator] Fix status reporting of out-of-order component updates#13119

Open
VihasMakwana wants to merge 5 commits intoelastic:mainfrom
VihasMakwana:fix-status-components
Open

[coordinator] Fix status reporting of out-of-order component updates#13119
VihasMakwana wants to merge 5 commits intoelastic:mainfrom
VihasMakwana:fix-status-components

Conversation

@VihasMakwana
Copy link
Contributor

@VihasMakwana VihasMakwana commented Mar 11, 2026

What does this PR do?

When transitioning from otel to process runtime, if an otel component takes too long to stop, it will emit Stopped state only after timeout expiration. By this time, the process runtime would have already reported a Starting state.
Upon receiving a Stopped state from old runtime, we will erroneously remove the new Starting state.

This PR fixes the flow by introducing a new LastCreatedAt variable for a component. We will only process a state update when the state update is either from same instance of the component, or from a newer instance.

Why is it important?

Buggy scenario:

  1. Component c is created at time=0s
  2. It transitions to Starting state. We will report this state as it's the first state for this component.
  3. The user updates the config and changes the runtime.
  4. The following steps take place concurrently
    • Coordinator stops the component and waits for it to exit.
    • Coordinator creates a new component model and runs it in process mode and time=1s. It also reports a Starting state for a given component.
  5. The older components takes too long to stop and we forcefully kill it and report a Stopped state.
  6. The coordinator receives this stale Stopped event and erroneously removes the new component from the status map.

After the PR:

  1. Component c is created at startTime=0s
  2. It transitions to Starting state. We will report this state as it's the first state for this component.
  3. The user updates the config and changes the runtime.
  4. The following steps take place concurrently
    • Coordinator stops the component and waits for it to exit.
    • Coordinator creates a new component model and runs it in process mode and startTime=1s. It also reports a Starting state for a given component.
  5. The older components takes too long to stop and we forcefully kill it and report a Stopped state.
  6. The coordinator receives this stale Stopped event but ignores it, since the stored startTime of the current component is later than the received event.

Checklist

  • I have read and understood the pull request guidelines of this project.
  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool
  • I have added an integration test or an E2E test

How to test this PR locally

Related issues

@VihasMakwana VihasMakwana self-assigned this Mar 11, 2026
@VihasMakwana VihasMakwana added Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team backport-8.19 Automated backport to the 8.19 branch backport-9.3 Automated backport to the 9.3 branch labels Mar 11, 2026
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

@VihasMakwana VihasMakwana requested a review from cmacknz March 11, 2026 09:19
@VihasMakwana VihasMakwana force-pushed the fix-status-components branch 2 times, most recently from c118dfe to 68abcc5 Compare March 11, 2026 09:26
@VihasMakwana VihasMakwana force-pushed the fix-status-components branch from 68abcc5 to faa0fdf Compare March 11, 2026 09:27
Copy link
Contributor

@swiatekm swiatekm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a pretty nice solution to the problem, well done! Have you tested whether it solves it in practice? I recall @cmacknz being able to reproduce it consistently on his dev machine.

@VihasMakwana
Copy link
Contributor Author

This is a pretty nice solution to the problem, well done! Have you tested whether it solves it in practice? I recall @cmacknz being able to reproduce it consistently on his dev machine.

Unfortunately, I haven't been able to reproduce this. I'm thinking of adding time.Sleep in filebeat receiver's Shutdown method to simulate a delay for now and test this out.

Component: component.Component{
ID: id,
},
Component: comp.Component,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is needed because the state coordinator needs LastConfiguredAt to correctly handle state transitions.

@VihasMakwana VihasMakwana requested a review from swiatekm March 11, 2026 14:47
@VihasMakwana
Copy link
Contributor Author

@swiatekm @cmacknz I was able to test my fix after adding a time.Sleep(5s) to the beatreciever's shutdown method. The stale Stopped state doesn't overwrite process runtime's new Starting state

@VihasMakwana VihasMakwana changed the title [status] Fix status reporting of out-of-order component updates [coordinator] Fix status reporting of out-of-order component updates Mar 11, 2026
@cmacknz
Copy link
Member

cmacknz commented Mar 11, 2026

Thanks, agree this is a nice solution. I'll let Mikolaj do the approving.

@elasticmachine
Copy link
Contributor

💛 Build succeeded, but was flaky

Failed CI Steps

History

cc @VihasMakwana

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-8.19 Automated backport to the 8.19 branch backport-9.3 Automated backport to the 9.3 branch Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[beats receivers] Switching from otel to process runtime can remove components from status output

4 participants