Skip to content

Conversation

@paulojmdias
Copy link
Member

@paulojmdias paulojmdias commented Sep 15, 2025

Description

This PR adds support for a configurable refresh_interval to the resourcedetectionprocessor.
When set, resource detection will re-run periodically in the background, allowing resources (e.g., cloud metadata, container tags) to be updated without restarting the collector.

Since refresh is disabled by default, the current behavior is maintained.

Link to tracking issue

Fixes #42663

Testing

I tested in AWS, updating tags while the processor is running.

Also added tests to support this feature.

Documentation

Updated README.md with this parameter.

@paulojmdias paulojmdias marked this pull request as ready for review September 18, 2025 22:07
@paulojmdias paulojmdias requested review from a team and dashpole as code owners September 18, 2025 22:07
@github-actions github-actions bot added the processor/resourcedetection Resource detection processor label Sep 18, 2025
@github-actions github-actions bot requested a review from Aneurysm9 September 18, 2025 22:07
@atoulme
Copy link
Contributor

atoulme commented Sep 19, 2025

This seems like a neat enhancement.

@github-actions
Copy link
Contributor

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added the Stale label Oct 10, 2025
@paulojmdias
Copy link
Member Author

/label -stale

@mx-psi
Copy link
Member

mx-psi commented Oct 10, 2025

@dashpole @Aneurysm9 this is waiting for your review

@paulojmdias
Copy link
Member Author

/label -stale

@github-actions github-actions bot removed the Stale label Oct 11, 2025
@paulojmdias
Copy link
Member Author

Tests are failing due to #43625

Comment on lines 888 to 890
- It may take some time for newly detected resource attributes to be applied.
- Changes to resource attributes can result in the creation of new metric time series.
- Frequent refreshes can increase resource usage and should be configured carefully.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This documentation could use a clarification pass. What is "some time"? How frequent is too frequent to refresh detected resource attributes? Should there be a lower bound on the refresh_interval?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good points! I've clarified the documentation with:

  • Specific timing: "up to refresh_interval duration"
  • Performance guidance: Values below 5 minutes can increase resource usage; below 1 minute is strongly discouraged
  • No enforced minimum, but clear recommendations provided
  • Added context about metric cardinality impact

defer p.mu.RUnlock()
res = p.detectedResource
if res == nil {
return pcommon.NewResource(), "", nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we initialize res as pcommon.NewResource() so we don't need this check?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've refactored this to initialize the return values upfront with their defaults, so we don't need the nil check anymore.


p.detectedResource.resource = res
p.detectedResource.schemaURL = mergedSchemaURL
// Only fail if ALL detectors failed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the current behavior? It seems like we should fail if any of the detectors failed...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've rephrased the comment to be clearer about the partial success behavior.

// Keep the last good snapshot if the refresh errored OR returned an empty resource.
if hadPrevSuccess && (err != nil || IsEmptyResource(res)) {
p.logger.Warn("resource refresh yielded empty or error; keeping previous snapshot", zap.Error(err))
return prev.resource, prev.schemaURL, prev.err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we return the previous error, won't that cause refreshLoop to log the previous error? That seems very confusing if the error differs from our current error.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, you're right. Thanks for pointing me out. Since prev.err is guaranteed to be nil here (from the hadPrevSuccess check), I've changed it to explicitly return nil to make the intent clear and avoid any confusion in the logs.

rdp.wg.Add(1)
go rdp.refreshLoop(client)
}
return nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we move the initial attempt at detection to Start so that we don't need the initOnce logic in the resource provider?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've moved the initial detection to Start(), which now calls Refresh() directly, removed the initOnce field entirely, and made Get() truly read-only with no initialization logic. Tests have been updated accordingly to call Refresh() first for initialization. Please let me know what you think about this change 🙏


stopCh chan struct{}
wg sync.WaitGroup
current atomic.Pointer[resourceSnapshot]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a big part of why the code is so confusing is that you are caching the latest resource here and in internal/resourcedetection.go. Please only store the current value in one place, and concentrate the refresh goroutine logic and handling of errors there.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've refactored the code to store the resource in a single location (the provider's detectedResource). I removed the current atomic.Pointer from the processor and moved all refresh logic into the provider with StartRefreshing() and StopRefreshing() methods. Now the processor just calls provider.Get() to retrieve the cached resource, and the provider handles all the refresh goroutine logic and error handling internally. This eliminates the duplicate state and makes the architecture much cleaner. Let me know if it makes sense to you 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add periodic re-detection for otelcol.processor.resourcedetection EC2 detector (e.g., refresh_interval)

5 participants