-
Notifications
You must be signed in to change notification settings - Fork 3.1k
[processor/resourcedetection] Add support for dynamic refresh resource attributes #42697
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…e attributes Signed-off-by: Paulo Dias <[email protected]>
Signed-off-by: Paulo Dias <[email protected]>
|
This seems like a neat enhancement. |
|
This PR was marked stale due to lack of activity. It will be closed in 14 days. |
|
/label -stale |
|
@dashpole @Aneurysm9 this is waiting for your review |
…-contrib into feat/42663 Signed-off-by: Paulo Dias <[email protected]>
Signed-off-by: Paulo Dias <[email protected]>
…lector-contrib into feat/42663 Signed-off-by: Paulo Dias <[email protected]>
Signed-off-by: Paulo Dias <[email protected]>
|
/label -stale |
|
Tests are failing due to #43625 |
processor/resourcedetectionprocessor/resourcedetection_processor.go
Outdated
Show resolved
Hide resolved
processor/resourcedetectionprocessor/resourcedetection_processor.go
Outdated
Show resolved
Hide resolved
| - It may take some time for newly detected resource attributes to be applied. | ||
| - Changes to resource attributes can result in the creation of new metric time series. | ||
| - Frequent refreshes can increase resource usage and should be configured carefully. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This documentation could use a clarification pass. What is "some time"? How frequent is too frequent to refresh detected resource attributes? Should there be a lower bound on the refresh_interval?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good points! I've clarified the documentation with:
- Specific timing: "up to refresh_interval duration"
- Performance guidance: Values below 5 minutes can increase resource usage; below 1 minute is strongly discouraged
- No enforced minimum, but clear recommendations provided
- Added context about metric cardinality impact
processor/resourcedetectionprocessor/internal/resourcedetection.go
Outdated
Show resolved
Hide resolved
processor/resourcedetectionprocessor/resourcedetection_processor.go
Outdated
Show resolved
Hide resolved
Signed-off-by: Paulo Dias <[email protected]>
Signed-off-by: Paulo Dias <[email protected]>
…-contrib into feat/42663 Signed-off-by: Paulo Dias <[email protected]>
Signed-off-by: Paulo Dias <[email protected]>
| defer p.mu.RUnlock() | ||
| res = p.detectedResource | ||
| if res == nil { | ||
| return pcommon.NewResource(), "", nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we initialize res as pcommon.NewResource() so we don't need this check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've refactored this to initialize the return values upfront with their defaults, so we don't need the nil check anymore.
|
|
||
| p.detectedResource.resource = res | ||
| p.detectedResource.schemaURL = mergedSchemaURL | ||
| // Only fail if ALL detectors failed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the current behavior? It seems like we should fail if any of the detectors failed...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've rephrased the comment to be clearer about the partial success behavior.
| // Keep the last good snapshot if the refresh errored OR returned an empty resource. | ||
| if hadPrevSuccess && (err != nil || IsEmptyResource(res)) { | ||
| p.logger.Warn("resource refresh yielded empty or error; keeping previous snapshot", zap.Error(err)) | ||
| return prev.resource, prev.schemaURL, prev.err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we return the previous error, won't that cause refreshLoop to log the previous error? That seems very confusing if the error differs from our current error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, you're right. Thanks for pointing me out. Since prev.err is guaranteed to be nil here (from the hadPrevSuccess check), I've changed it to explicitly return nil to make the intent clear and avoid any confusion in the logs.
| rdp.wg.Add(1) | ||
| go rdp.refreshLoop(client) | ||
| } | ||
| return nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we move the initial attempt at detection to Start so that we don't need the initOnce logic in the resource provider?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've moved the initial detection to Start(), which now calls Refresh() directly, removed the initOnce field entirely, and made Get() truly read-only with no initialization logic. Tests have been updated accordingly to call Refresh() first for initialization. Please let me know what you think about this change 🙏
|
|
||
| stopCh chan struct{} | ||
| wg sync.WaitGroup | ||
| current atomic.Pointer[resourceSnapshot] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a big part of why the code is so confusing is that you are caching the latest resource here and in internal/resourcedetection.go. Please only store the current value in one place, and concentrate the refresh goroutine logic and handling of errors there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've refactored the code to store the resource in a single location (the provider's detectedResource). I removed the current atomic.Pointer from the processor and moved all refresh logic into the provider with StartRefreshing() and StopRefreshing() methods. Now the processor just calls provider.Get() to retrieve the cached resource, and the provider handles all the refresh goroutine logic and error handling internally. This eliminates the duplicate state and makes the architecture much cleaner. Let me know if it makes sense to you 🙏
Signed-off-by: Paulo Dias <[email protected]>
…lector-contrib into feat/42663 Signed-off-by: Paulo Dias <[email protected]>
Description
This PR adds support for a configurable
refresh_intervalto the resourcedetectionprocessor.When set, resource detection will re-run periodically in the background, allowing resources (e.g., cloud metadata, container tags) to be updated without restarting the collector.
Since refresh is disabled by default, the current behavior is maintained.
Link to tracking issue
Fixes #42663
Testing
I tested in AWS, updating tags while the processor is running.
Also added tests to support this feature.
Documentation
Updated README.md with this parameter.