-
Notifications
You must be signed in to change notification settings - Fork 2.7k
feat(query/v2): add max-trace-size limit with early termination to prevent OOM #7527
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat(query/v2): add max-trace-size limit with early termination to prevent OOM #7527
Conversation
…early termination Signed-off-by: ADITYATIWARI342005 <[email protected]>
|
Hi @yurishkuro @AnmolxSingh This PR is optimal and addresses the issue #7495 completely. Please review it. |
…n in translator Signed-off-by: ADITYATIWARI342005 <[email protected]>
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #7527 +/- ##
==========================================
- Coverage 96.50% 96.48% -0.03%
==========================================
Files 385 385
Lines 23316 23422 +106
==========================================
+ Hits 22501 22598 +97
- Misses 627 633 +6
- Partials 188 191 +3
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Metrics Comparison SummaryTotal changes across all snapshots: 53 Detailed changes per snapshotsummary_metrics_snapshot_cassandra📊 Metrics Diff SummaryTotal Changes: 53
🆕 Added Metrics
View diff sample+http_server_request_body_size_bytes{http_request_method="GET",http_response_status_code="503",le="+Inf",network_protocol_name="http",network_protocol_version="1.1",otel_scope_name="go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp",otel_scope_schema_url="",otel_scope_version="0.62.0",server_address="localhost",server_port="13133",url_scheme="http"}
+http_server_request_body_size_bytes{http_request_method="GET",http_response_status_code="503",le="0",network_protocol_name="http",network_protocol_version="1.1",otel_scope_name="go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp",otel_scope_schema_url="",otel_scope_version="0.62.0",server_address="localhost",server_port="13133",url_scheme="http"}
+http_server_request_body_size_bytes{http_request_method="GET",http_response_status_code="503",le="10",network_protocol_name="http",network_protocol_version="1.1",otel_scope_name="go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp",otel_scope_schema_url="",otel_scope_version="0.62.0",server_address="localhost",server_port="13133",url_scheme="http"}
+http_server_request_body_size_bytes{http_request_method="GET",http_response_status_code="503",le="100",network_protocol_name="http",network_protocol_version="1.1",otel_scope_name="go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp",otel_scope_schema_url="",otel_scope_version="0.62.0",server_address="localhost",server_port="13133",url_scheme="http"}
+http_server_request_body_size_bytes{http_request_method="GET",http_response_status_code="503",le="1000",network_protocol_name="http",network_protocol_version="1.1",otel_scope_name="go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp",otel_scope_schema_url="",otel_scope_version="0.62.0",server_address="localhost",server_port="13133",url_scheme="http"}
+http_server_request_body_size_bytes{http_request_method="GET",http_response_status_code="503",le="10000",network_protocol_name="http",network_protocol_version="1.1",otel_scope_name="go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp",otel_scope_schema_url="",otel_scope_version="0.62.0",server_address="localhost",server_port="13133",url_scheme="http"}
+http_server_request_body_size_bytes{http_request_method="GET",http_response_status_code="503",le="25",network_protocol_name="http",network_protocol_version="1.1",otel_scope_name="go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp",otel_scope_schema_url="",otel_scope_version="0.62.0",server_address="localhost",server_port="13133",url_scheme="http"}
...View diff sample+http_server_request_duration_seconds{http_request_method="GET",http_response_status_code="503",le="+Inf",network_protocol_name="http",network_protocol_version="1.1",otel_scope_name="go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp",otel_scope_schema_url="",otel_scope_version="0.62.0",server_address="localhost",server_port="13133",url_scheme="http"}
+http_server_request_duration_seconds{http_request_method="GET",http_response_status_code="503",le="0.005",network_protocol_name="http",network_protocol_version="1.1",otel_scope_name="go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp",otel_scope_schema_url="",otel_scope_version="0.62.0",server_address="localhost",server_port="13133",url_scheme="http"}
+http_server_request_duration_seconds{http_request_method="GET",http_response_status_code="503",le="0.01",network_protocol_name="http",network_protocol_version="1.1",otel_scope_name="go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp",otel_scope_schema_url="",otel_scope_version="0.62.0",server_address="localhost",server_port="13133",url_scheme="http"}
+http_server_request_duration_seconds{http_request_method="GET",http_response_status_code="503",le="0.025",network_protocol_name="http",network_protocol_version="1.1",otel_scope_name="go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp",otel_scope_schema_url="",otel_scope_version="0.62.0",server_address="localhost",server_port="13133",url_scheme="http"}
+http_server_request_duration_seconds{http_request_method="GET",http_response_status_code="503",le="0.05",network_protocol_name="http",network_protocol_version="1.1",otel_scope_name="go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp",otel_scope_schema_url="",otel_scope_version="0.62.0",server_address="localhost",server_port="13133",url_scheme="http"}
+http_server_request_duration_seconds{http_request_method="GET",http_response_status_code="503",le="0.075",network_protocol_name="http",network_protocol_version="1.1",otel_scope_name="go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp",otel_scope_schema_url="",otel_scope_version="0.62.0",server_address="localhost",server_port="13133",url_scheme="http"}
+http_server_request_duration_seconds{http_request_method="GET",http_response_status_code="503",le="0.1",network_protocol_name="http",network_protocol_version="1.1",otel_scope_name="go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp",otel_scope_schema_url="",otel_scope_version="0.62.0",server_address="localhost",server_port="13133",url_scheme="http"}
...View diff sample+http_server_response_body_size_bytes{http_request_method="GET",http_response_status_code="503",le="+Inf",network_protocol_name="http",network_protocol_version="1.1",otel_scope_name="go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp",otel_scope_schema_url="",otel_scope_version="0.62.0",server_address="localhost",server_port="13133",url_scheme="http"}
+http_server_response_body_size_bytes{http_request_method="GET",http_response_status_code="503",le="0",network_protocol_name="http",network_protocol_version="1.1",otel_scope_name="go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp",otel_scope_schema_url="",otel_scope_version="0.62.0",server_address="localhost",server_port="13133",url_scheme="http"}
+http_server_response_body_size_bytes{http_request_method="GET",http_response_status_code="503",le="10",network_protocol_name="http",network_protocol_version="1.1",otel_scope_name="go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp",otel_scope_schema_url="",otel_scope_version="0.62.0",server_address="localhost",server_port="13133",url_scheme="http"}
+http_server_response_body_size_bytes{http_request_method="GET",http_response_status_code="503",le="100",network_protocol_name="http",network_protocol_version="1.1",otel_scope_name="go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp",otel_scope_schema_url="",otel_scope_version="0.62.0",server_address="localhost",server_port="13133",url_scheme="http"}
+http_server_response_body_size_bytes{http_request_method="GET",http_response_status_code="503",le="1000",network_protocol_name="http",network_protocol_version="1.1",otel_scope_name="go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp",otel_scope_schema_url="",otel_scope_version="0.62.0",server_address="localhost",server_port="13133",url_scheme="http"}
+http_server_response_body_size_bytes{http_request_method="GET",http_response_status_code="503",le="10000",network_protocol_name="http",network_protocol_version="1.1",otel_scope_name="go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp",otel_scope_schema_url="",otel_scope_version="0.62.0",server_address="localhost",server_port="13133",url_scheme="http"}
+http_server_response_body_size_bytes{http_request_method="GET",http_response_status_code="503",le="25",network_protocol_name="http",network_protocol_version="1.1",otel_scope_name="go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp",otel_scope_schema_url="",otel_scope_version="0.62.0",server_address="localhost",server_port="13133",url_scheme="http"}
... |
|
@yurishkuro |
Add configurable trace size limiting to prevent OOM issues when retrieving large traces. Implements early termination with proper iterator stopping, comprehensive unit tests, and full configuration wiring through v1/v2 query services and jaeger extension. Addresses maintainer feedback by preserving original function signatures and adding validation tests. Signed-off-by: ADITYATIWARI342005 <[email protected]>
Fix critical bug where truncated traces would prevent processing of subsequent traces in the same sequence. Now properly resets truncated state when a new trace ID is detected, allowing multiple traces to be processed correctly even when some exceed the size limit. Signed-off-by: ADITYATIWARI342005 <[email protected]>
…ot review on the test Co-authored-by: graphite-app[bot] <96075541+graphite-app[bot]@users.noreply.github.com> Signed-off-by: ADITYA TIWARI <[email protected]>
Add tests to cover edge cases in applyTraceSizeLimit: - Error handling in sequence processing - Truncated trace continuation across multiple chunks - Multiple traces where some exceed limit while others are processed These tests ensure the max-trace-size feature works correctly in all scenarios." Signed-off-by: ADITYA TIWARI <[email protected]>
|
@yurishkuro |
- Remove redundant V1TracesFromSeq2WithLimit function - Merge limiting logic into V1TracesFromSeq2 with maxTraceSize parameter - Fix trace boundary detection with proper getFirstTraceID helper - Optimize to single-pass processing instead of multiple loops - Update all call sites to use unified API Signed-off-by: ADITYATIWARI342005 <[email protected]>
…undant logic) + Created single helper: extractTraceIDAndCount() - does both operations in one loop Signed-off-by: ADITYATIWARI342005 <[email protected]>
…ds repetition \n Core truncation scenarios covered No duplication with existing tests Less code to maintain Signed-off-by: ADITYATIWARI342005 <[email protected]>
|
Hi @yurishkuro , Thank you for your guidance and efforts so far! |
Which problem is this PR solving?
Description of the changes
Adds
max_trace_sizeparameter to v2 query service with early termination when limit is reached, preventing OOM on large traces.Key changes:
V1TracesFromSeq2now accepts explicitmaxTraceSize intparameterapplyTraceSizeLimithelper truncates traces and logs warningsModified files:
internal/storage/v2/v1adapter/translator.go- limiting logicinternal/storage/v2/v1adapter/translator_test.go- test coverageinternal/storage/v2/v1adapter/{tracereader,spanreader}.go- parameter wiringHow was this change tested?
go test ./internal/storage/v2/v1adapter -count=1passesChecklist