Skip to content

Commit 59c4d11

Browse files
committed
Merge remote-tracking branch 'origin/master' into vandonr/process2
2 parents c05d5d6 + fb53844 commit 59c4d11

File tree

32 files changed

+4572
-1319
lines changed

32 files changed

+4572
-1319
lines changed

.azure-pipelines/ultimate-pipeline.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -70,8 +70,8 @@ variables:
7070
nativeBuildDotnetSdkVersion: 7.0.306
7171
# These are the Managed DevOps pool names we use
7272
linuxTasksPool: azure-managed-linux-tasks
73-
linuxX64SmokePool: azure-managed-linux-smoke
74-
linuxX64Pool: azure-managed-linux-x64-1
73+
linuxX64SmokePool: azure-managed-linux-smoke-2
74+
linuxX64Pool: azure-managed-linux-x64-2
7575
linuxArm64Pool: azure-managed-linux-arm64-2
7676
windowsX64Pool: azure-managed-windows-x64-1
7777

.gitlab-ci.yml

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -232,3 +232,12 @@ dsm_throughput:
232232
- if: '$CI_PIPELINE_SOURCE == "schedule" && $BENCHMARK_RUN == "true"'
233233
when: always
234234
- when: manual
235+
236+
237+
validate_supported_configurations_local_file:
238+
stage: build
239+
rules:
240+
- when: on_success
241+
extends: .validate_supported_configurations_local_file
242+
variables:
243+
LOCAL_JSON_PATH: "tracer/src/Datadog.Trace/Configuration/supported-configurations.json"

.gitlab/one-pipeline.locked.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
11
# DO NOT EDIT THIS FILE MANUALLY
22
# This file is auto-generated by automation.
33
include:
4-
- remote: https://gitlab-templates.ddbuild.io/libdatadog/one-pipeline/ca/5ad4e568659a0e385e3cd429b7845ad8e2171cfe0e27ee5b9eeb4cd4b67825f5/one-pipeline.yml
4+
- remote: https://gitlab-templates.ddbuild.io/libdatadog/one-pipeline/ca/b8fc1b7f45e49ddd65623f217f03c38def169aff6f2518380e5b415514e4cb81/one-pipeline.yml

docs/development/CI/TroubleshootingCIFailures.md

Lines changed: 264 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -54,6 +54,37 @@ The human-readable build URL format is:
5454
https://dev.azure.com/datadoghq/dd-trace-dotnet/_build/results?buildId=<BUILD_ID>
5555
```
5656

57+
### Using Azure DevOps MCP (AI Assistant Integration)
58+
59+
If you're using an AI assistant with the Azure DevOps MCP server, you can use these tools for cleaner queries:
60+
61+
#### Get build information
62+
Ask your assistant to use `mcp__azure-devops__pipelines_get_builds` with:
63+
- `project: "dd-trace-dotnet"`
64+
- `buildIds: [<BUILD_ID>]`
65+
66+
This returns structured build data including status, result, queue time, and trigger information.
67+
68+
#### Get build logs
69+
Ask your assistant to use `mcp__azure-devops__pipelines_get_build_log` with:
70+
- `project: "dd-trace-dotnet"`
71+
- `buildId: <BUILD_ID>`
72+
73+
Note: Large builds may have very large logs that exceed token limits. In that case, fall back to curl/jq to target specific log IDs.
74+
75+
#### Get specific log by ID
76+
Ask your assistant to use `mcp__azure-devops__pipelines_get_build_log_by_id` with:
77+
- `project: "dd-trace-dotnet"`
78+
- `buildId: <BUILD_ID>`
79+
- `logId: <LOG_ID>`
80+
- Optional: `startLine` and `endLine` to limit output
81+
82+
**Advantages of MCP approach:**
83+
- Structured JSON responses (no manual parsing)
84+
- Works naturally in conversation with AI assistants
85+
- Handles authentication automatically
86+
- Can combine multiple queries in a single request
87+
5788
## Investigating Test Failures
5889

5990
### Find failed tasks in a build
@@ -73,18 +104,48 @@ curl -s "https://dev.azure.com/datadoghq/a51c4863-3eb4-4c5d-878a-58b41a049e4e/_a
73104

74105
### Download and search logs
75106

107+
#### Option 1: Using curl (works everywhere)
108+
76109
```bash
77110
curl -s "https://dev.azure.com/datadoghq/a51c4863-3eb4-4c5d-878a-58b41a049e4e/_apis/build/builds/<BUILD_ID>/logs/<LOG_ID>" \
78111
| grep -i "fail\|error"
79112
```
80113

114+
#### Option 2: Using Azure CLI (recommended on Windows)
115+
116+
```bash
117+
az rest --url "https://dev.azure.com/datadoghq/a51c4863-3eb4-4c5d-878a-58b41a049e4e/_apis/build/builds/<BUILD_ID>/logs/<LOG_ID>?api-version=7.0" \
118+
2>&1 | grep -i "fail\|error"
119+
```
120+
121+
Note: You may see a warning about authentication - this is safe to ignore for public builds.
122+
123+
#### Option 3: Using GitHub CLI for quick overview
124+
125+
```bash
126+
# Get quick summary of all checks for a PR
127+
gh pr checks <PR_NUMBER>
128+
129+
# Get detailed PR status including links to Azure DevOps
130+
gh pr view <PR_NUMBER> --json statusCheckRollup
131+
```
132+
81133
### Get detailed context around failures
82134

135+
Using curl:
136+
83137
```bash
84138
curl -s "https://dev.azure.com/datadoghq/a51c4863-3eb4-4c5d-878a-58b41a049e4e/_apis/build/builds/<BUILD_ID>/logs/<LOG_ID>" \
85139
| grep -A 30 "TestName.That.Failed"
86140
```
87141

142+
Or with Azure CLI:
143+
144+
```bash
145+
az rest --url "https://dev.azure.com/datadoghq/a51c4863-3eb4-4c5d-878a-58b41a049e4e/_apis/build/builds/<BUILD_ID>/logs/<LOG_ID>?api-version=7.0" \
146+
2>&1 | grep -A 30 "TestName.That.Failed"
147+
```
148+
88149
## Mapping Commits to Builds
89150

90151
Azure DevOps builds test **merge commits** (`refs/pull/<PR_NUMBER>/merge`), not branch commits directly.
@@ -107,8 +168,156 @@ To find which branch commit caused a failure:
107168

108169
The build queued shortly after the commit was pushed is likely testing that commit.
109170

171+
## Determining If Failures Are Related to Your Changes
172+
173+
When tests fail on master after your PR is merged, determine if failures are new or pre-existing:
174+
175+
### Compare with previous build on master
176+
177+
```bash
178+
# List recent builds on master
179+
az pipelines runs list \
180+
--organization https://dev.azure.com/datadoghq \
181+
--project dd-trace-dotnet \
182+
--branch master \
183+
--top 10 \
184+
--query "[].{id:id, result:result, sourceVersion:sourceVersion, finishTime:finishTime}" \
185+
--output table
186+
187+
# Find the build for the commit before yours
188+
git log --oneline HEAD~1..HEAD # Identify your commit and the previous one
189+
190+
# Compare failed tasks between builds
191+
# Your build:
192+
curl -s "https://dev.azure.com/datadoghq/a51c4863-3eb4-4c5d-878a-58b41a049e4e/_apis/build/builds/<YOUR_BUILD_ID>/timeline" \
193+
| jq -r '.records[] | select(.result == "failed") | .name'
194+
195+
# Previous build:
196+
curl -s "https://dev.azure.com/datadoghq/a51c4863-3eb4-4c5d-878a-58b41a049e4e/_apis/build/builds/<PREVIOUS_BUILD_ID>/timeline" \
197+
| jq -r '.records[] | select(.result == "failed") | .name'
198+
```
199+
200+
**New failures** only appear in your build → likely related to your changes.
201+
**Same failures** appear in both → likely pre-existing/flaky tests.
202+
203+
### Master-only tests
204+
205+
Some tests (profiler integration tests, exploration tests) run only on master branch, not on PRs. If you see failures on master that you didn't see on your PR:
206+
207+
1. This is expected - those tests don't run on PRs
208+
2. Compare with the previous successful master build to confirm they're new
209+
3. The failures are likely related to your changes
210+
211+
## Understanding Test Infrastructure
212+
213+
When test failures don't make obvious sense, investigate the test infrastructure to understand how tests are configured.
214+
215+
### Finding test configuration
216+
217+
Tests may set up environments differently than production code. For example:
218+
219+
```bash
220+
# Find how a specific test sets up environment variables
221+
grep -r "DD_DOTNET_TRACER_HOME\|DD_TRACE_ENABLED" profiler/test/
222+
223+
# Look for test helper classes
224+
find . -name "*EnvironmentHelper*.cs" -o -name "*TestRunner*.cs"
225+
226+
# Check what environment variables a test actually sets
227+
# Read the test code path from failing test name:
228+
# Example: Datadog.Profiler.SmokeTests.WebsiteAspNetCore01Test.CheckSmoke
229+
# Path: profiler/test/Datadog.Profiler.IntegrationTests/SmokeTests/WebsiteAspNetCore01Test.cs
230+
```
231+
232+
**Common gotchas:**
233+
- Profiler tests may disable the tracer (`DD_TRACE_ENABLED=0`)
234+
- Different test suites (tracer vs profiler) have different configurations
235+
- Test environment may not match production deployment
236+
237+
### Cross-cutting test failures
238+
239+
Changes in one component may affect tests for another component:
240+
241+
- **Managed tracer changes** may affect profiler tests (they share the managed loader)
242+
- **Native changes** may affect managed tests (if they change initialization order)
243+
- **Environment variable handling** may affect both tracer and profiler
244+
245+
**Investigation strategy:**
246+
1. Identify which component the failing test is for (tracer, profiler, debugger, etc.)
247+
2. Compare with your changes - do they touch shared infrastructure?
248+
3. Check if test configuration differs from production (e.g., disabled features)
249+
4. Trace through initialization code to find the interaction point
250+
251+
### Tracing error messages to source code
252+
253+
When you find an error message in logs, trace it back to source code:
254+
255+
```bash
256+
# Search for the error message across the codebase
257+
grep -r "One or multiple services failed to start" .
258+
259+
# Example output:
260+
# profiler/src/ProfilerEngine/Datadog.Profiler.Native/CorProfilerCallback.cpp:710
261+
# Log::Error("One or multiple services failed to start after a delay...");
262+
```
263+
264+
This helps you understand:
265+
- Which component is logging the error (native/managed, tracer/profiler)
266+
- The context of the failure (initialization, shutdown, runtime)
267+
- Related code that might be affected
268+
110269
## Common Test Failure Patterns
111270

271+
### Infrastructure Failures (Not Your Code)
272+
273+
Some failures are infrastructure-related and can be retried without code changes:
274+
275+
#### Docker Rate Limiting
276+
277+
```
278+
toomanyrequests: You have reached your unauthenticated pull rate limit. https://www.docker.com/increase-rate-limit
279+
```
280+
281+
**Solution**: Retry the failed job in Azure DevOps. This is a transient Docker Hub rate limit issue.
282+
283+
#### Timeout/Network Issues
284+
285+
```
286+
##[error]The job running on runner X has exceeded the maximum execution time
287+
TLS handshake timeout
288+
Connection reset by peer
289+
```
290+
291+
**Solution**: Retry the failed job. These are typically transient network issues.
292+
293+
#### Identifying Flaky Tests and Retry Attempts
294+
295+
Azure DevOps automatically retries some failed stages. You can identify retried tasks in the build timeline:
296+
297+
**Using curl/jq:**
298+
```bash
299+
curl -s "https://dev.azure.com/datadoghq/a51c4863-3eb4-4c5d-878a-58b41a049e4e/_apis/build/builds/<BUILD_ID>/timeline" \
300+
| jq -r '.records[] | select(.previousAttempts != null and (.previousAttempts | length) > 0) | "\(.name): attempt \(.attempt), previous attempts: \(.previousAttempts | length)"'
301+
```
302+
303+
**Using Azure DevOps MCP:**
304+
Ask your assistant to check the build timeline for tasks with `previousAttempts` or `attempt > 1`.
305+
306+
**What this means:**
307+
- `"attempt": 2` with `"result": "succeeded"` → The task failed initially but passed on retry (likely a flake)
308+
- `"previousAttempts": [...]` → Contains IDs of previous failed attempts
309+
310+
**When you see retried tasks:**
311+
1. If a task succeeded on retry after an initial failure, it's likely a flaky/intermittent issue
312+
2. The overall build result may still show as "failed" even if the retry succeeded, depending on pipeline configuration
313+
3. Check if the failure pattern is known (see "Flaky Profiler Stack Walking Failures" below)
314+
315+
**How to retry a failed job:**
316+
1. Open the build in Azure DevOps: `https://dev.azure.com/datadoghq/dd-trace-dotnet/_build/results?buildId=<BUILD_ID>`
317+
2. Find the failed stage/job
318+
3. Click the "..." menu → "Retry failed stages" or "Retry stage"
319+
4. Only failed stages will be retried; successful stages are not re-run
320+
112321
### Unit Test Failures
113322

114323
Failed unit tests typically appear in logs as:
@@ -131,6 +340,25 @@ Integration test failures may indicate:
131340

132341
Check the specific integration test logs for details about which service or scenario failed.
133342

343+
#### Flaky Profiler Stack Walking Failures (Alpine/musl)
344+
345+
**Symptom:**
346+
```
347+
Failed to walk N stacks for sampled exception: E_FAIL (80004005)
348+
```
349+
or
350+
```
351+
Failed to walk N stacks for sampled exception: CORPROF_E_STACKSNAPSHOT_UNSAFE
352+
```
353+
354+
**Appears in**: Smoke tests on Alpine Linux (musl libc), particularly `installer_smoke_tests``linux alpine_3_1-alpine3_14`
355+
356+
**Cause**: Race condition in the profiler when unwinding call stacks while threads are running. This is a known limitation on Alpine/musl platforms and appears intermittently.
357+
358+
**Solution**: Retry the failed job. The smoke test check `CheckSmokeTestsForErrors` has an allowlist for known patterns, but some error codes like `E_FAIL` may occasionally slip through.
359+
360+
**Note**: The profiler only logs these warnings every 100 failures to avoid log spam, so seeing this message indicates multiple stack walking attempts have failed.
361+
134362
### Build Failures
135363

136364
Build failures typically show:
@@ -159,6 +387,35 @@ Common verification failures:
159387

160388
## Example Investigation Workflow
161389

390+
### Quick Investigation (AI Assistant with MCP)
391+
392+
If you're using an AI assistant with Azure DevOps MCP:
393+
394+
```
395+
"Why did Azure DevOps build <BUILD_ID> fail?"
396+
```
397+
398+
The assistant will:
399+
1. Get build information using `mcp__azure-devops__pipelines_get_builds`
400+
2. Identify the result and any failed stages
401+
3. Check for retry attempts to identify flaky tests
402+
4. Provide guidance on whether to retry or investigate further
403+
404+
### Quick Investigation (GitHub CLI)
405+
406+
```bash
407+
# 1. Get quick overview of all checks
408+
gh pr checks <PR_NUMBER>
409+
410+
# 2. If Azure DevOps checks failed, check the logs directly
411+
# Get the build ID from the Azure DevOps URL in the output above, then:
412+
BUILD_ID=<build_id_from_checks>
413+
az rest --url "https://dev.azure.com/datadoghq/a51c4863-3eb4-4c5d-878a-58b41a049e4e/_apis/build/builds/${BUILD_ID}/logs/<LOG_ID>?api-version=7.0" \
414+
2>&1 | grep -i "error\|fail\|toomanyrequests"
415+
```
416+
417+
### Detailed Investigation
418+
162419
```bash
163420
# 1. Find your PR number
164421
gh pr list --head <BRANCH_NAME>
@@ -175,11 +432,17 @@ BUILD_ID=<your_build_id>
175432
curl -s "https://dev.azure.com/datadoghq/a51c4863-3eb4-4c5d-878a-58b41a049e4e/_apis/build/builds/${BUILD_ID}/timeline" \
176433
| jq -r '.records[] | select(.result == "failed") | "\(.name): log.id=\(.log.id)"'
177434

178-
# 4. Download and examine logs
435+
# 4. Download and examine logs (choose one method)
179436
LOG_ID=<log_id_from_above>
437+
438+
# Using curl:
180439
curl -s "https://dev.azure.com/datadoghq/a51c4863-3eb4-4c5d-878a-58b41a049e4e/_apis/build/builds/${BUILD_ID}/logs/${LOG_ID}" \
181440
| grep -A 30 "FAIL"
182441

442+
# Or using Azure CLI (Windows):
443+
az rest --url "https://dev.azure.com/datadoghq/a51c4863-3eb4-4c5d-878a-58b41a049e4e/_apis/build/builds/${BUILD_ID}/logs/${LOG_ID}?api-version=7.0" \
444+
2>&1 | grep -A 30 "FAIL"
445+
183446
# 5. Open build in browser for full details
184447
open "https://dev.azure.com/datadoghq/dd-trace-dotnet/_build/results?buildId=${BUILD_ID}"
185448
```

0 commit comments

Comments
 (0)