@@ -54,6 +54,37 @@ The human-readable build URL format is:
5454https://dev.azure.com/datadoghq/dd-trace-dotnet/_build/results?buildId=<BUILD_ID>
5555```
5656
57+ ### Using Azure DevOps MCP (AI Assistant Integration)
58+
59+ If you're using an AI assistant with the Azure DevOps MCP server, you can use these tools for cleaner queries:
60+
61+ #### Get build information
62+ Ask your assistant to use ` mcp__azure-devops__pipelines_get_builds ` with:
63+ - ` project: "dd-trace-dotnet" `
64+ - ` buildIds: [<BUILD_ID>] `
65+
66+ This returns structured build data including status, result, queue time, and trigger information.
67+
68+ #### Get build logs
69+ Ask your assistant to use ` mcp__azure-devops__pipelines_get_build_log ` with:
70+ - ` project: "dd-trace-dotnet" `
71+ - ` buildId: <BUILD_ID> `
72+
73+ Note: Large builds may have very large logs that exceed token limits. In that case, fall back to curl/jq to target specific log IDs.
74+
75+ #### Get specific log by ID
76+ Ask your assistant to use ` mcp__azure-devops__pipelines_get_build_log_by_id ` with:
77+ - ` project: "dd-trace-dotnet" `
78+ - ` buildId: <BUILD_ID> `
79+ - ` logId: <LOG_ID> `
80+ - Optional: ` startLine ` and ` endLine ` to limit output
81+
82+ ** Advantages of MCP approach:**
83+ - Structured JSON responses (no manual parsing)
84+ - Works naturally in conversation with AI assistants
85+ - Handles authentication automatically
86+ - Can combine multiple queries in a single request
87+
5788## Investigating Test Failures
5889
5990### Find failed tasks in a build
@@ -73,18 +104,48 @@ curl -s "https://dev.azure.com/datadoghq/a51c4863-3eb4-4c5d-878a-58b41a049e4e/_a
73104
74105### Download and search logs
75106
107+ #### Option 1: Using curl (works everywhere)
108+
76109``` bash
77110curl -s " https://dev.azure.com/datadoghq/a51c4863-3eb4-4c5d-878a-58b41a049e4e/_apis/build/builds/<BUILD_ID>/logs/<LOG_ID>" \
78111 | grep -i " fail\|error"
79112```
80113
114+ #### Option 2: Using Azure CLI (recommended on Windows)
115+
116+ ``` bash
117+ az rest --url " https://dev.azure.com/datadoghq/a51c4863-3eb4-4c5d-878a-58b41a049e4e/_apis/build/builds/<BUILD_ID>/logs/<LOG_ID>?api-version=7.0" \
118+ 2>&1 | grep -i " fail\|error"
119+ ```
120+
121+ Note: You may see a warning about authentication - this is safe to ignore for public builds.
122+
123+ #### Option 3: Using GitHub CLI for quick overview
124+
125+ ``` bash
126+ # Get quick summary of all checks for a PR
127+ gh pr checks < PR_NUMBER>
128+
129+ # Get detailed PR status including links to Azure DevOps
130+ gh pr view < PR_NUMBER> --json statusCheckRollup
131+ ```
132+
81133### Get detailed context around failures
82134
135+ Using curl:
136+
83137``` bash
84138curl -s " https://dev.azure.com/datadoghq/a51c4863-3eb4-4c5d-878a-58b41a049e4e/_apis/build/builds/<BUILD_ID>/logs/<LOG_ID>" \
85139 | grep -A 30 " TestName.That.Failed"
86140```
87141
142+ Or with Azure CLI:
143+
144+ ``` bash
145+ az rest --url " https://dev.azure.com/datadoghq/a51c4863-3eb4-4c5d-878a-58b41a049e4e/_apis/build/builds/<BUILD_ID>/logs/<LOG_ID>?api-version=7.0" \
146+ 2>&1 | grep -A 30 " TestName.That.Failed"
147+ ```
148+
88149## Mapping Commits to Builds
89150
90151Azure DevOps builds test ** merge commits** (` refs/pull/<PR_NUMBER>/merge ` ), not branch commits directly.
@@ -107,8 +168,156 @@ To find which branch commit caused a failure:
107168
108169The build queued shortly after the commit was pushed is likely testing that commit.
109170
171+ ## Determining If Failures Are Related to Your Changes
172+
173+ When tests fail on master after your PR is merged, determine if failures are new or pre-existing:
174+
175+ ### Compare with previous build on master
176+
177+ ``` bash
178+ # List recent builds on master
179+ az pipelines runs list \
180+ --organization https://dev.azure.com/datadoghq \
181+ --project dd-trace-dotnet \
182+ --branch master \
183+ --top 10 \
184+ --query " [].{id:id, result:result, sourceVersion:sourceVersion, finishTime:finishTime}" \
185+ --output table
186+
187+ # Find the build for the commit before yours
188+ git log --oneline HEAD~1..HEAD # Identify your commit and the previous one
189+
190+ # Compare failed tasks between builds
191+ # Your build:
192+ curl -s " https://dev.azure.com/datadoghq/a51c4863-3eb4-4c5d-878a-58b41a049e4e/_apis/build/builds/<YOUR_BUILD_ID>/timeline" \
193+ | jq -r ' .records[] | select(.result == "failed") | .name'
194+
195+ # Previous build:
196+ curl -s " https://dev.azure.com/datadoghq/a51c4863-3eb4-4c5d-878a-58b41a049e4e/_apis/build/builds/<PREVIOUS_BUILD_ID>/timeline" \
197+ | jq -r ' .records[] | select(.result == "failed") | .name'
198+ ```
199+
200+ ** New failures** only appear in your build → likely related to your changes.
201+ ** Same failures** appear in both → likely pre-existing/flaky tests.
202+
203+ ### Master-only tests
204+
205+ Some tests (profiler integration tests, exploration tests) run only on master branch, not on PRs. If you see failures on master that you didn't see on your PR:
206+
207+ 1 . This is expected - those tests don't run on PRs
208+ 2 . Compare with the previous successful master build to confirm they're new
209+ 3 . The failures are likely related to your changes
210+
211+ ## Understanding Test Infrastructure
212+
213+ When test failures don't make obvious sense, investigate the test infrastructure to understand how tests are configured.
214+
215+ ### Finding test configuration
216+
217+ Tests may set up environments differently than production code. For example:
218+
219+ ``` bash
220+ # Find how a specific test sets up environment variables
221+ grep -r " DD_DOTNET_TRACER_HOME\|DD_TRACE_ENABLED" profiler/test/
222+
223+ # Look for test helper classes
224+ find . -name " *EnvironmentHelper*.cs" -o -name " *TestRunner*.cs"
225+
226+ # Check what environment variables a test actually sets
227+ # Read the test code path from failing test name:
228+ # Example: Datadog.Profiler.SmokeTests.WebsiteAspNetCore01Test.CheckSmoke
229+ # Path: profiler/test/Datadog.Profiler.IntegrationTests/SmokeTests/WebsiteAspNetCore01Test.cs
230+ ```
231+
232+ ** Common gotchas:**
233+ - Profiler tests may disable the tracer (` DD_TRACE_ENABLED=0 ` )
234+ - Different test suites (tracer vs profiler) have different configurations
235+ - Test environment may not match production deployment
236+
237+ ### Cross-cutting test failures
238+
239+ Changes in one component may affect tests for another component:
240+
241+ - ** Managed tracer changes** may affect profiler tests (they share the managed loader)
242+ - ** Native changes** may affect managed tests (if they change initialization order)
243+ - ** Environment variable handling** may affect both tracer and profiler
244+
245+ ** Investigation strategy:**
246+ 1 . Identify which component the failing test is for (tracer, profiler, debugger, etc.)
247+ 2 . Compare with your changes - do they touch shared infrastructure?
248+ 3 . Check if test configuration differs from production (e.g., disabled features)
249+ 4 . Trace through initialization code to find the interaction point
250+
251+ ### Tracing error messages to source code
252+
253+ When you find an error message in logs, trace it back to source code:
254+
255+ ``` bash
256+ # Search for the error message across the codebase
257+ grep -r " One or multiple services failed to start" .
258+
259+ # Example output:
260+ # profiler/src/ProfilerEngine/Datadog.Profiler.Native/CorProfilerCallback.cpp:710
261+ # Log::Error("One or multiple services failed to start after a delay...");
262+ ```
263+
264+ This helps you understand:
265+ - Which component is logging the error (native/managed, tracer/profiler)
266+ - The context of the failure (initialization, shutdown, runtime)
267+ - Related code that might be affected
268+
110269## Common Test Failure Patterns
111270
271+ ### Infrastructure Failures (Not Your Code)
272+
273+ Some failures are infrastructure-related and can be retried without code changes:
274+
275+ #### Docker Rate Limiting
276+
277+ ```
278+ toomanyrequests: You have reached your unauthenticated pull rate limit. https://www.docker.com/increase-rate-limit
279+ ```
280+
281+ ** Solution** : Retry the failed job in Azure DevOps. This is a transient Docker Hub rate limit issue.
282+
283+ #### Timeout/Network Issues
284+
285+ ```
286+ ##[error]The job running on runner X has exceeded the maximum execution time
287+ TLS handshake timeout
288+ Connection reset by peer
289+ ```
290+
291+ ** Solution** : Retry the failed job. These are typically transient network issues.
292+
293+ #### Identifying Flaky Tests and Retry Attempts
294+
295+ Azure DevOps automatically retries some failed stages. You can identify retried tasks in the build timeline:
296+
297+ ** Using curl/jq:**
298+ ``` bash
299+ curl -s " https://dev.azure.com/datadoghq/a51c4863-3eb4-4c5d-878a-58b41a049e4e/_apis/build/builds/<BUILD_ID>/timeline" \
300+ | jq -r ' .records[] | select(.previousAttempts != null and (.previousAttempts | length) > 0) | "\(.name): attempt \(.attempt), previous attempts: \(.previousAttempts | length)"'
301+ ```
302+
303+ ** Using Azure DevOps MCP:**
304+ Ask your assistant to check the build timeline for tasks with ` previousAttempts ` or ` attempt > 1 ` .
305+
306+ ** What this means:**
307+ - ` "attempt": 2 ` with ` "result": "succeeded" ` → The task failed initially but passed on retry (likely a flake)
308+ - ` "previousAttempts": [...] ` → Contains IDs of previous failed attempts
309+
310+ ** When you see retried tasks:**
311+ 1 . If a task succeeded on retry after an initial failure, it's likely a flaky/intermittent issue
312+ 2 . The overall build result may still show as "failed" even if the retry succeeded, depending on pipeline configuration
313+ 3 . Check if the failure pattern is known (see "Flaky Profiler Stack Walking Failures" below)
314+
315+ ** How to retry a failed job:**
316+ 1 . Open the build in Azure DevOps: ` https://dev.azure.com/datadoghq/dd-trace-dotnet/_build/results?buildId=<BUILD_ID> `
317+ 2 . Find the failed stage/job
318+ 3 . Click the "..." menu → "Retry failed stages" or "Retry stage"
319+ 4 . Only failed stages will be retried; successful stages are not re-run
320+
112321### Unit Test Failures
113322
114323Failed unit tests typically appear in logs as:
@@ -131,6 +340,25 @@ Integration test failures may indicate:
131340
132341Check the specific integration test logs for details about which service or scenario failed.
133342
343+ #### Flaky Profiler Stack Walking Failures (Alpine/musl)
344+
345+ ** Symptom:**
346+ ```
347+ Failed to walk N stacks for sampled exception: E_FAIL (80004005)
348+ ```
349+ or
350+ ```
351+ Failed to walk N stacks for sampled exception: CORPROF_E_STACKSNAPSHOT_UNSAFE
352+ ```
353+
354+ ** Appears in** : Smoke tests on Alpine Linux (musl libc), particularly ` installer_smoke_tests ` → ` linux alpine_3_1-alpine3_14 `
355+
356+ ** Cause** : Race condition in the profiler when unwinding call stacks while threads are running. This is a known limitation on Alpine/musl platforms and appears intermittently.
357+
358+ ** Solution** : Retry the failed job. The smoke test check ` CheckSmokeTestsForErrors ` has an allowlist for known patterns, but some error codes like ` E_FAIL ` may occasionally slip through.
359+
360+ ** Note** : The profiler only logs these warnings every 100 failures to avoid log spam, so seeing this message indicates multiple stack walking attempts have failed.
361+
134362### Build Failures
135363
136364Build failures typically show:
@@ -159,6 +387,35 @@ Common verification failures:
159387
160388## Example Investigation Workflow
161389
390+ ### Quick Investigation (AI Assistant with MCP)
391+
392+ If you're using an AI assistant with Azure DevOps MCP:
393+
394+ ```
395+ "Why did Azure DevOps build <BUILD_ID> fail?"
396+ ```
397+
398+ The assistant will:
399+ 1 . Get build information using ` mcp__azure-devops__pipelines_get_builds `
400+ 2 . Identify the result and any failed stages
401+ 3 . Check for retry attempts to identify flaky tests
402+ 4 . Provide guidance on whether to retry or investigate further
403+
404+ ### Quick Investigation (GitHub CLI)
405+
406+ ``` bash
407+ # 1. Get quick overview of all checks
408+ gh pr checks < PR_NUMBER>
409+
410+ # 2. If Azure DevOps checks failed, check the logs directly
411+ # Get the build ID from the Azure DevOps URL in the output above, then:
412+ BUILD_ID=< build_id_from_checks>
413+ az rest --url " https://dev.azure.com/datadoghq/a51c4863-3eb4-4c5d-878a-58b41a049e4e/_apis/build/builds/${BUILD_ID} /logs/<LOG_ID>?api-version=7.0" \
414+ 2>&1 | grep -i " error\|fail\|toomanyrequests"
415+ ```
416+
417+ ### Detailed Investigation
418+
162419``` bash
163420# 1. Find your PR number
164421gh pr list --head < BRANCH_NAME>
@@ -175,11 +432,17 @@ BUILD_ID=<your_build_id>
175432curl -s " https://dev.azure.com/datadoghq/a51c4863-3eb4-4c5d-878a-58b41a049e4e/_apis/build/builds/${BUILD_ID} /timeline" \
176433 | jq -r ' .records[] | select(.result == "failed") | "\(.name): log.id=\(.log.id)"'
177434
178- # 4. Download and examine logs
435+ # 4. Download and examine logs (choose one method)
179436LOG_ID=< log_id_from_above>
437+
438+ # Using curl:
180439curl -s " https://dev.azure.com/datadoghq/a51c4863-3eb4-4c5d-878a-58b41a049e4e/_apis/build/builds/${BUILD_ID} /logs/${LOG_ID} " \
181440 | grep -A 30 " FAIL"
182441
442+ # Or using Azure CLI (Windows):
443+ az rest --url " https://dev.azure.com/datadoghq/a51c4863-3eb4-4c5d-878a-58b41a049e4e/_apis/build/builds/${BUILD_ID} /logs/${LOG_ID} ?api-version=7.0" \
444+ 2>&1 | grep -A 30 " FAIL"
445+
183446# 5. Open build in browser for full details
184447open " https://dev.azure.com/datadoghq/dd-trace-dotnet/_build/results?buildId=${BUILD_ID} "
185448```
0 commit comments