Skip to content

Conversation

Copy link
Contributor

Copilot AI commented Oct 13, 2025

Fix race condition in concurrent DeploymentState access causing intermittent AzureDeployerTests failures

Root Cause Analysis

The tests fail intermittently due to a race condition when multiple Azure Bicep resources are provisioned in parallel (via Task.WhenAll in AzureEnvironmentResource.ProvisionAzureBicepResourcesAsync). All provisioning tasks share the same DeploymentState JsonObject through ProvisioningContext and access it concurrently without synchronization.

The race condition occurs when:

  1. Multiple threads check jsonObj[key] and get null
  2. They try to add the same property
  3. One succeeds, others fail or cause enumeration errors
  4. During retries, another thread modifies the collection
  5. This causes InvalidOperationException: Collection was modified or KeyNotFoundException

Implementation Plan

  • Analyze the issue and identify root cause (concurrent access to shared DeploymentState)
  • Add thread-safe synchronization at ProvisioningContext level (not just Prop method)
  • Update BicepProvisioner to use WithDeploymentState for all state modifications
  • Build and test the fix
  • Verify tests pass consistently (ran multiple tests 5 times each - all passed)
  • Update unit tests to reflect new design
  • Verify concurrent tests pass consistently (ran 5 times - all passed)
  • Revert unnecessary changes to JsonExtensions.cs per code review feedback
  • Clean up and finalize

Changes Made

  1. Modified src/Aspire.Hosting.Azure/Provisioning/ProvisioningContext.cs

    • Added private _deploymentStateLock object
    • Added WithDeploymentState(Action<JsonObject>) method for synchronized state access
    • Added WithDeploymentState<T>(Func<JsonObject, T>) method for synchronized state reads
    • This provides thread-safety at the point where the JsonObject is accessed
  2. Modified src/Aspire.Hosting.Azure/Provisioning/Provisioners/BicepProvisioner.cs

    • Wrapped all DeploymentState modifications in context.WithDeploymentState()
    • Ensures all concurrent access is properly synchronized
  3. src/Aspire.Hosting.Azure/Provisioning/JsonExtensions.cs

    • Kept original implementation unchanged
    • The existing TryAdd logic with retry handles race conditions at the JSON level
    • Thread-safety is now primarily enforced at the ProvisioningContext level
  4. Updated tests/Aspire.Hosting.Azure.Tests/JsonExtensionsTests.cs

    • Removed concurrent access test (no longer applicable since thread-safety is at ProvisioningContext level)
    • Kept basic functional tests for Prop method
  5. Updated tests/Aspire.Hosting.Azure.Tests/ProvisioningContextTests.cs

    • Added WithDeploymentState_ConcurrentAccess_IsThreadSafe test
    • Added WithDeploymentState_Action_ExecutesSuccessfully test
    • Added WithDeploymentState_Func_ReturnsValue test
    • Added WithDeploymentState_ConcurrentReadsAndWrites_MaintainsConsistency test

Test Results

All tests pass consistently:

  • JsonExtensionsTests: 3/3 tests passing
  • ProvisioningContextTests: 14/14 tests passing (includes 4 new thread-safety tests)
  • AzureDeployerTests: 18/18 tests passing

Design Rationale

The approach is superior because:

  1. Addresses the root cause: Protects ALL DeploymentState access, not just Prop calls
  2. Prevents future issues: Any code accessing DeploymentState must use WithDeploymentState
  3. Clear ownership: ProvisioningContext owns the lock for its DeploymentState
  4. Follows best practices: Lock is a private object owned by the class
  5. More maintainable: Thread-safety is explicit and localized to ProvisioningContext

This addresses the feedback that making Prop thread-safe "just kicks the problem down to the next thing that breaks when multiple threads are reading/writing JSON at the same time."

Original prompt

This section details on the original issue you should resolve

<issue_title>[Failing test]: Aspire.Hosting.Azure.Tests.AzureDeployerTests fail on Azure DevOps: the given key 'AZURE_CONTAINER_REGISTRY_NAME' was not present in the dictionary.</issue_title>
<issue_description>### Is there an existing issue for this?

  • I have searched the existing issues

Build information

Build:

Fill in the error message template

Error Message

Multiple tests in Aspire.Hosting.Azure.Tests.AzureDeployerTests are failing exclusively on Azure DevOps (passing on GitHub Actions workflows) with KeyNotFoundException when trying to access Azure resource configuration keys.

Primary Failures (KeyNotFoundException)

All 7 tests fail with the same root cause - missing Azure resource configuration keys:

  1. DeployAsync_WithMultipleComputeEnvironments_Works (line 370) - Missing: AZURE_CONTAINER_REGISTRY_NAME
  2. DeployAsync_WithContainer_Works (line 188) - Missing: AZURE_CONTAINER_REGISTRY_NAME
  3. DeployAsync_WithProjectResource_Works (line 291) - Missing: AZURE_CONTAINER_REGISTRY_NAME
  4. DeployAsync_WithDockerfile_Works (line 235) - Missing: AZURE_CONTAINER_REGISTRY_NAME
  5. DeployAsync_WithOnlyAzureResources_PrintsDashboardUrl (line 639) - Missing: AZURE_CONTAINER_APPS_ENVIRONMENT_DEFAULT_DOMAIN
  6. DeployAsync_WithSingleRedisCache_CallsDeployingComputeResources (line 589) - Missing: AZURE_CONTAINER_REGISTRY_NAME
  7. DeployAsync_WithAzureFunctionsProject_Works (line 891) - Missing: AZURE_CONTAINER_REGISTRY_NAME

Secondary Failures (Missing Project Files)

2 additional tests failed due to missing test project files:

  • PublishAsAzureAppServiceWebsite_ThrowsIfNoEnvironment
  • MultipleAzureAppServiceEnvironmentsSupported

Stack Trace

System.Collections.Generic.KeyNotFoundException : The given key 'AZURE_CONTAINER_REGISTRY_NAME' was not present in the dictionary.
    at System.Collections.Generic.Dictionary`2.get_Item(TKey key)
    at Aspire.Hosting.Azure.Tests.AzureDeployerTests.DeployAsync_WithMultipleComputeEnvironments_Works() in D:\a\_work\1\s\tests\Aspire.Hosting.Azure.Tests\AzureDeployerTests.cs:line 370
    --- End of stack trace from previous location ---

Similar stack traces occur for other failing tests at different line numbers (188, 235, 291, 589, 639, 891).

Notable Environment Differences (AzDO vs GitHub Actions)

  • Docker Unavailability: Could not harvest all abandoned containers {"Error": "exec: \"docker\": executable file not found in %PATH%"}
  • User Secrets Path Issues (warnings):
warn: Aspire.Hosting.Azure.Provisioning.Internal.DefaultUserSecretsManager[0]
      Failed to save user secrets.
      System.InvalidOperationException: User secrets path could not be determined.

Other info

Test Context and Environment

The tests are using mock Azure provisioning with test credentials:
info: Aspire.Hosting.Azure.Provisioning.Internal.PublishModeProvisioningContextProvider[0]
Default subscription: Test Subscription (/subscriptions/12345678-1234-1234-1234-123456789012)
info: Aspire.Hosting.Azure.Provisioning.Internal.PublishModeProvisioningContextProvider[0]
Tenant: 87654321-4321-4321-4321-210987654321
info: Aspire.Hosting.Azure.Provisioning.Internal.PublishModeProvisioningContextProvider[0]
Using existing resource group test-rg.

Analysis

Root Cause: The tests expect Azure resource configuration keys (primarily AZURE_CONTAINER_REGISTRY_NAME and AZURE_CONTAINER_APPS_ENVIRONMENT_DEFAULT_DOMAIN) to be present in a dictionary during the deployment provisioning phase, but these keys are not being populated in the AzDO/Helix environment.

Report

Summary

24-Hour Hit Count 7-Day Hit Count 1-Month Count
0 0 0

<agent_instructions>This seems to be a race condition. It fails only some times. Debug the test, and figure out how to reproduce the issue. Once you can reproduce, figure out how to fix the problem.</agent_instructions>

Comments on the Issue (you are @copilot in this section)

@radical cc @mitchdenny @eerhardt @captainsafia The fact that these ...

Fixes #11728


💬 Share your feedback on Copilot coding agent for the chance to win a $200 gift card! Click here to start the survey.

@github-actions
Copy link
Contributor

github-actions bot commented Oct 13, 2025

🚀 Dogfood this PR with:

⚠️ WARNING: Do not do this without first carefully reviewing the code of this PR to satisfy yourself it is safe.

curl -fsSL https://raw.githubusercontent.com/dotnet/aspire/main/eng/scripts/get-aspire-cli-pr.sh | bash -s -- 11974

Or

  • Run remotely in PowerShell:
iex "& { $(irm https://raw.githubusercontent.com/dotnet/aspire/main/eng/scripts/get-aspire-cli-pr.ps1) } 11974"

1 similar comment
@github-actions
Copy link
Contributor

🚀 Dogfood this PR with:

⚠️ WARNING: Do not do this without first carefully reviewing the code of this PR to satisfy yourself it is safe.

curl -fsSL https://raw.githubusercontent.com/dotnet/aspire/main/eng/scripts/get-aspire-cli-pr.sh | bash -s -- 11974

Or

  • Run remotely in PowerShell:
iex "& { $(irm https://raw.githubusercontent.com/dotnet/aspire/main/eng/scripts/get-aspire-cli-pr.ps1) } 11974"

@radical radical marked this pull request as ready for review October 13, 2025 02:07
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes a race condition in JsonExtensions.Prop() that was causing intermittent failures in AzureDeployerTests when multiple Azure Bicep resources were provisioned concurrently. The race condition occurred when multiple threads tried to access the same JSON property simultaneously, leading to InvalidOperationException or KeyNotFoundException.

Key changes:

  • Added thread-safe locking mechanism to JsonExtensions.Prop() method
  • Simplified the get-or-create logic by replacing TryAdd with lock-protected Add
  • Added comprehensive unit tests to verify thread-safety and concurrent access behavior

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
src/Aspire.Hosting.Azure/Provisioning/JsonExtensions.cs Added lock-based synchronization to prevent race conditions in concurrent property access
tests/Aspire.Hosting.Azure.Tests/JsonExtensionsTests.cs Added unit tests including concurrent access verification and edge case coverage

if (node is not null)
// Lock on the JsonObject to ensure thread-safe access when multiple
// bicep resources are being provisioned in parallel
lock (jsonObj)
Copy link

Copilot AI Oct 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Locking on the JsonObject instance itself can be problematic as it's a public object that external code might also lock on, potentially causing deadlocks. Consider using a private static readonly object or ConcurrentDictionary for thread-safe operations instead.

Copilot uses AI. Check for mistakes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@JamesNK JamesNK Oct 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with this. I think thread safety would be better added to prevent multiple threads from using the JSON at once, rather than making Prop method itself safe.

Making Prop safe against itself just kicks the problem down to the next thing that breaks when multiple threads are reading/writing JSON at the same time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot Address the feedback here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with this. I think thread safety would be better added to prevent multiple threads from using the JSON at once, rather than making Prop method itself safe.
Making Prop safe against itself just kicks the problem down to the next thing that breaks when multiple threads are reading/writing JSON at the same time.

@copilot consider this feedback. Look at how the json objects are being used and from where. And how we can make it thread safe for writes, and reads.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with this. I think thread safety would be better added to prevent multiple threads from using the JSON at once, rather than making Prop method itself safe.
Making Prop safe against itself just kicks the problem down to the next thing that breaks when multiple threads are reading/writing JSON at the same time.

@copilot This feedback has not been correctly addressed. Look at how the json objects are being used and from where. And how we can make it thread safe for writes, and reads. And then fix that. Add relevant tests too.

Copilot AI changed the title [WIP] Fix failing AzureDeployerTests due to missing key Fix race condition in JsonExtensions.Prop causing intermittent AzureDeployerTests failures Oct 13, 2025
Copilot AI requested a review from radical October 13, 2025 02:11
Copy link
Member

@JamesNK JamesNK left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dotnet-policy-service dotnet-policy-service bot added the needs-author-action An issue or pull request that requires more info or actions from the author. label Oct 13, 2025
Copy link
Contributor Author

Copilot AI commented Oct 13, 2025

See #11974 (comment)

Changed to use ConditionalWeakTable<JsonObject, object> to associate a dedicated lock object with each JsonObject instance, avoiding locking on external objects. Commit: eeb2ee7

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • aka.ms
    • Triggering command: /usr/local/bin/bicep build /tmp/aspire5E11xy/env.module.bicep --stdout (dns block)
    • Triggering command: /usr/local/bin/bicep build /tmp/aspiredwJdjF/teststorage.module.bicep --stdout (dns block)
    • Triggering command: /usr/local/bin/bicep build /tmp/aspirefAt6Ni/env.module.bicep --stdout (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

@dotnet-policy-service dotnet-policy-service bot removed the needs-author-action An issue or pull request that requires more info or actions from the author. label Oct 13, 2025
@radical
Copy link
Member

radical commented Oct 13, 2025

I agree with this. I think thread safety would be better added to prevent multiple threads from using the JSON at once, rather than making Prop method itself safe.
Making Prop safe against itself just kicks the problem down to the next thing that breaks when multiple threads are reading/writing JSON at the same time.

@copilot This feedback has not been correctly addressed. Reconsider it. Look at how the json objects are being used and from where. And how we can make it thread safe for writes, and reads. And then fix that. Add relevant tests too.

Address feedback to provide thread-safety at the ProvisioningContext level
rather than in the Prop method itself. This ensures all DeploymentState
access is synchronized when multiple resources are provisioned in parallel.

- Added WithDeploymentState methods to ProvisioningContext for thread-safe access
- Updated BicepProvisioner to use WithDeploymentState for all state modifications
- Reverted JsonExtensions.Prop to simpler implementation without locking
- Updated tests: removed concurrent test from JsonExtensionsTests
- Added comprehensive thread-safety tests to ProvisioningContextTests

Co-authored-by: radical <[email protected]>
Copy link
Contributor Author

Copilot AI commented Oct 13, 2025

I agree with this. I think thread safety would be better added to prevent multiple threads from using the JSON at once, rather than making Prop method itself safe.
Making Prop safe against itself...

Reconsidered the approach. Thread-safety is now provided at the ProvisioningContext level using WithDeploymentState() methods, which protect all concurrent access to the shared DeploymentState JsonObject. This ensures that all reads and writes are synchronized when multiple bicep resources are provisioned in parallel, not just Prop calls. Commit: (see latest commit)

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • aka.ms
    • Triggering command: /usr/local/bin/bicep build /tmp/aspireP5tScT/env.module.bicep --stdout (dns block)
    • Triggering command: /usr/local/bin/bicep build /tmp/aspireGhNFSV/teststorage.module.bicep --stdout (dns block)
    • Triggering command: /usr/local/bin/bicep build /tmp/aspireVYGl6O/env.module.bicep --stdout (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

Copilot AI changed the title Fix race condition in JsonExtensions.Prop causing intermittent AzureDeployerTests failures Fix race condition in concurrent DeploymentState access causing intermittent AzureDeployerTests failures Oct 13, 2025
@davidfowl davidfowl dismissed JamesNK’s stale review October 18, 2025 08:17

Issues is fixed and it's blocking PRs!

// Note: This method should only be called from within a synchronized context
// (e.g., within ProvisioningContext.WithDeploymentState) when multiple threads
// may be accessing the same JsonObject concurrently
node = new JsonObject();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot undo these changes since the core issue is fixed in WithDeploymentState

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted JsonExtensions.cs to its original implementation. The thread-safety is now solely handled by WithDeploymentState at the ProvisioningContext level. Commit: (see latest)

As suggested by @davidfowl, revert changes to JsonExtensions.cs since
the core thread-safety issue is now properly handled by WithDeploymentState
at the ProvisioningContext level. The original TryAdd logic with retry
is restored.

Co-authored-by: davidfowl <[email protected]>
Copy link
Contributor Author

Copilot AI commented Oct 18, 2025

Just as a heads up, I was blocked by some firewall rules while working on your feedback. Expand below for details.

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • aka.ms
    • Triggering command: /usr/local/bin/bicep build /tmp/aspireV5SyCd/env.module.bicep --stdout (dns block)
    • Triggering command: /usr/local/bin/bicep build /tmp/aspirev61VwK/teststorage.module.bicep --stdout (dns block)
    • Triggering command: /usr/local/bin/bicep build /tmp/aspiremjDgAU/teststorage.module.bicep --stdout (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

@davidfowl davidfowl merged commit fec45b6 into main Oct 18, 2025
306 of 308 checks passed
@dotnet-policy-service dotnet-policy-service bot added this to the 13.0 milestone Oct 18, 2025
@eerhardt eerhardt deleted the copilot/fix-azure-deployer-tests branch October 20, 2025 16:17
@github-actions github-actions bot locked and limited conversation to collaborators Nov 20, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Failing test]: Aspire.Hosting.Azure.Tests.AzureDeployerTests fail: the given key 'AZURE_CONTAINER_REGISTRY_NAME' was not present in the dictionary.

4 participants