Skip to content

ACA backend: Cosmos DB latency/failures – tune SDK timeouts/retries and partition/RU (rg-artagent-contoso) #6

@JinLee794

Description

@JinLee794

Summary

  • Cosmos dependencies: audioagentcollection (76 calls, 14 failures, p9560s) and users (14 calls, 6 failures, p95120s). Failures correlate with identity issues and long SDK timeouts.

Severity & SLA risk

  • sev3 (elevated latency and intermittent failures)

Detection

  • App Insights dependencies show above failure/latency metrics.
  • Exceptions include upsert failures tied to prior token issues.

Impacted components

  • Cosmos DB client configuration; partitioning/RU provisioning.

Suspected cause

  • Long default timeouts and retry budgets; possibly hot partition/RU throttling. Confidence: medium.

Recommended actions

  • Set bounded requestTimeout (e.g., 5–10s) and overall retry policy
  • Enable fast-fail on cancellation from upstream
  • Verify partition keys and RU/s; increase or autoscale where needed
  • Add diagnostics capturing RU charge/throttle codes

Acceptance criteria

  • p95 < 400ms, p99 < 1s on key operations
  • Dependency failure rate <1%
  • No prolonged 60–120s Cosmos calls

Follow-ups

  • Add load test to validate RU/partitioning

Missing info

  • Database/account names, SDK version in use

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions