Circuit breaker retry budgets count retries inconsistently

*Circuit breaker retry budgets counts retries inconsistently*: *Retry budgets count retries in backoff as active, but doesn't count them towards the calculation of the limit*

*Description*:
Retry budgets allow you to configure a circuit breaker on the proportion of requests that are retries. I will ignore the min retry limit control for this issue (assume it is 0) as it isn't pertinent.

Here is how that is calculated:
- Limit: `(num_pending_requests + num_active_requests) * budget_percent / 100.0`
- Count: number of outstanding retries, either active, pending, in backoff, or waiting for rate limiting

If the router decides a request needs a retry, it checks against the limit before proceeding, and if it doesn't fit returns the error instead of retrying.

The problem is that even when the retry budget is 100%, retries can still overflow due to how the limit is calculated.

Imagine the following scenario: a request has just been completed to the upstream and has received a retriable error. The retry budget is set to 100%.

In this scenario, the retry limit is 0 as there are no active or pending requests. When the router checks the retry budget, it sees that an additional retry would overflow because there is no space for additional retries, even though the budget is 100%.

This behavior is super unintuitive and not well described in the docs. It seems to me that the idea behind a retry budget is something like:
- Limit the percentage of outstanding requests to a single upstream so that you can't get into a situation where most of the upstream requests are retries, creating a storm
- Have a lower bound on the limit so that retries can continue at low request rates that may violate the budget due to low sample size

In that model, it makes very little sense that a budget of 100% still allows retries to overflow. In fact, validation on the percentage type used in configuration prevents you from even specifying a limit above 100%, making it very dangerous to enable the retry budget feature.

*Second related problem*:
While attempting to write an integration test to capture the above behavior, I discovered an additional counting inconsistency that depends on the upstream protocol version.

When using an HTTP 1 or HTTP 3 upstream, a request is still counted as active when the circuit breaker determination for its retry is being made. This can have an effect on the retry budget calculation.

When using an HTTP 2 upstream, the request is not considered as active.

The issue is most clear by reading the repro code below.

*Proposal*:
I would propose that we count retries in backoff towards the retry budget limit, i.e. `limit = (pending + active + backoff) * budget`. That would ensure that the actual value never exceeds a budget of 100%, and better captures the intuitive idea that a retry budget is out of the proportion of all outstanding requests including all retries.

For the upstream protocol inconsistency, I would propose that a request should not be considered as active when calculating budgets, as at that point the request is complete and no longer actively taking up resources for the upstream.

If folks are amenable to my proposal I can likely PR the changes.

*Repro steps*:
I have reproduced the issue in a circuit breaker integration test in envoy [here](https://github.com/AdamEAnderson/envoy/pull/2/files).
The relevant code is below:
```
class RetryBudgetIntegrationTest : public HttpProtocolIntegrationTest {
public:
  void initialize() override { HttpProtocolIntegrationTest::initialize(); }
};

INSTANTIATE_TEST_SUITE_P(Protocols, RetryBudgetIntegrationTest,
                         testing::ValuesIn(HttpProtocolIntegrationTest::getProtocolTestParams()),
                         HttpProtocolIntegrationTest::protocolTestParamsToString);

TEST_P(RetryBudgetIntegrationTest, CircuitBreakerRetryBudgets) {
  // Create a config with a retry budget of 100% and a (very) long retry backoff
  config_helper_.addConfigModifier([](envoy::config::bootstrap::v3::Bootstrap& bootstrap) {
    auto* retry_budget = bootstrap.mutable_static_resources()
                             ->mutable_clusters(0)
                             ->mutable_circuit_breakers()
                             ->add_thresholds()
                             ->mutable_retry_budget();
    retry_budget->mutable_min_retry_concurrency()->set_value(0);
    retry_budget->mutable_budget_percent()->set_value(100);
  });
  config_helper_.addConfigModifier(
      [&](envoy::extensions::filters::network::http_connection_manager::v3::HttpConnectionManager&
              hcm) -> void {
        auto* retry_policy =
            hcm.mutable_route_config()->mutable_virtual_hosts(0)->mutable_retry_policy();
        retry_policy->set_retry_on("5xx");
        retry_policy->mutable_retry_back_off()->mutable_base_interval()->set_seconds(100);
      });
  initialize();

  Http::TestResponseHeaderMapImpl error_response_headers{{":status", "500"}};

  // Send a request that will fail and trigger a retry
  codec_client_ = makeHttpConnection(lookupPort("http"));
  auto response1 = codec_client_->makeHeaderOnlyRequest(default_request_headers_);
  waitForNextUpstreamRequest();
  upstream_request_->encodeHeaders(error_response_headers, true); // trigger retry
  EXPECT_TRUE(upstream_request_->complete());

  // Observe odd behavior of retry overflow even though the budget is set to 100%
  test_server_->waitForCounterEq("cluster.cluster_0.upstream_rq_total", 1);
  if (upstreamProtocol() == Http::CodecType::HTTP2) {
    // For H2 upstreams the observed behavior is that the retry budget max is 0
    // i.e. it doesn't count the request that just failed as active
    EXPECT_EQ(test_server_->counter("cluster.cluster_0.upstream_rq_retry_overflow")->value(), 1);
    // since the retry overflowed we get the error response
    ASSERT_TRUE(response1->waitForEndStream());
  } else {
    // For H1/H3 upstreams the observed behavior is that the retry budget max is 1
    // i.e. it counts the request that just failed as still active when calculating retries
    //
    // Now this retry is in backoff and not counted as an active or pending request,
    // but still counted against the retry limit
    EXPECT_EQ(test_server_->counter("cluster.cluster_0.upstream_rq_retry_overflow")->value(), 0);
  }

  // now we are in a slightly weird protocol-dependent state:
  // - For H2: the request is completed and there is no retry happening
  // - For H1/H3: the request is in a long retry backoff

  // make another request that will fail and trigger another retry
  Envoy::IntegrationCodecClientPtr codec_client2 = makeHttpConnection(lookupPort("http"));
  auto response2 = codec_client2->makeHeaderOnlyRequest(default_request_headers_);

  waitForNextUpstreamRequest();
  upstream_request_->encodeHeaders(error_response_headers, true); // trigger retry
  EXPECT_TRUE(upstream_request_->complete());

  test_server_->waitForCounterEq("cluster.cluster_0.upstream_rq_total", 2);
  // This time behavior is independent of upstream protocol, no matter what the retry overflows
  //
  // In the case of H2, it would overflow regardless of the retry in backoff due to the behavior
  // seen above
  //
  // However, for H1/H3 the retry in backoff is counted against the active retry count, but not
  // counted against the limit (active + pending requests) so we have:
  // - 1 retry in backoff
  // - 1 request counted as active (due to H1/H3 specific behavior seen above)
  // Because our retry budget is 100%, that gives us a retry limit of 1 since the retry in backoff
  // is not counted in active + pending requests So we overflow the retry budget on the second
  // request
  if (upstreamProtocol() == Http::CodecType::HTTP2) {
    EXPECT_EQ(test_server_->counter("cluster.cluster_0.upstream_rq_retry_overflow")->value(), 2);
  } else {
    EXPECT_EQ(test_server_->counter("cluster.cluster_0.upstream_rq_retry_overflow")->value(), 1);
  }

  // since the retry overflowed we get the error response
  ASSERT_TRUE(response2->waitForEndStream());

  if (upstreamProtocol() != Http::CodecType::HTTP2) {
    // For H1/H3: the first request is in a long retry backoff and must be manually abandoned
    // otherwise the codec destructor is annoyed about the outstanding request
    codec_client_->rawConnection().close(Envoy::Network::ConnectionCloseType::Abort);
  }
  codec_client2->close();
}
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Circuit breaker retry budgets count retries inconsistently #30205

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Circuit breaker retry budgets count retries inconsistently #30205

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions