Skip to content

Conversation

@hawkw
Copy link
Contributor

@hawkw hawkw commented Jun 2, 2023

The latest proxy-api release, v0.10.0, adds fields to the
OutboundPolicies API for configuring HTTP request timeouts, based on
the proposed changes to HTTPRoute in kubernetes-sigs/gateway-api#1997.

This branch updates the proxy-api dependency to v0.10.0 and adds the new
timeout configuration fields to the proxy's internal client policy
types. In addition, this branch adds a timeout middleware to the HTTP
client policy stack, so that the timeout described by the
Rule.request_timeout field is now applied.

Implementing the RouteBackend.request_timeout field with semantics as
close as possible to those described in GEP-1742 will be somewhat more
complex, and will be added in a separate PR. The reason for this
complexity is that, as described in GEP-1742, the
timeouts.backendRequest timeout is intended to bound only the time it
takes for the backend to respond to the request, rather than bounding
both time spent in the proxy and time spent in the backend (which isthe
intention behind the timeouts.request timeout). Therefore, the timeout
for backendRequest should be started as "close" to the actual backend
client as possible, ideally in the endpoint stack after the request has
already made it through the load balancer queue. Starting this timeout
in the logical stack like the route's request_timeout does would mean
that time spent in the proxy's queues is included in this timeout, which
seems not to be the intention. However, if we want these timeouts to
start in the endpoint stack, we can't configure them using the target
type, because a backend may be referenced with different timeouts by
different HTTPRoutes, and we would prefer not to create separate client
stacks based on those timeout values. Instead, we will need the request
to carry the configured timeout value as an extension, and a new timeout
middleware that applies timeouts from a request extension. This will be
implemented in a follow-up PR.

@hawkw hawkw requested a review from a team as a code owner June 2, 2023 18:59
Copy link
Member

@adleong adleong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any proxy metrics which are incremented when a timeout occurs? Are timeouts distinguishable from application errors? Are request timeouts distinguishable from connect timeouts, detect timeouts, or failfast?

@hawkw
Copy link
Contributor Author

hawkw commented Jun 12, 2023

@adleong

Are there any proxy metrics which are incremented when a timeout occurs?

With regards to metrics, these timeouts should behave analogously to timeouts configured for ServiceProfiles. When a request times out, we return an HTTP 504 response to the application, and increment the response_total metric with that status code as a label.

Are timeouts distinguishable from application errors?

AFAIK, we don't currently have a metric label that distinguishes HTTP 504 error codes that result from internal proxy timeouts from those that are returned by an application service. It's probably a good idea to add a label or something that make proxy timeouts distinguishable from those returned by an upstream application, but this maybe deserves some additional design work --- we could probably make it possible to distinguish between any error status synthesized by the proxy and errors with the same status returned by a remote application service.

Are request timeouts distinguishable from connect timeouts, detect timeouts, or failfast?

Protocol detection timeouts don't result in an error being returned to the application; when protocol detection times out, the proxy handles the connection as opaque. This is a different behavior than returning an error, so they are distinguishable because the proxy does something different.

Regarding failfast, the proxy will currently return an HTTP 504 for requests already dispatched to a service when it enters failfast, but returns an HTTP 503 when a new request would be dispatched to a service which is shedding load due to failfast:

// No available backend can be found for a request.
if errors::is_caused_by::<errors::FailFastError>(&*error) {
// XXX(ver) This should probably be SERVICE_UNAVAILABLE, because
// this is basically no different from a LoadShedError, but that
// would be a change in behavior.
return Ok(errors::SyntheticHttpResponse::gateway_timeout(error));
}
if errors::is_caused_by::<errors::LoadShedError>(&*error) {
return Ok(errors::SyntheticHttpResponse::unavailable(error));
}

This means both timeouts will sometimes increment the response_total metric with status="504", so they cannot always be distinguished in metrics currently.

In all cases, we will log substantially different error messages for these timeouts, and the value of the l5d-proxy-error header will be different. So these timeouts can be distinguished that way.

@hawkw
Copy link
Contributor Author

hawkw commented Jun 13, 2023

@adleong Update regarding my previous comment on metrics: I had forgotten about the outbound_http_errors_total metric, which is also incremented by these timeouts. The timeouts added in this PR will behave identically to ServiceProfile timeouts when incrementing this metric: we will increment the metric with the error="response timeout" label. This can be distinguished from failfast and load-shedding timeouts, which have their own labels. Similarly, we can distinguish these timeouts in the metrics from application 504s, since those do not increment the outbound_http_errors_total metric.

@hawkw hawkw requested a review from adleong June 13, 2023 15:54
Copy link
Member

@mateiidavid mateiidavid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Legible enough for me to review 🙃 looks good. The opaque route is confusing me a bit but think that's more due to the complexity of opaqueness, not because of this PR.

@hawkw hawkw requested review from mateiidavid and olix0r June 14, 2023 16:24
@hawkw hawkw merged commit a512ea6 into main Jun 14, 2023
@hawkw hawkw deleted the eliza/client-policy-timeouts branch June 14, 2023 18:36
hawkw added a commit that referenced this pull request Jun 15, 2023
Depends on #2418 

The latest proxy-api release, v0.10.0, adds fields to the
`OutboundPolicies` API for configuring HTTP request timeouts, based on
the proposed changes to HTTPRoute in kubernetes-sigs/gateway-api#1997.
PR #2418 updates the proxy to depend on the new proxy-api release, and
implements the `Rule.request_timeout` field added to the API. However,
that branch does *not* add a timeout for the
`RouteBackend.request_timeout` field. This branch changes the proxy to
apply the backend request timeout when configured by the policy
controller.

This branch implements `RouteBackend.request_timeout` by adding an
additional timeout layer in the `MatchedBackend` stack. This applies the
per-backend timeout once a backend is selected for a route. I've also
added stack tests for the interaction between the request and backend
request timeouts.

Note that once retries are added to client policy stacks, it may be
necessary to move the backend request timeout to ensure it occurs
"below" retries, depending on where the retry middleware ends up being
located in the proxy stack.
hawkw added a commit that referenced this pull request Jun 15, 2023
PRs #2418 and #2419 add per-route and per-backend request timeouts
configured by the `OutboundPolicies` API to the `MatchedRoute` and
`MatchedBackend` layers in the outbound `ClientPolicy` stack,
respectively. This means that — unlike in the `ServiceProfile` stack —
two separate request timeouts can be configured in `ClientPolicy`
stacks. However, because both the `MatchedRoute` and `MatchedBackend`
layers are in the HTTP logical stack, the errors emitted by both
timeouts will have a `LogicalError` as their most specific error
metadata, meaning that the log messages and `l5d-proxy-error` headers
recorded for these timeouts do not indicate whether the timeout that
failed the request was the route request timeout or the backend request
timeout.

In order to ensure this information is recorded and exposed to the user,
this branch adds two new error wrapper types, one of which enriches an
error with a `RouteRef`'s metadata, and one of which enriches an error
with a `BackendRef`'s metadata. The `MatchedRoute` stack now wraps all
errors with `RouteRef` metadata, and the `MatchedBackend` stack wraps
errors with `BackendRef` metadata. This way, when the route timeout
fails a request, the error will include the route metadata, while when
the backend request timeout fails a request, the error will include both
the route and backend metadata.

Adding these new error wrappers also has the additional side benefit of
adding this metadata to errors returned by filters, allowing users to
distinguish between errors emitted by a filter on a route rule and
errors emitted by a per-backend filter. Also, any other errors emitted
lower in the stack for requests that are handled by a client policy
stack will now also include this metadata, which seems generally useful.

Example errors, taken from a proxy unit test:

backend request:
```
logical service logical.test.svc.cluster.local:666: route httproute.test.timeout-route: backend service.test.test-svc:666: HTTP response timeout after 1s
```
route request:
```
logical service logical.test.svc.cluster.local:666: route httproute.test.timeout-route: HTTP response timeout after 2s
```
risingspiral pushed a commit to linkerd/linkerd2 that referenced this pull request Jun 15, 2023
PRs #2418 and #2419 add per-route and per-backend request timeouts
configured by the `OutboundPolicies` API to the `MatchedRoute` and
`MatchedBackend` layers in the outbound `ClientPolicy` stack,
respectively. This means that — unlike in the `ServiceProfile` stack —
two separate request timeouts can be configured in `ClientPolicy`
stacks. However, because both the `MatchedRoute` and `MatchedBackend`
layers are in the HTTP logical stack, the errors emitted by both
timeouts will have a `LogicalError` as their most specific error
metadata, meaning that the log messages and `l5d-proxy-error` headers
recorded for these timeouts do not indicate whether the timeout that
failed the request was the route request timeout or the backend request
timeout.

In order to ensure this information is recorded and exposed to the user,
this branch adds two new error wrapper types, one of which enriches an
error with a `RouteRef`'s metadata, and one of which enriches an error
with a `BackendRef`'s metadata. The `MatchedRoute` stack now wraps all
errors with `RouteRef` metadata, and the `MatchedBackend` stack wraps
errors with `BackendRef` metadata. This way, when the route timeout
fails a request, the error will include the route metadata, while when
the backend request timeout fails a request, the error will include both
the route and backend metadata.

Adding these new error wrappers also has the additional side benefit of
adding this metadata to errors returned by filters, allowing users to
distinguish between errors emitted by a filter on a route rule and
errors emitted by a per-backend filter. Also, any other errors emitted
lower in the stack for requests that are handled by a client policy
stack will now also include this metadata, which seems generally useful.

Example errors, taken from a proxy unit test:

backend request:
```
logical service logical.test.svc.cluster.local:666: route httproute.test.timeout-route: backend service.test.test-svc:666: HTTP response timeout after 1s
```
route request:
```
logical service logical.test.svc.cluster.local:666: route httproute.test.timeout-route: HTTP response timeout after 2s
```

---

* recover: remove unused `mut` (linkerd/linkerd2-proxy#2425)
* outbound: implement `OutboundPolicies` route request timeouts (linkerd/linkerd2-proxy#2418)
* build(deps): bump tj-actions/changed-files from 35.7.7 to 36.2.1 (linkerd/linkerd2-proxy#2427)
* outbound: implement `OutboundPolicies` backend request timeouts (linkerd/linkerd2-proxy#2419)
* outbound: add backend and route metadata to errors (linkerd/linkerd2-proxy#2428)

Signed-off-by: Eric Anderson <[email protected]>
risingspiral pushed a commit to linkerd/linkerd2 that referenced this pull request Jun 15, 2023
Updating proxy version on main

PRs #2418 and #2419 add per-route and per-backend request timeouts
configured by the `OutboundPolicies` API to the `MatchedRoute` and
`MatchedBackend` layers in the outbound `ClientPolicy` stack,
respectively. This means that — unlike in the `ServiceProfile` stack —
two separate request timeouts can be configured in `ClientPolicy`
stacks. However, because both the `MatchedRoute` and `MatchedBackend`
layers are in the HTTP logical stack, the errors emitted by both
timeouts will have a `LogicalError` as their most specific error
metadata, meaning that the log messages and `l5d-proxy-error` headers
recorded for these timeouts do not indicate whether the timeout that
failed the request was the route request timeout or the backend request
timeout.

In order to ensure this information is recorded and exposed to the user,
this branch adds two new error wrapper types, one of which enriches an
error with a `RouteRef`'s metadata, and one of which enriches an error
with a `BackendRef`'s metadata. The `MatchedRoute` stack now wraps all
errors with `RouteRef` metadata, and the `MatchedBackend` stack wraps
errors with `BackendRef` metadata. This way, when the route timeout
fails a request, the error will include the route metadata, while when
the backend request timeout fails a request, the error will include both
the route and backend metadata.

Adding these new error wrappers also has the additional side benefit of
adding this metadata to errors returned by filters, allowing users to
distinguish between errors emitted by a filter on a route rule and
errors emitted by a per-backend filter. Also, any other errors emitted
lower in the stack for requests that are handled by a client policy
stack will now also include this metadata, which seems generally useful.

Example errors, taken from a proxy unit test:

backend request:
```
logical service logical.test.svc.cluster.local:666: route httproute.test.timeout-route: backend service.test.test-svc:666: HTTP response timeout after 1s
```
route request:
```
logical service logical.test.svc.cluster.local:666: route httproute.test.timeout-route: HTTP response timeout after 2s
```

---

* recover: remove unused `mut` (linkerd/linkerd2-proxy#2425)
* outbound: implement `OutboundPolicies` route request timeouts (linkerd/linkerd2-proxy#2418)
* build(deps): bump tj-actions/changed-files from 35.7.7 to 36.2.1 (linkerd/linkerd2-proxy#2427)
* outbound: implement `OutboundPolicies` backend request timeouts (linkerd/linkerd2-proxy#2419)
* outbound: add backend and route metadata to errors (linkerd/linkerd2-proxy#2428)

Signed-off-by: Eric Anderson <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants