outbound: implement `OutboundPolicies` route request timeouts #2418

hawkw · 2023-06-02T18:59:47Z

The latest proxy-api release, v0.10.0, adds fields to the
OutboundPolicies API for configuring HTTP request timeouts, based on
the proposed changes to HTTPRoute in kubernetes-sigs/gateway-api#1997.

This branch updates the proxy-api dependency to v0.10.0 and adds the new
timeout configuration fields to the proxy's internal client policy
types. In addition, this branch adds a timeout middleware to the HTTP
client policy stack, so that the timeout described by the
Rule.request_timeout field is now applied.

Implementing the RouteBackend.request_timeout field with semantics as
close as possible to those described in GEP-1742 will be somewhat more
complex, and will be added in a separate PR. The reason for this
complexity is that, as described in GEP-1742, the
timeouts.backendRequest timeout is intended to bound only the time it
takes for the backend to respond to the request, rather than bounding
both time spent in the proxy and time spent in the backend (which isthe
intention behind the timeouts.request timeout). Therefore, the timeout
for backendRequest should be started as "close" to the actual backend
client as possible, ideally in the endpoint stack after the request has
already made it through the load balancer queue. Starting this timeout
in the logical stack like the route's request_timeout does would mean
that time spent in the proxy's queues is included in this timeout, which
seems not to be the intention. However, if we want these timeouts to
start in the endpoint stack, we can't configure them using the target
type, because a backend may be referenced with different timeouts by
different HTTPRoutes, and we would prefer not to create separate client
stacks based on those timeout values. Instead, we will need the request
to carry the configured timeout value as an extension, and a new timeout
middleware that applies timeouts from a request extension. This will be
implemented in a follow-up PR.

adleong

Are there any proxy metrics which are incremented when a timeout occurs? Are timeouts distinguishable from application errors? Are request timeouts distinguishable from connect timeouts, detect timeouts, or failfast?

hawkw · 2023-06-12T14:20:22Z

@adleong

Are there any proxy metrics which are incremented when a timeout occurs?

With regards to metrics, these timeouts should behave analogously to timeouts configured for ServiceProfiles. When a request times out, we return an HTTP 504 response to the application, and increment the response_total metric with that status code as a label.

Are timeouts distinguishable from application errors?

AFAIK, we don't currently have a metric label that distinguishes HTTP 504 error codes that result from internal proxy timeouts from those that are returned by an application service. It's probably a good idea to add a label or something that make proxy timeouts distinguishable from those returned by an upstream application, but this maybe deserves some additional design work --- we could probably make it possible to distinguish between any error status synthesized by the proxy and errors with the same status returned by a remote application service.

Are request timeouts distinguishable from connect timeouts, detect timeouts, or failfast?

Protocol detection timeouts don't result in an error being returned to the application; when protocol detection times out, the proxy handles the connection as opaque. This is a different behavior than returning an error, so they are distinguishable because the proxy does something different.

Regarding failfast, the proxy will currently return an HTTP 504 for requests already dispatched to a service when it enters failfast, but returns an HTTP 503 when a new request would be dispatched to a service which is shedding load due to failfast:

linkerd2-proxy/linkerd/app/outbound/src/http/server.rs

Lines 154 to 163 in 704ef31

    
           // No available backend can be found for a request. 
        
           if errors::is_caused_by::<errors::FailFastError>(&*error) { 
        
               // XXX(ver) This should probably be SERVICE_UNAVAILABLE, because 
        
               // this is basically no different from a LoadShedError, but that 
        
               // would be a change in behavior. 
        
               return Ok(errors::SyntheticHttpResponse::gateway_timeout(error)); 
        
           } 
        
           if errors::is_caused_by::<errors::LoadShedError>(&*error) { 
        
               return Ok(errors::SyntheticHttpResponse::unavailable(error)); 
        
           }

This means both timeouts will sometimes increment the response_total metric with status="504", so they cannot always be distinguished in metrics currently.

In all cases, we will log substantially different error messages for these timeouts, and the value of the l5d-proxy-error header will be different. So these timeouts can be distinguished that way.

hawkw · 2023-06-13T14:19:32Z

@adleong Update regarding my previous comment on metrics: I had forgotten about the outbound_http_errors_total metric, which is also incremented by these timeouts. The timeouts added in this PR will behave identically to ServiceProfile timeouts when incrementing this metric: we will increment the metric with the error="response timeout" label. This can be distinguished from failfast and load-shedding timeouts, which have their own labels. Similarly, we can distinguish these timeouts in the metrics from application 504s, since those do not increment the outbound_http_errors_total metric.

linkerd/proxy/client-policy/src/opaq.rs

mateiidavid

Legible enough for me to review 🙃 looks good. The opaque route is confusing me a bit but think that's more due to the complexity of opaqueness, not because of this PR.

linkerd/proxy/client-policy/src/opaq.rs

linkerd/app/outbound/src/http/logical/policy/route.rs

Depends on #2418 The latest proxy-api release, v0.10.0, adds fields to the `OutboundPolicies` API for configuring HTTP request timeouts, based on the proposed changes to HTTPRoute in kubernetes-sigs/gateway-api#1997. PR #2418 updates the proxy to depend on the new proxy-api release, and implements the `Rule.request_timeout` field added to the API. However, that branch does *not* add a timeout for the `RouteBackend.request_timeout` field. This branch changes the proxy to apply the backend request timeout when configured by the policy controller. This branch implements `RouteBackend.request_timeout` by adding an additional timeout layer in the `MatchedBackend` stack. This applies the per-backend timeout once a backend is selected for a route. I've also added stack tests for the interaction between the request and backend request timeouts. Note that once retries are added to client policy stacks, it may be necessary to move the backend request timeout to ensure it occurs "below" retries, depending on where the retry middleware ends up being located in the proxy stack.

PRs #2418 and #2419 add per-route and per-backend request timeouts configured by the `OutboundPolicies` API to the `MatchedRoute` and `MatchedBackend` layers in the outbound `ClientPolicy` stack, respectively. This means that — unlike in the `ServiceProfile` stack — two separate request timeouts can be configured in `ClientPolicy` stacks. However, because both the `MatchedRoute` and `MatchedBackend` layers are in the HTTP logical stack, the errors emitted by both timeouts will have a `LogicalError` as their most specific error metadata, meaning that the log messages and `l5d-proxy-error` headers recorded for these timeouts do not indicate whether the timeout that failed the request was the route request timeout or the backend request timeout. In order to ensure this information is recorded and exposed to the user, this branch adds two new error wrapper types, one of which enriches an error with a `RouteRef`'s metadata, and one of which enriches an error with a `BackendRef`'s metadata. The `MatchedRoute` stack now wraps all errors with `RouteRef` metadata, and the `MatchedBackend` stack wraps errors with `BackendRef` metadata. This way, when the route timeout fails a request, the error will include the route metadata, while when the backend request timeout fails a request, the error will include both the route and backend metadata. Adding these new error wrappers also has the additional side benefit of adding this metadata to errors returned by filters, allowing users to distinguish between errors emitted by a filter on a route rule and errors emitted by a per-backend filter. Also, any other errors emitted lower in the stack for requests that are handled by a client policy stack will now also include this metadata, which seems generally useful. Example errors, taken from a proxy unit test: backend request: ``` logical service logical.test.svc.cluster.local:666: route httproute.test.timeout-route: backend service.test.test-svc:666: HTTP response timeout after 1s ``` route request: ``` logical service logical.test.svc.cluster.local:666: route httproute.test.timeout-route: HTTP response timeout after 2s ```

PRs #2418 and #2419 add per-route and per-backend request timeouts configured by the `OutboundPolicies` API to the `MatchedRoute` and `MatchedBackend` layers in the outbound `ClientPolicy` stack, respectively. This means that — unlike in the `ServiceProfile` stack — two separate request timeouts can be configured in `ClientPolicy` stacks. However, because both the `MatchedRoute` and `MatchedBackend` layers are in the HTTP logical stack, the errors emitted by both timeouts will have a `LogicalError` as their most specific error metadata, meaning that the log messages and `l5d-proxy-error` headers recorded for these timeouts do not indicate whether the timeout that failed the request was the route request timeout or the backend request timeout. In order to ensure this information is recorded and exposed to the user, this branch adds two new error wrapper types, one of which enriches an error with a `RouteRef`'s metadata, and one of which enriches an error with a `BackendRef`'s metadata. The `MatchedRoute` stack now wraps all errors with `RouteRef` metadata, and the `MatchedBackend` stack wraps errors with `BackendRef` metadata. This way, when the route timeout fails a request, the error will include the route metadata, while when the backend request timeout fails a request, the error will include both the route and backend metadata. Adding these new error wrappers also has the additional side benefit of adding this metadata to errors returned by filters, allowing users to distinguish between errors emitted by a filter on a route rule and errors emitted by a per-backend filter. Also, any other errors emitted lower in the stack for requests that are handled by a client policy stack will now also include this metadata, which seems generally useful. Example errors, taken from a proxy unit test: backend request: ``` logical service logical.test.svc.cluster.local:666: route httproute.test.timeout-route: backend service.test.test-svc:666: HTTP response timeout after 1s ``` route request: ``` logical service logical.test.svc.cluster.local:666: route httproute.test.timeout-route: HTTP response timeout after 2s ``` --- * recover: remove unused `mut` (linkerd/linkerd2-proxy#2425) * outbound: implement `OutboundPolicies` route request timeouts (linkerd/linkerd2-proxy#2418) * build(deps): bump tj-actions/changed-files from 35.7.7 to 36.2.1 (linkerd/linkerd2-proxy#2427) * outbound: implement `OutboundPolicies` backend request timeouts (linkerd/linkerd2-proxy#2419) * outbound: add backend and route metadata to errors (linkerd/linkerd2-proxy#2428) Signed-off-by: Eric Anderson <[email protected]>

Updating proxy version on main PRs #2418 and #2419 add per-route and per-backend request timeouts configured by the `OutboundPolicies` API to the `MatchedRoute` and `MatchedBackend` layers in the outbound `ClientPolicy` stack, respectively. This means that — unlike in the `ServiceProfile` stack — two separate request timeouts can be configured in `ClientPolicy` stacks. However, because both the `MatchedRoute` and `MatchedBackend` layers are in the HTTP logical stack, the errors emitted by both timeouts will have a `LogicalError` as their most specific error metadata, meaning that the log messages and `l5d-proxy-error` headers recorded for these timeouts do not indicate whether the timeout that failed the request was the route request timeout or the backend request timeout. In order to ensure this information is recorded and exposed to the user, this branch adds two new error wrapper types, one of which enriches an error with a `RouteRef`'s metadata, and one of which enriches an error with a `BackendRef`'s metadata. The `MatchedRoute` stack now wraps all errors with `RouteRef` metadata, and the `MatchedBackend` stack wraps errors with `BackendRef` metadata. This way, when the route timeout fails a request, the error will include the route metadata, while when the backend request timeout fails a request, the error will include both the route and backend metadata. Adding these new error wrappers also has the additional side benefit of adding this metadata to errors returned by filters, allowing users to distinguish between errors emitted by a filter on a route rule and errors emitted by a per-backend filter. Also, any other errors emitted lower in the stack for requests that are handled by a client policy stack will now also include this metadata, which seems generally useful. Example errors, taken from a proxy unit test: backend request: ``` logical service logical.test.svc.cluster.local:666: route httproute.test.timeout-route: backend service.test.test-svc:666: HTTP response timeout after 1s ``` route request: ``` logical service logical.test.svc.cluster.local:666: route httproute.test.timeout-route: HTTP response timeout after 2s ``` --- * recover: remove unused `mut` (linkerd/linkerd2-proxy#2425) * outbound: implement `OutboundPolicies` route request timeouts (linkerd/linkerd2-proxy#2418) * build(deps): bump tj-actions/changed-files from 35.7.7 to 36.2.1 (linkerd/linkerd2-proxy#2427) * outbound: implement `OutboundPolicies` backend request timeouts (linkerd/linkerd2-proxy#2419) * outbound: add backend and route metadata to errors (linkerd/linkerd2-proxy#2428) Signed-off-by: Eric Anderson <[email protected]>

hawkw added 3 commits June 2, 2023 10:38

update proxy-api to v0.10.0

53c57ba

add timeouts to policy types

6add70d

add route request timeouts

c9d7864

hawkw requested a review from a team as a code owner June 2, 2023 18:59

hawkw added 2 commits June 2, 2023 12:14

fixup integration tests

343d2eb

rustfmt

5d52a49

hawkw requested a review from a team June 2, 2023 19:59

hawkw mentioned this pull request Jun 2, 2023

outbound: implement OutboundPolicies backend request timeouts #2419

Merged

adleong reviewed Jun 2, 2023

View reviewed changes

hawkw requested a review from adleong June 13, 2023 15:54

adleong approved these changes Jun 13, 2023

View reviewed changes

olix0r reviewed Jun 14, 2023

View reviewed changes

linkerd/proxy/client-policy/src/opaq.rs Outdated Show resolved Hide resolved

mateiidavid approved these changes Jun 14, 2023

View reviewed changes

linkerd/proxy/client-policy/src/opaq.rs Outdated Show resolved Hide resolved

linkerd/app/outbound/src/http/logical/policy/route.rs Outdated Show resolved Hide resolved

olix0r reviewed Jun 14, 2023

View reviewed changes

linkerd/app/outbound/src/http/logical/policy/route.rs Outdated Show resolved Hide resolved

hawkw added 2 commits June 14, 2023 09:19

document opaque behavior

26c9492

put timeout layer before filters

38f3f2b

hawkw requested review from mateiidavid and olix0r June 14, 2023 16:24

mateiidavid approved these changes Jun 14, 2023

View reviewed changes

hawkw merged commit a512ea6 into main Jun 14, 2023

hawkw deleted the eliza/client-policy-timeouts branch June 14, 2023 18:36

hawkw mentioned this pull request Jun 15, 2023

outbound: add backend and route metadata to errors #2428

Merged

risingspiral mentioned this pull request Jun 15, 2023

proxy: v2.204.0 linkerd/linkerd2#11024

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

outbound: implement `OutboundPolicies` route request timeouts #2418

outbound: implement `OutboundPolicies` route request timeouts #2418

Uh oh!

hawkw commented Jun 2, 2023

Uh oh!

adleong left a comment

Uh oh!

hawkw commented Jun 12, 2023

Uh oh!

hawkw commented Jun 13, 2023

Uh oh!

Uh oh!

mateiidavid left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

outbound: implement OutboundPolicies route request timeouts #2418

outbound: implement OutboundPolicies route request timeouts #2418

Uh oh!

Conversation

hawkw commented Jun 2, 2023

Uh oh!

adleong left a comment

Choose a reason for hiding this comment

Uh oh!

hawkw commented Jun 12, 2023

Uh oh!

hawkw commented Jun 13, 2023

Uh oh!

Uh oh!

mateiidavid left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

outbound: implement `OutboundPolicies` route request timeouts #2418

outbound: implement `OutboundPolicies` route request timeouts #2418