-
Notifications
You must be signed in to change notification settings - Fork 284
outbound: implement OutboundPolicies route request timeouts
#2418
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
adleong
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any proxy metrics which are incremented when a timeout occurs? Are timeouts distinguishable from application errors? Are request timeouts distinguishable from connect timeouts, detect timeouts, or failfast?
With regards to metrics, these timeouts should behave analogously to timeouts configured for ServiceProfiles. When a request times out, we return an HTTP 504 response to the application, and increment the
AFAIK, we don't currently have a metric label that distinguishes HTTP 504 error codes that result from internal proxy timeouts from those that are returned by an application service. It's probably a good idea to add a label or something that make proxy timeouts distinguishable from those returned by an upstream application, but this maybe deserves some additional design work --- we could probably make it possible to distinguish between any error status synthesized by the proxy and errors with the same status returned by a remote application service.
Protocol detection timeouts don't result in an error being returned to the application; when protocol detection times out, the proxy handles the connection as opaque. This is a different behavior than returning an error, so they are distinguishable because the proxy does something different. Regarding failfast, the proxy will currently return an HTTP 504 for requests already dispatched to a service when it enters failfast, but returns an HTTP 503 when a new request would be dispatched to a service which is shedding load due to failfast: linkerd2-proxy/linkerd/app/outbound/src/http/server.rs Lines 154 to 163 in 704ef31
This means both timeouts will sometimes increment the response_total metric with status="504", so they cannot always be distinguished in metrics currently.
In all cases, we will log substantially different error messages for these timeouts, and the value of the |
|
@adleong Update regarding my previous comment on metrics: I had forgotten about the |
mateiidavid
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Legible enough for me to review 🙃 looks good. The opaque route is confusing me a bit but think that's more due to the complexity of opaqueness, not because of this PR.
Depends on #2418 The latest proxy-api release, v0.10.0, adds fields to the `OutboundPolicies` API for configuring HTTP request timeouts, based on the proposed changes to HTTPRoute in kubernetes-sigs/gateway-api#1997. PR #2418 updates the proxy to depend on the new proxy-api release, and implements the `Rule.request_timeout` field added to the API. However, that branch does *not* add a timeout for the `RouteBackend.request_timeout` field. This branch changes the proxy to apply the backend request timeout when configured by the policy controller. This branch implements `RouteBackend.request_timeout` by adding an additional timeout layer in the `MatchedBackend` stack. This applies the per-backend timeout once a backend is selected for a route. I've also added stack tests for the interaction between the request and backend request timeouts. Note that once retries are added to client policy stacks, it may be necessary to move the backend request timeout to ensure it occurs "below" retries, depending on where the retry middleware ends up being located in the proxy stack.
PRs #2418 and #2419 add per-route and per-backend request timeouts configured by the `OutboundPolicies` API to the `MatchedRoute` and `MatchedBackend` layers in the outbound `ClientPolicy` stack, respectively. This means that — unlike in the `ServiceProfile` stack — two separate request timeouts can be configured in `ClientPolicy` stacks. However, because both the `MatchedRoute` and `MatchedBackend` layers are in the HTTP logical stack, the errors emitted by both timeouts will have a `LogicalError` as their most specific error metadata, meaning that the log messages and `l5d-proxy-error` headers recorded for these timeouts do not indicate whether the timeout that failed the request was the route request timeout or the backend request timeout. In order to ensure this information is recorded and exposed to the user, this branch adds two new error wrapper types, one of which enriches an error with a `RouteRef`'s metadata, and one of which enriches an error with a `BackendRef`'s metadata. The `MatchedRoute` stack now wraps all errors with `RouteRef` metadata, and the `MatchedBackend` stack wraps errors with `BackendRef` metadata. This way, when the route timeout fails a request, the error will include the route metadata, while when the backend request timeout fails a request, the error will include both the route and backend metadata. Adding these new error wrappers also has the additional side benefit of adding this metadata to errors returned by filters, allowing users to distinguish between errors emitted by a filter on a route rule and errors emitted by a per-backend filter. Also, any other errors emitted lower in the stack for requests that are handled by a client policy stack will now also include this metadata, which seems generally useful. Example errors, taken from a proxy unit test: backend request: ``` logical service logical.test.svc.cluster.local:666: route httproute.test.timeout-route: backend service.test.test-svc:666: HTTP response timeout after 1s ``` route request: ``` logical service logical.test.svc.cluster.local:666: route httproute.test.timeout-route: HTTP response timeout after 2s ```
PRs #2418 and #2419 add per-route and per-backend request timeouts configured by the `OutboundPolicies` API to the `MatchedRoute` and `MatchedBackend` layers in the outbound `ClientPolicy` stack, respectively. This means that — unlike in the `ServiceProfile` stack — two separate request timeouts can be configured in `ClientPolicy` stacks. However, because both the `MatchedRoute` and `MatchedBackend` layers are in the HTTP logical stack, the errors emitted by both timeouts will have a `LogicalError` as their most specific error metadata, meaning that the log messages and `l5d-proxy-error` headers recorded for these timeouts do not indicate whether the timeout that failed the request was the route request timeout or the backend request timeout. In order to ensure this information is recorded and exposed to the user, this branch adds two new error wrapper types, one of which enriches an error with a `RouteRef`'s metadata, and one of which enriches an error with a `BackendRef`'s metadata. The `MatchedRoute` stack now wraps all errors with `RouteRef` metadata, and the `MatchedBackend` stack wraps errors with `BackendRef` metadata. This way, when the route timeout fails a request, the error will include the route metadata, while when the backend request timeout fails a request, the error will include both the route and backend metadata. Adding these new error wrappers also has the additional side benefit of adding this metadata to errors returned by filters, allowing users to distinguish between errors emitted by a filter on a route rule and errors emitted by a per-backend filter. Also, any other errors emitted lower in the stack for requests that are handled by a client policy stack will now also include this metadata, which seems generally useful. Example errors, taken from a proxy unit test: backend request: ``` logical service logical.test.svc.cluster.local:666: route httproute.test.timeout-route: backend service.test.test-svc:666: HTTP response timeout after 1s ``` route request: ``` logical service logical.test.svc.cluster.local:666: route httproute.test.timeout-route: HTTP response timeout after 2s ``` --- * recover: remove unused `mut` (linkerd/linkerd2-proxy#2425) * outbound: implement `OutboundPolicies` route request timeouts (linkerd/linkerd2-proxy#2418) * build(deps): bump tj-actions/changed-files from 35.7.7 to 36.2.1 (linkerd/linkerd2-proxy#2427) * outbound: implement `OutboundPolicies` backend request timeouts (linkerd/linkerd2-proxy#2419) * outbound: add backend and route metadata to errors (linkerd/linkerd2-proxy#2428) Signed-off-by: Eric Anderson <[email protected]>
Updating proxy version on main PRs #2418 and #2419 add per-route and per-backend request timeouts configured by the `OutboundPolicies` API to the `MatchedRoute` and `MatchedBackend` layers in the outbound `ClientPolicy` stack, respectively. This means that — unlike in the `ServiceProfile` stack — two separate request timeouts can be configured in `ClientPolicy` stacks. However, because both the `MatchedRoute` and `MatchedBackend` layers are in the HTTP logical stack, the errors emitted by both timeouts will have a `LogicalError` as their most specific error metadata, meaning that the log messages and `l5d-proxy-error` headers recorded for these timeouts do not indicate whether the timeout that failed the request was the route request timeout or the backend request timeout. In order to ensure this information is recorded and exposed to the user, this branch adds two new error wrapper types, one of which enriches an error with a `RouteRef`'s metadata, and one of which enriches an error with a `BackendRef`'s metadata. The `MatchedRoute` stack now wraps all errors with `RouteRef` metadata, and the `MatchedBackend` stack wraps errors with `BackendRef` metadata. This way, when the route timeout fails a request, the error will include the route metadata, while when the backend request timeout fails a request, the error will include both the route and backend metadata. Adding these new error wrappers also has the additional side benefit of adding this metadata to errors returned by filters, allowing users to distinguish between errors emitted by a filter on a route rule and errors emitted by a per-backend filter. Also, any other errors emitted lower in the stack for requests that are handled by a client policy stack will now also include this metadata, which seems generally useful. Example errors, taken from a proxy unit test: backend request: ``` logical service logical.test.svc.cluster.local:666: route httproute.test.timeout-route: backend service.test.test-svc:666: HTTP response timeout after 1s ``` route request: ``` logical service logical.test.svc.cluster.local:666: route httproute.test.timeout-route: HTTP response timeout after 2s ``` --- * recover: remove unused `mut` (linkerd/linkerd2-proxy#2425) * outbound: implement `OutboundPolicies` route request timeouts (linkerd/linkerd2-proxy#2418) * build(deps): bump tj-actions/changed-files from 35.7.7 to 36.2.1 (linkerd/linkerd2-proxy#2427) * outbound: implement `OutboundPolicies` backend request timeouts (linkerd/linkerd2-proxy#2419) * outbound: add backend and route metadata to errors (linkerd/linkerd2-proxy#2428) Signed-off-by: Eric Anderson <[email protected]>
The latest proxy-api release, v0.10.0, adds fields to the
OutboundPoliciesAPI for configuring HTTP request timeouts, based onthe proposed changes to HTTPRoute in kubernetes-sigs/gateway-api#1997.
This branch updates the proxy-api dependency to v0.10.0 and adds the new
timeout configuration fields to the proxy's internal client policy
types. In addition, this branch adds a timeout middleware to the HTTP
client policy stack, so that the timeout described by the
Rule.request_timeoutfield is now applied.Implementing the
RouteBackend.request_timeoutfield with semantics asclose as possible to those described in GEP-1742 will be somewhat more
complex, and will be added in a separate PR. The reason for this
complexity is that, as described in GEP-1742, the
timeouts.backendRequesttimeout is intended to bound only the time ittakes for the backend to respond to the request, rather than bounding
both time spent in the proxy and time spent in the backend (which isthe
intention behind the
timeouts.requesttimeout). Therefore, the timeoutfor
backendRequestshould be started as "close" to the actual backendclient as possible, ideally in the endpoint stack after the request has
already made it through the load balancer queue. Starting this timeout
in the logical stack like the route's
request_timeoutdoes would meanthat time spent in the proxy's queues is included in this timeout, which
seems not to be the intention. However, if we want these timeouts to
start in the endpoint stack, we can't configure them using the target
type, because a backend may be referenced with different timeouts by
different HTTPRoutes, and we would prefer not to create separate client
stacks based on those timeout values. Instead, we will need the request
to carry the configured timeout value as an extension, and a new timeout
middleware that applies timeouts from a request extension. This will be
implemented in a follow-up PR.