-
Notifications
You must be signed in to change notification settings - Fork 621
GEP-1742: HTTPRoute Timeouts API #1997
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 3 commits
accce11
0525880
7696daa
09ca210
18b6894
2799811
90606a8
f07a259
ad0e9c7
db6688b
224d342
7a01f3e
29cc9f0
26e451a
7b6315b
bef97ed
7e20c3d
650874d
659d478
3c0e8e6
6c01fa8
4e7c538
2a1c178
f9181ec
12c731a
ab8283d
bad35be
f0ac5ac
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -301,8 +301,100 @@ Could not find any HTTP specific timeouts. PRs welcomed. 😊 | |||||
|
|
||||||
| ## API | ||||||
|
|
||||||
| TBD. | ||||||
| The above diagrams show that there are many different kinds of configurable timeouts | ||||||
| supported by Gateway implementations: connect, idle, request, upstream, downstream. | ||||||
| Although there may be opportunity for the specification of a common API for more of | ||||||
| them in the future, this GEP will focus on the L7 timeouts in HTTPRoutes that are | ||||||
| most valuable to clients. | ||||||
|
Comment on lines
+306
to
+307
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think one of the most important questions here is where the timeout config belongs. This proposes HTTPRouteRule, but I think that both HTTPBackendRef and/or a new policy to extend Service would be compelling alternatives. It would be helpful if this GEP described why HTTPRouteRule is proposed and why the others alternatives were not chosen.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Added some text in the Alternatives section. |
||||||
|
|
||||||
| From the above analysis, it appears that most (all?) implementations are capable of | ||||||
| supporting the configuration of simple client downstream request timeouts on HTTPRoute | ||||||
| rules. This is a relatively small addition that would benefit many users. | ||||||
|
|
||||||
| A client HTTP request to a gateway may result in more than one call to the destination | ||||||
|
||||||
| backend service, for example, if automatic retries are supported. Some implementations | ||||||
| support configuring a timeout for the individual backend requests, seperate from the | ||||||
|
||||||
| support configuring a timeout for the individual backend requests, seperate from the | |
| support configuring a timeout for the individual backend requests, separate from the |
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/None-Core/non-Core/ ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you clarify if the client request covers the entire connection from client to backend, or just the downstream request from client to gateway? There is also at least one tunnel timeout (HAProxy) and perhaps some other timeouts that cover the entire connection, not just the downstream and upstream halves of the connection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a diagram in the "Timeout values" section that shows the relationships, but you need to look at the rendered markdown to see it. https://github.com/kubernetes-sigs/gateway-api/blob/7e20c3d25368b2c91d0952dfbd842f46e3324b36/geps/gep-1742.md#timeout-values
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did see the diagram, yet still wanted to get the explanation in words and in the godoc of the implementation section for Request and BackendRequest. I asked about something similar in that section.
I don't see any timeout that would cover the entire transaction, unless Request does.
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest having a meaningful zero-value for the timeouts, like Envoy does. "A value of 0 will completely disable the .... timeout", even if other related timeouts are configured.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds like a good idea to me. I'll add that 0 means disable (no timeout). Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would also add a name and/or type to the HTTPRouteTimeouts in case there are multiple needed for any given route. For example, maybe an idle timeout would be a different duration than a connect timeout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand what you're suggesting here. If we wanted (in the future) to add another timeout (e.g., idle timeout), wouldn't we just add another field to HTTPRouteTimeouts?
rules:
- backendRef:
...
timeouts:
request: 2s
idle: 10s
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, ok, so maybe you need to clarify in Goal and Non-Goals which timeout values you're addressing. If you call it "request", that doesn't mean much. What transactions are covered in "request"?
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although we've done it in some places throughout the API, I'd prefer to avoid adding new YAML snippets in our godocs because they're quite painful to support when generating docs.
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NGINX doesn't support request timeouts. However, not sure how many other implementations are in the same situation.
Is it possible to have this feature as Extended?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought NGINX's proxy_read_timeout aligned with Request timeout, is it a BackendRequest timeout, instead? If so, maybe BackendRequest should be Core, and Request should be Extended? Is there an NGINX timeout that takes retries into account?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought NGINX's proxy_read_timeout aligned with Request timeout, is it a BackendRequest timeout, instead?
unfortunately proxy_read_timeout applies not to the duration of a request, but to successive read operations from the proxied server :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Request *metav1.Duration `json:"request,omitempty"` | |
| ClientRequest *metav1.Duration `json:"clientRequest,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tend to prefer keeping it short-and-sweet (Request) but will switch if others also think ClientRequest would be better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I generally prefer greater specificity too, but this timeout as designed is broadly similar to what's called a "request timeout" by a few of the proxy implementations at least, so I think in this case, the more general term is okay.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if you have 2 routes with same backend, and define 2 different backend timeouts ?
Will it mean that the backend request will have different timeouts depending on path ? Are all implementations capable of doing this ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will it mean that the backend request will have different timeouts depending on path ? Are all implementations capable of doing this ?
Yes, the backend request timeout is client request specific. Envoy supports this (perTryTimeout) but all implementations probably not, so that's why it's Extended (instead of Core) support.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should document it should be LET Request timeout?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hzxuzhonghu I don't quite understand what you are suggesting. Can you clarify please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it should be LET
Oh, did you mean to say LESS? If so, good idea. Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not ISO 8601 as specified
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to say format is as time.ParseDuration
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thats probably my fault, I added a link to the POC PR mentioning ISO 8601 but didnt read much further into the actual format
apimachinery metav1.Duration is pretty light on documentation too, can maybe just rip the field documentation from go if we need to be specific: https://pkg.go.dev/maze.io/x/duration#ParseDuration
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this section should go before the API section, to further explain the timeouts that are being added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| There a 2 kinds of timeouts that can be configured in an `HTTPRouteRule`: | |
| There are 2 kinds of timeouts that can be configured in an `HTTPRouteRule`: |
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pedantic: we cannot measure the client end-to-end time, as there is latency between the client and us.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since I originally wrote that description... uh... maybe describe it as "an end-to-end timeout for a single request from the client to the Gateway API implementation", contrasted with "a single backend request from the Gateway API implementation to a workload"?
This would probably benefit from repeating the diagrams above to show where the two timeouts should measure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, this is not so pedantic. At a high level, the "useful" timeout that we're trying to support here is basically the time it takes to get a response. In the various diagrams above, it seems like there is a timeout that corresponds to the Finishes Response arrow, which is more-or-less what we want, but the question of when the timeout starts measuring is important. John pointed out to me that in the case of Envoy, the implementation route timeout (R timeout in the above diagram) does not start until the entire downstream request stream has been received. Not sure about Nginx and other impls, but seems we need to be more flexible with this definition, to make it implementable.
Maybe something like this:
timeouts.request is the timeout for the Gateway API implementation to send a response to a client HTTP request. Whether the gateway starts the timeout before or after the entire client request stream has been received, is implementation dependent.
Maybe we could first suggest the timeout begins after the entire client request stream has been received, and see if that is a problem for non-Envoy implementations?
Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the diagrams I did earlier (needed to recover context myself), I think that it's going to be difficult to define this timeout in such a way that all the implementations can implement it.
I was going to say something like "from when the proxy has finished receiving the request from the client to when it starts sending the response back to the client", but some proxies can't do that, I think.
I'd definitely encourage @frankbu to take the initial source diagram and try and mark where the request timeout would cover on a diagram for this section, and for the backendRequest timeout as well.
It may be that we need to be more vague here and say something like "this timeout covers as close to the whole request->response transaction as is possible for the proxy implementation, given its configurable timeouts." and then leave some of it up to the implementations. This would have the downside of being difficult to test though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that the common timeout we want to handle is for when it's taking too long to get the backend response. We should be able to test that, even if we make the spec a little vague about whether the timeout value includes other request overhead or not.
I'll update the wording to make it more clear, including adding the diagram with the proposed timeouts marked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated wording and added diagram, PTAL.
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually fields marked as Core support have tighter requirements than this and don't have implementation-specific components. I think this might be OK, but we need to be confident that we can write a reliable conformance test for this that would be compatible with all implementations. Are we sure that all implementations can support this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My first thought was to make it a requirement that the timeout start after the client request has been fully received, @youngnick had the same thought apparently (#1997 (comment), but unfortunately it seems that some proxies can't do that so this is the compromise.
I think a conformance test compatible with all instances is one that simply checks that a request times out when a backend is very slow, which btw is the main motivating use case for this feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a conformance test compatible with all instances is one that simply checks that a request times out when a backend is very slow, which btw is the main motivating use case for this feature.
rereading this intent, I'm not sure the description and naming of this field really line up? may be nitpicky and discussed before but did we consider naming this something like backendRequest if the intent is to cover the case that a backend is slow seems like the naming should reflect that
the timeout for the Gateway API implementation to send a response to a client HTTP request sounds like something the client itself should be enforcing (as it is written here)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did we consider naming this something like
backendRequest
This proposal actually has an optional non-Core field named backendRequest 😃 It's used to set a timeout for a single request from the gateway to the backend. The request timeout is controlling how long the client should wait for a response given that the gateway may be making multiple backend requests (e.g., retries) before replying. Note that without retries or other fancy things going on, the two timeouts degenerate to more-or-less the same thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah right, confused that field since it was talking about retries as well so forgot about it, but that field sounds to me to cover more like what the intent of your comment about testing above is
the language The request timeout is controlling how long the client should wait for a response... similarly timeouts.request is the timeout for the Gateway API implementation to send a response to a client HTTP request sounds like a timeout that a client would enforce
If the field is documented like this but really is implemented differently/means something else I think this will be confusing for users, implementers, and those trying to give advice on supporting an implementation (we already have trouble with users of Contour getting confused about timeouts we expose, which are a lot more specific in that they tie in to Envoy timeouts and cover specific hops in the request flow)
(this and other threads may be rehashing why this wasn't done earlier, generalizing timeouts feels like it will lead to confusion)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another general option as opposed to trying to generalize timeouts would be to make timeouts for all hops of a proxied request "extended" support and implementations could advertise what they support via conformance profile reporting as well as documentation, which has advantages but also a lot of disadvantages tbh
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was trying to describe timeout.request in a way to be clear that it is for the Gateway API implementation to timeout doing it's job (which mostly means while waiting for 1 or more backend requests to finish). Maybe there's a better way to word it? I'll give it some thought, and suggestions are welcome 😃
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which has advantages but also a lot of disadvantages tbh
Yeah, it would be nice to make at least something Core.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the comment in the Go struct for more clarity.
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I would set
timeout: PT1M
for a 1 minute timeout?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I noticed that too. It's actually the Golang time.Duration string format, which somewhere I read is "based on ISO 8601", or something like that. In the comment in front of the go struct I have this:
// Timeout values are formatted like 1h/1m/1s/1ms as parsed by Golang time.ParseDuration
Not sure what I should reference where it has the complete syntax documented? The Golang time.ParseDuration doc is pretty vague: https://pkg.go.dev/time#ParseDuration
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah looks like this was already discussed, so maybe disregard my comment below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I first read this I wondered why it only focused on HTTPRoute, but later I realized the original GEP specified that focus. It would be a good idea to change the title to "HTTPRoute Timeouts API", or else mention in Non-Goals that other route types are not covered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Title changed