-
Notifications
You must be signed in to change notification settings - Fork 132
Description
This might be a re-hash of these and other similar issues:
#62
elixir-tesla/tesla#467
However, I indeed think something strange is happening, along with the other folks who commented on other issues, some very recent and mimic my experience. The issue is that certain requests fail with a {:error, "socket closed"} error, without actually being executed at all and not due to any real connectivity issue. As a sidenote, the {:error, "socket closed"] result just the {:error, %Mint.TransportError{reason: :closed} returned by Finch wrapped by the Tesla adapter. But that's beside the point, that just says that the adapter is an unlikely culprit.
These issues were much more common before I set the pool_max_idle_time to 60_000 down from the default :infinity, like 10 times more common. Yet the error still happens. I setup telemetry to try and gather more data on the issue. After capturing few such cases, this is the pattern I usually see (telemetry events forwarded to logs):
16:16:59.565 [info] Finch telemetry: reused_connection.
16:16:59.566 [info] Finch telemetry: http request sent, idle_time: 115862688122, method: DELETE, path: <redacted>
16:16:59.566 [info] Finch telemetry: receive: start, idle_time: 115862688122, method: DELETE path: <redacted>
16:17:00.749 [info] Finch telemetry: pool idle timeout, host: <redacted>, :pool_max_idle_time_exceeded.
16:17:00.750 [info] Finch telemetry: receive: stop, idle_time: 115862688122, method: DELETE path: <redacted>
16:17:00.750 <socket closed error here in my code>
...
<exponential backoff>
<retry succeeds with fresh connection with a sane idle_time>
Please notice two things, the idle_time is always a very large number when this happens, this is the equivalent of 3.7 years. Successful requests have an expected idle_time, less than 60 seconds. I also noticed (I believe always) that there is a Finch telemetry: pool idle timeout event when this error happens like immediately, a millisecond, before the connection closes. I didn't have telemetry from back when the pool_max_idle_time was at it's default (:infinity) but the the number of errors such was greater. It's a bit hard for me to revert it back to :infinity and test now easily because this code is deployed to an environment where it's being used and would increase error rate downstream potentially. And we need some traffic to make it happen, but if it would help debugging, I could try to make it work.
If the pool shuts down/restarts because of the idle time, I don't think it should have done that. There was more than a second delay after the request and the pool being closed on itself, it should have registered that it's still in use and stay alive with connections open. On the flip side, I don't have an explanation for the idle time, though, as it was a large number to begin with.
We see this a few times each day towards different hosts, maybe 2-6 events daily, but on a mildly utilised system. We expect more traffic soon. Also, we used Hackney before migrating to Finch and didn't see similar issues. Looking back at it, I don't think Hackney did any retries by itself. We migrated to Finch because of the more advanced connection pool, but we had to build a lot of defensive logic around this issue which we didn't need. Which is kind of fine, connections can be broken or unstable and it should be taken care , but this still bothers me because these are not real connectivity issues in my opinion :-) . I would rather improve on this and keep Finch than migrate back to Hackney.
Please advise which more information can I send. I'll continue looking on my side, but I'm not yet familiar with Finch's source code yet.
Edit: I've just seen this issue: #269 and it really resembles my case, and we're indeed using OTP 26.1.2, but the OP had issues with any connection to the affected host. Mine are haphazard.