Skip to content

Fix stale TCP connections blocking file distribution downloads#35932

Closed
papa99do wants to merge 1 commit intovespa-engine:masterfrom
marqo-ai:fix/stale-tcp-connection-file-distribution
Closed

Fix stale TCP connections blocking file distribution downloads#35932
papa99do wants to merge 1 commit intovespa-engine:masterfrom
marqo-ai:fix/stale-tcp-connection-file-distribution

Conversation

@papa99do
Copy link
Contributor

@papa99do papa99do commented Feb 16, 2026

Problem

When the config-proxy's RPC connection to a config server becomes stale (half-open TCP socket ), file distribution downloads hang indefinitely. The download executor retries on the same dead socket because getTarget() still sees it as "valid". This blocks all file reference downloads on the affected node for ~16 minutes until the configproxy's internal JRT reconnect timer fires.

Root cause

  1. No mechanism to force-close a stale connection. FileReferenceDownloader had no way to detect repeated timeouts and proactively tear down a dead connection.

  2. Cross-thread socket visibility. The socket field in jrt/Connection.java is written by the Connector thread and read by the file downloader thread. Without volatile, closeSocket() could silently no-op due to JMM visibility — confirmed in production where ss -tnp showed the same local port and growing Send-Q after force-close attempts.

Fix

Timeout-based force-close (filedistribution/)

  • Replace boolean return from startDownloadRpc() with DownloadResult enum (SUCCESS, TIMEOUT, FAILURE), distinguishing ErrorCode.TIMEOUT from other errors.
  • Track consecutive RPC timeouts; after reaching the threshold, call connection.closeConnection() to synchronously close the TCP socket. The next retry establishes a fresh connection.

Cross-thread socket visibility (jrt/)

  • Add volatile SocketChannel channelForClose to Connection.java — written on socket creation, read in closeSocket(). This avoids adding volatile overhead to the hot I/O path (where socket is read on every packet).
  • Add closeSocket() as a public no-op on Target (since Connection is package-private), overridden in Connection.

TCP socket options (jrt/)

  • Enable SO_KEEPALIVE (OS-level dead-peer detection) and SO_LINGER(0) (immediate RST on close) on JRT connections, matching C++ FNET behavior.

Connection close API (config/)

  • Add closeConnection() default method to Connection interface, implemented in JRTConnection to call target.closeSocket() (synchronous) then target.close() (async cleanup).

How to activate

The feature is off by default. Two environment variables control it:

# Required: set an aggressive RPC timeout so retries happen within the download budget.
# By default rpc_timeout == download_timeout, so no retries are attempted.
VESPA_FILE_DOWNLOAD_RPC_TIMEOUT=5

# Force-close connection after N consecutive RPC timeouts (0 = disabled).
VESPA_FILE_DOWNLOAD_MAX_TIMEOUTS_BEFORE_CLOSE=2

With these settings: 5s timeout → close after 2 timeouts → fresh connection on 3rd attempt → recovery in ~15s instead of ~16min.

Files changed

File Change
jrt/.../Target.java closeSocket() no-op on base class
jrt/.../Connection.java volatile channelForClose, closeSocket() override, SO_KEEPALIVE, SO_LINGER(0)
config/.../Connection.java closeConnection() default method
config/.../JRTConnection.java closeTarget() — sync socket close + async cleanup
filedistribution/.../FileReferenceDownloader.java DownloadResult enum, timeout counting, force-close logic
filedistribution/.../FileDownloader.java Constructor overload for maxTimeoutsBeforeClose
filedistribution/.../FileDownloaderTest.java 3 new tests + TimeoutResponseHandler mock
jrt/.../ConnectionTest.java Cross-thread closeSocket() visibility test

Test plan

  • Unit tests for timeout-based close (threshold=1, threshold=2, disabled-by-default)
  • JRT test for cross-thread closeSocket() visibility
  • Observed working in production on Vespa 8.639.59 — stale socket closed, new connection established, downloads resumed immediately

When a config server becomes unreachable (e.g. during rolling upgrades),
JRT connections can enter a half-open state where the TCP socket remains
open but the remote endpoint is gone. Subsequent RPC calls on these stale
connections hang until timeout rather than failing fast, blocking file
distribution downloads for extended periods.

Changes:
- Add closeSocket() to jrt Target/Connection for synchronous TCP socket
  close with cross-thread visibility via volatile field
- Enable SO_KEEPALIVE and SO_LINGER(0) on JRT connections for faster
  detection of dead peers and immediate socket teardown
- Add closeConnection() to config Connection interface and JRTConnection
  to allow callers to force-close stale connections
- Track consecutive RPC timeouts in FileReferenceDownloader; after
  a configurable threshold, force-close the connection so a fresh one
  is established on the next retry
- Feature is off by default (threshold=0); activate via the
  VESPA_FILE_DOWNLOAD_MAX_TIMEOUTS_BEFORE_CLOSE environment variable
- Add unit tests for timeout-based connection close behavior
@hmusum
Copy link
Member

hmusum commented Feb 26, 2026

We did a preliminary review. Basically this PR does 3 things:

  1. Change filedistribution code to check results and keep track of how many times a timeout has occurred, closing target if it has happened many times. This looks good, but haven't looked in detail. I would prefer a separate PR for this.
  2. The keepalive and socket linger settings also look sane, I would like a separate PR for that as well.
  3. The closeConnection/closeSocket stuff is not going to be accepted. A separate PR is OK and we can discuss there, but it will not be accepted as it is. No details at this time, sorry.

@papa99do
Copy link
Contributor Author

Splitting this PR into 3 separate PRs per review feedback:

  1. SO_KEEPALIVE / SO_LINGER socket options — standalone 2-line change in jrt/Connection.java
  2. Filedistribution timeout tracking + connection closeFileReferenceDownloader, FileDownloader, tests, plus closeConnection() default method on config/Connection.java (no-op until JRT implementation lands)
  3. closeSocket/closeConnection JRT implementationvolatile channelForClose, Target.closeSocket(), Connection.closeSocket() override, JRTConnection.closeTarget(), ConnectionTest.java (for separate discussion)

Will link the new PRs here once created.

@papa99do
Copy link
Contributor Author

The first PR: #36060
The second PR: #36061
The third PR: will discard the third PR. I'll verify if the first PR solves my issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants