Fix stale TCP connections blocking file distribution downloads#35932
Closed
papa99do wants to merge 1 commit intovespa-engine:masterfrom
Closed
Fix stale TCP connections blocking file distribution downloads#35932papa99do wants to merge 1 commit intovespa-engine:masterfrom
papa99do wants to merge 1 commit intovespa-engine:masterfrom
Conversation
When a config server becomes unreachable (e.g. during rolling upgrades), JRT connections can enter a half-open state where the TCP socket remains open but the remote endpoint is gone. Subsequent RPC calls on these stale connections hang until timeout rather than failing fast, blocking file distribution downloads for extended periods. Changes: - Add closeSocket() to jrt Target/Connection for synchronous TCP socket close with cross-thread visibility via volatile field - Enable SO_KEEPALIVE and SO_LINGER(0) on JRT connections for faster detection of dead peers and immediate socket teardown - Add closeConnection() to config Connection interface and JRTConnection to allow callers to force-close stale connections - Track consecutive RPC timeouts in FileReferenceDownloader; after a configurable threshold, force-close the connection so a fresh one is established on the next retry - Feature is off by default (threshold=0); activate via the VESPA_FILE_DOWNLOAD_MAX_TIMEOUTS_BEFORE_CLOSE environment variable - Add unit tests for timeout-based connection close behavior
Member
|
We did a preliminary review. Basically this PR does 3 things:
|
Contributor
Author
|
Splitting this PR into 3 separate PRs per review feedback:
Will link the new PRs here once created. |
Contributor
Author
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
When the config-proxy's RPC connection to a config server becomes stale (half-open TCP socket ), file distribution downloads hang indefinitely. The download executor retries on the same dead socket because
getTarget()still sees it as "valid". This blocks all file reference downloads on the affected node for ~16 minutes until the configproxy's internal JRT reconnect timer fires.Root cause
No mechanism to force-close a stale connection.
FileReferenceDownloaderhad no way to detect repeated timeouts and proactively tear down a dead connection.Cross-thread socket visibility. The
socketfield injrt/Connection.javais written by the Connector thread and read by the file downloader thread. Withoutvolatile,closeSocket()could silently no-op due to JMM visibility — confirmed in production wheress -tnpshowed the same local port and growing Send-Q after force-close attempts.Fix
Timeout-based force-close (
filedistribution/)startDownloadRpc()withDownloadResultenum (SUCCESS,TIMEOUT,FAILURE), distinguishingErrorCode.TIMEOUTfrom other errors.connection.closeConnection()to synchronously close the TCP socket. The next retry establishes a fresh connection.Cross-thread socket visibility (
jrt/)volatile SocketChannel channelForClosetoConnection.java— written on socket creation, read incloseSocket(). This avoids adding volatile overhead to the hot I/O path (wheresocketis read on every packet).closeSocket()as a public no-op onTarget(sinceConnectionis package-private), overridden inConnection.TCP socket options (
jrt/)SO_KEEPALIVE(OS-level dead-peer detection) andSO_LINGER(0)(immediate RST on close) on JRT connections, matching C++ FNET behavior.Connection close API (
config/)closeConnection()default method toConnectioninterface, implemented inJRTConnectionto calltarget.closeSocket()(synchronous) thentarget.close()(async cleanup).How to activate
The feature is off by default. Two environment variables control it:
With these settings: 5s timeout → close after 2 timeouts → fresh connection on 3rd attempt → recovery in ~15s instead of ~16min.
Files changed
jrt/.../Target.javacloseSocket()no-op on base classjrt/.../Connection.javavolatile channelForClose,closeSocket()override,SO_KEEPALIVE,SO_LINGER(0)config/.../Connection.javacloseConnection()default methodconfig/.../JRTConnection.javacloseTarget()— sync socket close + async cleanupfiledistribution/.../FileReferenceDownloader.javaDownloadResultenum, timeout counting, force-close logicfiledistribution/.../FileDownloader.javamaxTimeoutsBeforeClosefiledistribution/.../FileDownloaderTest.javaTimeoutResponseHandlermockjrt/.../ConnectionTest.javacloseSocket()visibility testTest plan
closeSocket()visibility