Skip to content

Track consecutive RPC timeouts in file distribution and close stale connections#36061

Merged
hmusum merged 1 commit intovespa-engine:masterfrom
marqo-ai:fix/filedistribution-timeout-close
Mar 9, 2026
Merged

Track consecutive RPC timeouts in file distribution and close stale connections#36061
hmusum merged 1 commit intovespa-engine:masterfrom
marqo-ai:fix/filedistribution-timeout-close

Conversation

@papa99do
Copy link
Contributor

Summary

  • Adds timeout tracking to file distribution RPC downloads: when consecutive timeouts exceed a configurable threshold, the connection is closed via Target.close() so JRTConnection.getTarget() reconnects automatically
  • Adds closeConnection() to the Connection interface (default no-op) with implementation in JRTConnection
  • Controlled by VESPA_FILE_DOWNLOAD_MAX_TIMEOUTS_BEFORE_CLOSE env var (default 0 = disabled)

Details

When a file distribution RPC request times out, the underlying TCP connection may be stale (e.g., broken by a load balancer or network partition). Previously, the downloader would keep retrying on the same dead connection until the overall download timeout expired.

Now, FileReferenceDownloader returns a DownloadResult enum (SUCCESS/TIMEOUT/FAILURE) from startDownloadRpc. In waitUntilDownloadStarted, consecutive timeouts are tracked per connection. When the count reaches maxTimeoutsBeforeClose, the connection is closed and the counter resets. The connection pool's switchConnection on each retry naturally picks up a fresh connection.

Test plan

  • testConnectionCloseOnTimeout — threshold=1, verifies close called for each timeout
  • testConnectionCloseAfterNTimeouts — threshold=2 with 6 timeouts, verifies 3 closes
  • testNoConnectionCloseOnTimeoutByDefault — threshold=0, verifies no closes
  • All existing FileDownloaderTest tests pass unchanged

…onnections

When file distribution RPC requests time out repeatedly, the underlying
TCP connection may be stale. This adds tracking of consecutive timeouts
per connection in FileReferenceDownloader. When the count reaches a
configurable threshold, the connection is closed via Target.close() so
that JRTConnection.getTarget() will establish a fresh one on next use.

The threshold is controlled by VESPA_FILE_DOWNLOAD_MAX_TIMEOUTS_BEFORE_CLOSE
(default 0 = disabled), consistent with existing env var patterns.

Changes:
- Add closeConnection() default method to Connection interface
- Implement closeConnection() in JRTConnection using target.close()
- Add DownloadResult enum (SUCCESS/TIMEOUT/FAILURE) to FileReferenceDownloader
- Track consecutive timeouts and close connection when threshold exceeded
- Add FileDownloader constructor overload accepting maxTimeoutsBeforeClose
- Add 3 tests: threshold=1, threshold=2, and disabled (threshold=0)
Copy link
Member

@hmusum hmusum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, thanks for the contribution

@hmusum hmusum merged commit 3303b36 into vespa-engine:master Mar 9, 2026
2 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants