Skip to content

Refactor SSL socket send path and improve test coverage#4909

Open
nanangizz wants to merge 5 commits intomasterfrom
refactor/ssl-send-path
Open

Refactor SSL socket send path and improve test coverage#4909
nanangizz wants to merge 5 commits intomasterfrom
refactor/ssl-send-path

Conversation

@nanangizz
Copy link
Copy Markdown
Member

@nanangizz nanangizz commented Apr 8, 2026

Summary

Major

  • Replace the ring-buffer send mechanism (send_buf + single send_buf_pending slot) with per-op pool-allocated ssl_send_op_t using embedded encrypted data buffer, free list with configurable cap (PJ_SSL_SEND_OP_FREE_LIST_MAX), and true memory release on discard. This eliminates the PJ_ENOMEM crash during renegotiation reported in Fix callback not being called after activesock shutdown (#4864) #4878.
  • Fix pre-existing GnuTLS ssl_write pointer arithmetic bug: ((read_data_t *)data) + total_written advanced by 16 bytes per unit instead of 1, causing data corruption after the first TLS record (~16KB boundary). Fixed to ((char *)data) + total_written.
  • Fix use-after-free in pj_ssl_sock_start_connect2 when TCP connect completes synchronously — user callback could destroy ssock, then code continued accessing it.

Medium

  • Add ssl_do_handshake_and_flush() wrapper unifying flush + error handling across all 5 SSL backends (OpenSSL, GnuTLS, mbedTLS, Schannel, Darwin).

  • Fix flush_delayed_send: pass correct app_key, handle PJ_EPENDING to prevent double-send, invoke callback on synchronous success.

  • Fix ssock_on_data_sent: check app_key (not send_key) for handshake/shutdown detection — was dead code before.

  • Fix ssock_on_connect_complete: return via on_handshake_complete instead of bare PJ_ENOMEM from pj_bool_t function.

  • Fix accepted server-side socket leaks in ssl_sock tests: track accepted sockets in state structs and close explicitly in cleanup. Previously only listener and client sockets were closed, leaking GnuTLS sessions. Affects: stress tests, client_non_ssl, and large_msg_test.

  • Add LSAN suppressions for libgnutls.so and libtasn1.so to handle GnuTLS library-level leaks outside our control.

  • (Pre-existing, found during CI) Fix race condition in OpenSSL init_openssl(): openssl_init_count was set before cipher enumeration completed, causing concurrent callers to see zero ciphers. Fix: three-state flag (0=not started, -1=in progress, 1=done) with spin-wait. Also add NULL checks for SSL_CTX_new/SSL_new.

  • (Pre-existing, found during CI) Fix GnuTLS per-socket tls_deinit() bug: each socket called gnutls_global_deinit() on destroy, unbalancing the init/deinit count. Fix: init once (idempotent), register tls_deinit() via pj_atexit() for cleanup at pj_shutdown().

Minor / Miscellaneous

  • Fix perf_test result aggregation: state_cli[1]state_cli[i].
  • Add alloc_send_op NULL check for OOM safety.
  • Rename circ_buf_output/circ_buf_input to ssl_write_buf/ssl_read_buf for clarity (not always circular — OpenSSL uses memory BIO).
  • Add "Caller must hold write_mutex" documentation to ssl_write() in all 6 backends.
  • Reorder test cleanup: close sockets first, then poll to drain events, then destroy resources.

New tests (ssl_sock_stress.c)

  • send_load_test: 100 rapid sends with async callback verification
  • large_msg_test: 64KB message (multi-TLS-record) byte-for-byte echo verification
  • close_pending_test: close socket with pending sends — verify no crash/UAF
  • bidir_test: bidirectional simultaneous send load
  • mt_send_load_test: 3 worker threads + 3 concurrent SSL clients (closest to production scenario from Fix callback not being called after activesock shutdown (#4864) #4878)

CI

  • Add ioq-no-fast-track matrix job (select+OpenSSL, epoll+GnuTLS) with PJ_IOQUEUE_FAST_TRACK=0 + ASan, running ioqueue, activesock, ssl_sock, and ssl_sock_stress tests.

Test plan

  • make -j3 — zero errors, zero warnings
  • ssl_sock_test — all tests pass (Linux/OpenSSL + ASan)
  • ssl_sock_stress_test — all tests pass (Linux/OpenSSL + ASan)
  • Tested on Windows MSVC: Schannel + IOCP, Schannel + select, OpenSSL + select
  • Tested with PJ_IOQUEUE_FAST_TRACK=0 — async path exercised (pending=200, sent_cb=200)
  • CI: ioq-no-fast-track / select+ossl
  • CI: ioq-no-fast-track / epoll+gtls

Closes #4878 (SSL-layer issues). The ioqueue-layer fix was merged separately in af42c36.

Co-Authored-By: Claude Code

Known issue: FAST_TRACK + GnuTLS stress test

send_load_test with SEND_LOAD_COUNT=200 hangs under GnuTLS + PJ_IOQUEUE_FAST_TRACK=1 (default). The server receives N-1 out of N packets — a TCP partial write from send() causes data loss. The whole_data retry in pj_activesock_send should handle partial writes, but a timing-dependent interaction prevents it. Investigation findings:

  • Reproduces with both select and epoll ioqueue — independent of ioqueue backend
  • Does not reproduce with OpenSSL
  • Without echo (server only receives, no send): passes. With echo send inside on_data_read callback: fails
  • One suspected factor is that GnuTLS is the only backend setting TCP_NODELAY in pj_ssl_sock_param_default, but this has not been confirmed
  • The issue is timing-sensitive — adding fprintf in the code path makes it disappear

Workaround: SEND_LOAD_COUNT reduced to 100. The CI ioq-no-fast-track job uses FAST_TRACK=0 which avoids the issue entirely.

Items for separate PRs (from deep review)

  • GnuTLS cert verification: hostname check, x509 type check, cert leak on error path
  • Schannel: IP SAN matching, shutdown flush
  • C89 compliance fixes across SSL backends
  • Test robustness: echo_test timeout tolerance, echo op_key reuse

@nanangizz nanangizz force-pushed the refactor/ssl-send-path branch from a8dcecc to d750782 Compare April 8, 2026 15:03
Replace the ring-buffer-based send mechanism (send_buf + single
send_buf_pending slot) with per-op pool-allocated ssl_send_op_t
using embedded encrypted data buffer, free list with configurable
cap, and true memory release on discard. This eliminates the
PJ_ENOMEM crash during renegotiation reported in #4878.

Key changes:
- New ssl_send_op_t with per-op pool for true memory release
- ssl_do_handshake_and_flush() wrapper unifying error handling
  across all 5 SSL backends (OpenSSL, GnuTLS, mbedTLS, Schannel,
  Darwin)
- Fix flush_delayed_send: pass correct app_key, handle PJ_EPENDING
  to prevent double-send, invoke callback on synchronous success
- Fix ssock_on_data_sent: check app_key (not send_key) for
  handshake/shutdown detection (was dead code before)
- Fix ssock_on_connect_complete: return via on_handshake_complete
  instead of bare PJ_ENOMEM from pj_bool_t function
- Rename circ_buf_output/input to ssl_write_buf/ssl_read_buf for
  clarity (not always circular — OpenSSL uses memory BIO)
- Add "Caller must hold write_mutex" documentation to ssl_write()
  in all 6 backends

Test improvements:
- send_load_test: 200 rapid sends with async callback verification
- large_msg_test: 64KB message (multi-TLS-record) echo verification
- close_pending_test: close socket with pending sends (no crash)
- bidir_test: bidirectional simultaneous send load
- mt_send_load_test: 3 worker threads + 3 concurrent clients

CI: add ioq-no-fast-track job (PJ_IOQUEUE_FAST_TRACK=0 + ASan)
to force async send path in activesock and ssl_sock tests.

Co-Authored-By: Claude Code
@nanangizz nanangizz force-pushed the refactor/ssl-send-path branch from d750782 to bcd5049 Compare April 8, 2026 15:10
- Fix pre-existing GnuTLS ssl_write bug: pointer arithmetic used
  (read_data_t *)data + offset instead of (char *)data + offset,
  advancing by 16 bytes per unit instead of 1, causing data
  corruption after the first TLS record (16KB boundary).
- Fix accepted server-side socket leaks in stress tests: track
  accepted sockets in server state and close explicitly in cleanup.
  Previously only the listener and client sockets were closed.
- Reorder test cleanup: close sockets first, then poll to drain
  pending events, then destroy timer/ioqueue/pool.
- Reduce SEND_LOAD_COUNT from 200 to 100 to avoid a pre-existing
  FAST_TRACK + GnuTLS interaction issue where the last packet
  gets stuck when echoing inside on_data_read callback.

Co-Authored-By: Claude Code
The client_non_ssl test creates an SSL server that accepts a plain TCP
client. The accepted server-side SSL socket was never explicitly closed,
leaking the GnuTLS session. Track the accepted socket via test_state
and close it in cleanup.

Co-Authored-By: Claude Code
The client_non_ssl test's accepted socket destruction is deferred via
grp_lock ref counting. Add post-close polling to allow deferred
destruction to complete. Also add LSAN suppressions for libgnutls.so
and libtasn1.so to handle any remaining GnuTLS library-level leaks
that are outside our control.

Co-Authored-By: Claude Code
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Refactors PJLIB SSL socket send/handshake plumbing across backends to fix correctness issues under load (notably renegotiation/OOM paths), and expands automated coverage via new stress and large-message tests plus CI matrix coverage.

Changes:

  • Replaces the SSL socket network-send buffering with per-send ssl_send_op_t allocations (with recycling/free-list) and unifies handshake+flush behavior.
  • Fixes several backend-specific issues (e.g., GnuTLS write pointer arithmetic) and hardens library init/shutdown behavior.
  • Adds new SSL socket stress/large-message tests and a CI job to exercise non-fast-track ioqueue paths under ASan/LSan.

Reviewed changes

Copilot reviewed 19 out of 19 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
tests/sanitizers/lsan.supp Adds LSAN suppressions for GnuTLS/tasn1.
pjlib/src/pjlib-test/test.h Registers ssl_sock_stress_test() in test declarations.
pjlib/src/pjlib-test/test.c Adds ssl_sock_stress_test to the pjlib test suite.
pjlib/src/pjlib-test/ssl_sock.c Refactors test shared defs into a header, fixes cleanup leaks, adds 64KB large-message echo test, fixes perf aggregation.
pjlib/src/pjlib-test/ssl_sock_test.h New shared helpers for SSL socket tests (cert load + server creation).
pjlib/src/pjlib-test/ssl_sock_stress.c New stress tests (send load, close-with-pending, bidir, MT load).
pjlib/src/pj/ssl_sock_imp_common.h Introduces ssl_send_op_t, replaces old send buffer fields, renames circ-buf fields to ssl read/write buffers.
pjlib/src/pj/ssl_sock_imp_common.c Implements send-op allocation/freeing, new flush path, handshake+flush wrapper, delayed-send fixes, connect completion UAF fix.
pjlib/src/pj/ssl_sock_ossl.c Improves OpenSSL init robustness, renames flush helper, documents locking expectations.
pjlib/src/pj/ssl_sock_gtls.c Makes GnuTLS global init idempotent, fixes write pointer arithmetic, updates buffer naming, adds write-mutex doc.
pjlib/src/pj/ssl_sock_mbedtls.c Updates to new ssl read/write buffer naming and handshake flush behavior.
pjlib/src/pj/ssl_sock_schannel.c Updates to new ssl read/write buffer naming and flush helper rename; adds locking doc.
pjlib/src/pj/ssl_sock_darwin.c Updates to new ssl read/write buffer naming and removes backend-local flush calls.
pjlib/src/pj/ssl_sock_apple.m Updates to new ssl read/write buffer naming and adds locking doc.
pjlib/CMakeLists.txt Adds ssl_sock_stress.c to CMake test build.
pjlib/build/pjlib_test.vcxproj(.filters) Adds ssl_sock_stress.c to Visual Studio test project.
pjlib/build/Makefile Adds ssl_sock_stress.o to autotools test objects.
.github/workflows/ci-linux.yml Adds ioq-no-fast-track matrix job running SSL/ioqueue tests under ASan/LSan with FAST_TRACK disabled.

OpenSSL:
- Fix race in init_openssl(): openssl_init_count was set to 1 before
  cipher enumeration completed. A concurrent thread could see the flag,
  skip init, and find ssl_cipher_num still zero. Fix: use three-state
  flag (0=not started, -1=in progress, 1=done) with spin-wait.
- Add NULL checks for SSL_CTX_new/SSL_new in init_openssl().
- Check init_openssl() return in ssl_ciphers_populate() and ssl_create().

GnuTLS:
- Fix race in tls_init(): add same three-state guard as OpenSSL.
- Fix per-socket tls_deinit() bug: each socket called
  gnutls_global_deinit() on destroy, but with the new global init
  guard, gnutls_global_init() was only called once — causing unbalanced
  deinit. Fix: remove per-socket init/deinit, call tls_init() once
  (idempotent) and register tls_deinit() via pj_atexit() for cleanup
  at pj_shutdown().
- Check tls_init() return in ssl_ciphers_populate().

Co-Authored-By: Claude Code
@azertyfun
Copy link
Copy Markdown
Contributor

Seems like a pretty good improvement.

Things that I had a hard time to reason about that have not been addressed by this PR:

@nanangizz
Copy link
Copy Markdown
Member Author

@azertyfun Thanks for the thorough feedback. We investigated all three points. Here's what we found and our plan.

1. write_mutex thread safety

We agree the current lock juggling (write_mutexasock_send_mutex → release write_mutex) is hard to reason about. We also found a data serialization gap: in ssl_send, write_mutex is released after ssl_write but before flush_ssl_write_buf — allowing another thread to interleave ssl_write into the same ssl_write_buf, merging encrypted data from different sends into one send operation and losing the other's callback. Furthermore, asock_send_mutex only protects the pj_activesock_send() call, not the reading from ssl_write_buf — so TLS records from different threads can arrive at the peer out of order.

Our plan: replace the current lock juggling with a producer-consumer pattern (similar to the CALLBACK_NO_LOCK mechanism). The encrypt + queue step is atomic under write_mutex, but write_mutex is released before pj_activesock_send(). Only one thread does the actual sending (controlled by a sending_in_progress flag); other threads just queue their ops and return PJ_EPENDING. The sending thread drains all queued ops, and ssock_on_data_sent continues draining on async completion. Ops that were queued by other callers (who received PJ_EPENDING) get their callbacks fired during the drain. A fully synchronous send (caller is the sending thread, send completes immediately) returns PJ_SUCCESS without callback — same as current behavior. This eliminates all lock ordering concerns — safe with any CALLBACK_NO_LOCK setting.

2. PJ_EPENDING code paths and lost callbacks

We audited all flush_ssl_write_buf and flush_delayed_send call sites. Confirmed issues:

  • ssock_on_data_sent line 882: flush_delayed_send return value is ignored. If renegotiation flush fails, the app never gets a callback for that data — exactly the resource leak pattern from Fix callback not being called after activesock shutdown (#4864) #4878.
  • OpenSSL/Schannel ssl_reset_sock_state: close_notify flush return ignored, socket closed immediately. Async send orphaned.
  • ssl_on_destroy: active send_ops released without firing callbacks.

The producer-consumer redesign in point 1 addresses most of these — the drain loop ensures every queued op gets a callback. The shutdown path issues will be addressed in the same follow-up.

3. TLS renegotiation test coverage

Agree this is the key gap. We checked all backends — renegotiation is actually implemented in OpenSSL (SSL_renegotiate), GnuTLS (gnutls_rehandshake), and Darwin (SSLReHandshake). mbedTLS is a no-op and Apple returns PJ_ENOTSUP. We'll add renegotiation-during-transfer variants of the stress tests using TLS 1.2 (TLS 1.3 replaced renegotiation with key update). This would have caught the original #4878 PJ_ENOMEM.

We plan to address all three points in a follow-up PR — the send path redesign is a significant change that warrants its own review cycle.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants