Refactor SSL socket send path and improve test coverage#4909
Refactor SSL socket send path and improve test coverage#4909
Conversation
a8dcecc to
d750782
Compare
Replace the ring-buffer-based send mechanism (send_buf + single send_buf_pending slot) with per-op pool-allocated ssl_send_op_t using embedded encrypted data buffer, free list with configurable cap, and true memory release on discard. This eliminates the PJ_ENOMEM crash during renegotiation reported in #4878. Key changes: - New ssl_send_op_t with per-op pool for true memory release - ssl_do_handshake_and_flush() wrapper unifying error handling across all 5 SSL backends (OpenSSL, GnuTLS, mbedTLS, Schannel, Darwin) - Fix flush_delayed_send: pass correct app_key, handle PJ_EPENDING to prevent double-send, invoke callback on synchronous success - Fix ssock_on_data_sent: check app_key (not send_key) for handshake/shutdown detection (was dead code before) - Fix ssock_on_connect_complete: return via on_handshake_complete instead of bare PJ_ENOMEM from pj_bool_t function - Rename circ_buf_output/input to ssl_write_buf/ssl_read_buf for clarity (not always circular — OpenSSL uses memory BIO) - Add "Caller must hold write_mutex" documentation to ssl_write() in all 6 backends Test improvements: - send_load_test: 200 rapid sends with async callback verification - large_msg_test: 64KB message (multi-TLS-record) echo verification - close_pending_test: close socket with pending sends (no crash) - bidir_test: bidirectional simultaneous send load - mt_send_load_test: 3 worker threads + 3 concurrent clients CI: add ioq-no-fast-track job (PJ_IOQUEUE_FAST_TRACK=0 + ASan) to force async send path in activesock and ssl_sock tests. Co-Authored-By: Claude Code
d750782 to
bcd5049
Compare
- Fix pre-existing GnuTLS ssl_write bug: pointer arithmetic used (read_data_t *)data + offset instead of (char *)data + offset, advancing by 16 bytes per unit instead of 1, causing data corruption after the first TLS record (16KB boundary). - Fix accepted server-side socket leaks in stress tests: track accepted sockets in server state and close explicitly in cleanup. Previously only the listener and client sockets were closed. - Reorder test cleanup: close sockets first, then poll to drain pending events, then destroy timer/ioqueue/pool. - Reduce SEND_LOAD_COUNT from 200 to 100 to avoid a pre-existing FAST_TRACK + GnuTLS interaction issue where the last packet gets stuck when echoing inside on_data_read callback. Co-Authored-By: Claude Code
The client_non_ssl test creates an SSL server that accepts a plain TCP client. The accepted server-side SSL socket was never explicitly closed, leaking the GnuTLS session. Track the accepted socket via test_state and close it in cleanup. Co-Authored-By: Claude Code
The client_non_ssl test's accepted socket destruction is deferred via grp_lock ref counting. Add post-close polling to allow deferred destruction to complete. Also add LSAN suppressions for libgnutls.so and libtasn1.so to handle any remaining GnuTLS library-level leaks that are outside our control. Co-Authored-By: Claude Code
9d9d19d to
d34eb8a
Compare
There was a problem hiding this comment.
Pull request overview
Refactors PJLIB SSL socket send/handshake plumbing across backends to fix correctness issues under load (notably renegotiation/OOM paths), and expands automated coverage via new stress and large-message tests plus CI matrix coverage.
Changes:
- Replaces the SSL socket network-send buffering with per-send
ssl_send_op_tallocations (with recycling/free-list) and unifies handshake+flush behavior. - Fixes several backend-specific issues (e.g., GnuTLS write pointer arithmetic) and hardens library init/shutdown behavior.
- Adds new SSL socket stress/large-message tests and a CI job to exercise non-fast-track ioqueue paths under ASan/LSan.
Reviewed changes
Copilot reviewed 19 out of 19 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/sanitizers/lsan.supp | Adds LSAN suppressions for GnuTLS/tasn1. |
| pjlib/src/pjlib-test/test.h | Registers ssl_sock_stress_test() in test declarations. |
| pjlib/src/pjlib-test/test.c | Adds ssl_sock_stress_test to the pjlib test suite. |
| pjlib/src/pjlib-test/ssl_sock.c | Refactors test shared defs into a header, fixes cleanup leaks, adds 64KB large-message echo test, fixes perf aggregation. |
| pjlib/src/pjlib-test/ssl_sock_test.h | New shared helpers for SSL socket tests (cert load + server creation). |
| pjlib/src/pjlib-test/ssl_sock_stress.c | New stress tests (send load, close-with-pending, bidir, MT load). |
| pjlib/src/pj/ssl_sock_imp_common.h | Introduces ssl_send_op_t, replaces old send buffer fields, renames circ-buf fields to ssl read/write buffers. |
| pjlib/src/pj/ssl_sock_imp_common.c | Implements send-op allocation/freeing, new flush path, handshake+flush wrapper, delayed-send fixes, connect completion UAF fix. |
| pjlib/src/pj/ssl_sock_ossl.c | Improves OpenSSL init robustness, renames flush helper, documents locking expectations. |
| pjlib/src/pj/ssl_sock_gtls.c | Makes GnuTLS global init idempotent, fixes write pointer arithmetic, updates buffer naming, adds write-mutex doc. |
| pjlib/src/pj/ssl_sock_mbedtls.c | Updates to new ssl read/write buffer naming and handshake flush behavior. |
| pjlib/src/pj/ssl_sock_schannel.c | Updates to new ssl read/write buffer naming and flush helper rename; adds locking doc. |
| pjlib/src/pj/ssl_sock_darwin.c | Updates to new ssl read/write buffer naming and removes backend-local flush calls. |
| pjlib/src/pj/ssl_sock_apple.m | Updates to new ssl read/write buffer naming and adds locking doc. |
| pjlib/CMakeLists.txt | Adds ssl_sock_stress.c to CMake test build. |
| pjlib/build/pjlib_test.vcxproj(.filters) | Adds ssl_sock_stress.c to Visual Studio test project. |
| pjlib/build/Makefile | Adds ssl_sock_stress.o to autotools test objects. |
| .github/workflows/ci-linux.yml | Adds ioq-no-fast-track matrix job running SSL/ioqueue tests under ASan/LSan with FAST_TRACK disabled. |
OpenSSL: - Fix race in init_openssl(): openssl_init_count was set to 1 before cipher enumeration completed. A concurrent thread could see the flag, skip init, and find ssl_cipher_num still zero. Fix: use three-state flag (0=not started, -1=in progress, 1=done) with spin-wait. - Add NULL checks for SSL_CTX_new/SSL_new in init_openssl(). - Check init_openssl() return in ssl_ciphers_populate() and ssl_create(). GnuTLS: - Fix race in tls_init(): add same three-state guard as OpenSSL. - Fix per-socket tls_deinit() bug: each socket called gnutls_global_deinit() on destroy, but with the new global init guard, gnutls_global_init() was only called once — causing unbalanced deinit. Fix: remove per-socket init/deinit, call tls_init() once (idempotent) and register tls_deinit() via pj_atexit() for cleanup at pj_shutdown(). - Check tls_init() return in ssl_ciphers_populate(). Co-Authored-By: Claude Code
d34eb8a to
7c21a98
Compare
|
Seems like a pretty good improvement. Things that I had a hard time to reason about that have not been addressed by this PR:
|
|
@azertyfun Thanks for the thorough feedback. We investigated all three points. Here's what we found and our plan. 1. write_mutex thread safety We agree the current lock juggling ( Our plan: replace the current lock juggling with a producer-consumer pattern (similar to the 2. PJ_EPENDING code paths and lost callbacks We audited all
The producer-consumer redesign in point 1 addresses most of these — the drain loop ensures every queued op gets a callback. The shutdown path issues will be addressed in the same follow-up. 3. TLS renegotiation test coverage Agree this is the key gap. We checked all backends — renegotiation is actually implemented in OpenSSL ( We plan to address all three points in a follow-up PR — the send path redesign is a significant change that warrants its own review cycle. |
Summary
Major
send_buf+ singlesend_buf_pendingslot) with per-op pool-allocatedssl_send_op_tusing embedded encrypted data buffer, free list with configurable cap (PJ_SSL_SEND_OP_FREE_LIST_MAX), and true memory release on discard. This eliminates thePJ_ENOMEMcrash during renegotiation reported in Fix callback not being called after activesock shutdown (#4864) #4878.ssl_writepointer arithmetic bug:((read_data_t *)data) + total_writtenadvanced by 16 bytes per unit instead of 1, causing data corruption after the first TLS record (~16KB boundary). Fixed to((char *)data) + total_written.pj_ssl_sock_start_connect2when TCP connect completes synchronously — user callback could destroy ssock, then code continued accessing it.Medium
Add
ssl_do_handshake_and_flush()wrapper unifying flush + error handling across all 5 SSL backends (OpenSSL, GnuTLS, mbedTLS, Schannel, Darwin).Fix
flush_delayed_send: pass correctapp_key, handlePJ_EPENDINGto prevent double-send, invoke callback on synchronous success.Fix
ssock_on_data_sent: checkapp_key(notsend_key) for handshake/shutdown detection — was dead code before.Fix
ssock_on_connect_complete: return viaon_handshake_completeinstead of barePJ_ENOMEMfrompj_bool_tfunction.Fix accepted server-side socket leaks in ssl_sock tests: track accepted sockets in state structs and close explicitly in cleanup. Previously only listener and client sockets were closed, leaking GnuTLS sessions. Affects: stress tests,
client_non_ssl, andlarge_msg_test.Add LSAN suppressions for
libgnutls.soandlibtasn1.soto handle GnuTLS library-level leaks outside our control.(Pre-existing, found during CI) Fix race condition in OpenSSL
init_openssl():openssl_init_countwas set before cipher enumeration completed, causing concurrent callers to see zero ciphers. Fix: three-state flag (0=not started, -1=in progress, 1=done) with spin-wait. Also add NULL checks forSSL_CTX_new/SSL_new.(Pre-existing, found during CI) Fix GnuTLS per-socket
tls_deinit()bug: each socket calledgnutls_global_deinit()on destroy, unbalancing the init/deinit count. Fix: init once (idempotent), registertls_deinit()viapj_atexit()for cleanup atpj_shutdown().Minor / Miscellaneous
perf_testresult aggregation:state_cli[1]→state_cli[i].alloc_send_opNULL check for OOM safety.circ_buf_output/circ_buf_inputtossl_write_buf/ssl_read_buffor clarity (not always circular — OpenSSL uses memory BIO).write_mutex" documentation tossl_write()in all 6 backends.New tests (
ssl_sock_stress.c)send_load_test: 100 rapid sends with async callback verificationlarge_msg_test: 64KB message (multi-TLS-record) byte-for-byte echo verificationclose_pending_test: close socket with pending sends — verify no crash/UAFbidir_test: bidirectional simultaneous send loadmt_send_load_test: 3 worker threads + 3 concurrent SSL clients (closest to production scenario from Fix callback not being called after activesock shutdown (#4864) #4878)CI
ioq-no-fast-trackmatrix job (select+OpenSSL,epoll+GnuTLS) withPJ_IOQUEUE_FAST_TRACK=0+ ASan, running ioqueue, activesock, ssl_sock, and ssl_sock_stress tests.Test plan
make -j3— zero errors, zero warningsssl_sock_test— all tests pass (Linux/OpenSSL + ASan)ssl_sock_stress_test— all tests pass (Linux/OpenSSL + ASan)PJ_IOQUEUE_FAST_TRACK=0— async path exercised (pending=200, sent_cb=200)ioq-no-fast-track / select+osslioq-no-fast-track / epoll+gtlsCloses #4878 (SSL-layer issues). The ioqueue-layer fix was merged separately in af42c36.
Co-Authored-By: Claude Code
Known issue: FAST_TRACK + GnuTLS stress test
send_load_testwithSEND_LOAD_COUNT=200hangs under GnuTLS +PJ_IOQUEUE_FAST_TRACK=1(default). The server receives N-1 out of N packets — a TCP partial write fromsend()causes data loss. Thewhole_dataretry inpj_activesock_sendshould handle partial writes, but a timing-dependent interaction prevents it. Investigation findings:selectandepollioqueue — independent of ioqueue backendon_data_readcallback: failsTCP_NODELAYinpj_ssl_sock_param_default, but this has not been confirmedfprintfin the code path makes it disappearWorkaround:
SEND_LOAD_COUNTreduced to 100. The CIioq-no-fast-trackjob usesFAST_TRACK=0which avoids the issue entirely.Items for separate PRs (from deep review)
echo_testtimeout tolerance, echo op_key reuse