Skip to content

Fix callback not being called after activesock shutdown (#4864)#4878

Merged
nanangizz merged 3 commits intopjsip:masterfrom
azertyfun:activesock-txdata-leak
Apr 8, 2026
Merged

Fix callback not being called after activesock shutdown (#4864)#4878
nanangizz merged 3 commits intopjsip:masterfrom
azertyfun:activesock-txdata-leak

Conversation

@azertyfun
Copy link
Copy Markdown
Contributor

@azertyfun azertyfun commented Mar 19, 2026

This is critical to avoid a reference leak. The callback is responsible for calling pjsip_tx_data_dec_ref.


Background: When pj_ioqueue_unregister() is called with pending async writes, on_write_complete callbacks are silently skipped. This causes tdata reference leaks since upper layers rely on these callbacks to call pjsip_tx_data_dec_ref(). The bug only manifests when the ioqueue fast-track fails (high load, full send buffer), making it nearly invisible in normal testing. Exists with or without PJ_IOQUEUE_CALLBACK_NO_LOCK.

Approach: Drain pending write callbacks during key unregistration, handling two lists with different semantics: write_cb_list (completed ops, deferred callback) invoked with success status respecting write_callback_thread serialization, then write_list (pending unsent ops) invoked with -PJ_ECANCELLED.

Maintainer rework addressing review feedback:

  • No deadlock: callbacks invoked without holding key/ioqueue locks. Key lifetime protected with grp_lock ref counting.
  • All maintained backends: epoll, kqueue, select, IOCP. Skipped deprecated symbian/uwp.
  • IOCP: cancelled overlapped ops invoke write callback with -PJ_ECANCELLED; write_callback_thread properly cleared in closing path.
  • activesock: forward on_data_sent when SHUT_TX set (not on_data_read).
  • PJ_IOQUEUE_FAST_TRACK config to disable fast-track for testing async path.
  • Regression test: concurrent sends + close before poll, with SO_SNDBUF shrink to force async path.

Note: SSL socket send_buf_pending fixes (flush_delayed_send, ssock_on_data_sent error path) have been deferred to a separate PR. The SSL send path has pre-existing issues that require a more thorough refactor — see discussion in review comments. The ioqueue-level fix in this PR is sufficient to address the original #4864 tdata leak.

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Mar 19, 2026

CLA assistant check
All committers have signed the CLA.

@azertyfun azertyfun force-pushed the activesock-txdata-leak branch from c60348f to e608796 Compare March 19, 2026 17:27
@azertyfun
Copy link
Copy Markdown
Contributor Author

This fixes the issue as described & reproduced in #4864.

I am moderately confident that this is a necessary and sufficient fix, however the whole architecture around TLS transport is quite difficult to get my head around. There are many points where a tdata can be delayed or cancelled. Hopefully someone a lot more familiar than me with this codebase can weigh in on this.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a shutdown edge-case in PJLIB’s active socket write-completion path so higher layers still receive the send-completion callback, preventing reference leaks (e.g., callers that decrement pjsip_tx_data refs in on_data_sent).

Changes:

  • Invoke on_data_sent when SHUT_TX is set in ioqueue_on_write_complete() (instead of returning silently).

azertyfun pushed a commit to azertyfun/pjproject that referenced this pull request Mar 20, 2026
azertyfun added a commit to azertyfun/pjproject that referenced this pull request Mar 20, 2026
@azertyfun azertyfun force-pushed the activesock-txdata-leak branch from 226b170 to af16a17 Compare March 20, 2026 12:14
@azertyfun
Copy link
Copy Markdown
Contributor Author

Actually with further testing I encountered another case where the callback still isn't called. This requires more thorough troubleshooting.

One major pain point is that the issue is only reproducible in high-load environments or with the ioqueue "fast track" commented out; else pj_ioqueue_sendto always returns PJ_SUCCESS in which case the caller is responsible for invoking the callback. When the fast-track fails, pj_ioqueue_sendto always returns PJ_EPENDING in which case the callee is responsible for invoking the callback. However, since this case is not normally covered by the test suite, it is evidently not guaranteed to be correct.

Marking this as draft until I figure out a more thorough solution.

@azertyfun azertyfun marked this pull request as draft March 20, 2026 13:28
azertyfun pushed a commit to azertyfun/pjproject that referenced this pull request Mar 20, 2026
The leak happens when

1. send_buf_pending is already in use
2. ssock_on_data_sent calls flush_circ_buf_output
3. send_buf_pending gets overwritten -> previous data in
   send_buf_pending is lost forever, its callback is never called, and
   therefore the reference to the tdata it contains is never
   decremented, preventing transport destruction
@azertyfun
Copy link
Copy Markdown
Contributor Author

azertyfun commented Mar 20, 2026

Fixed a second leak. At this point I've lost all confidence that the original implementation has any understanding of the importance of not losing track of ioqueue keys. I'll make a thorough review next week.

azertyfun added a commit to azertyfun/pjproject that referenced this pull request Mar 24, 2026
The fast-track remains enabled by default, but this is useful for
troubleshooting.

The ioqueue has two widly different behaviors:

1. In the fast-track (almost always used in low-load scenarios), sending
   a write_op always immediately returns with PJ_SUCCESS. The semantics
   in this case are that the _caller_ is responsible for calling any
   relevant callbacks.
2. Outside the fast track (relevant in high-load scenarios, which
   unfortunately do not seem to be extensively covered by regression
   tests), sending a write_op returns PJ_EPENDING in the happy path
   which changes the semantics; now the _ioqueue_ becomes responsible
   for the data owned by write_op and for calling the relevant
   callbacks.
   See pjsip#4864, pjsip#4878 for why this is a tricky behavior that can easily be
   missed and cause bugs.

Disabling the fast track allows for more easily testing the second case.
azertyfun added a commit to azertyfun/pjproject that referenced this pull request Mar 24, 2026
The leak happens when
1. send_buf_pending is already in use
2. ssock_on_data_sent calls flush_circ_buf_output
3. send_buf_pending gets overwritten -> previous data in
   send_buf_pending is lost forever, its callback is never called, and
   therefore the reference to the tdata it contains is never
   decremented, preventing transport destruction
@azertyfun azertyfun force-pushed the activesock-txdata-leak branch from 4a54cd5 to b3440a7 Compare March 24, 2026 16:09
azertyfun added a commit to azertyfun/pjproject that referenced this pull request Mar 24, 2026
Before bugfixes in pjsip#4878, the test fails:

16:58:29.999 Test "concurrent_sends[i].sent" != 0 fails in activesock.c:469 (sent callback not called)
azertyfun added a commit to azertyfun/pjproject that referenced this pull request Mar 25, 2026
Before bugfixes in pjsip#4878, the test fails:

16:58:29.999 Test "concurrent_sends[i].sent" != 0 fails in activesock.c:469 (sent callback not called)
@azertyfun azertyfun force-pushed the activesock-txdata-leak branch from b3440a7 to 47cead4 Compare March 25, 2026 09:46
@azertyfun
Copy link
Copy Markdown
Contributor Author

Alright, I think I have reviewed all the relevant code paths I could think of for this issue. This led me to add a couple additional fixes compared to before (same logic of always invoking the callback in on_data_read + fix for pj_activesock_close()`. I also took the opportunity to add a regression test.

This is the mind map I used to review the possible journeys taken by a tdata object sent from pjsip_transport_send().

tdata journey

@azertyfun azertyfun marked this pull request as ready for review March 25, 2026 09:55
azertyfun added a commit to azertyfun/pjproject that referenced this pull request Mar 25, 2026
The fast-track remains enabled by default, but this is useful for
troubleshooting.

The ioqueue has two widly different behaviors:

1. In the fast-track (almost always used in low-load scenarios), sending
   a write_op always immediately returns with PJ_SUCCESS. The semantics
   in this case are that the _caller_ is responsible for calling any
   relevant callbacks.
2. Outside the fast track (relevant in high-load scenarios, which
   unfortunately do not seem to be extensively covered by regression
   tests), sending a write_op returns PJ_EPENDING in the happy path
   which changes the semantics; now the _ioqueue_ becomes responsible
   for the data owned by write_op and for calling the relevant
   callbacks.
   See pjsip#4864, pjsip#4878 for why this is a tricky behavior that can easily be
   missed and cause bugs.

Disabling the fast track allows for more easily testing the second case.
azertyfun added a commit to azertyfun/pjproject that referenced this pull request Mar 25, 2026
The leak happens when
1. send_buf_pending is already in use
2. ssock_on_data_sent calls flush_circ_buf_output
3. send_buf_pending gets overwritten -> previous data in
   send_buf_pending is lost forever, its callback is never called, and
   therefore the reference to the tdata it contains is never
   decremented, preventing transport destruction
azertyfun added a commit to azertyfun/pjproject that referenced this pull request Mar 25, 2026
Before bugfixes in pjsip#4878, the test fails:

16:58:29.999 Test "concurrent_sends[i].sent" != 0 fails in activesock.c:469 (sent callback not called)
azertyfun pushed a commit to azertyfun/pjproject that referenced this pull request Mar 25, 2026
@azertyfun azertyfun force-pushed the activesock-txdata-leak branch from 47cead4 to 31f22f4 Compare March 25, 2026 14:05
azertyfun added a commit to azertyfun/pjproject that referenced this pull request Mar 25, 2026
@azertyfun azertyfun force-pushed the activesock-txdata-leak branch from 31f22f4 to bc22897 Compare March 25, 2026 14:06
azertyfun added a commit to azertyfun/pjproject that referenced this pull request Mar 25, 2026
Before bugfixes in pjsip#4878, the test fails:

16:58:29.999 Test "concurrent_sends[i].sent" != 0 fails in activesock.c:469 (sent callback not called)
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 5 comments.

Comments suppressed due to low confidence (1)

pjlib/src/pj/ssl_sock_imp_common.c:604

  • This new branch returns PJ_ENOMEM when alloc_send_data() fails while send_buf_pending already contains data. That status can now propagate to ssl_send()/flush_delayed_send() and be treated as a hard error (e.g., renegotiation path treats anything other than SUCCESS/EPENDING as failure). Consider returning a backpressure status that callers already treat as non-fatal (e.g., PJ_EBUSY/PJ_EPENDING), or updating the relevant callers to handle this specific condition without tearing down the connection.
    wdata = alloc_send_data(ssock, needed_len);
    if (wdata == NULL && ssock->send_buf_pending.data_len) {
        /* Buffer full and there's already pending data. Cannot overwrite
         * send_buf_pending as that would lose the previous data and its
         * callback reference (causing a tdata leak). See #4878.
         */
        pj_lock_release(ssock->write_mutex);
        return PJ_ENOMEM;
    } else if (wdata == NULL) {
        /* Oops, the send buffer is full, let's just
         * queue it for sending and return PJ_EPENDING.
         */
        ssock->send_buf_pending.data_len = needed_len;
        ssock->send_buf_pending.app_key = send_key;
        ssock->send_buf_pending.flags = flags;
        ssock->send_buf_pending.plain_data_len = orig_len;
        pj_lock_release(ssock->write_mutex);
        return PJ_EPENDING;
    }

nanangizz added a commit to azertyfun/pjproject that referenced this pull request Apr 1, 2026
…ip#4878)

When pj_ioqueue_unregister() is called with pending async writes,
on_write_complete callbacks were silently skipped. Upper layers (e.g.,
SIP transport) rely on these callbacks to release resources such as
tdata references, causing reference leaks under high load.

The fix drains pending write callbacks during key unregistration:
- write_cb_list (completed ops, deferred callback): invoked with
  op->written (success status), respecting write_callback_thread
  serialization.
- write_list (pending ops, never sent): invoked with -PJ_ECANCELLED.

Changes across all maintained ioqueue backends (epoll, kqueue, select,
IOCP). Also includes:
- PJ_IOQUEUE_FAST_TRACK config to disable fast-track for testing
- activesock: forward on_data_sent callback when SHUT_TX is set
- ssl_sock: prevent send_buf_pending overwrite (tdata leak)
- Regression test for send callbacks on activesock close

Co-Authored-By: Nathan Monfils <nathanmonfils@gmail.com>
Co-Authored-By: Claude Code
@nanangizz nanangizz force-pushed the activesock-txdata-leak branch from 60e1f69 to 45d60d4 Compare April 1, 2026 04:40
nanangizz added a commit to azertyfun/pjproject that referenced this pull request Apr 1, 2026
…ip#4878)

When pj_ioqueue_unregister() is called with pending async writes,
on_write_complete callbacks were silently skipped. Upper layers (e.g.,
SIP transport) rely on these callbacks to release resources such as
tdata references, causing reference leaks under high load.

The fix drains pending write callbacks during key unregistration:
- write_cb_list (completed ops, deferred callback): invoked with
  op->written (success status), respecting write_callback_thread
  serialization.
- write_list (pending ops, never sent): invoked with -PJ_ECANCELLED.

Changes across all maintained ioqueue backends (epoll, kqueue, select,
IOCP). Also includes:
- PJ_IOQUEUE_FAST_TRACK config to disable fast-track for testing
- activesock: forward on_data_sent callback when SHUT_TX is set
- ssl_sock: prevent send_buf_pending overwrite (tdata leak)
- Regression test for send callbacks on activesock close

Co-Authored-By: Nathan Monfils <nathanmonfils@gmail.com>
Co-Authored-By: Claude Code
@nanangizz nanangizz force-pushed the activesock-txdata-leak branch from 45d60d4 to d395b2b Compare April 1, 2026 05:03
nanangizz added a commit to azertyfun/pjproject that referenced this pull request Apr 1, 2026
…ip#4878)

When pj_ioqueue_unregister() is called with pending async writes,
on_write_complete callbacks were silently skipped. Upper layers (e.g.,
SIP transport) rely on these callbacks to release resources such as
tdata references, causing reference leaks under high load.

The fix drains pending write callbacks during key unregistration:
- write_cb_list (completed ops, deferred callback): invoked with
  op->written (success status), respecting write_callback_thread
  serialization.
- write_list (pending ops, never sent): invoked with -PJ_ECANCELLED.

Changes across all maintained ioqueue backends (epoll, kqueue, select,
IOCP). Also includes:
- PJ_IOQUEUE_FAST_TRACK config to disable fast-track for testing
- activesock: forward on_data_sent callback when SHUT_TX is set
- ssl_sock: prevent send_buf_pending overwrite (tdata leak)
- Regression test for send callbacks on activesock close

Co-Authored-By: Nathan Monfils <nathanmonfils@gmail.com>
Co-Authored-By: Claude Code
@nanangizz nanangizz force-pushed the activesock-txdata-leak branch from d395b2b to bc60815 Compare April 1, 2026 05:23
@nanangizz nanangizz requested a review from sauwming April 1, 2026 06:04
@sauwming sauwming linked an issue Apr 2, 2026 that may be closed by this pull request
@nanangizz
Copy link
Copy Markdown
Member

@azertyfun We've tried to address the issues mentioned earlier. Please let us know if you notice any further issues or have additional comments.

@azertyfun
Copy link
Copy Markdown
Contributor Author

We deployed the earlier fix (which sends garbage over the wire – but that doesn't really matter in our use-case). However we had to roll it back because we unexpectedly encountered crashes.

I am currently investigating the cause. I still think the fix is sound in principle, but it may be triggering other existing race conditions. In particular this bit seems faulty: if the ioqueue is busy, ssl_send returns PJ_EPENDING, and wp is not removed from ssock->write_pending. However it has already been added to write_list, so it will be duplicated. I still need to look further to understand why this is only happening now.

@nanangizz
Copy link
Copy Markdown
Member

You're right. Asked the AI to check that area, and it found several issues!!!

1. Wrong app_key passed to ssl_send
flush_delayed_send passes &wp->key (the internal SSL write_data_t key) instead of wp->app_key (the original caller's op_key). When the async callback fires, the TLS transport casts op_key to pjsip_tx_data_op_key* — but &wp->key is NOT a valid pjsip_tx_data_op_key, so accessing tdata_op_key->tdata and tdata_op_key->callback accesses invalid memory. Potential crash/UB.

2. PJ_EPENDING not handled — double-send
When ssl_send returns PJ_EPENDING (data encrypted + queued), wp stays in write_pending. Next flush_delayed_send call re-processes wp: ssl_write encrypts the same plaintext again → duplicate data on the wire, potential TLS record corruption.

3. Missing callback on PJ_SUCCESS — tdata leak
When delayed data is sent synchronously (PJ_SUCCESS), no on_data_sent callback fires. The app (which received PJ_EPENDING from the original pj_ssl_sock_send) expects a callback → tdata ref never decremented → leak.

4. send_buf_pending not cancelled on error in ssock_on_data_sent
When sent < 0 (error/cancel) and the app callback returns PJ_TRUE, ssock_on_data_sent tries to flush_circ_buf_output for send_buf_pending on a dead socket. This fails, and the send_buf_pending callback is never invoked → tdata leak.

Here is the draft patch (compile tested only, will test further next week).

@azertyfun
Copy link
Copy Markdown
Contributor Author

After further investigation, the crash I described above was one of the outcomes, but the more common one was that we saw ssl0x7fa1caca6a00 Renegotiation failed: Not enough memory (PJ_ENOMEM) sometimes followed by a deadlock. In every case where there was a deadlock, there was this PJ_ENOMEM error. Unfortunately I do not have a workable core dump of those occurrences.

Now it is clearly related to the return PJ_ENOMEM this PR adds. What's unclear for now is how.

Thanks to the error message we know t hat ssl_do_handshake failed with PJ_ENOMEM due to the change I made. Before in the same situation it was silently failing with PJ_EPENDING instead, which essentially a no-op.

From there I can only speculate why there was a deadlock, but on_error calls on_handshake_complete which calls on_connect_complete which triggers a bunch of things in asterisk... so who knows what may have happened from there.

The whole logic for SSL handshaking is spread out, has multiple redundancies, and probably just as buggy as the rest of the code touched by this PR. Another thorough review is needed, but that's going to be quite a few additional man-hours which I'm starting to run short on as other priorities are catching up after several weeks working on this already.

And anyway, how can we know when we've squashed the last critical bug? How can we verify our fix when the interfaces are so complex and behave in radically different ways under load vs not, including in edge-cases like SSL renegotiation under load ? Unfortunately given that I caused a major incident by pushing a partial fix, we won't get approval for another attempt unless we have a complete explanation for all issues encountered and a way to reliably reproduce our race conditions and verify the correctness of any fixes.

Perhaps the solution for us will be to wait until PJ_IOQUEUE_CALLBACK_NO_LOCK is stabilized&released, and just rely on the TLS connection never breaking so we never have to encounter these buggy and untested edge-cases. Or perhaps switch to UDP/TCP through a VPN. Which I realize aren't real solutions and they do not help you. Given the scope of the architural and implementation issues in the SSL subsystem, short of a complete rewrite I honestly lack confidence that pjproject can ensure correctness.

…ip#4878)

When pj_ioqueue_unregister() is called with pending async writes,
on_write_complete callbacks were silently skipped. Upper layers (e.g.,
SIP transport) rely on these callbacks to release resources such as
tdata references, causing reference leaks under high load.

The fix drains pending write callbacks during key unregistration:
- write_cb_list (completed ops, deferred callback): invoked with
  op->written (success status), respecting write_callback_thread
  serialization.
- write_list (pending ops, never sent): invoked with -PJ_ECANCELLED.

Changes across all maintained ioqueue backends (epoll, kqueue, select,
IOCP). Also includes:
- PJ_IOQUEUE_FAST_TRACK config to disable fast-track for testing
- activesock: forward on_data_sent callback when SHUT_TX is set
- Regression test for send callbacks on activesock close

Co-Authored-By: Nathan Monfils <nathanmonfils@gmail.com>
Co-Authored-By: Claude Code
@nanangizz nanangizz force-pushed the activesock-txdata-leak branch from bc60815 to aacb3bc Compare April 8, 2026 06:26
@nanangizz
Copy link
Copy Markdown
Member

@azertyfun Thanks for the detailed crash report and analysis. You're right — the PJ_ENOMEM guard in flush_circ_buf_output is problematic. It propagates through ssl_do_handshake() (all 5 SSL backends call flush_circ_buf_output directly with handshake_op_key), and the renegotiation handler in ssock_on_data_read only accepts PJ_SUCCESS/PJ_EPENDING — anything else triggers on_erroron_handshake_complete, which explains both the error message and the downstream deadlock in your application.

We've removed the SSL change from this PR. The ioqueue-level fix (draining write_list/write_cb_list on pj_ioqueue_unregister) is sufficient to address the original #4864 tdata leak — it catches all cases where the socket is closed with pending writes, regardless of whether SSL is involved.

The SSL send_buf_pending issues you identified (overwrite, flush_delayed_send wrong app_key, missing callbacks) are real but pre-existing. They deserve a proper fix, but the send path has deeper architectural limitations (single send_buf_pending slot, mixed handshake/app data in circ_buf_output, inconsistent error handling across backends) that make targeted patches risky — as your deployment experience showed. We plan to address these in a separate PR with a more thorough refactor of the SSL send path.

Could you try deploying this updated version (without the SSL change)? It should fix the tdata leak without triggering the renegotiation crash.

@azertyfun
Copy link
Copy Markdown
Contributor Author

azertyfun commented Apr 8, 2026

We already had a major customer-impacting incident which unfortunately means at this point I won't be able to get approval to push further fixes in production without very strong guarantees. We are now mitigating the TLS issues by staying on version 2.15.1 but reducing the number of registrations per TLS connection.

We will need proper assurances (refactor to enable thorough unit test coverage of the SSL subsystem, and probably additional integration testing with asterisk) before I will be allowed to push a fix through.

(To be clear: for the purposes of this PR I have no problem if it gets merged as-is, I just won't be able to verify that it fixes our specific production issues)

- Add PJ_UNUSED_ARG for status/sent variables in pj_ioqueue_send(),
  pj_ioqueue_sendto(), and pj_ioqueue_accept() when fast-track is
  disabled, fixing MSVC C4101 warnings.
- Fix activesock UDP echo test: accept PJ_EPENDING as non-error from
  sendto() since async path always returns PJ_EPENDING when fast-track
  is disabled. Also fix error message to print actual send status
  instead of the recvfrom callback status.

Co-Authored-By: Claude Code
@nanangizz nanangizz merged commit af42c36 into pjsip:master Apr 8, 2026
5 checks passed
nanangizz added a commit that referenced this pull request Apr 8, 2026
Replace the ring-buffer-based send mechanism (send_buf + single
send_buf_pending slot) with per-op pool-allocated ssl_send_op_t
using embedded encrypted data buffer, free list with configurable
cap, and true memory release on discard. This eliminates the
PJ_ENOMEM crash during renegotiation reported in #4878.

Key changes:
- New ssl_send_op_t with per-op pool for true memory release
- ssl_do_handshake_and_flush() wrapper unifying error handling
  across all 5 SSL backends (OpenSSL, GnuTLS, mbedTLS, Schannel,
  Darwin)
- Fix flush_delayed_send: pass correct app_key, handle PJ_EPENDING
  to prevent double-send, invoke callback on synchronous success
- Fix ssock_on_data_sent: check app_key (not send_key) for
  handshake/shutdown detection (was dead code before)
- Fix ssock_on_connect_complete: return via on_handshake_complete
  instead of bare PJ_ENOMEM from pj_bool_t function
- Rename circ_buf_output/input to ssl_write_buf/ssl_read_buf for
  clarity (not always circular — OpenSSL uses memory BIO)
- Add "Caller must hold write_mutex" documentation to ssl_write()
  in all 6 backends

Test improvements:
- send_load_test: 200 rapid sends with async callback verification
- large_msg_test: 64KB message (multi-TLS-record) echo verification
- close_pending_test: close socket with pending sends (no crash)
- bidir_test: bidirectional simultaneous send load
- mt_send_load_test: 3 worker threads + 3 concurrent clients

CI: add ioq-no-fast-track job (PJ_IOQUEUE_FAST_TRACK=0 + ASan)
to force async send path in activesock and ssl_sock tests.

Co-Authored-By: Claude Code
nanangizz added a commit that referenced this pull request Apr 8, 2026
Replace the ring-buffer-based send mechanism (send_buf + single
send_buf_pending slot) with per-op pool-allocated ssl_send_op_t
using embedded encrypted data buffer, free list with configurable
cap, and true memory release on discard. This eliminates the
PJ_ENOMEM crash during renegotiation reported in #4878.

Key changes:
- New ssl_send_op_t with per-op pool for true memory release
- ssl_do_handshake_and_flush() wrapper unifying error handling
  across all 5 SSL backends (OpenSSL, GnuTLS, mbedTLS, Schannel,
  Darwin)
- Fix flush_delayed_send: pass correct app_key, handle PJ_EPENDING
  to prevent double-send, invoke callback on synchronous success
- Fix ssock_on_data_sent: check app_key (not send_key) for
  handshake/shutdown detection (was dead code before)
- Fix ssock_on_connect_complete: return via on_handshake_complete
  instead of bare PJ_ENOMEM from pj_bool_t function
- Rename circ_buf_output/input to ssl_write_buf/ssl_read_buf for
  clarity (not always circular — OpenSSL uses memory BIO)
- Add "Caller must hold write_mutex" documentation to ssl_write()
  in all 6 backends

Test improvements:
- send_load_test: 200 rapid sends with async callback verification
- large_msg_test: 64KB message (multi-TLS-record) echo verification
- close_pending_test: close socket with pending sends (no crash)
- bidir_test: bidirectional simultaneous send load
- mt_send_load_test: 3 worker threads + 3 concurrent clients

CI: add ioq-no-fast-track job (PJ_IOQUEUE_FAST_TRACK=0 + ASan)
to force async send path in activesock and ssl_sock tests.

Co-Authored-By: Claude Code
nanangizz added a commit that referenced this pull request Apr 8, 2026
Replace the ring-buffer-based send mechanism (send_buf + single
send_buf_pending slot) with per-op pool-allocated ssl_send_op_t
using embedded encrypted data buffer, free list with configurable
cap, and true memory release on discard. This eliminates the
PJ_ENOMEM crash during renegotiation reported in #4878.

Key changes:
- New ssl_send_op_t with per-op pool for true memory release
- ssl_do_handshake_and_flush() wrapper unifying error handling
  across all 5 SSL backends (OpenSSL, GnuTLS, mbedTLS, Schannel,
  Darwin)
- Fix flush_delayed_send: pass correct app_key, handle PJ_EPENDING
  to prevent double-send, invoke callback on synchronous success
- Fix ssock_on_data_sent: check app_key (not send_key) for
  handshake/shutdown detection (was dead code before)
- Fix ssock_on_connect_complete: return via on_handshake_complete
  instead of bare PJ_ENOMEM from pj_bool_t function
- Rename circ_buf_output/input to ssl_write_buf/ssl_read_buf for
  clarity (not always circular — OpenSSL uses memory BIO)
- Add "Caller must hold write_mutex" documentation to ssl_write()
  in all 6 backends

Test improvements:
- send_load_test: 200 rapid sends with async callback verification
- large_msg_test: 64KB message (multi-TLS-record) echo verification
- close_pending_test: close socket with pending sends (no crash)
- bidir_test: bidirectional simultaneous send load
- mt_send_load_test: 3 worker threads + 3 concurrent clients

CI: add ioq-no-fast-track job (PJ_IOQUEUE_FAST_TRACK=0 + ASan)
to force async send path in activesock and ssl_sock tests.

Co-Authored-By: Claude Code
@nanangizz
Copy link
Copy Markdown
Member

@azertyfun Thanks again for the detailed report and analysis on this issue. Sorry for the problems it caused in your production environment.

Building on the ioqueue fix merged here, we've done a broader SSL socket send path refactor in #4909 — replacing the ring-buffer mechanism with per-op pool allocation, fixing several related bugs found during the process, and adding stress tests that cover the high-load/concurrent scenarios you described. Unfortunately we don't currently have an Asterisk test setup, so we haven't been able to validate against that environment. Would appreciate any feedback or suggestions, especially if there are specific patterns from your environment we should add to the test coverage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

tdata refcounting issue prevents SIPTLS transport shutdown

6 participants