Skip to content

cgrp: Fix NULL dereference in LeaveGroup when coordinator unavailable#5327

Open
MAlostaz (MAlostaz) wants to merge 1 commit intoconfluentinc:masterfrom
MAlostaz:fix-leavegroup-null-broker
Open

cgrp: Fix NULL dereference in LeaveGroup when coordinator unavailable#5327
MAlostaz (MAlostaz) wants to merge 1 commit intoconfluentinc:masterfrom
MAlostaz:fix-leavegroup-null-broker

Conversation

@MAlostaz
Copy link

@MAlostaz MAlostaz (MAlostaz) commented Feb 24, 2026

Summary

Issue: #5347

Fix SIGSEGV crash in rd_kafka_cgrp_handle_LeaveGroup() when the coordinator broker becomes unavailable during consumer close.

Problem

When a consumer is destroyed (rd_kafka_destroy()) and the group coordinator is unavailable, rd_kafka_cgrp_leave() calls rd_kafka_cgrp_handle_LeaveGroup() with a NULL broker pointer (rkb = rkcg->rkcg_coord). The error path then dereferences this NULL pointer:

`rd_kafka_dbg(rkb->rkb_rk, CGRP, "LEAVEGROUP", ...);  // CRASH: rkb is NULL`

This crash requires the coordinator to become unavailable at the exact moment the consumer is shutting down. Typical triggers:

  • Rolling broker upgrades where the coordinator broker restarts
  • Coordinator failover during consumer shutdown
  • Network partition isolating the coordinator

Solution

Replace rkb->rkb_rk with rk in the rd_kafka_dbg() calls. The rk parameter is always valid (passed directly to the function), and is semantically equivalent to rkb->rkb_rk when rkb is non-NULL.

Without the fix (crashes):

  • Consumer process dies with SIGSEGV
  • Local resources may not be cleaned up (memory, file handles, etc.)
  • Broker doesn't get LeaveGroup
  • Broker waits for session timeout → rebalance

With the fix (graceful exit):

  • Consumer process exits cleanly
  • Local resources are properly cleaned up
  • Broker doesn't get LeaveGroup
  • Broker waits for session timeout → rebalance

The broker-side outcome is identical. In both cases, the broker doesn't receive LeaveGroup and must wait for session timeout. We can't avoid this because the coordinator is unavailable. But this way there is no core dump and crash.

Backtrace

    #0  rd_kafka_cgrp_handle_LeaveGroup (rk=0x..., rkb=0x0, err=RD_KAFKA_RESP_ERR__WAIT_COORD, ...)
        at rdkafka_cgrp.c:984
    #1  rd_kafka_cgrp_leave (rkcg=0x...) at rdkafka_cgrp.c:1158
    #2  rd_kafka_cgrp_terminate (rkcg=0x...) at rdkafka_cgrp.c:...
    #3  rd_kafka_destroy_internal (rk=0x...) at rdkafka.c:...
  • We observed intermittent SIGSEGV crashes in production during consumer shutdown
  • We captured the core dump and analyzed with gdb
  • The above backtrace shows rkb=0x0 (NULL) in rd_kafka_cgrp_handle_LeaveGroup()
  • We traced the call site in rd_kafka_cgrp_leave() (line 1158):
    } else
        rd_kafka_cgrp_handle_LeaveGroup(rkcg->rkcg_rk, rkcg->rkcg_coord,  // <-- rkcg_coord is NULL here
                                         RD_KAFKA_RESP_ERR__WAIT_COORD,
                                         NULL, NULL, rkcg);
  • This else branch is taken when no coordinator is available (rkcg->rkcg_coord == NULL)
  • The function then attempts to log using rkb->rkb_rk at line 984, causing NULL dereference
  • Confirmed that rk parameter is always valid and equivalent to rkb->rkb_rk when rkb is non-NULL

@confluent-cla-assistant
Copy link

confluent-cla-assistant bot commented Feb 24, 2026

🎉 All Contributor License Agreements have been signed. Ready to merge.
✅ MAlostaz
Please push an empty commit if you would like to re-run the checks to verify CLA status for all contributors.

rd_kafka_cgrp_handle_LeaveGroup() would crash with SIGSEGV when logging
errors because it dereferenced rkb->rkb_rk when rkb was NULL. This can
occur when the coordinator becomes unavailable during consumer shutdown.

Use the always-valid `rk` parameter instead of `rkb->rkb_rk` in the
rd_kafka_dbg() calls in the error path.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant