Skip to content

Check if len in headers to avoid KeyError in supervisor-proc-exit-listener#23419

Merged
qiluo-msft merged 3 commits intosonic-net:masterfrom
ZhaohuiS:ZhaohuiS/handle_exception_supervisor
Jul 25, 2025
Merged

Check if len in headers to avoid KeyError in supervisor-proc-exit-listener#23419
qiluo-msft merged 3 commits intosonic-net:masterfrom
ZhaohuiS:ZhaohuiS/handle_exception_supervisor

Conversation

@ZhaohuiS
Copy link
Contributor

@ZhaohuiS ZhaohuiS commented Jul 22, 2025

Why I did it

It found the following KeyError in syslog, not only for lldp, but also for snmp and bgp.

2025 Jul 19 18:13:00.240397 vlab-01 ERR lldp#supervisor-proc-exit-listener: Exception: 'len', trace: Traceback (most recent call last):
  File "/usr/bin/supervisor-proc-exit-listener", line 249, in <module>
    main(sys.argv[1:])
  File "/usr/bin/supervisor-proc-exit-listener", line 182, in main
    payload = sys.stdin.read(int(headers['len']))
KeyError: 'len'

The context syslog is:

2025 Jul 19 18:12:59.505711 vlab-01 INFO lldp#supervisord 2025-07-19 18:12:59,504 INFO waiting for supervisor-proc-exit-listener, rsyslogd, lldpd, lldp-syncd, lldpmgrd to die
2025 Jul 19 18:12:59.761223 vlab-01 INFO containerd[684]: time="2025-07-19T18:12:59.759992163Z" level=info msg="shim disconnected" id=cd6e41a2cc82aae25d2d65801984943311b3f025c98ca865ea79be95194abc95
2025 Jul 19 18:12:59.762463 vlab-01 INFO containerd[684]: time="2025-07-19T18:12:59.760103279Z" level=warning msg="cleaning up after shim disconnected" id=cd6e41a2cc82aae25d2d65801984943311b3f025c98ca865ea79be95194abc95 namespace=moby
2025 Jul 19 18:12:59.765745 vlab-01 INFO containerd[684]: time="2025-07-19T18:12:59.760116062Z" level=info msg="cleaning up dead shim"
2025 Jul 19 18:12:59.767134 vlab-01 INFO dockerd[752]: time="2025-07-19T18:12:59.760554606Z" level=info msg="ignoring event" container=cd6e41a2cc82aae25d2d65801984943311b3f025c98ca865ea79be95194abc95 module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
2025 Jul 19 18:12:59.784436 vlab-01 INFO containerd[684]: time="2025-07-19T18:12:59.783563921Z" level=warning msg="cleanup warnings time=\"2025-07-19T18:12:59Z\" level=info msg=\"starting signal loop\" namespace=moby pid=42053 runtime=io.containerd.runc.v2\n"
2025 Jul 19 18:12:59.826676 vlab-01 INFO systemd[1]: var-lib-docker-overlay2-472b96da162023c3bc1e0d4132486ad7c122b23acf07f93d0e5b0a9538d7cebe-merged.mount: Deactivated successfully.
2025 Jul 19 18:12:59.840815 vlab-01 INFO container: docker cmd: wait for teamd
2025 Jul 19 18:12:59.843934 vlab-01 INFO container: docker cmd: stop for teamd
2025 Jul 19 18:12:59.861044 vlab-01 DEBUG container: container_stop: END
2025 Jul 19 18:12:59.906677 vlab-01 NOTICE admin: Stopped teamd service...
2025 Jul 19 18:12:59.938168 vlab-01 INFO systemd[1]: teamd.service: Deactivated successfully.
2025 Jul 19 18:12:59.938548 vlab-01 INFO systemd[1]: Stopped teamd.service - TEAMD container.
2025 Jul 19 18:12:59.939901 vlab-01 NOTICE rsyslog_plugin: :- publish: EVENT_PUBLISHED: {"sonic-events-host:event-stopped-ctr":{"ctr_name":"TEAMD","timestamp":"2025-07-19T18:12:59.939561Z"}}
2025 Jul 19 18:13:00.196745 vlab-01 INFO dockerd[752]: time="2025-07-19T18:13:00.196382391Z" level=info msg="Container failed to exit within 10s of signal 15 - using the force" container=2188a8952aa8d602224b78e325295831aceb8f14d1b6ec8869cc153b7eafef6a
2025 Jul 19 18:13:00.240397 vlab-01 ERR lldp#supervisor-proc-exit-listener: Exception: 'len', trace: Traceback (most recent call last):#012  File "/usr/bin/supervisor-proc-exit-listener", line 249, in <module>#012    main(sys.argv[1:])#012  File "/usr/bin/supervisor-proc-exit-listener", line 182, in main#012    payload = sys.stdin.read(int(headers['len']))#012                                 ~~~~~~~^^^^^^^#012KeyError: 'len'

During shutdown, supervisor is sending termination events to the event listener, but the shutdown process is interrupting the event stream.
The container is being forcibly killed (Container failed to exit within 10s of signal 15 - using the force), which can interrupt the supervisor event protocol mid-stream.

Supervisor starts sending an event header
Before it can complete sending the full header (including len: field), the process gets interrupted
The listener receives a partial/malformed header without the len field

Work item tracking
  • Microsoft ADO 33409727:

How I did it

Check if 'len' exists before using it, if there is no len, it can't process the further steps.

How to verify it

Which release branch to backport (provide reason below if selected)

  • 202205
  • 202211
  • 202305
  • 202311
  • 202405
  • 202411
  • 202505

Tested branch (Please provide the tested image version)

Master image

  • SONiC.master.903923-5879963b4
  • SONiC.master.903791-260dff5a1

Description for the changelog

Link to config_db schema for YANG module changes

A picture of a cute animal (not mandatory but encouraged)

@mssonicbld
Copy link
Collaborator

/azp run Azure.sonic-buildimage

@ZhaohuiS ZhaohuiS requested a review from lguohan as a code owner July 22, 2025 09:35
@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Signed-off-by: Zhaohui Sun <[email protected]>
@mssonicbld
Copy link
Collaborator

/azp run Azure.sonic-buildimage

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@ZhaohuiS ZhaohuiS changed the title Add exception handling while getting header length in supervisor-proc-exit-listener Check if len in headers to avoid KeyError in supervisor-proc-exit-listener Jul 22, 2025
@ZhaohuiS ZhaohuiS requested a review from liuh-80 July 22, 2025 09:38
@ZhaohuiS
Copy link
Contributor Author

@qiluo-msft could you please help review?

headers = childutils.get_headers(line)
# Check if 'len' exists before using it
if 'len' not in headers:
syslog.syslog(syslog.LOG_ERR, "WARNING: Missing 'len' header in: {}".format(headers))
Copy link
Collaborator

@qiluo-msft qiluo-msft Jul 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LOG_ERR

do you want to use LOG_WARNING? then it will not impact LogAnalyzer result. #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion, update code, please review it again, thanks.

headers = childutils.get_headers(line)
# Check if 'len' exists before using it
if 'len' not in headers:
syslog.syslog(syslog.LOG_ERR, "WARNING: Missing 'len' header in: {}".format(headers))
Copy link
Collaborator

@qiluo-msft qiluo-msft Jul 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

headers

Could you also print line in this message for future troubleshooting purpose? #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion, update code, please review it again, thanks.

Signed-off-by: Zhaohui Sun <[email protected]>
@mssonicbld
Copy link
Collaborator

/azp run Azure.sonic-buildimage

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@qiluo-msft qiluo-msft merged commit dc53d2d into sonic-net:master Jul 25, 2025
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants