Skip to content

[T2][202405] Zebra process consuming a large amount of memory resulting in OOM kernel panics #20337

@arista-nwolfe

Description

@arista-nwolfe

On full T2 devices in 202405 Arista is seeing the zebra process in FRR consume a large amount of memory (10x compared to 202205).

202405:

root@cmp206-4:~# docker exec -it bgp0 bash
root@cmp206-4:/# ps aux
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root           1  0.0  0.2  38116 32024 pts/0    Ss+  21:23   0:01 /usr/bin/python3 /usr/local/bin/supervisord
root          44  0.1  0.2 131684 31888 pts/0    Sl   21:23   0:06 python3 /usr/bin/supervisor-proc-exit-listener --container-name bgp
root          47  0.0  0.0 230080  4164 pts/0    Sl   21:23   0:00 /usr/sbin/rsyslogd -n -iNONE
frr           51 27.5  8.1 2018736 1283692 pts/0 Sl   21:23  16:57 /usr/lib/frr/zebra -A 127.0.0.1 -s 90000000 -M dplane_fpm_nl -M snmp

202205:

root@cmp210-3:~# docker exec -it bgp bash
root@cmp210-3:/# ps aux
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root           1  0.0  0.1  30524 26232 pts/0    Ss+  21:59   0:00 /usr/bin/python3 /usr/local/bin/supervisord
root          26  0.0  0.1  30808 25712 pts/0    S    21:59   0:00 python3 /usr/bin/supervisor-proc-exit-listener --container-name bgp
root          27  0.0  0.0 220836  3764 pts/0    Sl   21:59   0:00 /usr/sbin/rsyslogd -n -iNONE
frr           31  9.7  0.7 730360 128852 pts/0   Sl   21:59   2:32 /usr/lib/frr/zebra -A 127.0.0.1 -s 90000000 -M fpm -M snmp

This results in the system having very low amounts of free memory:

> free -m
               total        used        free      shared  buff/cache   available
Mem:           15388       15304         158         284         481          83

If we run a command which causes zebra to consume even more memory like show ip route it can cause kernel panics due to OOM:

[74531.234009] Kernel panic - not syncing: Out of memory: compulsory panic_on_oom is enabled
[74531.260707] CPU: 1 PID: 735 Comm: auditd Kdump: loaded Tainted: G           OE      6.1.0-11-2-amd64 #1  Debian 6.1.38-4
[74531.313431] Call Trace:
[74531.365891]  <TASK>
[74531.418342]  dump_stack_lvl+0x44/0x5c
[74531.470844]  panic+0x118/0x2ed
[74531.523334]  out_of_memory.cold+0x67/0x7e

When we look at the show memory in FRR we see the max nexthops is significantly higher on 202405 than 202205.
202405:

show memory
Memory statistics for zebra:
  Total heap allocated:  > 2GB
--- qmem libfrr ---
Type                          : Current#   Size       Total     Max#  MaxBytes
Nexthop                       :     1669    160      280536  8113264 1363218720    # ASIC0
Nexthop                       :     1535    160      258120  2097270 352476288     # ASIC1

202205:

show memory
Memory statistics for zebra:
  Total heap allocated:  72 MiB
--- qmem libfrr ---
Type                          : Current#   Size       Total     Max#  MaxBytes
Nexthop                       :     1173    152      178312    36591   5563080

NOTES:
-both 202205 and 202405 have the same number of routes installed
-we also seen an increase on t2-min topologies but the absolute memory usage is at least half of what T2 is seeing so we aren't seeing OOMs on t2-min
-the FRR version changed between 202205=FRRouting 8.2.2 and 202405=FRRouting 8.5.4

Metadata

Metadata

Assignees

Labels

Chassis 🤖Modular chassis supportP0Priority of the issueTriagedthis issue has been triaged

Type

No type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions