[Mellanox] Fix lua/orchagent storm detection race by andriymoroz-mlnx · Pull Request #697 · sonic-net/sonic-swss

andriymoroz-mlnx · 2018-11-20T15:16:22Z

Signed-off-by: Andriy Moroz [email protected]

What I did
Fixed race condition between lua script and orchagent which sometimes caused double storm detection

Why I did it
Stricter usage of *_last counters used in script to detect storm

How I verified it
PFC_WD test case for _last counters should pass

Details if related

Signed-off-by: Andriy Moroz <[email protected]>

lguohan · 2018-11-21T10:38:30Z

does this race condition happen on other platform's lua script?

lguohan · 2018-11-21T10:39:55Z

can you explain the nature of the race condition more clearly?

wendani · 2018-11-21T22:35:56Z

At the execution that the detection script detects a storm, the script publishes the storm event to pfcwdorch. The immediate next execution of the detection script is expected to get pfc_wd_status ~= 'operational' to exit the detection logic and return. This expects the pfcwdorch to finish processing the storm occurrence event and get all actions taken within one polling interval. In a bad timing, however, pfcwdorch may not have processed the storm event to set the queue PFC_WD_STATUS in COUNTERS_DB properly by the time the immediate next execution of the detection script is called.

By resetting SAI_PORT_STAT_PFC_?_RX_PKTS_last and _RX_PAUSE_DURATION_last at the storm detection execution, the immediate next execution of the detection logic will skip the main detection logic. This allows the pfcwdorch to get the storm event processing done within a timeline of two polling intervals, not the previous one polling interval. By increasing the time window, the double detection case is expected to be largely reduced if not fully mitigated in some pfcwdorch extremely non-repsonsive cases.

#697 (comment)

@andriymoroz-mlnx is this the correct understanding of the idea?

wendani · 2018-11-21T23:52:25Z

orchagent/pfc_detect_mellanox.lua

                        -- DEBUG CODE END.
                        (occupancy_bytes == 0 and packets - packets_last == 0 and (pfc_duration - pfc_duration_last) > poll_time * 0.8) then
                        if time_left <= poll_time then
+                            redis.call('HDEL', counters_table_name .. ':' .. port_id, pfc_rx_pkt_key .. '_last')


Need to understand how the change impacts the restore logic because restore relies on SAI_PORT_STAT_PFC_x_RX_PKTS_last, which is now reset on each storm detection. Because SAI_PORT_STAT_PFC_x_RX_PKTS_last is reset at a storm detection, the immediate next polling interval following the storm detection, the restore script can see pfc_rx_packets_last == 0, and skips the restore logic. https://github.com/Azure/sonic-swss/blob/6007e7f68cc103784208f69f8e6a0f12d4f2a193/orchagent/pfc_restore.lua#L42

So if there exists a situation that the storm restore is detected at the next polling interval of the storm detection occurrence, the restore signal is now delayed to 2 * polling intervals. Does this situation happen in realistic life, or its occurrence is just theorical?

restoration script resets this field but will do it in proper time after orchagent reacted to the storm

if not ... pfc_wd_status ~= 'operational' ... ... redis.call('HSET', counters_table_name .. ':' .. port_id, pfc_rx_pkt_key .. '_last', pfc_rx_packets) end

If _RX_PKTS_last and _RX_PAUSE_DURATION_last is hdeleted at a storm signal, the earliest restore signal will be two polling intervals away from the time a detect signal is published.

…nters operation (sonic-net#697) Immediately after a clear counter operation, the difference between new counter and old counter is negative. Returning 0 in this situation

Signed-off-by: Andriy Moroz <[email protected]>

[Mellanox] Fix lua/orchagent storm detection race

60ce32a

Signed-off-by: Andriy Moroz <[email protected]>

stcheng added the Bug 🐛 label Nov 20, 2018

wendani self-requested a review November 21, 2018 18:43

wendani reviewed Nov 21, 2018

View reviewed changes

wendani approved these changes Nov 29, 2018

View reviewed changes

lguohan merged commit 95c3739 into sonic-net:master Nov 30, 2018

wendani mentioned this pull request Apr 2, 2019

[orchagent]: Added support of PFC WD for BFN platform #823

Merged

Janetxxx pushed a commit to Janetxxx/sonic-swss that referenced this pull request Nov 10, 2025

[Mellanox] Fix lua/orchagent storm detection race (sonic-net#697)

abc0469

Signed-off-by: Andriy Moroz <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Mellanox] Fix lua/orchagent storm detection race#697

[Mellanox] Fix lua/orchagent storm detection race#697
lguohan merged 1 commit intosonic-net:masterfrom
andriymoroz-mlnx:pfcwd_double_detect_fix

andriymoroz-mlnx commented Nov 20, 2018

Uh oh!

lguohan commented Nov 21, 2018

Uh oh!

lguohan commented Nov 21, 2018

Uh oh!

wendani commented Nov 21, 2018

Uh oh!

wendani Nov 21, 2018 •

edited

Loading

Uh oh!

andriymoroz-mlnx Nov 27, 2018

Uh oh!

wendani Apr 2, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

andriymoroz-mlnx commented Nov 20, 2018

Uh oh!

lguohan commented Nov 21, 2018

Uh oh!

lguohan commented Nov 21, 2018

Uh oh!

wendani commented Nov 21, 2018

Uh oh!

wendani Nov 21, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andriymoroz-mlnx Nov 27, 2018

Choose a reason for hiding this comment

Uh oh!

wendani Apr 2, 2019

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

wendani Nov 21, 2018 •

edited

Loading