Skip to content

[Mellanox] Fix lua/orchagent storm detection race#697

Merged
lguohan merged 1 commit intosonic-net:masterfrom
andriymoroz-mlnx:pfcwd_double_detect_fix
Nov 30, 2018
Merged

[Mellanox] Fix lua/orchagent storm detection race#697
lguohan merged 1 commit intosonic-net:masterfrom
andriymoroz-mlnx:pfcwd_double_detect_fix

Conversation

@andriymoroz-mlnx
Copy link
Copy Markdown
Contributor

Signed-off-by: Andriy Moroz [email protected]

What I did
Fixed race condition between lua script and orchagent which sometimes caused double storm detection

Why I did it
Stricter usage of *_last counters used in script to detect storm

How I verified it
PFC_WD test case for _last counters should pass

Details if related

@lguohan
Copy link
Copy Markdown
Contributor

lguohan commented Nov 21, 2018

does this race condition happen on other platform's lua script?

@lguohan
Copy link
Copy Markdown
Contributor

lguohan commented Nov 21, 2018

can you explain the nature of the race condition more clearly?

@wendani wendani self-requested a review November 21, 2018 18:43
@wendani
Copy link
Copy Markdown
Contributor

wendani commented Nov 21, 2018

At the execution that the detection script detects a storm, the script publishes the storm event to pfcwdorch. The immediate next execution of the detection script is expected to get pfc_wd_status ~= 'operational' to exit the detection logic and return. This expects the pfcwdorch to finish processing the storm occurrence event and get all actions taken within one polling interval. In a bad timing, however, pfcwdorch may not have processed the storm event to set the queue PFC_WD_STATUS in COUNTERS_DB properly by the time the immediate next execution of the detection script is called.

By resetting SAI_PORT_STAT_PFC_?_RX_PKTS_last and _RX_PAUSE_DURATION_last at the storm detection execution, the immediate next execution of the detection logic will skip the main detection logic. This allows the pfcwdorch to get the storm event processing done within a timeline of two polling intervals, not the previous one polling interval. By increasing the time window, the double detection case is expected to be largely reduced if not fully mitigated in some pfcwdorch extremely non-repsonsive cases.

#697 (comment)

@andriymoroz-mlnx is this the correct understanding of the idea?

-- DEBUG CODE END.
(occupancy_bytes == 0 and packets - packets_last == 0 and (pfc_duration - pfc_duration_last) > poll_time * 0.8) then
if time_left <= poll_time then
redis.call('HDEL', counters_table_name .. ':' .. port_id, pfc_rx_pkt_key .. '_last')
Copy link
Copy Markdown
Contributor

@wendani wendani Nov 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to understand how the change impacts the restore logic because restore relies on SAI_PORT_STAT_PFC_x_RX_PKTS_last, which is now reset on each storm detection. Because SAI_PORT_STAT_PFC_x_RX_PKTS_last is reset at a storm detection, the immediate next polling interval following the storm detection, the restore script can see pfc_rx_packets_last == 0, and skips the restore logic. https://github.com/Azure/sonic-swss/blob/6007e7f68cc103784208f69f8e6a0f12d4f2a193/orchagent/pfc_restore.lua#L42

So if there exists a situation that the storm restore is detected at the next polling interval of the storm detection occurrence, the restore signal is now delayed to 2 * polling intervals. Does this situation happen in realistic life, or its occurrence is just theorical?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

restoration script resets this field but will do it in proper time after orchagent reacted to the storm

if not ... pfc_wd_status ~= 'operational' ...
    ...
    redis.call('HSET', counters_table_name .. ':' .. port_id, pfc_rx_pkt_key .. '_last', pfc_rx_packets)
end

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If _RX_PKTS_last and _RX_PAUSE_DURATION_last is hdeleted at a storm signal, the earliest restore signal will be two polling intervals away from the time a detect signal is published.

@lguohan lguohan merged commit 95c3739 into sonic-net:master Nov 30, 2018
EdenGri pushed a commit to EdenGri/sonic-swss that referenced this pull request Feb 28, 2022
…nters operation (sonic-net#697)

Immediately after a clear counter operation, the difference between new
counter and old counter is negative. Returning 0 in this situation
Janetxxx pushed a commit to Janetxxx/sonic-swss that referenced this pull request Nov 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants