[Mellanox] Fix lua/orchagent storm detection race#697
[Mellanox] Fix lua/orchagent storm detection race#697lguohan merged 1 commit intosonic-net:masterfrom
Conversation
Signed-off-by: Andriy Moroz <[email protected]>
|
does this race condition happen on other platform's lua script? |
|
can you explain the nature of the race condition more clearly? |
|
At the execution that the detection script detects a storm, the script publishes the storm event to pfcwdorch. The immediate next execution of the detection script is expected to get pfc_wd_status ~= 'operational' to exit the detection logic and return. This expects the pfcwdorch to finish processing the storm occurrence event and get all actions taken within one polling interval. In a bad timing, however, pfcwdorch may not have processed the storm event to set the queue PFC_WD_STATUS in COUNTERS_DB properly by the time the immediate next execution of the detection script is called. By resetting SAI_PORT_STAT_PFC_?_RX_PKTS_last and _RX_PAUSE_DURATION_last at the storm detection execution, the immediate next execution of the detection logic will skip the main detection logic. This allows the pfcwdorch to get the storm event processing done within a timeline of two polling intervals, not the previous one polling interval. By increasing the time window, the double detection case is expected to be largely reduced if not fully mitigated in some pfcwdorch extremely non-repsonsive cases. @andriymoroz-mlnx is this the correct understanding of the idea? |
| -- DEBUG CODE END. | ||
| (occupancy_bytes == 0 and packets - packets_last == 0 and (pfc_duration - pfc_duration_last) > poll_time * 0.8) then | ||
| if time_left <= poll_time then | ||
| redis.call('HDEL', counters_table_name .. ':' .. port_id, pfc_rx_pkt_key .. '_last') |
There was a problem hiding this comment.
Need to understand how the change impacts the restore logic because restore relies on SAI_PORT_STAT_PFC_x_RX_PKTS_last, which is now reset on each storm detection. Because SAI_PORT_STAT_PFC_x_RX_PKTS_last is reset at a storm detection, the immediate next polling interval following the storm detection, the restore script can see pfc_rx_packets_last == 0, and skips the restore logic. https://github.com/Azure/sonic-swss/blob/6007e7f68cc103784208f69f8e6a0f12d4f2a193/orchagent/pfc_restore.lua#L42
So if there exists a situation that the storm restore is detected at the next polling interval of the storm detection occurrence, the restore signal is now delayed to 2 * polling intervals. Does this situation happen in realistic life, or its occurrence is just theorical?
There was a problem hiding this comment.
restoration script resets this field but will do it in proper time after orchagent reacted to the storm
if not ... pfc_wd_status ~= 'operational' ...
...
redis.call('HSET', counters_table_name .. ':' .. port_id, pfc_rx_pkt_key .. '_last', pfc_rx_packets)
end
There was a problem hiding this comment.
If _RX_PKTS_last and _RX_PAUSE_DURATION_last is hdeleted at a storm signal, the earliest restore signal will be two polling intervals away from the time a detect signal is published.
…nters operation (sonic-net#697) Immediately after a clear counter operation, the difference between new counter and old counter is negative. Returning 0 in this situation
Signed-off-by: Andriy Moroz <[email protected]>
Signed-off-by: Andriy Moroz [email protected]
What I did
Fixed race condition between lua script and orchagent which sometimes caused double storm detection
Why I did it
Stricter usage of *_last counters used in script to detect storm
How I verified it
PFC_WD test case for _last counters should pass
Details if related