[PFCWD]: Add alert mode broadcom support#343
Conversation
|
Depends on sonic-net/sonic-swss-common#141 and sonic-net/sonic-sairedis#241 |
d9d6fde to
d5f5463
Compare
orchagent/orchdaemon.cpp
Outdated
| queueStatIds, | ||
| queueAttrIds)); | ||
| } | ||
| else |
There was a problem hiding this comment.
else if (platform == BROADCOM_PLATFORM_SUBSTRING)
There was a problem hiding this comment.
getenv("platform") is empty for broadcom platform. Created an issue sonic-net/sonic-buildimage#1031. Before we fix that issue, let's do this way.
There was a problem hiding this comment.
I see. In this case, maybe some kind of "TODO" comment would be appropriate with the explanation why Broadcom is a default?
| time_left = time_left - poll_time | ||
| end | ||
| else | ||
| if pfc_wd_action == 'alert' and pfc_wd_status ~= 'operational' then |
There was a problem hiding this comment.
Plugins do not distinguish between actions. Two types of notifications are "storm" and "restore". Orchagent decides what to do based on configuration.
There was a problem hiding this comment.
Currently, the detect plugin checks queues in operational status, and restore plugin checks queues in stormed status. This is not true for the alert mode. In alert mode, the detect plugin will check all the queues regardless of the queue status, and restore plugin never check anything. Currently the logic of when/which queue to check is included in the plugin itself. That's the reason why plugin also needs to know the action type to decide which queue to check for alert-mode and normal case. In #342, it still runs restore plugins for stormed queues in alert mode, which is not correct.
There was a problem hiding this comment.
According to the document, definition of alert mode is following:
PFC watchdog logs when the action of dropping packets when pause storm occurs and resume forwarding traffic when pause storm stops. With the alert mode, PFC watchdog only need to log that pause storm detected and pause storm stopped.
So as I understand, alert mode is same as other actions, except it doesn't apply any configuration, but only logs events.
@lguohan can you please take a look? Maybe I'm missing something?
There was a problem hiding this comment.
The restore plugin criteria is that the stormed queue receives no more PFC. This criteria actually bases on the fact that we have taken the recovery action. After the recovery action, no more PFC should be received if the queue has recovered. In alert mode however, we don't take any action but log the events. In such case, restore criteria is too strong, as it's OK for the queue to occasionally receive PFC when no recovery action has been taken.
Detection plugin is the criteria to determine whether the queue is stormed when no recovery action has been taken. So in alert mode, we would use detection plugin to determine whether the queue storm is stopped.
| SAI_QUEUE_ATTR_PAUSE_STATUS, | ||
| }; | ||
|
|
||
| m_orchList.push_back(new PfcWdSwOrch<PfcWdActionHandler, PfcWdActionHandler>( |
There was a problem hiding this comment.
I think you can reuse forward handler
There was a problem hiding this comment.
Currently both drop and forward uses base class for simplicity. I agree we might reuse forward handler. This PR is for alert mode. So I would prefer to update drop/forward handler together in a separate PR.
| @@ -0,0 +1,66 @@ | |||
| -- KEYS - queue IDs | |||
There was a problem hiding this comment.
Is there any difference from mlnx plugin? If not, maybe it's better to rename the original to some general name and create symlinks?
There was a problem hiding this comment.
I think we will combine the detect/restore plugin into a single file in future. It might be better to separate the lua scripts for different platforms for now.
There was a problem hiding this comment.
Well, now that we see that restoration criteria are common, it might be a good idea to leave them apart. @lguohan what do you think?
There was a problem hiding this comment.
I see your point. Will use pfc_restore.lua for both platforms.
| return actionMap.at(key); | ||
| } | ||
|
|
||
| template <typename DropHandler, typename ForwardHandler> |
There was a problem hiding this comment.
Plugins don't have to know which action type it is. It's orchagent's responsibility to make such decision.
There was a problem hiding this comment.
As above comments, detect plugin need to detect stormed queues in alert mode instead of restore plugin. This logic need to be added into lua script based on current implementation.
orchagent/pfcwdorch.cpp
Outdated
| return; | ||
| SWSS_LOG_NOTICE("Current platform is mlnx"); | ||
| } | ||
| else |
There was a problem hiding this comment.
Why else? We must not default to any platform.
And I don't really understand why we need this kind of notice message.
There was a problem hiding this comment.
Removed notice message.
| string detectPluginName = "pfc_detect_" + platform + ".lua"; | ||
| string restorePluginName = "pfc_restore_" + platform + ".lua"; | ||
|
|
||
| try |
There was a problem hiding this comment.
Do you want not to handle it?
If we'll fail to load .lua script due to, for example, syntax error orchagent will crash.
There was a problem hiding this comment.
I agree it might be better to catch the exception to prevent orchagent from crashing. Update the change.
|
@sihuihan88 In general, if PFCWD does not support some action for a specific platform, you can just use the base class which does nothing except logging. |
732ab66 to
06b5345
Compare
orchagent/pfcwdorch.cpp
Outdated
| return; | ||
| SWSS_LOG_NOTICE("Current platform is mlnx"); | ||
| } | ||
| else |
There was a problem hiding this comment.
This is not a good assumption. What needs to be done here is to export the ASIC instead of the platform inside the orchagent.sh file in the sonic-buildimage repository. Here we could leverage the ASIC env then.
orchagent/pfcwdorch.cpp
Outdated
| if (entry->second.action == PfcWdAction::PFC_WD_ACTION_DROP) | ||
| if (entry->second.action == PfcWdAction::PFC_WD_ACTION_ALERT) | ||
| { | ||
| if(entry->second.handler == nullptr) |
| void createEntry(const string& key, const vector<FieldValueTuple>& data); | ||
| static string serializeAction(const PfcWdAction &action); | ||
| private: | ||
| void createEntry(const string& key, const vector<FieldValueTuple>& data); |
orchagent/pfcwdorch.cpp
Outdated
| entry->second.index, | ||
| PfcWdOrch<DropHandler, ForwardHandler>::getCountersTable()); | ||
| } | ||
|
|
|
@stcheng are you aware of any github bot that will do such style checks for every new PR? It might significantly simplify a reviewer's job. |
|
@marian-pritsak we might need to add some checks during the build time. e.g. cpplint |
b069578 to
15aa344
Compare
15aa344 to
9e90bfa
Compare
|
retest this please |
Signed-off-by: Wenda <[email protected]>
ecnconfig check against invalid argument value (sonic-net#343)
…-net#343) We have a memory leak in pyext. As result ledd, lddpd, and pmon could use a lot of memory. I read swig documentation in many places + some source code, but not everything. As I understood from the documentation SWIG needs to have a mechanism to garbage collect objects, which are created inside of C++ method, were given outside, and it is user responsibility to remove the objects. To address such objects with python swig introduced *_OWN flag. So by documentation and sources I read I think we definitely don't need *_OWN flag, because we don't want the selectabl object to be removed, after we receive it from Select.
Signed-off-by: Sihui Han [email protected]
What I did
Enhance alert mode for PFCWD: in alert mode, pfcwd will keep using PFC storm detection criteria to see whether storm happens. If storm happens, pfcwd sends the syslog message, and keeps using detection criteria to monitor the queue status, and send syslog message when the storm stops.
Add initial support for broadcom PFCWD: current broadcom PFCWD will only support alert mode and rely on software implementation to detect storm. Will add support for drop and storm recovery in future.
Why I did it
Enhance PFCWD feature.
How I verified it
Test on the broadcom DUT, it behaves as desired.
Details if related