Skip to content

[PFCWD]: Add alert mode broadcom support#343

Merged
stcheng merged 4 commits intosonic-net:masterfrom
sihuihan88:dev/sihan/pfc
Oct 13, 2017
Merged

[PFCWD]: Add alert mode broadcom support#343
stcheng merged 4 commits intosonic-net:masterfrom
sihuihan88:dev/sihan/pfc

Conversation

@sihuihan88
Copy link
Contributor

Signed-off-by: Sihui Han [email protected]

What I did

  1. Enhance alert mode for PFCWD: in alert mode, pfcwd will keep using PFC storm detection criteria to see whether storm happens. If storm happens, pfcwd sends the syslog message, and keeps using detection criteria to monitor the queue status, and send syslog message when the storm stops.

  2. Add initial support for broadcom PFCWD: current broadcom PFCWD will only support alert mode and rely on software implementation to detect storm. Will add support for drop and storm recovery in future.

Why I did it
Enhance PFCWD feature.
How I verified it
Test on the broadcom DUT, it behaves as desired.
Details if related

@sihuihan88
Copy link
Contributor Author

queueStatIds,
queueAttrIds));
}
else
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

else if (platform == BROADCOM_PLATFORM_SUBSTRING)

Copy link
Contributor Author

@sihuihan88 sihuihan88 Oct 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getenv("platform") is empty for broadcom platform. Created an issue sonic-net/sonic-buildimage#1031. Before we fix that issue, let's do this way.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. In this case, maybe some kind of "TODO" comment would be appropriate with the explanation why Broadcom is a default?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do

time_left = time_left - poll_time
end
else
if pfc_wd_action == 'alert' and pfc_wd_status ~= 'operational' then
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plugins do not distinguish between actions. Two types of notifications are "storm" and "restore". Orchagent decides what to do based on configuration.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, the detect plugin checks queues in operational status, and restore plugin checks queues in stormed status. This is not true for the alert mode. In alert mode, the detect plugin will check all the queues regardless of the queue status, and restore plugin never check anything. Currently the logic of when/which queue to check is included in the plugin itself. That's the reason why plugin also needs to know the action type to decide which queue to check for alert-mode and normal case. In #342, it still runs restore plugins for stormed queues in alert mode, which is not correct.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the document, definition of alert mode is following:
PFC watchdog logs when the action of dropping packets when pause storm occurs and resume forwarding traffic when pause storm stops. With the alert mode, PFC watchdog only need to log that pause storm detected and pause storm stopped.

So as I understand, alert mode is same as other actions, except it doesn't apply any configuration, but only logs events.

@lguohan can you please take a look? Maybe I'm missing something?

Copy link
Contributor Author

@sihuihan88 sihuihan88 Oct 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The restore plugin criteria is that the stormed queue receives no more PFC. This criteria actually bases on the fact that we have taken the recovery action. After the recovery action, no more PFC should be received if the queue has recovered. In alert mode however, we don't take any action but log the events. In such case, restore criteria is too strong, as it's OK for the queue to occasionally receive PFC when no recovery action has been taken.

Detection plugin is the criteria to determine whether the queue is stormed when no recovery action has been taken. So in alert mode, we would use detection plugin to determine whether the queue storm is stopped.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

SAI_QUEUE_ATTR_PAUSE_STATUS,
};

m_orchList.push_back(new PfcWdSwOrch<PfcWdActionHandler, PfcWdActionHandler>(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can reuse forward handler

Copy link
Contributor Author

@sihuihan88 sihuihan88 Oct 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently both drop and forward uses base class for simplicity. I agree we might reuse forward handler. This PR is for alert mode. So I would prefer to update drop/forward handler together in a separate PR.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@@ -0,0 +1,66 @@
-- KEYS - queue IDs
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any difference from mlnx plugin? If not, maybe it's better to rename the original to some general name and create symlinks?

Copy link
Contributor Author

@sihuihan88 sihuihan88 Oct 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we will combine the detect/restore plugin into a single file in future. It might be better to separate the lua scripts for different platforms for now.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, now that we see that restoration criteria are common, it might be a good idea to leave them apart. @lguohan what do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point. Will use pfc_restore.lua for both platforms.

return actionMap.at(key);
}

template <typename DropHandler, typename ForwardHandler>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plugins don't have to know which action type it is. It's orchagent's responsibility to make such decision.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above comments, detect plugin need to detect stormed queues in alert mode instead of restore plugin. This logic need to be added into lua script based on current implementation.

return;
SWSS_LOG_NOTICE("Current platform is mlnx");
}
else
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why else? We must not default to any platform.
And I don't really understand why we need this kind of notice message.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed notice message.

string detectPluginName = "pfc_detect_" + platform + ".lua";
string restorePluginName = "pfc_restore_" + platform + ".lua";

try
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want not to handle it?
If we'll fail to load .lua script due to, for example, syntax error orchagent will crash.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it might be better to catch the exception to prevent orchagent from crashing. Update the change.

@marian-pritsak
Copy link
Collaborator

@sihuihan88 In general, if PFCWD does not support some action for a specific platform, you can just use the base class which does nothing except logging.

Copy link
Contributor

@stcheng stcheng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

various misc changes.

return;
SWSS_LOG_NOTICE("Current platform is mlnx");
}
else
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a good assumption. What needs to be done here is to export the ASIC instead of the platform inside the orchagent.sh file in the sonic-buildimage repository. Here we could leverage the ASIC env then.

if (entry->second.action == PfcWdAction::PFC_WD_ACTION_DROP)
if (entry->second.action == PfcWdAction::PFC_WD_ACTION_ALERT)
{
if(entry->second.handler == nullptr)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add space between if

void createEntry(const string& key, const vector<FieldValueTuple>& data);
static string serializeAction(const PfcWdAction &action);
private:
void createEntry(const string& key, const vector<FieldValueTuple>& data);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

four spaces

entry->second.index,
PfcWdOrch<DropHandler, ForwardHandler>::getCountersTable());
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra empty line

@marian-pritsak
Copy link
Collaborator

@stcheng are you aware of any github bot that will do such style checks for every new PR? It might significantly simplify a reviewer's job.

@stcheng
Copy link
Contributor

stcheng commented Oct 12, 2017

@marian-pritsak we might need to add some checks during the build time. e.g. cpplint

@sihuihan88 sihuihan88 force-pushed the dev/sihan/pfc branch 2 times, most recently from b069578 to 15aa344 Compare October 12, 2017 18:25
@sihuihan88
Copy link
Contributor Author

retest this please

@stcheng stcheng merged commit e07d86b into sonic-net:master Oct 13, 2017
@sihuihan88 sihuihan88 deleted the dev/sihan/pfc branch January 3, 2018 00:57
EdenGri pushed a commit to EdenGri/sonic-swss that referenced this pull request Feb 28, 2022
EdenGri pushed a commit to EdenGri/sonic-swss that referenced this pull request Feb 28, 2022
ecnconfig check against invalid argument value (sonic-net#343)
lukasstockner pushed a commit to genesiscloud/sonic-swss that referenced this pull request Apr 2, 2023
jianyuewu pushed a commit to jianyuewu/sonic-swss that referenced this pull request Dec 24, 2025
…-net#343)

We have a memory leak in pyext. As result ledd, lddpd, and pmon could use a lot of memory.

I read swig documentation in many places + some source code, but not everything. As I understood from the documentation SWIG needs to have a mechanism to garbage collect objects, which are created inside of C++ method, were given outside, and it is user responsibility to remove the objects. To address such objects with python swig introduced *_OWN flag. So by documentation and sources I read I think we definitely don't need *_OWN flag, because we don't want the selectabl object to be removed, after we receive it from Select.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants