Skip to content

[portmgrd] regression: prevent runtime exception (crash) in configuring portchannel at boot#3432

Merged
prsunny merged 7 commits intosonic-net:masterfrom
bradh352:bradh352/portchannel-crash
Apr 10, 2025
Merged

[portmgrd] regression: prevent runtime exception (crash) in configuring portchannel at boot#3432
prsunny merged 7 commits intosonic-net:masterfrom
bradh352:bradh352/portchannel-crash

Conversation

@bradh352
Copy link
Copy Markdown
Contributor

@bradh352 bradh352 commented Dec 19, 2024

What I did
Prevent setting a default port MTU on PortChannel member ports as it will fail (at least on Dell S5248F) during boot and cause portmgrd to exit. The current code in portmgr.cpp is setting a default value for an MTU (9100) even when its a PortChannel member, so this patch prevents that default value from being set.

Also if a user were to incorrectly specify an MTU on a Port that is a member of the port channel via config_db.json this too would bring down portmgrd, so catch that and just emit a warning instead (NOTE: the YANG model does NOT support checking/preventing an MTU set on a PORT that is part of a PORTCHANNEL, so this secondary issue should be caught and handled gracefully).

In order to not add much overhead for large port count systems, we are also lazily caching portchannel members (using a local variable to doTask() so it is short-lived) and only using that cache on a new port being brought up or on failure to set an MTU.

This code is only called if the port is not in the PORT_TABLE. I'm not aware of any instances where ports are added to PORT_TABLE after startup.

During startup, it is expected that doTask() will not be invoked per port, but rather receive events for multiple (likely all) ports at once, so I optimized for that case by adding a hashtable cache so each port won't have to pull the list for the db, its pulled at most once per doTask() call.

Therefore, I expect the overhead of this patch to be exactly 1 db query per switch startup (not per port and not after startup). Then a very cheap hashtable lookup once per port, again, only during startup.

Why I did it
The current code always attempts to set an MTU on the PORT by setting a default here:

/* If this is the first time we set port settings
* assign default admin status and mtu
*/
if (!configured)
{
admin_status = DEFAULT_ADMIN_STATUS_STR;
mtu = DEFAULT_MTU_STR;
m_portList.insert(alias);
}

Then applies it here:
if (!mtu.empty())
{
setPortMtu(alias, mtu);
SWSS_LOG_NOTICE("Configure %s MTU to %s", alias.c_str(), mtu.c_str());
}

So it isn't crashing because the user configured the MTU in the PORT config, but rather because it is done by default (in portmgr.cpp) when the port is created. (But it also would crash if a user set an MTU on a port which is bad since YANG doesn't do anything to prevent this).

NOTE: this only appears to crash on a freshly loaded config at boot, if you take an existing running configuration and modify it to add a portchannel it works since the port already exists in PORT_TABLE so the default MTU setting path isn't taken in the above referenced code.

This regression was caused by 8b99543 (Oct 2024)... but just reverting that patch isn't the right solution. The startup logic does not appear to be proper as it tries to set a default MTU regardless if its valid to do so for the port or not.

Logs show this issue which is a critical failure causing the entire switch to go down:

2024 Dec 17 19:26:20.964259 sw1 INFO swss#supervisord: portmgrd RTNETLINK answers: Operation not permitted
2024 Dec 17 19:26:20.965353 sw1 ERR swss#portmgrd: :- main: Runtime error: /sbin/ip link set dev "Ethernet0" mtu "9100" : 
2024 Dec 17 19:26:20.967020 sw1 INFO swss#supervisord 2024-12-17 19:26:20,966 WARN exited: portmgrd (exit status 255; not expected)

It is possible this won't happen on all switches as I'm guessing this is likely a race condition between PortChannel creation and Port creation since they both listen to the configdb separately. So at least on my Dell S5248F switches, the PortChannel is triggered first by teammgrd before the physical Ports are added to PORT_TABLE by portmgrd.

How I verified it

Apply patch and verify this config no longer causes crash on Dell S5248F (Broadcom Trident3) during startup/boot.

You will observe the configuration below does not contain an mtu at all, because the primary issue is code internal to SONiC setting a default mtu.

Tested on 202411 and master.

{
    "PORT": {
        "Ethernet0": {
            "admin_status": "up",
            "alias": "twentyfiveGigE1/1/1",
            "autoneg": "off",
            "description": "PortChannel1 mgmt",
            "fec": "rs",
            "index": "1",
            "lanes": "49",
            "speed": "25000"
        },
        "Ethernet1": {
            "admin_status": "up",
            "alias": "twentyfiveGigE1/1/2",
            "autoneg": "off",
            "description": "PortChannel1 mgmt",
            "fec": "rs",
            "index": "2",
            "lanes": "50",
            "speed": "25000"
        }
    },
    "PORTCHANNEL": {
        "PortChannel0001": {
            "admin_status": "up",
            "description": "management interface",
            "lacp_key": "auto",
            "min_links": "1"
        }
    },
    "PORTCHANNEL_INTERFACE": {
        "PortChannel0001": {
            "ipv6_use_link_local_only": "enable",
            "mac_addr": "02:d3:ab:fe:fd:c4"
        },
        "PortChannel0001|10.0.0.11/24": {}
    },
    "PORTCHANNEL_MEMBER": {
        "PortChannel0001|Ethernet0": {},
        "PortChannel0001|Ethernet1": {}
    }
}

Details if related
Signed-off-by: Brad House (@bradh352)

@bradh352 bradh352 requested a review from prsunny as a code owner December 19, 2024 00:37
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@bradh352 bradh352 changed the title portmgrd: prevent runtime failure in setting MTU on portchannel member [portmgrd] prevent runtime exception (crash) in setting MTU on portchannel member Dec 19, 2024
@bradh352
Copy link
Copy Markdown
Contributor Author

@prsunny please review

bradh352 added a commit to bradh352/sonic-swss that referenced this pull request Dec 24, 2024
…annel member (PR sonic-net#3432)

Do not attempt to set the MTU directly on PortChannel members as it will
likely fail.  The MTU gets inherited as part of the PortChannel.

Signed-off-by: Brad House (@bradh352)
bradh352 added a commit to bradh352/sonic-swss that referenced this pull request Dec 24, 2024
…annel member (PR sonic-net#3432)

Do not attempt to set the MTU directly on PortChannel members as it will
likely fail.  The MTU gets inherited as part of the PortChannel.

Signed-off-by: Brad House (@bradh352)
bradh352 added a commit to bradh352/sonic-swss that referenced this pull request Dec 24, 2024
…annel member (PR sonic-net#3432)

Do not attempt to set the MTU directly on PortChannel members as it will
likely fail.  The MTU gets inherited as part of the PortChannel.

Signed-off-by: Brad House (@bradh352)
bradh352 added a commit to bradh352/sonic-swss that referenced this pull request Dec 24, 2024
…annel member (PR sonic-net#3432)

Do not attempt to set the MTU directly on PortChannel members as it will
likely fail.  The MTU gets inherited as part of the PortChannel.

Signed-off-by: Brad House (@bradh352)
bradh352 added a commit to bradh352/sonic-swss that referenced this pull request Jan 2, 2025
…annel member (PR sonic-net#3432)

Do not attempt to set the MTU directly on PortChannel members as it will
likely fail.  The MTU gets inherited as part of the PortChannel.

Signed-off-by: Brad House (@bradh352)
bradh352 added a commit to bradh352/sonic-swss that referenced this pull request Jan 2, 2025
…annel member (PR sonic-net#3432)

Do not attempt to set the MTU directly on PortChannel members as it will
likely fail.  The MTU gets inherited as part of the PortChannel.

Signed-off-by: Brad House (@bradh352)
@prsunny prsunny requested review from dgsudharsan and prgeor January 6, 2025 18:36
github-actions bot pushed a commit to bradh352/sonic-swss that referenced this pull request Jan 7, 2025
…annel member (PR sonic-net#3432)

Do not attempt to set the MTU directly on PortChannel members as it will
likely fail.  The MTU gets inherited as part of the PortChannel.

Signed-off-by: Brad House (@bradh352)
Copy link
Copy Markdown
Collaborator

@dgsudharsan dgsudharsan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add UT to cover this scenario.

@dgsudharsan
Copy link
Copy Markdown
Collaborator

@bradh352 We do have this check at CLI level https://github.com/sonic-net/sonic-utilities/blob/80d469886f120bfe9bc60024f608c039dce06646/config/main.py#L4948

Why do we need such checks at multiple places? @prsunny what are your thoughts on this?

@bradh352
Copy link
Copy Markdown
Contributor Author

bradh352 commented Jan 7, 2025

@bradh352 We do have this check at CLI level https://github.com/sonic-net/sonic-utilities/blob/80d469886f120bfe9bc60024f608c039dce06646/config/main.py#L4948

Why do we need such checks at multiple places? @prsunny what are your thoughts on this?

People using things like Ansible, don't use the CLI to set configuration. They modify the /etc/sonic/config_db.json which does nothing to prevent this. ALSO, in this case, as you can see from the /etc/sonic/config_db.json example I provided, no MTU is provided at all in the PORT configuration. Its being autopopulated somewhere as a default. I didn't try to track that down.

@bradh352 bradh352 force-pushed the bradh352/portchannel-crash branch from 277e0ae to 201d5b1 Compare January 7, 2025 02:32
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@bradh352 bradh352 force-pushed the bradh352/portchannel-crash branch from 84669f9 to d4b4b98 Compare January 7, 2025 02:55
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@bradh352
Copy link
Copy Markdown
Contributor Author

bradh352 commented Jan 7, 2025

Please add UT to cover this scenario.

I committed one, no idea if its right.

@bradh352 bradh352 requested a review from dgsudharsan January 7, 2025 09:45
@bradh352
Copy link
Copy Markdown
Contributor Author

bradh352 commented Jan 7, 2025

coverage looks good, any other comments?

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@bradh352
Copy link
Copy Markdown
Contributor Author

@prsunny is the current commit what you want?

@prsunny
Copy link
Copy Markdown
Collaborator

prsunny commented Mar 28, 2025

@prsunny is the current commit what you want?

Not really, keep the old code as-is and log WARN instead of 'throw'

This reverts commit 5cd5d27.

Nope, @prsunny didn't accept.
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@bradh352
Copy link
Copy Markdown
Contributor Author

@prsunny ok, figured less code was better by removing another conditional that is irrelevant, but if you want it there, so be it. Please see latest commit.

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@prsunny
Copy link
Copy Markdown
Collaborator

prsunny commented Apr 1, 2025

Suggest following Sonic github community guidelines for commit messages as reviewers can have better understanding of the code that's added/removed in each commit.

Copy link
Copy Markdown
Collaborator

@dgsudharsan dgsudharsan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please modify commit messages as suggested.

@bhouse-nexthop
Copy link
Copy Markdown
Contributor

Suggest following Sonic github community guidelines for commit messages as reviewers can have better understanding of the code that's added/removed in each commit.

Do you just want them all squashed into 1 commit with the relevant message? The first commit in the series has a proper commit message, but it is no longer relevant to the current overall patch set. The rest of the commits obviously do not have properly formed messages since I was more concerned with coming up with something that was acceptable to you from a code standpoint.

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@prsunny
Copy link
Copy Markdown
Collaborator

prsunny commented Apr 10, 2025

Suggest following Sonic github community guidelines for commit messages as reviewers can have better understanding of the code that's added/removed in each commit.

Do you just want them all squashed into 1 commit with the relevant message? The first commit in the series has a proper commit message, but it is no longer relevant to the current overall patch set. The rest of the commits obviously do not have properly formed messages since I was more concerned with coming up with something that was acceptable to you from a code standpoint.

Leave it for this PR and address for future PRs to have meaningful commits. Will merge once the PR checkers pass.

@prsunny
Copy link
Copy Markdown
Collaborator

prsunny commented Apr 10, 2025

Bypassed coverage temporarily as this is an exception path.

@bradh352
Copy link
Copy Markdown
Contributor Author

Please tag for backport to 202411 as well, or I can make a PR for that if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants