Fix race condition between networking service and interface-config service#10573
Merged
yxieca merged 3 commits intosonic-net:masterfrom May 5, 2022
Merged
Fix race condition between networking service and interface-config service#10573yxieca merged 3 commits intosonic-net:masterfrom
yxieca merged 3 commits intosonic-net:masterfrom
Conversation
…rvice Conflicts: files/image_config/interfaces/interfaces-config.sh
keboliu
previously approved these changes
Apr 14, 2022
Collaborator
Author
|
/azpw run Azure.sonic-buildimage |
Collaborator
|
/AzurePipelines run Azure.sonic-buildimage |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Collaborator
Author
|
/azpw run Azure.sonic-buildimage |
Collaborator
|
/AzurePipelines run Azure.sonic-buildimage |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Collaborator
Author
|
/azpw run Azure.sonic-buildimage |
Collaborator
|
/AzurePipelines run Azure.sonic-buildimage |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Collaborator
|
@qiluo-msft could you please help to review or suggest someone who can? |
Collaborator
Author
|
Hi @yxieca , could you please help review and sign-off? |
yxieca
reviewed
Apr 25, 2022
Collaborator
Author
|
/azpw run Azure.sonic-buildimage |
Collaborator
|
/AzurePipelines run Azure.sonic-buildimage |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Collaborator
Author
|
/azpw run Azure.sonic-buildimage |
Collaborator
|
/AzurePipelines run Azure.sonic-buildimage |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Collaborator
|
@yxieca kindly reminder to review fixes and signoff |
yxieca
reviewed
May 2, 2022
Collaborator
Author
|
@yxieca kindly reminder. |
yxieca
approved these changes
May 5, 2022
Collaborator
Author
|
Hi @judyjoseph , could you please help cherry-pick to 202111? For 202012, there is no clean cherry-pick, I will add separate PR. |
Junchao-Mellanox
added a commit
to Junchao-Mellanox/sonic-buildimage
that referenced
this pull request
May 6, 2022
…rvice (sonic-net#10573) Why I did it The PR is aimed to fix a bug that mgmt port eth0 may loss IP even if user configured static IP of eth0. This is not a always reproduceable issue, the reproducing flow is like: Systemd starts networking service, which runs a dhcp based configuration and assigned an ip from dhcp. Systemd starts interface-config service who depends on networking service Interface-config service runs command “ifdown –force eth0”, check line. but networking service is still running so that this line failed with error: “error: Another instance of this program is already running.”. This error is printed by ifupdown2 lib who is the main process of networking service. So, ifdown actually does not work here, the ip of eth0 is not down. Interface-config service updates /etc/networking/interface to static configuration. Interface-config service runs command “systemctl restart networking”. This command kills the previous networking related processes (log: networking.service: Main process exited, code=killed, status=15/TERM), and try to reconfigure the ip address with static configuration. But it detects that the configured IP and the existing IP are the same, and it does not really configure the ip to kernel. Hence, the ip is still getting from dhcp. (this could be a bug of ifupdown2: previous ip is from dhcp, new ip is a static ip, it treats them as same instead of re-configuring the IP) When the lease of the ip expires, the ip of eth0 is removed by kernel and the issue reproduces. The issue is not always reproduceable because networking service usually runs fast so that it won't hit step#3. How I did it Check networking service state before running "ifdown –force eth0", wait for it done if it is activating. How to verify it Manual test. Conflicts: files/image_config/interfaces/interfaces-config.sh
6 tasks
judyjoseph
pushed a commit
that referenced
this pull request
May 8, 2022
…rvice (#10573) Why I did it The PR is aimed to fix a bug that mgmt port eth0 may loss IP even if user configured static IP of eth0. This is not a always reproduceable issue, the reproducing flow is like: Systemd starts networking service, which runs a dhcp based configuration and assigned an ip from dhcp. Systemd starts interface-config service who depends on networking service Interface-config service runs command “ifdown –force eth0”, check line. but networking service is still running so that this line failed with error: “error: Another instance of this program is already running.”. This error is printed by ifupdown2 lib who is the main process of networking service. So, ifdown actually does not work here, the ip of eth0 is not down. Interface-config service updates /etc/networking/interface to static configuration. Interface-config service runs command “systemctl restart networking”. This command kills the previous networking related processes (log: networking.service: Main process exited, code=killed, status=15/TERM), and try to reconfigure the ip address with static configuration. But it detects that the configured IP and the existing IP are the same, and it does not really configure the ip to kernel. Hence, the ip is still getting from dhcp. (this could be a bug of ifupdown2: previous ip is from dhcp, new ip is a static ip, it treats them as same instead of re-configuring the IP) When the lease of the ip expires, the ip of eth0 is removed by kernel and the issue reproduces. The issue is not always reproduceable because networking service usually runs fast so that it won't hit step#3. How I did it Check networking service state before running "ifdown –force eth0", wait for it done if it is activating. How to verify it Manual test.
qiluo-msft
pushed a commit
that referenced
this pull request
May 14, 2022
…rvice (#10573) (#10766) Backport #10573 to 202012. #### Why I did it The PR is aimed to fix a bug that mgmt port eth0 may loss IP even if user configured static IP of eth0. This is not a always reproduceable issue, the reproducing flow is like: 1. Systemd starts networking service, which runs a dhcp based configuration and assigned an ip from dhcp. 2. Systemd starts interface-config service who depends on networking service 3. Interface-config service runs command “ifdown –force eth0”, check [line](https://github.com/Azure/sonic-buildimage/blob/16717d2dc51f74fa711ed7b4392ce5e4f7e71c29/files/image_config/interfaces/interfaces-config.sh#L4). but networking service is still running so that this [line](https://github.com/CumulusNetworks/ifupdown2/blob/ac32bec0e24d64c583778f387050a7b6f4269db0/ifupdown2/ifupdown/main.py#L74) failed with error: “error: Another instance of this program is already running.”. This error is printed by ifupdown2 lib who is the main process of networking service. So, ifdown actually does not work here, the ip of eth0 is not down. 4. Interface-config service updates /etc/networking/interface to static configuration. 5. Interface-config service runs command “systemctl restart networking”. This command kills the previous networking related processes (log: networking.service: Main process exited, code=killed, status=15/TERM), and try to reconfigure the ip address with static configuration. But it detects that the configured IP and the existing IP are the same, and it does not really configure the ip to kernel. Hence, the ip is still getting from dhcp. (this could be a bug of ifupdown2: previous ip is from dhcp, new ip is a static ip, it treats them as same instead of re-configuring the IP) 6. When the lease of the ip expires, the ip of eth0 is removed by kernel and the issue reproduces. The issue is not always reproduceable because networking service usually runs fast so that it won't hit step#3. #### How I did it Check networking service state before running "ifdown –force eth0", wait for it done if it is activating. #### How to verify it Manual test.
liushilongbuaa
pushed a commit
to liushilongbuaa/sonic-buildimage
that referenced
this pull request
Jun 20, 2022
…rvice (sonic-net#10573) Why I did it The PR is aimed to fix a bug that mgmt port eth0 may loss IP even if user configured static IP of eth0. This is not a always reproduceable issue, the reproducing flow is like: Systemd starts networking service, which runs a dhcp based configuration and assigned an ip from dhcp. Systemd starts interface-config service who depends on networking service Interface-config service runs command “ifdown –force eth0”, check line. but networking service is still running so that this line failed with error: “error: Another instance of this program is already running.”. This error is printed by ifupdown2 lib who is the main process of networking service. So, ifdown actually does not work here, the ip of eth0 is not down. Interface-config service updates /etc/networking/interface to static configuration. Interface-config service runs command “systemctl restart networking”. This command kills the previous networking related processes (log: networking.service: Main process exited, code=killed, status=15/TERM), and try to reconfigure the ip address with static configuration. But it detects that the configured IP and the existing IP are the same, and it does not really configure the ip to kernel. Hence, the ip is still getting from dhcp. (this could be a bug of ifupdown2: previous ip is from dhcp, new ip is a static ip, it treats them as same instead of re-configuring the IP) When the lease of the ip expires, the ip of eth0 is removed by kernel and the issue reproduces. The issue is not always reproduceable because networking service usually runs fast so that it won't hit step#3. How I did it Check networking service state before running "ifdown –force eth0", wait for it done if it is activating. How to verify it Manual test.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why I did it
The PR is aimed to fix a bug that mgmt port eth0 may loss IP even if user configured static IP of eth0. This is not a always reproduceable issue, the reproducing flow is like:
The issue is not always reproduceable because networking service usually runs fast so that it won't hit step#3.
How I did it
Check networking service state before running "ifdown –force eth0", wait for it done if it is activating.
How to verify it
Manual test.
Which release branch to backport (provide reason below if selected)
Description for the changelog
Link to config_db schema for YANG module changes
A picture of a cute animal (not mandatory but encouraged)