Skip to content

Fix race condition between networking service and interface-config service#10573

Merged
yxieca merged 3 commits intosonic-net:masterfrom
Junchao-Mellanox:fix-mgmt-ip-loss-1
May 5, 2022
Merged

Fix race condition between networking service and interface-config service#10573
yxieca merged 3 commits intosonic-net:masterfrom
Junchao-Mellanox:fix-mgmt-ip-loss-1

Conversation

@Junchao-Mellanox
Copy link
Copy Markdown
Collaborator

Why I did it

The PR is aimed to fix a bug that mgmt port eth0 may loss IP even if user configured static IP of eth0. This is not a always reproduceable issue, the reproducing flow is like:

  1. Systemd starts networking service, which runs a dhcp based configuration and assigned an ip from dhcp.
  2. Systemd starts interface-config service who depends on networking service
  3. Interface-config service runs command “ifdown –force eth0”, check line. but networking service is still running so that this line failed with error: “error: Another instance of this program is already running.”. This error is printed by ifupdown2 lib who is the main process of networking service. So, ifdown actually does not work here, the ip of eth0 is not down.
  4. Interface-config service updates /etc/networking/interface to static configuration.
  5. Interface-config service runs command “systemctl restart networking”. This command kills the previous networking related processes (log: networking.service: Main process exited, code=killed, status=15/TERM), and try to reconfigure the ip address with static configuration. But it detects that the configured IP and the existing IP are the same, and it does not really configure the ip to kernel. Hence, the ip is still getting from dhcp. (this could be a bug of ifupdown2: previous ip is from dhcp, new ip is a static ip, it treats them as same instead of re-configuring the IP)
  6. When the lease of the ip expires, the ip of eth0 is removed by kernel and the issue reproduces.

The issue is not always reproduceable because networking service usually runs fast so that it won't hit step#3.

How I did it

Check networking service state before running "ifdown –force eth0", wait for it done if it is activating.

How to verify it

Manual test.

Which release branch to backport (provide reason below if selected)

  • 201811
  • 201911
  • 202006
  • 202012
  • 202106
  • 202111

Description for the changelog

Link to config_db schema for YANG module changes

A picture of a cute animal (not mandatory but encouraged)

…rvice

Conflicts:
	files/image_config/interfaces/interfaces-config.sh
keboliu
keboliu previously approved these changes Apr 14, 2022
@Junchao-Mellanox
Copy link
Copy Markdown
Collaborator Author

/azpw run Azure.sonic-buildimage

@mssonicbld
Copy link
Copy Markdown
Collaborator

/AzurePipelines run Azure.sonic-buildimage

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@Junchao-Mellanox
Copy link
Copy Markdown
Collaborator Author

/azpw run Azure.sonic-buildimage

@mssonicbld
Copy link
Copy Markdown
Collaborator

/AzurePipelines run Azure.sonic-buildimage

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@Junchao-Mellanox
Copy link
Copy Markdown
Collaborator Author

/azpw run Azure.sonic-buildimage

@mssonicbld
Copy link
Copy Markdown
Collaborator

/AzurePipelines run Azure.sonic-buildimage

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@liat-grozovik
Copy link
Copy Markdown
Collaborator

@qiluo-msft could you please help to review or suggest someone who can?

@keboliu keboliu requested a review from prsunny April 19, 2022 01:10
@qiluo-msft qiluo-msft requested a review from yxieca April 21, 2022 21:28
@Junchao-Mellanox
Copy link
Copy Markdown
Collaborator Author

Hi @yxieca , could you please help review and sign-off?

@Junchao-Mellanox
Copy link
Copy Markdown
Collaborator Author

/azpw run Azure.sonic-buildimage

@mssonicbld
Copy link
Copy Markdown
Collaborator

/AzurePipelines run Azure.sonic-buildimage

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@Junchao-Mellanox
Copy link
Copy Markdown
Collaborator Author

/azpw run Azure.sonic-buildimage

@mssonicbld
Copy link
Copy Markdown
Collaborator

/AzurePipelines run Azure.sonic-buildimage

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@liat-grozovik
Copy link
Copy Markdown
Collaborator

@yxieca kindly reminder to review fixes and signoff

@Junchao-Mellanox
Copy link
Copy Markdown
Collaborator Author

@yxieca kindly reminder.

@yxieca yxieca merged commit 681c248 into sonic-net:master May 5, 2022
@Junchao-Mellanox Junchao-Mellanox deleted the fix-mgmt-ip-loss-1 branch May 6, 2022 01:05
@Junchao-Mellanox Junchao-Mellanox added Request for 202012 Branch Request for 202111 Branch For PRs being requested for 202111 branch labels May 6, 2022
@Junchao-Mellanox
Copy link
Copy Markdown
Collaborator Author

Hi @judyjoseph , could you please help cherry-pick to 202111?

For 202012, there is no clean cherry-pick, I will add separate PR.

Junchao-Mellanox added a commit to Junchao-Mellanox/sonic-buildimage that referenced this pull request May 6, 2022
…rvice (sonic-net#10573)

Why I did it
The PR is aimed to fix a bug that mgmt port eth0 may loss IP even if user configured static IP of eth0. This is not a always reproduceable issue, the reproducing flow is like:

Systemd starts networking service, which runs a dhcp based configuration and assigned an ip from dhcp.
Systemd starts interface-config service who depends on networking service
Interface-config service runs command “ifdown –force eth0”, check line. but networking service is still running so that this line failed with error: “error: Another instance of this program is already running.”. This error is printed by ifupdown2 lib who is the main process of networking service. So, ifdown actually does not work here, the ip of eth0 is not down.
Interface-config service updates /etc/networking/interface to static configuration.
Interface-config service runs command “systemctl restart networking”. This command kills the previous networking related processes (log: networking.service: Main process exited, code=killed, status=15/TERM), and try to reconfigure the ip address with static configuration. But it detects that the configured IP and the existing IP are the same, and it does not really configure the ip to kernel. Hence, the ip is still getting from dhcp. (this could be a bug of ifupdown2: previous ip is from dhcp, new ip is a static ip, it treats them as same instead of re-configuring the IP)
When the lease of the ip expires, the ip of eth0 is removed by kernel and the issue reproduces.
The issue is not always reproduceable because networking service usually runs fast so that it won't hit step#3.

How I did it
Check networking service state before running "ifdown –force eth0", wait for it done if it is activating.

How to verify it
Manual test.
Conflicts:
	files/image_config/interfaces/interfaces-config.sh
judyjoseph pushed a commit that referenced this pull request May 8, 2022
…rvice (#10573)

Why I did it
The PR is aimed to fix a bug that mgmt port eth0 may loss IP even if user configured static IP of eth0. This is not a always reproduceable issue, the reproducing flow is like:

Systemd starts networking service, which runs a dhcp based configuration and assigned an ip from dhcp.
Systemd starts interface-config service who depends on networking service
Interface-config service runs command “ifdown –force eth0”, check line. but networking service is still running so that this line failed with error: “error: Another instance of this program is already running.”. This error is printed by ifupdown2 lib who is the main process of networking service. So, ifdown actually does not work here, the ip of eth0 is not down.
Interface-config service updates /etc/networking/interface to static configuration.
Interface-config service runs command “systemctl restart networking”. This command kills the previous networking related processes (log: networking.service: Main process exited, code=killed, status=15/TERM), and try to reconfigure the ip address with static configuration. But it detects that the configured IP and the existing IP are the same, and it does not really configure the ip to kernel. Hence, the ip is still getting from dhcp. (this could be a bug of ifupdown2: previous ip is from dhcp, new ip is a static ip, it treats them as same instead of re-configuring the IP)
When the lease of the ip expires, the ip of eth0 is removed by kernel and the issue reproduces.
The issue is not always reproduceable because networking service usually runs fast so that it won't hit step#3.

How I did it
Check networking service state before running "ifdown –force eth0", wait for it done if it is activating.

How to verify it
Manual test.
qiluo-msft pushed a commit that referenced this pull request May 14, 2022
…rvice (#10573) (#10766)

Backport #10573 to 202012.

#### Why I did it

The PR is aimed to fix a bug that mgmt port eth0 may loss IP even if user configured static IP of eth0. This is not a always reproduceable issue, the reproducing flow is like:

1.	Systemd starts networking service, which runs a dhcp based configuration and assigned an ip from dhcp.
2.	Systemd starts interface-config service who depends on networking service
3.	Interface-config service runs command  “ifdown –force eth0”, check [line](https://github.com/Azure/sonic-buildimage/blob/16717d2dc51f74fa711ed7b4392ce5e4f7e71c29/files/image_config/interfaces/interfaces-config.sh#L4). but networking service is still running so that this [line](https://github.com/CumulusNetworks/ifupdown2/blob/ac32bec0e24d64c583778f387050a7b6f4269db0/ifupdown2/ifupdown/main.py#L74) failed with error: “error: Another instance of this program is already running.”. This error is printed by ifupdown2 lib who is the main process of networking service. So, ifdown actually does not work here, the ip of eth0 is not down.
4.	Interface-config service updates /etc/networking/interface to static configuration.
5.	Interface-config service runs command “systemctl restart networking”. This command kills the previous networking related processes (log: networking.service: Main process exited, code=killed, status=15/TERM), and try to reconfigure the ip address with static configuration. But it detects that the configured IP and the existing IP are the same, and it does not really configure the ip to kernel. Hence, the ip is still getting from dhcp. (this could be a bug of ifupdown2: previous ip is from dhcp, new ip is a static ip, it treats them as same instead of re-configuring the IP)
6.	When the lease of the ip expires, the ip of eth0 is removed by kernel and the issue reproduces.

The issue is not always reproduceable because networking service usually runs fast so that it won't hit step#3.

#### How I did it

Check networking service state before running "ifdown –force eth0", wait for it done if it is activating.

#### How to verify it

Manual test.
liushilongbuaa pushed a commit to liushilongbuaa/sonic-buildimage that referenced this pull request Jun 20, 2022
…rvice (sonic-net#10573)

Why I did it
The PR is aimed to fix a bug that mgmt port eth0 may loss IP even if user configured static IP of eth0. This is not a always reproduceable issue, the reproducing flow is like:

Systemd starts networking service, which runs a dhcp based configuration and assigned an ip from dhcp.
Systemd starts interface-config service who depends on networking service
Interface-config service runs command “ifdown –force eth0”, check line. but networking service is still running so that this line failed with error: “error: Another instance of this program is already running.”. This error is printed by ifupdown2 lib who is the main process of networking service. So, ifdown actually does not work here, the ip of eth0 is not down.
Interface-config service updates /etc/networking/interface to static configuration.
Interface-config service runs command “systemctl restart networking”. This command kills the previous networking related processes (log: networking.service: Main process exited, code=killed, status=15/TERM), and try to reconfigure the ip address with static configuration. But it detects that the configured IP and the existing IP are the same, and it does not really configure the ip to kernel. Hence, the ip is still getting from dhcp. (this could be a bug of ifupdown2: previous ip is from dhcp, new ip is a static ip, it treats them as same instead of re-configuring the IP)
When the lease of the ip expires, the ip of eth0 is removed by kernel and the issue reproduces.
The issue is not always reproduceable because networking service usually runs fast so that it won't hit step#3.

How I did it
Check networking service state before running "ifdown –force eth0", wait for it done if it is activating.

How to verify it
Manual test.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Included in 202111 Branch Request for 202111 Branch For PRs being requested for 202111 branch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants