Improve the cleanup of processes and interfaces before stopping PTF container by wangxin · Pull Request #10069 · sonic-net/sonic-mgmt

wangxin · 2023-09-20T01:52:11Z

Description of PR

Summary:
Fixes # (issue)

Type of change

Bug fix
Testbed and Framework(new/improvement)
Test case(new/improvement)

Back port request

201911
202012
202205

Approach

What is the motivation for this PR?

We still observe issues with "testbed-cli.sh remove-topo" and "testbed-cli.sh restart-ptf":

Server may crash and run into CPU softlock issue.
Some exabgp process cannot be fully stopped and "restart-ptf" may fail.
The expectation is that remove-topo and restart-ptf can always be successful. And of course, no server crash.

Possible reason of server crash:

Some exabgp processes are still running in PTF container while we remove the container. This could cause server crash.
Some network interfaces are in the PTF container's network namespace while we remove the container.

How did you do it?

Added a customized module "ptf_control" to stop&kill processes running in PTF container in a more aggressive and reliable way.
Improve the vm_topology module to remove network interfaces from the PTF container in the "unbind" procedure.
Added a vm_topology "unbind" step in the "testbed-cli.sh restart-ptf" procedure.
Updated some "ip link" commands to fully compliant with the syntax in "ip link help".

How did you verify/test it?

Tested the add-topo/remove-topo on both physical and KVM testbed.
Tested restart-ptf on physical testbed.

Any platform specific information?

Supported testbed topology if it's a new test case?

Documentation

mssonicbld · 2023-09-20T01:53:17Z

The pre-commit check detected issues in the files touched by this pull request.
The pre-commit check is a mandatory check, please fix detected issues.

Detailed pre-commit check results:
trim trailing whitespace.................................................Passed fix end of files.........................................................Passed check yaml...............................................................Passed check for added large files..............................................Passed check python ast.........................................................Passed flake8...................................................................Failed - hook id: flake8 - exit code: 1 ansible/roles/vm_set/library/ptf_control.py:113:1: E302 expected 2 blank lines, found 1 flake8...............................................(no files to check)Skipped check conditional mark sort..............................................Passed

To run the pre-commit checks locally, you can follow below steps:

Ensure that default python is python3. In sonic-mgmt docker container, default python is python2. You can run
the check by activating the python3 virtual environment in sonic-mgmt docker container or outside of sonic-mgmt
docker container.
Ensure that the pre-commit package is installed:

sudo pip install pre-commit

Go to repository root folder
Install the pre-commit hooks:

pre-commit install

Use pre-commit to check staged file:

pre-commit

Alternatively, you can check committed files using:

pre-commit run --from-ref <commit_id> --to-ref <commit_id>

mssonicbld · 2023-09-21T06:18:28Z

The pre-commit check detected issues in the files touched by this pull request.
The pre-commit check is a mandatory check, please fix detected issues.

Detailed pre-commit check results:
trim trailing whitespace.................................................Passed fix end of files.........................................................Passed check yaml...............................................................Passed check for added large files..............................................Passed check python ast.........................................................Passed flake8...................................................................Failed - hook id: flake8 - exit code: 1 ansible/roles/vm_set/library/ptf_control.py:75:121: E501 line too long (123 > 120 characters) flake8...............................................(no files to check)Skipped check conditional mark sort..............................................Passed

To run the pre-commit checks locally, you can follow below steps:

Ensure that default python is python3. In sonic-mgmt docker container, default python is python2. You can run
the check by activating the python3 virtual environment in sonic-mgmt docker container or outside of sonic-mgmt
docker container.
Ensure that the pre-commit package is installed:

sudo pip install pre-commit

Go to repository root folder
Install the pre-commit hooks:

pre-commit install

Use pre-commit to check staged file:

pre-commit

Alternatively, you can check committed files using:

pre-commit run --from-ref <commit_id> --to-ref <commit_id>

yejianquan

LGTM

lizhijianrd · 2023-09-20T05:58:47Z

ansible/roles/vm_set/library/ptf_control.py

+        self.cmd('docker exec -t {} bash -c "kill -9 {}"'.format(self.ctn_name, pid), ignore_failure=True)
+
+    def kill_processes(self):
+        self.cmd('docker exec -t {} bash -c "ps -ef"'.format(self.ctn_name))


What is this command used for? (I have same confusion at line 108)

It is for debugging purpose. By default, this module can generate logs like /tmp/ptf_control_xxx.log.
With this command, we can clearly see the processes running in PTF docker before and after the killing. Next time if server crash happens again, we may be able to get some clue from the log.

Recently we noticed some tests may create exabgp processes like "exabgp-psudoswitch1" in PTF container. So, a more aggressive way is required to kill the processes. Also, it would be better to collect more information for debugging if the issue happens again in the future.

mssonicbld · 2023-10-30T06:00:32Z

@wangxin PR conflicts with 202205 branch

mssonicbld · 2023-10-30T06:00:34Z

@wangxin PR conflicts with 202012 branch

…ontainer (sonic-net#10069) What is the motivation for this PR? We still observe issues with "testbed-cli.sh remove-topo" and "testbed-cli.sh restart-ptf": Server may crash and run into CPU softlock issue. Some exabgp process cannot be fully stopped and "restart-ptf" may fail. The expectation is that remove-topo and restart-ptf can always be successful. And of course, no server crash. Possible reason of server crash: Some exabgp processes are still running in PTF container while we remove the container. This could cause server crash. Some network interfaces are in the PTF container's network namespace while we remove the container. How did you do it? Added a customized module "ptf_control" to stop&kill processes running in PTF container in a more aggressive and reliable way. Improve the vm_topology module to remove network interfaces from the PTF container in the "unbind" procedure. Added a vm_topology "unbind" step in the "testbed-cli.sh restart-ptf" procedure. Updated some "ip link" commands to fully compliant with the syntax in "ip link help". How did you verify/test it? Tested the add-topo/remove-topo on both physical and KVM testbed. Tested restart-ptf on phsycial testbed.

mssonicbld · 2023-10-30T06:00:40Z

Cherry-pick PR to 202305: #10521

…ontainer (sonic-net#10069) What is the motivation for this PR? We still observe issues with "testbed-cli.sh remove-topo" and "testbed-cli.sh restart-ptf": Server may crash and run into CPU softlock issue. Some exabgp process cannot be fully stopped and "restart-ptf" may fail. The expectation is that remove-topo and restart-ptf can always be successful. And of course, no server crash. Possible reason of server crash: Some exabgp processes are still running in PTF container while we remove the container. This could cause server crash. Some network interfaces are in the PTF container's network namespace while we remove the container. How did you do it? Added a customized module "ptf_control" to stop&kill processes running in PTF container in a more aggressive and reliable way. Improve the vm_topology module to remove network interfaces from the PTF container in the "unbind" procedure. Added a vm_topology "unbind" step in the "testbed-cli.sh restart-ptf" procedure. Updated some "ip link" commands to fully compliant with the syntax in "ip link help". How did you verify/test it? Tested the add-topo/remove-topo on both physical and KVM testbed. Tested restart-ptf on phsycial testbed.

…ontainer (#10069) What is the motivation for this PR? We still observe issues with "testbed-cli.sh remove-topo" and "testbed-cli.sh restart-ptf": Server may crash and run into CPU softlock issue. Some exabgp process cannot be fully stopped and "restart-ptf" may fail. The expectation is that remove-topo and restart-ptf can always be successful. And of course, no server crash. Possible reason of server crash: Some exabgp processes are still running in PTF container while we remove the container. This could cause server crash. Some network interfaces are in the PTF container's network namespace while we remove the container. How did you do it? Added a customized module "ptf_control" to stop&kill processes running in PTF container in a more aggressive and reliable way. Improve the vm_topology module to remove network interfaces from the PTF container in the "unbind" procedure. Added a vm_topology "unbind" step in the "testbed-cli.sh restart-ptf" procedure. Updated some "ip link" commands to fully compliant with the syntax in "ip link help". How did you verify/test it? Tested the add-topo/remove-topo on both physical and KVM testbed. Tested restart-ptf on phsycial testbed.

Cherry-pick #10069 and #10286 to 202205 branch. * Improve the cleanup of processes and interfaces before stopping PTF container (#10244 What is the motivation for this PR? We still observe issues with "testbed-cli.sh remove-topo" and "testbed-cli.sh restart-ptf": Server may crash and run into CPU softlock issue. Some exabgp process cannot be fully stopped and "restart-ptf" may fail. The expectation is that remove-topo and restart-ptf can always be successful. And of course, no server crash. Possible reason of server crash: Some exabgp processes are still running in PTF container while we remove the container. This could cause server crash. Some network interfaces are in the PTF container's network namespace while we remove the container. How did you do it? Added a customized module "ptf_control" to stop&kill processes running in PTF container in a more aggressive and reliable way. Improve the vm_topology module to remove network interfaces from the PTF container in the "unbind" procedure. Added a vm_topology "unbind" step in the "testbed-cli.sh restart-ptf" procedure. Updated some "ip link" commands to fully compliant with the syntax in "ip link help". How did you verify/test it? Tested the add-topo/remove-topo on both physical and KVM testbed. Tested restart-ptf on phsycial testbed. * Avoid running command in exited ptf docker container (#10286) While stopping PTF container, "ptf_control" module is executed to kill all processes in the PTF container. The original code checks if the PTF container's Pid exists before running command in the PTF container. Unfortunately, this check is not enough. PTF docker container in exited status still has Pid. This change improved the code for getting PTF container's Pid. When PTF container is not in "running" status, always return None for PTF container's Pid. Signed-off-by: Xin Wang <[email protected]>

…ontainer (sonic-net#10069) What is the motivation for this PR? We still observe issues with "testbed-cli.sh remove-topo" and "testbed-cli.sh restart-ptf": Server may crash and run into CPU softlock issue. Some exabgp process cannot be fully stopped and "restart-ptf" may fail. The expectation is that remove-topo and restart-ptf can always be successful. And of course, no server crash. Possible reason of server crash: Some exabgp processes are still running in PTF container while we remove the container. This could cause server crash. Some network interfaces are in the PTF container's network namespace while we remove the container. How did you do it? Added a customized module "ptf_control" to stop&kill processes running in PTF container in a more aggressive and reliable way. Improve the vm_topology module to remove network interfaces from the PTF container in the "unbind" procedure. Added a vm_topology "unbind" step in the "testbed-cli.sh restart-ptf" procedure. Updated some "ip link" commands to fully compliant with the syntax in "ip link help". How did you verify/test it? Tested the add-topo/remove-topo on both physical and KVM testbed. Tested restart-ptf on phsycial testbed.

* Improve the cleanup of processes and interfaces before stopping PTF container (#10069) What is the motivation for this PR? We still observe issues with "testbed-cli.sh remove-topo" and "testbed-cli.sh restart-ptf": Server may crash and run into CPU softlock issue. Some exabgp process cannot be fully stopped and "restart-ptf" may fail. The expectation is that remove-topo and restart-ptf can always be successful. And of course, no server crash. Possible reason of server crash: Some exabgp processes are still running in PTF container while we remove the container. This could cause server crash. Some network interfaces are in the PTF container's network namespace while we remove the container. How did you do it? Added a customized module "ptf_control" to stop&kill processes running in PTF container in a more aggressive and reliable way. Improve the vm_topology module to remove network interfaces from the PTF container in the "unbind" procedure. Added a vm_topology "unbind" step in the "testbed-cli.sh restart-ptf" procedure. Updated some "ip link" commands to fully compliant with the syntax in "ip link help". How did you verify/test it? Tested the add-topo/remove-topo on both physical and KVM testbed. Tested restart-ptf on phsycial testbed. * Avoid running command in exited ptf docker container (#10286) While stopping PTF container, "ptf_control" module is executed to kill all processes in the PTF container. The original code checks if the PTF container's Pid exists before running command in the PTF container. Unfortunately, this check is not enough. PTF docker container in exited status still has Pid. This change improved the code for getting PTF container's Pid. When PTF container is not in "running" status, always return None for PTF container's Pid. Signed-off-by: Xin Wang <[email protected]> * Install right version of docker.py for ubuntu22.04 --------- Signed-off-by: Xin Wang <[email protected]>

…ontainer (sonic-net#10069) What is the motivation for this PR? We still observe issues with "testbed-cli.sh remove-topo" and "testbed-cli.sh restart-ptf": Server may crash and run into CPU softlock issue. Some exabgp process cannot be fully stopped and "restart-ptf" may fail. The expectation is that remove-topo and restart-ptf can always be successful. And of course, no server crash. Possible reason of server crash: Some exabgp processes are still running in PTF container while we remove the container. This could cause server crash. Some network interfaces are in the PTF container's network namespace while we remove the container. How did you do it? Added a customized module "ptf_control" to stop&kill processes running in PTF container in a more aggressive and reliable way. Improve the vm_topology module to remove network interfaces from the PTF container in the "unbind" procedure. Added a vm_topology "unbind" step in the "testbed-cli.sh restart-ptf" procedure. Updated some "ip link" commands to fully compliant with the syntax in "ip link help". How did you verify/test it? Tested the add-topo/remove-topo on both physical and KVM testbed. Tested restart-ptf on phsycial testbed.

wangxin added 3 commits September 18, 2023 17:48

Safe remove PTF container

e294603

Add ptf_control

774a4cf

Aggressively kill

6a615be

wangxin requested review from lizhijianrd and yejianquan September 20, 2023 01:52

wangxin added 2 commits September 20, 2023 09:55

Fix pre-commit style

292fe5d

Kill all possible exabgp processes in ptf

f7b5eb9

Fix pre-commit style issue

55537be

yejianquan approved these changes Sep 21, 2023

View reviewed changes

lizhijianrd reviewed Sep 21, 2023

View reviewed changes

lizhijianrd approved these changes Sep 21, 2023

View reviewed changes

wangxin merged commit 5658ac9 into sonic-net:master Sep 21, 2023

wangxin added Request for 202205 branch Request for 202305 branch Request for 202012 branch Approved for 202012 Branch Approved for 202205 branch Approved for 202305 branch labels Oct 30, 2023

mssonicbld added the Cherry Pick Conflict_202205 label Oct 30, 2023

mssonicbld added the Cherry Pick Conflict_202012 label Oct 30, 2023

mssonicbld added the Created PR to 202305 branch label Oct 30, 2023

mssonicbld mentioned this pull request Oct 30, 2023

[action] [PR:10069] Improve the cleanup of processes and interfaces before stopping PTF container #10521

Merged

6 tasks

wangxin mentioned this pull request Oct 30, 2023

[202205] Improve the cleanup of PTF container #10523

Merged

7 tasks

mssonicbld added Included in 202305 branch and removed Created PR to 202305 branch labels Oct 30, 2023

wangxin added the Included in 202205 branch label Oct 30, 2023

wangxin mentioned this pull request Oct 31, 2023

[202012] Improve the cleanup of PTF container #10533

Merged

7 tasks

wangxin added the Included in 202012 branch label Dec 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve the cleanup of processes and interfaces before stopping PTF container#10069

Improve the cleanup of processes and interfaces before stopping PTF container#10069
wangxin merged 6 commits intosonic-net:masterfrom
wangxin:safe-stop-ptf

wangxin commented Sep 20, 2023 •

edited

Loading

Uh oh!

mssonicbld commented Sep 20, 2023

Uh oh!

mssonicbld commented Sep 21, 2023

Uh oh!

yejianquan left a comment

Uh oh!

lizhijianrd Sep 20, 2023

Uh oh!

wangxin Sep 21, 2023

Uh oh!

mssonicbld commented Oct 30, 2023

Uh oh!

mssonicbld commented Oct 30, 2023

Uh oh!

mssonicbld commented Oct 30, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

wangxin commented Sep 20, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description of PR

Type of change

Back port request

Approach

What is the motivation for this PR?

How did you do it?

How did you verify/test it?

Any platform specific information?

Supported testbed topology if it's a new test case?

Documentation

Uh oh!

mssonicbld commented Sep 20, 2023

Uh oh!

mssonicbld commented Sep 21, 2023

Uh oh!

yejianquan left a comment

Choose a reason for hiding this comment

Uh oh!

lizhijianrd Sep 20, 2023

Choose a reason for hiding this comment

Uh oh!

wangxin Sep 21, 2023

Choose a reason for hiding this comment

Uh oh!

mssonicbld commented Oct 30, 2023

Uh oh!

mssonicbld commented Oct 30, 2023

Uh oh!

mssonicbld commented Oct 30, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

wangxin commented Sep 20, 2023 •

edited

Loading