Improve the cleanup of processes and interfaces before stopping PTF container#10069
Improve the cleanup of processes and interfaces before stopping PTF container#10069wangxin merged 6 commits intosonic-net:masterfrom
Conversation
|
The pre-commit check detected issues in the files touched by this pull request. Detailed pre-commit check results: To run the pre-commit checks locally, you can follow below steps:
|
|
The pre-commit check detected issues in the files touched by this pull request. Detailed pre-commit check results: To run the pre-commit checks locally, you can follow below steps:
|
| self.cmd('docker exec -t {} bash -c "kill -9 {}"'.format(self.ctn_name, pid), ignore_failure=True) | ||
|
|
||
| def kill_processes(self): | ||
| self.cmd('docker exec -t {} bash -c "ps -ef"'.format(self.ctn_name)) |
There was a problem hiding this comment.
What is this command used for? (I have same confusion at line 108)
There was a problem hiding this comment.
It is for debugging purpose. By default, this module can generate logs like /tmp/ptf_control_xxx.log.
With this command, we can clearly see the processes running in PTF docker before and after the killing. Next time if server crash happens again, we may be able to get some clue from the log.
Recently we noticed some tests may create exabgp processes like "exabgp-psudoswitch1" in PTF container. So, a more aggressive way is required to kill the processes. Also, it would be better to collect more information for debugging if the issue happens again in the future.
|
@wangxin PR conflicts with 202205 branch |
|
@wangxin PR conflicts with 202012 branch |
…ontainer (sonic-net#10069) What is the motivation for this PR? We still observe issues with "testbed-cli.sh remove-topo" and "testbed-cli.sh restart-ptf": Server may crash and run into CPU softlock issue. Some exabgp process cannot be fully stopped and "restart-ptf" may fail. The expectation is that remove-topo and restart-ptf can always be successful. And of course, no server crash. Possible reason of server crash: Some exabgp processes are still running in PTF container while we remove the container. This could cause server crash. Some network interfaces are in the PTF container's network namespace while we remove the container. How did you do it? Added a customized module "ptf_control" to stop&kill processes running in PTF container in a more aggressive and reliable way. Improve the vm_topology module to remove network interfaces from the PTF container in the "unbind" procedure. Added a vm_topology "unbind" step in the "testbed-cli.sh restart-ptf" procedure. Updated some "ip link" commands to fully compliant with the syntax in "ip link help". How did you verify/test it? Tested the add-topo/remove-topo on both physical and KVM testbed. Tested restart-ptf on phsycial testbed.
|
Cherry-pick PR to 202305: #10521 |
…ontainer (sonic-net#10069) What is the motivation for this PR? We still observe issues with "testbed-cli.sh remove-topo" and "testbed-cli.sh restart-ptf": Server may crash and run into CPU softlock issue. Some exabgp process cannot be fully stopped and "restart-ptf" may fail. The expectation is that remove-topo and restart-ptf can always be successful. And of course, no server crash. Possible reason of server crash: Some exabgp processes are still running in PTF container while we remove the container. This could cause server crash. Some network interfaces are in the PTF container's network namespace while we remove the container. How did you do it? Added a customized module "ptf_control" to stop&kill processes running in PTF container in a more aggressive and reliable way. Improve the vm_topology module to remove network interfaces from the PTF container in the "unbind" procedure. Added a vm_topology "unbind" step in the "testbed-cli.sh restart-ptf" procedure. Updated some "ip link" commands to fully compliant with the syntax in "ip link help". How did you verify/test it? Tested the add-topo/remove-topo on both physical and KVM testbed. Tested restart-ptf on phsycial testbed.
…ontainer (#10069) What is the motivation for this PR? We still observe issues with "testbed-cli.sh remove-topo" and "testbed-cli.sh restart-ptf": Server may crash and run into CPU softlock issue. Some exabgp process cannot be fully stopped and "restart-ptf" may fail. The expectation is that remove-topo and restart-ptf can always be successful. And of course, no server crash. Possible reason of server crash: Some exabgp processes are still running in PTF container while we remove the container. This could cause server crash. Some network interfaces are in the PTF container's network namespace while we remove the container. How did you do it? Added a customized module "ptf_control" to stop&kill processes running in PTF container in a more aggressive and reliable way. Improve the vm_topology module to remove network interfaces from the PTF container in the "unbind" procedure. Added a vm_topology "unbind" step in the "testbed-cli.sh restart-ptf" procedure. Updated some "ip link" commands to fully compliant with the syntax in "ip link help". How did you verify/test it? Tested the add-topo/remove-topo on both physical and KVM testbed. Tested restart-ptf on phsycial testbed.
Cherry-pick #10069 and #10286 to 202205 branch. * Improve the cleanup of processes and interfaces before stopping PTF container (#10244 What is the motivation for this PR? We still observe issues with "testbed-cli.sh remove-topo" and "testbed-cli.sh restart-ptf": Server may crash and run into CPU softlock issue. Some exabgp process cannot be fully stopped and "restart-ptf" may fail. The expectation is that remove-topo and restart-ptf can always be successful. And of course, no server crash. Possible reason of server crash: Some exabgp processes are still running in PTF container while we remove the container. This could cause server crash. Some network interfaces are in the PTF container's network namespace while we remove the container. How did you do it? Added a customized module "ptf_control" to stop&kill processes running in PTF container in a more aggressive and reliable way. Improve the vm_topology module to remove network interfaces from the PTF container in the "unbind" procedure. Added a vm_topology "unbind" step in the "testbed-cli.sh restart-ptf" procedure. Updated some "ip link" commands to fully compliant with the syntax in "ip link help". How did you verify/test it? Tested the add-topo/remove-topo on both physical and KVM testbed. Tested restart-ptf on phsycial testbed. * Avoid running command in exited ptf docker container (#10286) While stopping PTF container, "ptf_control" module is executed to kill all processes in the PTF container. The original code checks if the PTF container's Pid exists before running command in the PTF container. Unfortunately, this check is not enough. PTF docker container in exited status still has Pid. This change improved the code for getting PTF container's Pid. When PTF container is not in "running" status, always return None for PTF container's Pid. Signed-off-by: Xin Wang <[email protected]>
…ontainer (sonic-net#10069) What is the motivation for this PR? We still observe issues with "testbed-cli.sh remove-topo" and "testbed-cli.sh restart-ptf": Server may crash and run into CPU softlock issue. Some exabgp process cannot be fully stopped and "restart-ptf" may fail. The expectation is that remove-topo and restart-ptf can always be successful. And of course, no server crash. Possible reason of server crash: Some exabgp processes are still running in PTF container while we remove the container. This could cause server crash. Some network interfaces are in the PTF container's network namespace while we remove the container. How did you do it? Added a customized module "ptf_control" to stop&kill processes running in PTF container in a more aggressive and reliable way. Improve the vm_topology module to remove network interfaces from the PTF container in the "unbind" procedure. Added a vm_topology "unbind" step in the "testbed-cli.sh restart-ptf" procedure. Updated some "ip link" commands to fully compliant with the syntax in "ip link help". How did you verify/test it? Tested the add-topo/remove-topo on both physical and KVM testbed. Tested restart-ptf on phsycial testbed.
* Improve the cleanup of processes and interfaces before stopping PTF container (#10069) What is the motivation for this PR? We still observe issues with "testbed-cli.sh remove-topo" and "testbed-cli.sh restart-ptf": Server may crash and run into CPU softlock issue. Some exabgp process cannot be fully stopped and "restart-ptf" may fail. The expectation is that remove-topo and restart-ptf can always be successful. And of course, no server crash. Possible reason of server crash: Some exabgp processes are still running in PTF container while we remove the container. This could cause server crash. Some network interfaces are in the PTF container's network namespace while we remove the container. How did you do it? Added a customized module "ptf_control" to stop&kill processes running in PTF container in a more aggressive and reliable way. Improve the vm_topology module to remove network interfaces from the PTF container in the "unbind" procedure. Added a vm_topology "unbind" step in the "testbed-cli.sh restart-ptf" procedure. Updated some "ip link" commands to fully compliant with the syntax in "ip link help". How did you verify/test it? Tested the add-topo/remove-topo on both physical and KVM testbed. Tested restart-ptf on phsycial testbed. * Avoid running command in exited ptf docker container (#10286) While stopping PTF container, "ptf_control" module is executed to kill all processes in the PTF container. The original code checks if the PTF container's Pid exists before running command in the PTF container. Unfortunately, this check is not enough. PTF docker container in exited status still has Pid. This change improved the code for getting PTF container's Pid. When PTF container is not in "running" status, always return None for PTF container's Pid. Signed-off-by: Xin Wang <[email protected]> * Install right version of docker.py for ubuntu22.04 --------- Signed-off-by: Xin Wang <[email protected]>
…ontainer (sonic-net#10069) What is the motivation for this PR? We still observe issues with "testbed-cli.sh remove-topo" and "testbed-cli.sh restart-ptf": Server may crash and run into CPU softlock issue. Some exabgp process cannot be fully stopped and "restart-ptf" may fail. The expectation is that remove-topo and restart-ptf can always be successful. And of course, no server crash. Possible reason of server crash: Some exabgp processes are still running in PTF container while we remove the container. This could cause server crash. Some network interfaces are in the PTF container's network namespace while we remove the container. How did you do it? Added a customized module "ptf_control" to stop&kill processes running in PTF container in a more aggressive and reliable way. Improve the vm_topology module to remove network interfaces from the PTF container in the "unbind" procedure. Added a vm_topology "unbind" step in the "testbed-cli.sh restart-ptf" procedure. Updated some "ip link" commands to fully compliant with the syntax in "ip link help". How did you verify/test it? Tested the add-topo/remove-topo on both physical and KVM testbed. Tested restart-ptf on phsycial testbed.
Description of PR
Summary:
Fixes # (issue)
Type of change
Back port request
Approach
What is the motivation for this PR?
We still observe issues with "testbed-cli.sh remove-topo" and "testbed-cli.sh restart-ptf":
The expectation is that remove-topo and restart-ptf can always be successful. And of course, no server crash.
Possible reason of server crash:
How did you do it?
How did you verify/test it?
Tested the add-topo/remove-topo on both physical and KVM testbed.
Tested restart-ptf on physical testbed.
Any platform specific information?
Supported testbed topology if it's a new test case?
Documentation