Skip to content

Improve the cleanup of processes and interfaces before stopping PTF container#10069

Merged
wangxin merged 6 commits intosonic-net:masterfrom
wangxin:safe-stop-ptf
Sep 21, 2023
Merged

Improve the cleanup of processes and interfaces before stopping PTF container#10069
wangxin merged 6 commits intosonic-net:masterfrom
wangxin:safe-stop-ptf

Conversation

@wangxin
Copy link
Copy Markdown
Collaborator

@wangxin wangxin commented Sep 20, 2023

Description of PR

Summary:
Fixes # (issue)

Type of change

  • Bug fix
  • Testbed and Framework(new/improvement)
  • Test case(new/improvement)

Back port request

  • 201911
  • 202012
  • 202205

Approach

What is the motivation for this PR?

We still observe issues with "testbed-cli.sh remove-topo" and "testbed-cli.sh restart-ptf":

  1. Server may crash and run into CPU softlock issue.
  2. Some exabgp process cannot be fully stopped and "restart-ptf" may fail.
    The expectation is that remove-topo and restart-ptf can always be successful. And of course, no server crash.

Possible reason of server crash:

  1. Some exabgp processes are still running in PTF container while we remove the container. This could cause server crash.
  2. Some network interfaces are in the PTF container's network namespace while we remove the container.

How did you do it?

  1. Added a customized module "ptf_control" to stop&kill processes running in PTF container in a more aggressive and reliable way.
  2. Improve the vm_topology module to remove network interfaces from the PTF container in the "unbind" procedure.
  3. Added a vm_topology "unbind" step in the "testbed-cli.sh restart-ptf" procedure.
  4. Updated some "ip link" commands to fully compliant with the syntax in "ip link help".

How did you verify/test it?

Tested the add-topo/remove-topo on both physical and KVM testbed.
Tested restart-ptf on physical testbed.

Any platform specific information?

Supported testbed topology if it's a new test case?

Documentation

@mssonicbld
Copy link
Copy Markdown
Collaborator

The pre-commit check detected issues in the files touched by this pull request.
The pre-commit check is a mandatory check, please fix detected issues.

Detailed pre-commit check results:
trim trailing whitespace.................................................Passed
fix end of files.........................................................Passed
check yaml...............................................................Passed
check for added large files..............................................Passed
check python ast.........................................................Passed
flake8...................................................................Failed
- hook id: flake8
- exit code: 1

ansible/roles/vm_set/library/ptf_control.py:113:1: E302 expected 2 blank lines, found 1

flake8...............................................(no files to check)Skipped
check conditional mark sort..............................................Passed

To run the pre-commit checks locally, you can follow below steps:

  1. Ensure that default python is python3. In sonic-mgmt docker container, default python is python2. You can run
    the check by activating the python3 virtual environment in sonic-mgmt docker container or outside of sonic-mgmt
    docker container.
  2. Ensure that the pre-commit package is installed:
sudo pip install pre-commit
  1. Go to repository root folder
  2. Install the pre-commit hooks:
pre-commit install
  1. Use pre-commit to check staged file:
pre-commit
  1. Alternatively, you can check committed files using:
pre-commit run --from-ref <commit_id> --to-ref <commit_id>

@mssonicbld
Copy link
Copy Markdown
Collaborator

The pre-commit check detected issues in the files touched by this pull request.
The pre-commit check is a mandatory check, please fix detected issues.

Detailed pre-commit check results:
trim trailing whitespace.................................................Passed
fix end of files.........................................................Passed
check yaml...............................................................Passed
check for added large files..............................................Passed
check python ast.........................................................Passed
flake8...................................................................Failed
- hook id: flake8
- exit code: 1

ansible/roles/vm_set/library/ptf_control.py:75:121: E501 line too long (123 > 120 characters)

flake8...............................................(no files to check)Skipped
check conditional mark sort..............................................Passed

To run the pre-commit checks locally, you can follow below steps:

  1. Ensure that default python is python3. In sonic-mgmt docker container, default python is python2. You can run
    the check by activating the python3 virtual environment in sonic-mgmt docker container or outside of sonic-mgmt
    docker container.
  2. Ensure that the pre-commit package is installed:
sudo pip install pre-commit
  1. Go to repository root folder
  2. Install the pre-commit hooks:
pre-commit install
  1. Use pre-commit to check staged file:
pre-commit
  1. Alternatively, you can check committed files using:
pre-commit run --from-ref <commit_id> --to-ref <commit_id>

Copy link
Copy Markdown
Collaborator

@yejianquan yejianquan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

self.cmd('docker exec -t {} bash -c "kill -9 {}"'.format(self.ctn_name, pid), ignore_failure=True)

def kill_processes(self):
self.cmd('docker exec -t {} bash -c "ps -ef"'.format(self.ctn_name))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this command used for? (I have same confusion at line 108)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is for debugging purpose. By default, this module can generate logs like /tmp/ptf_control_xxx.log.
With this command, we can clearly see the processes running in PTF docker before and after the killing. Next time if server crash happens again, we may be able to get some clue from the log.

Recently we noticed some tests may create exabgp processes like "exabgp-psudoswitch1" in PTF container. So, a more aggressive way is required to kill the processes. Also, it would be better to collect more information for debugging if the issue happens again in the future.

@mssonicbld
Copy link
Copy Markdown
Collaborator

@wangxin PR conflicts with 202205 branch

@mssonicbld
Copy link
Copy Markdown
Collaborator

@wangxin PR conflicts with 202012 branch

mssonicbld pushed a commit to mssonicbld/sonic-mgmt that referenced this pull request Oct 30, 2023
…ontainer (sonic-net#10069)

What is the motivation for this PR?
We still observe issues with "testbed-cli.sh remove-topo" and "testbed-cli.sh restart-ptf":

Server may crash and run into CPU softlock issue.
Some exabgp process cannot be fully stopped and "restart-ptf" may fail.
The expectation is that remove-topo and restart-ptf can always be successful. And of course, no server crash.
Possible reason of server crash:

Some exabgp processes are still running in PTF container while we remove the container. This could cause server crash.
Some network interfaces are in the PTF container's network namespace while we remove the container.

How did you do it?
Added a customized module "ptf_control" to stop&kill processes running in PTF container in a more aggressive and reliable way.
Improve the vm_topology module to remove network interfaces from the PTF container in the "unbind" procedure.
Added a vm_topology "unbind" step in the "testbed-cli.sh restart-ptf" procedure.
Updated some "ip link" commands to fully compliant with the syntax in "ip link help".

How did you verify/test it?
Tested the add-topo/remove-topo on both physical and KVM testbed.
Tested restart-ptf on phsycial testbed.
@mssonicbld
Copy link
Copy Markdown
Collaborator

Cherry-pick PR to 202305: #10521

wangxin added a commit to wangxin/sonic-mgmt that referenced this pull request Oct 30, 2023
…ontainer (sonic-net#10069)

What is the motivation for this PR?
We still observe issues with "testbed-cli.sh remove-topo" and "testbed-cli.sh restart-ptf":

Server may crash and run into CPU softlock issue.
Some exabgp process cannot be fully stopped and "restart-ptf" may fail.
The expectation is that remove-topo and restart-ptf can always be successful. And of course, no server crash.
Possible reason of server crash:

Some exabgp processes are still running in PTF container while we remove the container. This could cause server crash.
Some network interfaces are in the PTF container's network namespace while we remove the container.

How did you do it?
Added a customized module "ptf_control" to stop&kill processes running in PTF container in a more aggressive and reliable way.
Improve the vm_topology module to remove network interfaces from the PTF container in the "unbind" procedure.
Added a vm_topology "unbind" step in the "testbed-cli.sh restart-ptf" procedure.
Updated some "ip link" commands to fully compliant with the syntax in "ip link help".

How did you verify/test it?
Tested the add-topo/remove-topo on both physical and KVM testbed.
Tested restart-ptf on phsycial testbed.
mssonicbld pushed a commit that referenced this pull request Oct 30, 2023
…ontainer (#10069)

What is the motivation for this PR?
We still observe issues with "testbed-cli.sh remove-topo" and "testbed-cli.sh restart-ptf":

Server may crash and run into CPU softlock issue.
Some exabgp process cannot be fully stopped and "restart-ptf" may fail.
The expectation is that remove-topo and restart-ptf can always be successful. And of course, no server crash.
Possible reason of server crash:

Some exabgp processes are still running in PTF container while we remove the container. This could cause server crash.
Some network interfaces are in the PTF container's network namespace while we remove the container.

How did you do it?
Added a customized module "ptf_control" to stop&kill processes running in PTF container in a more aggressive and reliable way.
Improve the vm_topology module to remove network interfaces from the PTF container in the "unbind" procedure.
Added a vm_topology "unbind" step in the "testbed-cli.sh restart-ptf" procedure.
Updated some "ip link" commands to fully compliant with the syntax in "ip link help".

How did you verify/test it?
Tested the add-topo/remove-topo on both physical and KVM testbed.
Tested restart-ptf on phsycial testbed.
wangxin added a commit that referenced this pull request Oct 30, 2023
Cherry-pick #10069 and #10286 to 202205 branch.

* Improve the cleanup of processes and interfaces before stopping PTF container (#10244 

What is the motivation for this PR?
We still observe issues with "testbed-cli.sh remove-topo" and "testbed-cli.sh restart-ptf":

Server may crash and run into CPU softlock issue.
Some exabgp process cannot be fully stopped and "restart-ptf" may fail.
The expectation is that remove-topo and restart-ptf can always be successful. And of course, no server crash.
Possible reason of server crash:

Some exabgp processes are still running in PTF container while we remove the container. This could cause server crash.
Some network interfaces are in the PTF container's network namespace while we remove the container.

How did you do it?
Added a customized module "ptf_control" to stop&kill processes running in PTF container in a more aggressive and reliable way.
Improve the vm_topology module to remove network interfaces from the PTF container in the "unbind" procedure.
Added a vm_topology "unbind" step in the "testbed-cli.sh restart-ptf" procedure.
Updated some "ip link" commands to fully compliant with the syntax in "ip link help".

How did you verify/test it?
Tested the add-topo/remove-topo on both physical and KVM testbed.
Tested restart-ptf on phsycial testbed.

* Avoid running command in exited ptf docker container (#10286)

While stopping PTF container, "ptf_control" module is executed to
kill all processes in the PTF container.
The original code checks if the PTF container's Pid exists before
running command in the PTF container. Unfortunately, this check
is not enough. PTF docker container in exited status still has Pid.

This change improved the code for getting PTF container's Pid.
When PTF container is not in "running" status, always return None
for PTF container's Pid.

Signed-off-by: Xin Wang <[email protected]>
wangxin added a commit to wangxin/sonic-mgmt that referenced this pull request Oct 31, 2023
…ontainer (sonic-net#10069)

What is the motivation for this PR?
We still observe issues with "testbed-cli.sh remove-topo" and "testbed-cli.sh restart-ptf":

Server may crash and run into CPU softlock issue.
Some exabgp process cannot be fully stopped and "restart-ptf" may fail.
The expectation is that remove-topo and restart-ptf can always be successful. And of course, no server crash.
Possible reason of server crash:

Some exabgp processes are still running in PTF container while we remove the container. This could cause server crash.
Some network interfaces are in the PTF container's network namespace while we remove the container.

How did you do it?
Added a customized module "ptf_control" to stop&kill processes running in PTF container in a more aggressive and reliable way.
Improve the vm_topology module to remove network interfaces from the PTF container in the "unbind" procedure.
Added a vm_topology "unbind" step in the "testbed-cli.sh restart-ptf" procedure.
Updated some "ip link" commands to fully compliant with the syntax in "ip link help".

How did you verify/test it?
Tested the add-topo/remove-topo on both physical and KVM testbed.
Tested restart-ptf on phsycial testbed.
yejianquan pushed a commit that referenced this pull request Nov 9, 2023
* Improve the cleanup of processes and interfaces before stopping PTF container (#10069)

What is the motivation for this PR?
We still observe issues with "testbed-cli.sh remove-topo" and "testbed-cli.sh restart-ptf":

Server may crash and run into CPU softlock issue.
Some exabgp process cannot be fully stopped and "restart-ptf" may fail.
The expectation is that remove-topo and restart-ptf can always be successful. And of course, no server crash.
Possible reason of server crash:

Some exabgp processes are still running in PTF container while we remove the container. This could cause server crash.
Some network interfaces are in the PTF container's network namespace while we remove the container.

How did you do it?
Added a customized module "ptf_control" to stop&kill processes running in PTF container in a more aggressive and reliable way.
Improve the vm_topology module to remove network interfaces from the PTF container in the "unbind" procedure.
Added a vm_topology "unbind" step in the "testbed-cli.sh restart-ptf" procedure.
Updated some "ip link" commands to fully compliant with the syntax in "ip link help".

How did you verify/test it?
Tested the add-topo/remove-topo on both physical and KVM testbed.
Tested restart-ptf on phsycial testbed.

* Avoid running command in exited ptf docker container (#10286)

While stopping PTF container, "ptf_control" module is executed to
kill all processes in the PTF container.
The original code checks if the PTF container's Pid exists before
running command in the PTF container. Unfortunately, this check
is not enough. PTF docker container in exited status still has Pid.

This change improved the code for getting PTF container's Pid.
When PTF container is not in "running" status, always return None
for PTF container's Pid.

Signed-off-by: Xin Wang <[email protected]>

* Install right version of docker.py for ubuntu22.04

---------

Signed-off-by: Xin Wang <[email protected]>
AharonMalkin pushed a commit to AharonMalkin/sonic-mgmt that referenced this pull request Jan 25, 2024
…ontainer (sonic-net#10069)

What is the motivation for this PR?
We still observe issues with "testbed-cli.sh remove-topo" and "testbed-cli.sh restart-ptf":

Server may crash and run into CPU softlock issue.
Some exabgp process cannot be fully stopped and "restart-ptf" may fail.
The expectation is that remove-topo and restart-ptf can always be successful. And of course, no server crash.
Possible reason of server crash:

Some exabgp processes are still running in PTF container while we remove the container. This could cause server crash.
Some network interfaces are in the PTF container's network namespace while we remove the container.

How did you do it?
Added a customized module "ptf_control" to stop&kill processes running in PTF container in a more aggressive and reliable way.
Improve the vm_topology module to remove network interfaces from the PTF container in the "unbind" procedure.
Added a vm_topology "unbind" step in the "testbed-cli.sh restart-ptf" procedure.
Updated some "ip link" commands to fully compliant with the syntax in "ip link help".

How did you verify/test it?
Tested the add-topo/remove-topo on both physical and KVM testbed.
Tested restart-ptf on phsycial testbed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants