Skip to content

Avoid running command in exited ptf docker container#10286

Merged
wangxin merged 1 commit intosonic-net:masterfrom
wangxin:fix-kill-ptf-exabgp
Oct 11, 2023
Merged

Avoid running command in exited ptf docker container#10286
wangxin merged 1 commit intosonic-net:masterfrom
wangxin:fix-kill-ptf-exabgp

Conversation

@wangxin
Copy link
Copy Markdown
Collaborator

@wangxin wangxin commented Oct 11, 2023

Description of PR

Summary:
Fixes # (issue)

Type of change

  • Bug fix
  • Testbed and Framework(new/improvement)
  • Test case(new/improvement)

Back port request

  • 201911
  • 202012
  • 202205

Approach

What is the motivation for this PR?

While stopping PTF container, "ptf_control" module is executed to kill all processes in the PTF container.
The original code checks if the PTF container's Pid exists before running command in the PTF container. Unfortunately, this check is not enough. PTF docker container in exited status still has Pid.

How did you do it?

This change improved the code for getting PTF container's Pid. When PTF container is not in "running" status, always return None for PTF container's Pid.

How did you verify/test it?

Any platform specific information?

Supported testbed topology if it's a new test case?

Documentation

While stopping PTF container, "ptf_control" module is executed to
kill all processes in the PTF container.
The original code checks if the PTF container's Pid exists before
running command in the PTF container. Unfortunately, this check
is not enough. PTF docker container in exited status still has Pid.

This change improved the code for getting PTF container's Pid.
When PTF container is not in "running" status, always return None
for PTF container's Pid.

Signed-off-by: Xin Wang <[email protected]>
@wangxin wangxin requested a review from yejianquan October 11, 2023 03:34
Copy link
Copy Markdown
Collaborator

@yejianquan yejianquan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mssonicbld
Copy link
Copy Markdown
Collaborator

@wangxin PR conflicts with 202305 branch

wangxin added a commit to wangxin/sonic-mgmt that referenced this pull request Oct 30, 2023
While stopping PTF container, "ptf_control" module is executed to
kill all processes in the PTF container.
The original code checks if the PTF container's Pid exists before
running command in the PTF container. Unfortunately, this check
is not enough. PTF docker container in exited status still has Pid.

This change improved the code for getting PTF container's Pid.
When PTF container is not in "running" status, always return None
for PTF container's Pid.

Signed-off-by: Xin Wang <[email protected]>
@mssonicbld
Copy link
Copy Markdown
Collaborator

Cherry-pick PR to 202305: #10524

mssonicbld pushed a commit to mssonicbld/sonic-mgmt that referenced this pull request Oct 30, 2023
While stopping PTF container, "ptf_control" module is executed to
kill all processes in the PTF container.
The original code checks if the PTF container's Pid exists before
running command in the PTF container. Unfortunately, this check
is not enough. PTF docker container in exited status still has Pid.

This change improved the code for getting PTF container's Pid.
When PTF container is not in "running" status, always return None
for PTF container's Pid.

Signed-off-by: Xin Wang <[email protected]>
wangxin added a commit that referenced this pull request Oct 30, 2023
Cherry-pick #10069 and #10286 to 202205 branch.

* Improve the cleanup of processes and interfaces before stopping PTF container (#10244 

What is the motivation for this PR?
We still observe issues with "testbed-cli.sh remove-topo" and "testbed-cli.sh restart-ptf":

Server may crash and run into CPU softlock issue.
Some exabgp process cannot be fully stopped and "restart-ptf" may fail.
The expectation is that remove-topo and restart-ptf can always be successful. And of course, no server crash.
Possible reason of server crash:

Some exabgp processes are still running in PTF container while we remove the container. This could cause server crash.
Some network interfaces are in the PTF container's network namespace while we remove the container.

How did you do it?
Added a customized module "ptf_control" to stop&kill processes running in PTF container in a more aggressive and reliable way.
Improve the vm_topology module to remove network interfaces from the PTF container in the "unbind" procedure.
Added a vm_topology "unbind" step in the "testbed-cli.sh restart-ptf" procedure.
Updated some "ip link" commands to fully compliant with the syntax in "ip link help".

How did you verify/test it?
Tested the add-topo/remove-topo on both physical and KVM testbed.
Tested restart-ptf on phsycial testbed.

* Avoid running command in exited ptf docker container (#10286)

While stopping PTF container, "ptf_control" module is executed to
kill all processes in the PTF container.
The original code checks if the PTF container's Pid exists before
running command in the PTF container. Unfortunately, this check
is not enough. PTF docker container in exited status still has Pid.

This change improved the code for getting PTF container's Pid.
When PTF container is not in "running" status, always return None
for PTF container's Pid.

Signed-off-by: Xin Wang <[email protected]>
mssonicbld pushed a commit that referenced this pull request Oct 30, 2023
While stopping PTF container, "ptf_control" module is executed to
kill all processes in the PTF container.
The original code checks if the PTF container's Pid exists before
running command in the PTF container. Unfortunately, this check
is not enough. PTF docker container in exited status still has Pid.

This change improved the code for getting PTF container's Pid.
When PTF container is not in "running" status, always return None
for PTF container's Pid.

Signed-off-by: Xin Wang <[email protected]>
wangxin added a commit to wangxin/sonic-mgmt that referenced this pull request Oct 31, 2023
While stopping PTF container, "ptf_control" module is executed to
kill all processes in the PTF container.
The original code checks if the PTF container's Pid exists before
running command in the PTF container. Unfortunately, this check
is not enough. PTF docker container in exited status still has Pid.

This change improved the code for getting PTF container's Pid.
When PTF container is not in "running" status, always return None
for PTF container's Pid.

Signed-off-by: Xin Wang <[email protected]>
yejianquan pushed a commit that referenced this pull request Nov 9, 2023
* Improve the cleanup of processes and interfaces before stopping PTF container (#10069)

What is the motivation for this PR?
We still observe issues with "testbed-cli.sh remove-topo" and "testbed-cli.sh restart-ptf":

Server may crash and run into CPU softlock issue.
Some exabgp process cannot be fully stopped and "restart-ptf" may fail.
The expectation is that remove-topo and restart-ptf can always be successful. And of course, no server crash.
Possible reason of server crash:

Some exabgp processes are still running in PTF container while we remove the container. This could cause server crash.
Some network interfaces are in the PTF container's network namespace while we remove the container.

How did you do it?
Added a customized module "ptf_control" to stop&kill processes running in PTF container in a more aggressive and reliable way.
Improve the vm_topology module to remove network interfaces from the PTF container in the "unbind" procedure.
Added a vm_topology "unbind" step in the "testbed-cli.sh restart-ptf" procedure.
Updated some "ip link" commands to fully compliant with the syntax in "ip link help".

How did you verify/test it?
Tested the add-topo/remove-topo on both physical and KVM testbed.
Tested restart-ptf on phsycial testbed.

* Avoid running command in exited ptf docker container (#10286)

While stopping PTF container, "ptf_control" module is executed to
kill all processes in the PTF container.
The original code checks if the PTF container's Pid exists before
running command in the PTF container. Unfortunately, this check
is not enough. PTF docker container in exited status still has Pid.

This change improved the code for getting PTF container's Pid.
When PTF container is not in "running" status, always return None
for PTF container's Pid.

Signed-off-by: Xin Wang <[email protected]>

* Install right version of docker.py for ubuntu22.04

---------

Signed-off-by: Xin Wang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants