Skip to content

[202012] Improve the cleanup of PTF container#10533

Merged
yejianquan merged 3 commits intosonic-net:202012from
wangxin:cherry-pick-stop-ptf-202012
Nov 9, 2023
Merged

[202012] Improve the cleanup of PTF container#10533
yejianquan merged 3 commits intosonic-net:202012from
wangxin:cherry-pick-stop-ptf-202012

Conversation

@wangxin
Copy link
Copy Markdown
Collaborator

@wangxin wangxin commented Oct 31, 2023

Cherry-pick #10069 and #10286 to 202012 branch.

What is the motivation for this PR?
We still observe issues with "testbed-cli.sh remove-topo" and "testbed-cli.sh restart-ptf":

Server may crash and run into CPU softlock issue.
Some exabgp process cannot be fully stopped and "restart-ptf" may fail. The expectation is that remove-topo and restart-ptf can always be successful. And of course, no server crash. Possible reason of server crash:

Some exabgp processes are still running in PTF container while we remove the container. This could cause server crash. Some network interfaces are in the PTF container's network namespace while we remove the container.

How did you do it?
Added a customized module "ptf_control" to stop&kill processes running in PTF container in a more aggressive and reliable way. Improve the vm_topology module to remove network interfaces from the PTF container in the "unbind" procedure. Added a vm_topology "unbind" step in the "testbed-cli.sh restart-ptf" procedure. Updated some "ip link" commands to fully compliant with the syntax in "ip link help".

How did you verify/test it?
Tested the add-topo/remove-topo on both physical and KVM testbed. Tested restart-ptf on phsycial testbed.

Description of PR

Summary:
Fixes # (issue)

Type of change

  • Bug fix
  • Testbed and Framework(new/improvement)
  • Test case(new/improvement)

Back port request

  • 201911
  • 202012
  • 202205
  • 202305

Approach

What is the motivation for this PR?

How did you do it?

How did you verify/test it?

Any platform specific information?

Supported testbed topology if it's a new test case?

Documentation

…ontainer (sonic-net#10069)

What is the motivation for this PR?
We still observe issues with "testbed-cli.sh remove-topo" and "testbed-cli.sh restart-ptf":

Server may crash and run into CPU softlock issue.
Some exabgp process cannot be fully stopped and "restart-ptf" may fail.
The expectation is that remove-topo and restart-ptf can always be successful. And of course, no server crash.
Possible reason of server crash:

Some exabgp processes are still running in PTF container while we remove the container. This could cause server crash.
Some network interfaces are in the PTF container's network namespace while we remove the container.

How did you do it?
Added a customized module "ptf_control" to stop&kill processes running in PTF container in a more aggressive and reliable way.
Improve the vm_topology module to remove network interfaces from the PTF container in the "unbind" procedure.
Added a vm_topology "unbind" step in the "testbed-cli.sh restart-ptf" procedure.
Updated some "ip link" commands to fully compliant with the syntax in "ip link help".

How did you verify/test it?
Tested the add-topo/remove-topo on both physical and KVM testbed.
Tested restart-ptf on phsycial testbed.
While stopping PTF container, "ptf_control" module is executed to
kill all processes in the PTF container.
The original code checks if the PTF container's Pid exists before
running command in the PTF container. Unfortunately, this check
is not enough. PTF docker container in exited status still has Pid.

This change improved the code for getting PTF container's Pid.
When PTF container is not in "running" status, always return None
for PTF container's Pid.

Signed-off-by: Xin Wang <[email protected]>
@wangxin wangxin requested review from wsycqyz and yejianquan October 31, 2023 10:25
Copy link
Copy Markdown
Collaborator

@yejianquan yejianquan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@wangxin
Copy link
Copy Markdown
Collaborator Author

wangxin commented Nov 8, 2023

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@yejianquan
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@yejianquan yejianquan merged commit 4473f6d into sonic-net:202012 Nov 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants