Skip to content

[action] [PR:10069] Improve the cleanup of processes and interfaces before stopping PTF container#10521

Merged
mssonicbld merged 1 commit intosonic-net:202305from
mssonicbld:cherry/202305/10069
Oct 30, 2023
Merged

[action] [PR:10069] Improve the cleanup of processes and interfaces before stopping PTF container#10521
mssonicbld merged 1 commit intosonic-net:202305from
mssonicbld:cherry/202305/10069

Conversation

@mssonicbld
Copy link
Copy Markdown
Collaborator

Description of PR

Summary:
Fixes # (issue)

Type of change

  • Bug fix
  • Testbed and Framework(new/improvement)
  • Test case(new/improvement)

Back port request

  • 201911
  • 202012
  • 202205

Approach

What is the motivation for this PR?

We still observe issues with "testbed-cli.sh remove-topo" and "testbed-cli.sh restart-ptf":

  1. Server may crash and run into CPU softlock issue.
  2. Some exabgp process cannot be fully stopped and "restart-ptf" may fail.
    The expectation is that remove-topo and restart-ptf can always be successful. And of course, no server crash.

Possible reason of server crash:

  1. Some exabgp processes are still running in PTF container while we remove the container. This could cause server crash.
  2. Some network interfaces are in the PTF container's network namespace while we remove the container.

How did you do it?

  1. Added a customized module "ptf_control" to stop&kill processes running in PTF container in a more aggressive and reliable way.
  2. Improve the vm_topology module to remove network interfaces from the PTF container in the "unbind" procedure.
  3. Added a vm_topology "unbind" step in the "testbed-cli.sh restart-ptf" procedure.
  4. Updated some "ip link" commands to fully compliant with the syntax in "ip link help".

How did you verify/test it?

Tested the add-topo/remove-topo on both physical and KVM testbed.
Tested restart-ptf on physical testbed.

Any platform specific information?

Supported testbed topology if it's a new test case?

Documentation

…ontainer (sonic-net#10069)

What is the motivation for this PR?
We still observe issues with "testbed-cli.sh remove-topo" and "testbed-cli.sh restart-ptf":

Server may crash and run into CPU softlock issue.
Some exabgp process cannot be fully stopped and "restart-ptf" may fail.
The expectation is that remove-topo and restart-ptf can always be successful. And of course, no server crash.
Possible reason of server crash:

Some exabgp processes are still running in PTF container while we remove the container. This could cause server crash.
Some network interfaces are in the PTF container's network namespace while we remove the container.

How did you do it?
Added a customized module "ptf_control" to stop&kill processes running in PTF container in a more aggressive and reliable way.
Improve the vm_topology module to remove network interfaces from the PTF container in the "unbind" procedure.
Added a vm_topology "unbind" step in the "testbed-cli.sh restart-ptf" procedure.
Updated some "ip link" commands to fully compliant with the syntax in "ip link help".

How did you verify/test it?
Tested the add-topo/remove-topo on both physical and KVM testbed.
Tested restart-ptf on phsycial testbed.
@mssonicbld
Copy link
Copy Markdown
Collaborator Author

Original PR: #10069

@mssonicbld mssonicbld merged commit 110257c into sonic-net:202305 Oct 30, 2023
@mssonicbld mssonicbld deleted the cherry/202305/10069 branch February 4, 2024 09:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants