Skip to content

[pfcwd] add an option to support C code send_packets function in pfc_gen.py#14540

Closed
lipxu wants to merge 10 commits intosonic-net:masterfrom
lipxu:20240912_publicMaster_pfcwd_C
Closed

[pfcwd] add an option to support C code send_packets function in pfc_gen.py#14540
lipxu wants to merge 10 commits intosonic-net:masterfrom
lipxu:20240912_publicMaster_pfcwd_C

Conversation

@lipxu
Copy link
Contributor

@lipxu lipxu commented Sep 12, 2024

Description of PR

Summary:
Fixes # (issue)
27262192

Type of change

  • Bug fix
  • Testbed and Framework(new/improvement)
  • Test case(new/improvement)

Back port request

  • 202012
  • 202205
  • 202305
  • 202311
  • 202405

Approach

What is the motivation for this PR?

Currently, the test uses the python script to send the PFC frames, but the performance does not meet the requirements sometimes fails to trigger the PFC storm immediately.

How did you do it?

Add an option to support using C code send_packets function in pfc_gen.py

How did you verify/test it?

https://elastictest.org/scheduler/testplan/66e12cd41b0209cf5f97a7ad
https://elastictest.org/scheduler/testplan/66e11e9ef90c6cd947f4ea3d

Any platform specific information?

Arista fanout device

Supported testbed topology if it's a new test case?

Documentation

@bingwang-ms
Copy link
Collaborator

Thanks for the improvement!
I believe the change can improve the performance of pfc_gen script. There can be another factor that leads to unstable test result of pfc_wd, that is the CPU scheduler.
If the process can't be isolated from kernel scheduler, then it will be schedule out sometimes, resulting in PFC pause not being sent out. I haven't figured out a solution for this issue as some fundamental library is missing from EOS.

@lipxu
Copy link
Contributor Author

lipxu commented Sep 13, 2024

Thanks for the improvement! I believe the change can improve the performance of pfc_gen script. There can be another factor that leads to unstable test result of pfc_wd, that is the CPU scheduler. If the process can't be isolated from kernel scheduler, then it will be schedule out sometimes, resulting in PFC pause not being sent out. I haven't figured out a solution for this issue as some fundamental library is missing from EOS.

Thanks for review @bingwang-ms , Yes, you are correct, based on the lua script log, the root cause should be the PFC frames not evenly then caused on2off value flaps and not trigger the PFC storm immediately. This should be caused by poor performance of fanout, we have no finial solution so far. I tried several workaround to reduce the failure rate, this one should be helpful, based on 7260 (which failed many times in nightly test) test result, we can see the following log with this PR, the detect time is much better than previous.
sorted all_detect_time [446, 452, 517, 543, 566, 572, 574, 620, 628, 663, 669, 731, 745, 750, 771, 1178, 1366, 1968, 2090]

@bingwang-ms
Copy link
Collaborator

The code logic LGTM. Wondering can we reuse the C code in https://github.com/sonic-net/sonic-mgmt/blob/master/tests/common/helpers/pfc_gen_t2.py ? The code is leveraging sendmsg to send packet, which should be more efficient.

bingwang-ms
bingwang-ms previously approved these changes Sep 13, 2024
@lipxu
Copy link
Contributor Author

lipxu commented Sep 14, 2024

The code logic LGTM. Wondering can we reuse the C code in https://github.com/sonic-net/sonic-mgmt/blob/master/tests/common/helpers/pfc_gen_t2.py ? The code is leveraging sendmsg to send packet, which should be more efficient.

Thanks for your suggestion @bingwang-ms , I will check and leverage this code if possible, meanwhile I created another PR #14589 to add half of polling time as compensation for real detect time to reduce the failure rate, thanks.

@lipxu
Copy link
Contributor Author

lipxu commented Sep 19, 2024

The code logic LGTM. Wondering can we reuse the C code in https://github.com/sonic-net/sonic-mgmt/blob/master/tests/common/helpers/pfc_gen_t2.py ? The code is leveraging sendmsg to send packet, which should be more efficient.

Thanks for your suggestion @bingwang-ms , I will check and leverage this code if possible, meanwhile I created another PR #14589 to add half of polling time as compensation for real detect time to reduce the failure rate, thanks.

@bingwang-ms Modify the function and use _sendmmsg for sending packages, Please review, thanks.
The test results on 7260 should be OK.
https://elastictest.org/scheduler/testplan/66eac165a4dd840e43c5b97c
https://elastictest.org/scheduler/testplan/66eac1260e2a7168ba84a218

@lipxu lipxu closed this Mar 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants