Fix Issue with Checking for Active ACL Rules by aclshow -a#21947
Conversation
|
/azp run |
|
Azure Pipelines failed to run 1 pipeline(s). |
44ef981 to
2168b67
Compare
|
/azp run |
|
Azure Pipelines will not run the associated pipelines, because the pull request was updated after the run command was issued. Review the pull request again and issue a new run command. |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
| """ | ||
| res = duthost.shell("aclshow -a")['stdout_lines'] | ||
| if len(res) <= 2 or [line for line in res if 'N/A' in line]: | ||
| res = duthost.shell("show acl rule")['stdout_lines'] |
There was a problem hiding this comment.
The CLI show acl rule returns all ACL rules, meaning it may not get the status of the rule we are trying to check.
Maybe we can read the status of rule from state_db directly?
There was a problem hiding this comment.
This verifies that all rules are in active state. If any become inactive due to loading a new ACL rule that should be something we flag.
Even before the check verified that all counters were not N/A. But if we want to alter the workflow to only check the rules we added, we can definitely do that
There was a problem hiding this comment.
I see. That makes sense.
But we also need to exclude control plane ACL. For control plane ACL, the status is always N/A.
The suggestion is to only check ACL rule status from one specific table with CLI show acl rule TABLE_NAME
There was a problem hiding this comment.
updated this check to take the table as an input. It now only checks rules for the table provided
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Head branch was pushed to by a user without write access
c251d4d to
8b51228
Compare
|
/azp run |
|
Azure Pipelines will not run the associated pipelines, because the pull request was updated after the run command was issued. Review the pull request again and issue a new run command. |
This PR adds test cases and supporting utilities to measure route programming time under high‑scale BGP IPv6 scenarios, building on the refactoring in PR sonic-net#21335. It focuses on quantifying how long it takes for routes to be fully programmed after BGP/connection events (e.g., convergence, admin flaps), and verifying that the measured RP times stay within expected limits and similar to the convergence time. Signed-off-by: Priyansh Tratiya <[email protected]> Signed-off-by: Andoni Sanguesa <[email protected]>
Signed-off-by: Andoni Sanguesa <[email protected]>
Signed-off-by: Andoni Sanguesa <[email protected]>
Signed-off-by: Andoni Sanguesa <[email protected]>
Signed-off-by: Andoni Sanguesa <[email protected]>
Signed-off-by: vikumarks <[email protected]> Signed-off-by: Andoni Sanguesa <[email protected]>
Summary: Health check sometimes load wrong inventory admin/password
Fixes # (issue) 36307349
From investigating I can see that this issue sometimes happen, sometimes doesn't happen. Diving deeper, I can see that this is heavily dependent on how Ansible process and use memory internally.
This would only happen if there are 2 fanout hosts. One is using sonic and one is using non-sonic
In a happy scenarios, comparing the fanouthost.vm.extra_vars of 2 fanouts, we can see that they have different memory address
memory id 140619746693120 host XXXX <----- DIFFERENT ID HERE
2026-01-08 11:31:14,402 testbed_health_check.py#185 INFO - {'hostname': 'XXXX', 'reachable': True, 'failed': True, 'module_stdout': '', 'module_stderr': '/bin/sh: /usr/bin/python3: No such file or directory\n', 'msg': 'The module failed to execute correctly, you probably need to set the interpreter.\nSee stdout/stderr for the exact error', 'rc': 127, 'ansible_facts': {'discovered_interpreter_python': '/usr/bin/python3'}, '_ansible_no_log': False, 'changed': False}
memory id 140619740737472 host YYYY <----- DIFFERENT ID HERE
2026-01-08 11:31:15,404 testbed_health_check.py#185 INFO - {'hostname': 'YYYY', 'reachable': True, 'failed': False, 'ping': 'pong', 'invocation': {'module_args': {'data': 'pong'}}, 'ansible_facts': {'discovered_interpreter_python': '/usr/bin/python3.9'}, '_ansible_no_log': False, 'changed': False}
In some scenarios, however, if ansible decided to re-use the memory address when initialising its VariableManager, we have the issue happen
memory id 139728659566400 host XXXX <---- SAME ID HERE
2026-01-08 11:31:43,750 testbed_health_check.py#185 INFO - {'hostname': 'XXXX', 'reachable': True, 'failed': False, 'ping': 'pong', 'invocation': {'module_args': {'data': 'pong'}}, 'ansible_facts': {'discovered_interpreter_python': '/usr/bin/python3.9'}, '_ansible_no_log': False, 'changed': False}
memory id 139728659566400 host YYYY <---- SAME ID HERE
2026-01-08 11:31:44,384 testbed_health_check.py#185 INFO - {'hostname': 'YYYY', 'reachable': False, 'failed': True, 'unreachable': True, 'msg': "Invalid/incorrect password: Warning: Permanently added '10.150.22.30' (ED25519) to the list of known hosts.\r\nNOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE\n\nUnauthorized access and/or use prohibited. All access and/or use subject to monitoring.\n\nNOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE\nPermission denied, please try again.", 'changed': False}
Since we're overwriting the ansible_ssh_user and ansible_ssh_password in the extra_vars
fanouthost.vm.extra_vars.update({"ansible_ssh_user": fanout_sonic_user, "ansible_ssh_password": fanout_sonic_password})
If in the scenario that the two memory addresses are the same, it will overwrite the ansible_ssh_user, and ansible_ssh_password as well. And everything in extra_vars takes top priority over inventory defined variables.
Therefore it leads to using wrong username and password.
Signed-off-by: Austin Pham <[email protected]>
Signed-off-by: Andoni Sanguesa <[email protected]>
|
Azure Pipelines will not run the associated pipelines, because the pull request was updated after the run command was issued. Review the pull request again and issue a new run command. |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
…et#21947) * Feature/route programming data (sonic-net#21523) This PR adds test cases and supporting utilities to measure route programming time under high‑scale BGP IPv6 scenarios, building on the refactoring in PR sonic-net#21335. It focuses on quantifying how long it takes for routes to be fully programmed after BGP/connection events (e.g., convergence, admin flaps), and verifying that the measured RP times stay within expected limits and similar to the convergence time. Signed-off-by: Andoni Sanguesa <[email protected]>
|
hi @AndoniSanguesa , do you mind to help bring this change to 202412? |
…et#21947) * Feature/route programming data (sonic-net#21523) This PR adds test cases and supporting utilities to measure route programming time under high‑scale BGP IPv6 scenarios, building on the refactoring in PR sonic-net#21335. It focuses on quantifying how long it takes for routes to be fully programmed after BGP/connection events (e.g., convergence, admin flaps), and verifying that the measured RP times stay within expected limits and similar to the convergence time. Signed-off-by: Andoni Sanguesa <[email protected]> Signed-off-by: Yael Tzur <[email protected]>
…et#21947) * Feature/route programming data (sonic-net#21523) This PR adds test cases and supporting utilities to measure route programming time under high‑scale BGP IPv6 scenarios, building on the refactoring in PR sonic-net#21335. It focuses on quantifying how long it takes for routes to be fully programmed after BGP/connection events (e.g., convergence, admin flaps), and verifying that the measured RP times stay within expected limits and similar to the convergence time. Signed-off-by: Andoni Sanguesa <[email protected]>
|
Cherry-pick PR to 202511: #22226 |
…#22226) * Feature/route programming data (#21523) This PR adds test cases and supporting utilities to measure route programming time under high‑scale BGP IPv6 scenarios, building on the refactoring in PR #21335. It focuses on quantifying how long it takes for routes to be fully programmed after BGP/connection events (e.g., convergence, admin flaps), and verifying that the measured RP times stay within expected limits and similar to the convergence time. Signed-off-by: Andoni Sanguesa <[email protected]> Co-authored-by: Andoni Sanguesa <[email protected]>
…et#21947) * Feature/route programming data (sonic-net#21523) This PR adds test cases and supporting utilities to measure route programming time under high‑scale BGP IPv6 scenarios, building on the refactoring in PR sonic-net#21335. It focuses on quantifying how long it takes for routes to be fully programmed after BGP/connection events (e.g., convergence, admin flaps), and verifying that the measured RP times stay within expected limits and similar to the convergence time. Signed-off-by: Andoni Sanguesa <[email protected]>
…et#21947) * Feature/route programming data (sonic-net#21523) This PR adds test cases and supporting utilities to measure route programming time under high‑scale BGP IPv6 scenarios, building on the refactoring in PR sonic-net#21335. It focuses on quantifying how long it takes for routes to be fully programmed after BGP/connection events (e.g., convergence, admin flaps), and verifying that the measured RP times stay within expected limits and similar to the convergence time. Signed-off-by: Andoni Sanguesa <[email protected]> Signed-off-by: ayya <[email protected]>
…et#21947) (sonic-net#22226) * Feature/route programming data (sonic-net#21523) This PR adds test cases and supporting utilities to measure route programming time under high‑scale BGP IPv6 scenarios, building on the refactoring in PR sonic-net#21335. It focuses on quantifying how long it takes for routes to be fully programmed after BGP/connection events (e.g., convergence, admin flaps), and verifying that the measured RP times stay within expected limits and similar to the convergence time. Signed-off-by: Andoni Sanguesa <[email protected]> Co-authored-by: Andoni Sanguesa <[email protected]> Signed-off-by: Lakshmi Yarramaneni <[email protected]>
…et#21947) * Feature/route programming data (sonic-net#21523) This PR adds test cases and supporting utilities to measure route programming time under high‑scale BGP IPv6 scenarios, building on the refactoring in PR sonic-net#21335. It focuses on quantifying how long it takes for routes to be fully programmed after BGP/connection events (e.g., convergence, admin flaps), and verifying that the measured RP times stay within expected limits and similar to the convergence time. Signed-off-by: Andoni Sanguesa <[email protected]> Signed-off-by: nnelluri-cisco <[email protected]>
…et#21947) * Feature/route programming data (sonic-net#21523) This PR adds test cases and supporting utilities to measure route programming time under high‑scale BGP IPv6 scenarios, building on the refactoring in PR sonic-net#21335. It focuses on quantifying how long it takes for routes to be fully programmed after BGP/connection events (e.g., convergence, admin flaps), and verifying that the measured RP times stay within expected limits and similar to the convergence time. Signed-off-by: Andoni Sanguesa <[email protected]> Signed-off-by: Raghavendran Ramanathan <[email protected]>
…et#21947) * Feature/route programming data (sonic-net#21523) This PR adds test cases and supporting utilities to measure route programming time under high‑scale BGP IPv6 scenarios, building on the refactoring in PR sonic-net#21335. It focuses on quantifying how long it takes for routes to be fully programmed after BGP/connection events (e.g., convergence, admin flaps), and verifying that the measured RP times stay within expected limits and similar to the convergence time. Signed-off-by: Andoni Sanguesa <[email protected]> Signed-off-by: Zhuohui Tan <[email protected]>
…et#21947) * Feature/route programming data (sonic-net#21523) This PR adds test cases and supporting utilities to measure route programming time under high‑scale BGP IPv6 scenarios, building on the refactoring in PR sonic-net#21335. It focuses on quantifying how long it takes for routes to be fully programmed after BGP/connection events (e.g., convergence, admin flaps), and verifying that the measured RP times stay within expected limits and similar to the convergence time. Signed-off-by: Andoni Sanguesa <[email protected]> Signed-off-by: Abhishek <[email protected]>
…et#21947) * Feature/route programming data (sonic-net#21523) This PR adds test cases and supporting utilities to measure route programming time under high‑scale BGP IPv6 scenarios, building on the refactoring in PR sonic-net#21335. It focuses on quantifying how long it takes for routes to be fully programmed after BGP/connection events (e.g., convergence, admin flaps), and verifying that the measured RP times stay within expected limits and similar to the convergence time. Signed-off-by: Andoni Sanguesa <[email protected]>
Description of PR
Some platforms appear to treat everflow ACLs differently from typical dataplane ACLs, in that counters in
showacl -awill always appear as N/A before and after traffic even if the rules are correctly formed. The previous version of the IPv6 Everflow test used the results ofshowacl -ato determine whether ACLs were configured and ready, but in the case of some Arista devices, the N/A behaviour caused the test to fail at that check. This PR simply changes the logic to check that all rules are 'active' in `show acl rule'.Type of change
Approach
What is the motivation for this PR?
We saw unexpected test failures after the deployment of the previous test
How did you do it?
Updated the failing check to use a source of truth that was reliable across platforms
How did you verify/test it?
Ran it against the known working platforms (sn4600) and verified it against platforms known to have the N/A issue (Arista 7060x6 and 7050cx3).
Documentation
acl_counters_fix_7060x6.txt
acl_counters_fix_sn2700.txt