[ansible] Fix Ansible extra_vars cache pollution causing test failures on EOS fanout testbeds#23339
Merged
Xichen96 merged 1 commit intosonic-net:masterfrom Mar 28, 2026
Merged
Conversation
Collaborator
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
b1e77b0 to
09260ca
Compare
Collaborator
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Ansible-core 2.15 changed load_extra_vars() to return a cached shared dict. Device classes (EosHost, SonicHost, etc.) mutate this dict via .extra_vars.update() to inject per-device credentials, which now pollutes the cache and causes subsequent connections to use wrong credentials. This breaks 7+ test suites on testbeds with EOS fanouts. Patch load_extra_vars in base.py to return a copy of the cached dict so each VariableManager gets an independent copy. Signed-off-by: Xichen Lin <lukelin0907@gmail.com>
09260ca to
bb2ecd0
Compare
Collaborator
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
bingwang-ms
approved these changes
Mar 27, 2026
Collaborator
|
Please verify if test plan can be executed normally as we do have some variables depend on these leaked variables. Thanks |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description of PR
Summary:
Fix Ansible
extra_varscache pollution caused by device classes (EosHost, SonicHost, etc.) mutating a shared cached dict returned byload_extra_vars(). This causes widespread test failures on testbeds with EOS or mixed fanout devices after thedocker-sonic-mgmtansible upgrade from 6.7.0 to 11.10.0.Ansible-core 2.15 changed
load_extra_vars()to cache its result and return the same dict on every call. Our device classes call.extra_vars.update()to inject per-device connection credentials, which worked fine before 2.15 (each call got a fresh dict) but now pollutes a shared cache. Once polluted, all subsequent Ansible connections inherit wrong credentials — e.g., EOS fanout credentials instead of DUT credentials.The fix monkey-patches
load_extra_varsinbase.pyto return a copy of the cached dict, so eachVariableManagergets its own independent copy.Summary:
Fixes # (issue)
Type of change
Back port request
Approach
What is the motivation for this PR?
7 test suites fail consistently on 720dt nightly runs (and any testbed with EOS fanout devices):
bgp_stress_link_flap,fdb,fdb_flush,link_flap,iface_namingmode,static_dns,dns_resolv_conf. Also reproduced on 8220 console switch testbed withshow: not founderrors.Root cause: ansible-core 2.15 (commit
95236c5) added a caching optimization toload_extra_vars()that returns the same dict every time. Our device classes (8 files intests/common/devices/) callvariable_manager.extra_vars.update(evars)which mutates this shared dict. After any fanout device is initialized, the cache contains fanout credentialsthat override DUT credentials for all subsequent connections.
How did you do it?
Added a monkey-patch at import time in
tests/common/devices/base.pythat wrapsansible.vars.manager.load_extra_varsto returndict(original_result)— a copy instead of the shared reference. EachVariableManagernow gets its own dict;.update()calls only affect that copy.The patch targets
ansible.vars.manager.load_extra_vars(the import site) rather thanansible.utils.vars.load_extra_vars(the definition site), because Python'sfrom X import Ycreates a direct reference that isn't affected by patching the original module.How did you verify/test it?
testbed-bjw2-can-720dt-3:iface_namingmodewent from 62 errors to 18 passed/44 skipped;bgp_stress_link_flap4 passed;fdb_flush4 passedtestbed-bjw3-can-8220-1(c0 topology, Cisco-8220):show: not founderror eliminatedAny platform specific information?
Supported testbed topology if it's a new test case?
Documentation