Skip to content

[ansible] Fix Ansible extra_vars cache pollution causing test failures on EOS fanout testbeds#23339

Merged
Xichen96 merged 1 commit intosonic-net:masterfrom
Xichen96:dev/xichenlin/fix-ansible
Mar 28, 2026
Merged

[ansible] Fix Ansible extra_vars cache pollution causing test failures on EOS fanout testbeds#23339
Xichen96 merged 1 commit intosonic-net:masterfrom
Xichen96:dev/xichenlin/fix-ansible

Conversation

@Xichen96
Copy link
Copy Markdown
Contributor

@Xichen96 Xichen96 commented Mar 26, 2026

Description of PR

Summary:
Fix Ansible extra_vars cache pollution caused by device classes (EosHost, SonicHost, etc.) mutating a shared cached dict returned by load_extra_vars(). This causes widespread test failures on testbeds with EOS or mixed fanout devices after the docker-sonic-mgmt ansible upgrade from 6.7.0 to 11.10.0.

Ansible-core 2.15 changed load_extra_vars() to cache its result and return the same dict on every call. Our device classes call .extra_vars.update() to inject per-device connection credentials, which worked fine before 2.15 (each call got a fresh dict) but now pollutes a shared cache. Once polluted, all subsequent Ansible connections inherit wrong credentials — e.g., EOS fanout credentials instead of DUT credentials.

The fix monkey-patches load_extra_vars in base.py to return a copy of the cached dict, so each VariableManager gets its own independent copy.

Summary:
Fixes # (issue)

Type of change

  • Bug fix
  • Testbed and Framework(new/improvement)
  • New Test case
    • Skipped for non-supported platforms
  • Test case improvement

Back port request

  • 202205
  • 202305
  • 202311
  • 202405
  • 202411
  • 202505
  • 202511

Approach

What is the motivation for this PR?

7 test suites fail consistently on 720dt nightly runs (and any testbed with EOS fanout devices): bgp_stress_link_flap, fdb, fdb_flush, link_flap, iface_namingmode, static_dns, dns_resolv_conf. Also reproduced on 8220 console switch testbed with show: not found errors.

Root cause: ansible-core 2.15 (commit 95236c5) added a caching optimization to load_extra_vars() that returns the same dict every time. Our device classes (8 files in tests/common/devices/) call variable_manager.extra_vars.update(evars) which mutates this shared dict. After any fanout device is initialized, the cache contains fanout credentials
that override DUT credentials for all subsequent connections.

How did you do it?

Added a monkey-patch at import time in tests/common/devices/base.py that wraps ansible.vars.manager.load_extra_vars to return dict(original_result) — a copy instead of the shared reference. Each VariableManager now gets its own dict; .update() calls only affect that copy.

The patch targets ansible.vars.manager.load_extra_vars (the import site) rather than ansible.utils.vars.load_extra_vars (the definition site), because Python's from X import Y creates a direct reference that isn't affected by patching the original module.

How did you verify/test it?

  • Isolated test runs on testbed-bjw2-can-720dt-3: iface_namingmode went from 62 errors to 18 passed/44 skipped; bgp_stress_link_flap 4 passed; fdb_flush 4 passed
  • Full nightly pipeline: unpatched had 9 failing suites, patched resolved 7 of them (597 tests passed)
  • Independently verified on testbed-bjw3-can-8220-1 (c0 topology, Cisco-8220): show: not found error eliminated

Any platform specific information?

Supported testbed topology if it's a new test case?

Documentation

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@Xichen96 Xichen96 force-pushed the dev/xichenlin/fix-ansible branch from b1e77b0 to 09260ca Compare March 26, 2026 05:33
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

 Ansible-core 2.15 changed load_extra_vars() to return a cached shared
 dict. Device classes (EosHost, SonicHost, etc.) mutate this dict via
 .extra_vars.update() to inject per-device credentials, which now
 pollutes the cache and causes subsequent connections to use wrong
 credentials. This breaks 7+ test suites on testbeds with EOS fanouts.

 Patch load_extra_vars in base.py to return a copy of the cached dict
 so each VariableManager gets an independent copy.

Signed-off-by: Xichen Lin <lukelin0907@gmail.com>
@Xichen96 Xichen96 changed the title [ansible] bug fix [ansible] Fix Ansible extra_vars cache pollution causing test failures on EOS fanout testbeds Mar 27, 2026
@Xichen96 Xichen96 force-pushed the dev/xichenlin/fix-ansible branch from 09260ca to bb2ecd0 Compare March 27, 2026 11:23
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@Xichen96 Xichen96 added Request for 202505 branch Request for 202511 branch Request to backport a change to 202511 branch labels Mar 27, 2026
@bingwang-ms
Copy link
Copy Markdown
Collaborator

Please verify if test plan can be executed normally as we do have some variables depend on these leaked variables. Thanks

@Xichen96 Xichen96 merged commit ed6658f into sonic-net:master Mar 28, 2026
20 of 21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Request for 202505 branch Request for 202511 branch Request to backport a change to 202511 branch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants