Skip to content

[sonic-mgmt][dualtor-aa] Fix fdb/test_fdb_mac_learning.py failures#15675

Merged
StormLiangMS merged 4 commits intosonic-net:masterfrom
vkjammala-arista:fix-test-fdb-mac-learning
Nov 28, 2024
Merged

[sonic-mgmt][dualtor-aa] Fix fdb/test_fdb_mac_learning.py failures#15675
StormLiangMS merged 4 commits intosonic-net:masterfrom
vkjammala-arista:fix-test-fdb-mac-learning

Conversation

@vkjammala-arista
Copy link
Contributor

@vkjammala-arista vkjammala-arista commented Nov 21, 2024

Description of PR

Summary: [dualtor-aa] Fix "fdb/test_fdb_mac_learning.py" failures
Fixes # https://github.com/aristanetworks/sonic-qual.msft/issues/329

Type of change

  • Bug fix
  • Testbed and Framework(new/improvement)
  • Test case(new/improvement)

Back port request

  • 202012
  • 202205
  • 202305
  • 202311
  • 202405

Approach

What is the motivation for this PR?

Test is currently failing on dualtor-aa topologies due to

  1. Packet sometimes going to unselected dut (due to active-active topology) and thus lead to mac learning failure.

  2. After bringing up interfaces (from shutdown state), there is time.sleep of 30 seconds which seem to be not enough for muxcable status on duthost to become consistent with mux server_status (see SERVER_STATUS shown as unknown below). We need to wait for SERVER_STATUS to match with STATUS field for mac learning to happen.

PORT       STATUS    SERVER_STATUS    HEALTH     HWSTATUS      LAST_SWITCHOVER_TIME
---------  --------  ---------------  ---------  ------------  ----------------------
Ethernet0  active    unknown          unhealthy  inconsistent
  1. As test is bringing down all the interfaces (including portchannels), ERR swss#tunnel_packet_handler.py: All portchannels failed to come up within 3 minutes, exiting. is coming during the test and causing test faiure (as log_analyzer is complaining)

How did you do it?

  1. Add fixture to setup topo in active-standby mode. This is needed to make sure packets goto selected dut (for mac
    learning to happen correctly).
  2. Introduce logic to wait for mux status to become consistent before sending traffic (instead of relying on time.sleep delay).
  3. Ignore "All port channels failed to come up ..." syslog, which seems to be expected as test is bringing down all the
    portchannels.

How did you verify/test it?

Stressed the test on Arista-7260CX3-D108C8 platform with dualtor-aa[-56] deployed and test is passing.

Any platform specific information?

Supported testbed topology if it's a new test case?

Documentation

1) Add fixture to setup topo in active-standby mode. This is needed to
   make sure packets goto selected dut (for mac learning to happen
   correctly).
2) Introduce logic to wait for mux status to become consistent before
   sending traffic (instead of relying on time.sleep delay).
3) Ignoring "...All port channels failed to come up within 3 minutes"
   syslog, as test is bringing down portchannels and restores them at
   the end.
@mssonicbld
Copy link
Collaborator

The pre-commit check detected issues in the files touched by this pull request.
The pre-commit check is a mandatory check, please fix detected issues.

Detailed pre-commit check results:
trim trailing whitespace.................................................Passed
fix end of files.........................................................Passed
check yaml...........................................(no files to check)Skipped
check for added large files..............................................Passed
check python ast.........................................................Passed
flake8...................................................................Failed
- hook id: flake8
- exit code: 1

tests/fdb/test_fdb_mac_learning.py:17:1: E302 expected 2 blank lines, found 1
tests/fdb/test_fdb_mac_learning.py:29:1: E302 expected 2 blank lines, found 1
tests/fdb/test_fdb_mac_learning.py:195:43: E225 missing whitespace around operator
tests/fdb/test_fdb_mac_learning.py:235:121: E501 line too long (128 > 120 characters)

flake8...............................................(no files to check)Skipped
check conditional mark sort..........................(no files to check)Skipped

To run the pre-commit checks locally, you can follow below steps:

  1. Ensure that default python is python3. In sonic-mgmt docker container, default python is python2. You can run
    the check by activating the python3 virtual environment in sonic-mgmt docker container or outside of sonic-mgmt
    docker container.
  2. Ensure that the pre-commit package is installed:
sudo pip install pre-commit
  1. Go to repository root folder
  2. Install the pre-commit hooks:
pre-commit install
  1. Use pre-commit to check staged file:
pre-commit
  1. Alternatively, you can check committed files using:
pre-commit run --from-ref <commit_id> --to-ref <commit_id>

time.sleep(30)
target_ports = [target_ports_to_ptf_mapping[0][0]]
duthost.shell("sudo config interface startup {}".format(target_ports[0]))
pytest_assert(wait_until(150, 5, 0, self.check_mux_status_consistency, duthost, target_ports))
Copy link
Collaborator

@lolyu lolyu Nov 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to check if this is dualtor testbed first? What if this is a t0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @lolyu for catching this, yeah for t0 mux status is irrelevant (as muxcable is specific to dualtor), will update check_mux_status_consistency method to handle this case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @lolyu I have updated the fix to take care of non-dualtor topologies, please review.

Muxcable is irrelevant for non-dualtor topologies and thus adding a
condition to check for mux status consistency in case of dualtor,
otherwise add delay using time.sleep (which is a existing change).
For active-active dualtor, NIC simulator doesn't install OVS flows for
downlink ports until the link status becomes consistent which seems to
happen only if upstream connectivity is restored
Copy link
Collaborator

@StormLiangMS StormLiangMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@StormLiangMS StormLiangMS merged commit 017cad2 into sonic-net:master Nov 28, 2024
mssonicbld pushed a commit to mssonicbld/sonic-mgmt that referenced this pull request Nov 28, 2024
…onic-net#15675)

* [sonic-mgmt][dualtor-aa] Fix fdb/test_fdb_mac_learning.py failures

1) Add fixture to setup topo in active-standby mode. This is needed to
   make sure packets goto selected dut (for mac learning to happen
   correctly).
2) Introduce logic to wait for mux status to become consistent before
   sending traffic (instead of relying on time.sleep delay).
3) Ignoring "...All port channels failed to come up within 3 minutes"
   syslog, as test is bringing down portchannels and restores them at
   the end.

* Fix pre-commit check failures.

* Update fix to handle non-dualtor case.

Muxcable is irrelevant for non-dualtor topologies and thus adding a
condition to check for mux status consistency in case of dualtor,
otherwise add delay using time.sleep (which is a existing change).

* [dualtor-aa] Bringup upstream connectivity for mac learning to happen

For active-active dualtor, NIC simulator doesn't install OVS flows for
downlink ports until the link status becomes consistent which seems to
happen only if upstream connectivity is restored
@mssonicbld
Copy link
Collaborator

Cherry-pick PR to 202405: #15784

mssonicbld pushed a commit that referenced this pull request Nov 28, 2024
…15675)

* [sonic-mgmt][dualtor-aa] Fix fdb/test_fdb_mac_learning.py failures

1) Add fixture to setup topo in active-standby mode. This is needed to
   make sure packets goto selected dut (for mac learning to happen
   correctly).
2) Introduce logic to wait for mux status to become consistent before
   sending traffic (instead of relying on time.sleep delay).
3) Ignoring "...All port channels failed to come up within 3 minutes"
   syslog, as test is bringing down portchannels and restores them at
   the end.

* Fix pre-commit check failures.

* Update fix to handle non-dualtor case.

Muxcable is irrelevant for non-dualtor topologies and thus adding a
condition to check for mux status consistency in case of dualtor,
otherwise add delay using time.sleep (which is a existing change).

* [dualtor-aa] Bringup upstream connectivity for mac learning to happen

For active-active dualtor, NIC simulator doesn't install OVS flows for
downlink ports until the link status becomes consistent which seems to
happen only if upstream connectivity is restored
@vkjammala-arista vkjammala-arista deleted the fix-test-fdb-mac-learning branch April 7, 2025 04:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants