Improve "Verify all interfaces are up" error handling#749
Improve "Verify all interfaces are up" error handling#749romankachur-mlnx wants to merge 3 commits intosonic-net:masterfrom romankachur-mlnx:master
Conversation
… case 'Verify interfaces are up' fails. 2. Added check_testbed_interfaces.yml to gather interfaces status on: - DUT. - Fanout (for MLNX only, filtered by check_interfaces_status tag). - Testbed server (by calling testbed_vm_status.yml playbook). - relevant VMs. 3. Added testbed_vm_status.yml palybook to enter Testbed server and gather VMs status. 4. Added show_int_portchannel_status.j2 to enter each relevant VM and gather Port-Channel status.
… case 'Verify interfaces are up' fails. 2. Added check_testbed_interfaces.yml to gather interfaces status on: - DUT - Fanout (for MLNX only, filtered by check_interfaces_status tag) - Testbed server (by calling testbed_vm_status.yml playbook) - relevant VMs 3. Added testbed_vm_status.yml palybook to enter Testbed server and gather VMs status. 4. Added show_int_portchannel_status.j2 to enter each relevant VM and gather Port-Channel status.
when 'Verify interfaces are up' fails. 2. Added check_testbed_interfaces.yml to gather interfaces status on: - DUT - Fanout (for MLNX only, filtered by check_interfaces_status tag) - Testbed server (by calling testbed_vm_status.yml playbook) - relevant VMs 3. Added testbed_vm_status.yml palybook to enter Testbed server and gather VMs status. 4. Added show_int_portchannel_status.j2 to enter each relevant VM and gather Port-Channel status.
|
This is useful feature for debug which found to be working for us. |
pavel-shirshov
left a comment
There was a problem hiding this comment.
Where the most information from the added steps are going?
We run them, ignore errors and miss all output (we can see the output only in debug mode)
|
|
||
| - name: Check Fanout interfaces | ||
| local_action: shell ansible-playbook -i lab fanout.yml -l {{ fanout_switch }} --tags check_interfaces_status | ||
| ignore_errors: yes |
There was a problem hiding this comment.
Where the output does go?
There was a problem hiding this comment.
All gathered info goes to the execution console, where it will be extracted if needed.
There was a problem hiding this comment.
Can we save the output into the variable and output it as
debug: var: output.stdout_lines
There was a problem hiding this comment.
Actually, it has register/debug of output:
When interfaces.yml fails,
it calls check_testbed_interfaces.yml with check_fanout: true
check_testbed_interface.yml plays another playbook - fanout.yml with tag check_interfaces_status:
local_action: shell ansible-playbook -i lab fanout.yml -l {{ fanout_switch }} --tags check_interfaces_status
which in turns calls fanout main.yml
which contains our custom block (which was not upstreamed):
###################################################################
# Check Fanout interfaces status #
###################################################################
- block:
- name: Check Fanout interfaces status
action: apswitch template=roles/fanout/templates/mlnx_interfaces_status.j2
connection: switch
register: fanout_interfaces
args:
login: "{{ switch_login['MLNX-OS'] }}"
- debug:
msg: "{{ fanout_interfaces.stdout.split('\n') }}"
when: peer_hwsku == "MLNX-OS"
tags: check_interfaces_status
Here we expect any team may have their own block relevant to specific vendor, called by
tags: check_interfaces_status
That is why you might have not got any information when you ran the test.
| ignore_errors: yes | ||
|
|
||
| - name: Get teamd dump | ||
| shell: teamdctl '{{ item }}' state dump |
There was a problem hiding this comment.
where the output does go?
There was a problem hiding this comment.
All gathered info goes to the execution console, where it will be extracted if needed.
There was a problem hiding this comment.
Can we save the output into the variable and output it as
debug: var: output.stdout_lines
There was a problem hiding this comment.
Yes, its simple imrovement, but will be done as new PR (this source branch doesn't exist anymore).
There was a problem hiding this comment.
To handle this case, I opened another PR, that includes this changes as well
#815
| ignore_errors: yes | ||
|
|
||
| - name: Gather vm list from Testbed server | ||
| local_action: shell ansible-playbook testbed_vm_status.yml -i veos -l "{{ testbed_facts['server'] }}" |
There was a problem hiding this comment.
where the output does go?
There was a problem hiding this comment.
All gathered info goes to the execution console, where it will be extracted if needed.
There was a problem hiding this comment.
Can we save the output into the variable and output it as
debug: var: output.stdout_lines
There was a problem hiding this comment.
Here is the same explanation as in the first conversation.
When interfaces.yml fails,
it calls check_testbed_interfaces.yml with check_vms: true
check_testbed_interface.yml plays another playbook - testbed_vm_status.yml:
local_action: shell ansible-playbook testbed_vm_status.yml -i veos -l "{{ testbed_facts['server'] }}"
testbed_vm_status.yml has register/debug of output
- hosts: servers:&vm_host
tasks:
- name: Get VM statuses from Testbed server
shell: virsh list
register: virsh_list
- name: Show VM statuses
debug: msg="{{ virsh_list['stdout_lines'] }}"
but it only will be shown in verbose mode -vvvvv of the outer playbook:
| connection: switch | ||
| ignore_errors: yes | ||
| when: vms["{{ item }}"]['hwsku'] == 'Arista-VM' | ||
| with_items: vms |
There was a problem hiding this comment.
where the output does go?
There was a problem hiding this comment.
All gathered info goes to the execution console, where it will be extracted if needed.
There was a problem hiding this comment.
Can we save the output into the variable and output it as
debug: var: output.stdout_lines
There was a problem hiding this comment.
Yes, its simple imrovement, but will be done as new PR (this source brunch doesn't exist anymore).
There was a problem hiding this comment.
To handle this case, I opened another PR, that includes this changes as well
#815
…PE-384C-B-O128S2 SKU (sonic-net#749) <!-- Please make sure you've read and understood our contributing guidelines; https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md Please provide following information to help code review process a bit easier: --> ### Description of PR <!-- - Please include a summary of the change and which issue is fixed. - Please also include relevant motivation and context. Where should reviewer start? background context? - List any dependencies that are required for this change. --> Summary: Fix k8s node joining failure for Arista-7060X6-16PE-384C-B-O128S2 SKU Fixes # (issue) ### Type of change <!-- - Fill x for your type of change. - e.g. - [x] Bug fix --> - [ ] Bug fix - [ ] Testbed and Framework(new/improvement) - [ ] New Test case - [ ] Skipped for non-supported platforms - [x] Test case improvement ### Back port request - [ ] 202205 - [ ] 202305 - [ ] 202311 - [ ] 202405 - [ ] 202411 - [x] 202505 ### Approach #### What is the motivation for this PR? For SKU Arista-7060X6-16PE-384C-B-O128S2, k8s test case fails due to joining failure, need to fix. #### How did you do it? The reason why the case fails is that there's a parameter for kubelet which will prefer IPv6 address, but in testbed, we are using IPv4, so just need to specify the --node-ip=mgmt_ip. #### How did you verify/test it? Run the modified code on Arista-7060X6-16PE-384C-B-O128S2 DUT, the case runs successfully. #### Any platform specific information? N/A #### Supported testbed topology if it's a new test case? N/A ### Documentation <!-- (If it's a new feature, new test case) Did you update documentation/Wiki relevant to your implementation? Link to the wiki page? -->
…nux-kernel] advance submodule head (sonic-net#13906) linkmgrd: * 3e7a9df 2023-02-19 | [active-active] Toggle to standby if default route is missing (sonic-net#171) (HEAD -> 202205) [Longxiang Lyu] * 8ab1b2b 2023-02-15 | [active-active] fix issue that interfaces get stuck in `active` if service starts up with link state down (sonic-net#169) [Jing Zhang] * df862ad 2023-02-11 | Fix mux config when gRPC connection is lost (sonic-net#166) [Longxiang Lyu] utilities: * 8aa7930c 2023-02-13 | [portstat CLI] don't print reminder if use json format (sonic-net#2670) (HEAD -> 202205, github/202205) [wenyiz2021] * 4e3bb6fa 2023-02-21 | Add "show fabric reachability" command. (sonic-net#2672) [jfeng-arista] * 3587a94b 2023-02-18 | [202205][dhcp_relay] Remove add field of vlanid to DHCP_RELAY table while adding vlan (sonic-net#2680) [Yaqiang Zhu] * 4f07f7f0 2023-02-10 | Skip saidump for Spine Router as this can take more than 5 sec (sonic-net#2637) (sonic-net#2671) [kenneth-arista] * e61c5ec4 2023-02-10 | [vlan] Refresh dhcpv6_relay config while adding/deleting a vlan (sonic-net#2660) (sonic-net#2669) [Yaqiang Zhu] swss: * 1bbf725 2023-02-14 | [Workaround] EvpnRemoteVnip2pOrch warmboot check failure (sonic-net#2626) (HEAD -> 202205) [jcaiMR] * 380f72b 2023-02-20 | Support for tc-dot1p and tc-dscp qosmap (sonic-net#2559) [Divya Mukundan] * dbf6fcc 2022-11-01 | Added LAG member check on addLagMember() (sonic-net#2464) [Andriy Kokhan] swss-common: * b31391b 2023-02-21 | Prevent sonic-db-cli generate core dump (sonic-net#749) (HEAD -> 202205) [Hua Liu] * 16ff689 2022-12-13 | Support for TC-DOT1p qos map (sonic-net#721) [Divya Mukundan] platform-daemons: * fb92af4 2023-02-09 | [ycabled] add more coverage to ycabled; add minor name change for vendor API CLI return key-values pairs (sonic-net#338) (HEAD -> 202205) [vdahiya12] linux-kernel: * 4e62401 2023-02-09 | Update linux kernel for hw-mgmt V.7.0020.4104 (sonic-net#305) (HEAD -> 202205) [Stephen Sun] Signed-off-by: Ying Xie <ying.xie@microsoft.com>
Why I did it e732ed0 - Prevent sonic-db-cli generate core dump (Update submodule: sairedis sonic-net#749) (4 minutes ago) [Hua Liu] 28adcb4 - Support for TC-DOT1p qos map (Update submodules: sonic-swss-common, sonic-sairedis sonic-net#721) (5 minutes ago) [Divya Mukundan] How I did it How to verify it
Description of PR
Summary:
when 'Verify interfaces are up' fails.
Type of change
Approach
How did you do it?
I changed interface.yml flow to gather more information from Testbed (DUT / Fanout / Server / VMS)
about the statuses of interfaces and Port-Channels.
Now, when 'Verify all interfaces are up' step when fail, playbook executes additional steps
to gather relevant information (interfaces statuses) from Testbed Server / DUT / Fanout / VMS.
How did you verify/test it?
I ran for example LLDP test (or any other test using interfaces.yml).
I used two cases:
In this case nothing is changes
or when some relevant VM has its Port-Channel deliberately down.
In that cases, the test fails as usual, but gathers additional information through the Testbed.
Any platform specific information?
Current implementation can enter Fanout switch, only if Fanout is MLNX type.
The 'Check Fanout interfaces' step in check_testbed_interfaces.yml calls fanout.yml (actually fanout role) with tag 'check_interfaces_status'
In this case anyone can modify roles/fanout/tasks/main.yml to execute Fanout specific step with
when: peer_hwsku == "<specific_fanout_type>"
tags: check_interfaces_status
Example:
Supported testbed topology if it's a new test case?
N/A
Documentation
N/A