Enhancing core_dump_and_config_check to be multi-asic aware#6527
Enhancing core_dump_and_config_check to be multi-asic aware#6527SuvarnaMeenakshi merged 6 commits intosonic-net:masterfrom
Conversation
e98fb4a to
09b82e5
Compare
|
The pre-commit check detected issues in the files touched by this pull request. For old issues, it is not mandatory to fix them because they were not caused by this change. It is unfair to blame Detailed pre-commit check results: To run the pre-commit checks locally, you can follow below steps:
|
|
@SuvarnaMeenakshi - i will fix the merge conflict - can you please review |
09b82e5 to
4b244b3
Compare
|
The pre-commit check detected issues in the files touched by this pull request. For old issues, it is not mandatory to fix them because they were not caused by this change. It is unfair to blame Detailed pre-commit check results: To run the pre-commit checks locally, you can follow below steps:
|
|
@SuvarnaMeenakshi @judyjoseph are you expecting the PR owners to fix the pre-commit errors - even though they are in lines of code that the PR owner didn't touch. |
@sanmalho-git The log mentions that new issues must be fixed. |
SuvarnaMeenakshi
left a comment
There was a problem hiding this comment.
To ensure changes are uniform, changes should be done here as well, where golden config db is created and removed:
https://github.com/sonic-net/sonic-mgmt/blob/master/ansible/config_sonic_basedon_testbed.yml#L584
https://github.com/sonic-net/sonic-mgmt/blob/master/tests/test_pretest.py#L291
4b244b3 to
93d2f77
Compare
|
The pre-commit check detected issues in the files touched by this pull request. For old issues, it is not mandatory to fix them because they were not caused by this change. It is unfair to blame Detailed pre-commit check results: To run the pre-commit checks locally, you can follow below steps:
|
Thanks for pointing this out. |
|
@anamehra fyi |
|
@sanmalho-git - can you resolve conflict and fix minor comments. |
In our pipeline runs, autorestart tests against a multi-asic DUT were failing with error '
Feature 'x' auto-restart is not consistent across namespaces
The reason was that pretest goes and disables autorestart on all the containers. Then ACL
tests run with change the config_db's and to retore in it's cleanup does a config load_minigraph.
The above results in setting the autorestart state of the containers back to default of 'enabled'
Now, when check_dut_health_status kicks in, it takes a snapshot of the config_db.json before the ACL
suite which has the autorestart state as 'disabled'. But, after the ACL suite, it detects that
autorestart state has changed to 'enabled'. Thus, it tries to restore it.
However, when it restores, it only restores config_db.json, and not the other asics config_db's.
This results in the state of autorestart to be not consistent across namespaces - global
has autorestart 'disabled', while namespace has autorestart 'enabled'.
Fix for the above it to have enhance check_dut_health_status to be multi-asic aware.
- It compares not just config_db.json, but also the config_db's of all the asics.
- If it finds that config has changed, restore not just config_db.json, but also the config_db's of all the asics before rebooting the DUT.
This will make sure that critical services and ports are up when proceeding to the next suite. The default wait time of 120 for pizza box and 240 for modular chassis is sometimes not sufficient - especially with all 400G ports having SFPs
93d2f77 to
4704b61
Compare
|
The pre-commit check detected issues in the files touched by this pull request. For old issues, it is not mandatory to fix them because they were not caused by this change. It is unfair to blame Detailed pre-commit check results: To run the pre-commit checks locally, you can follow below steps:
|
|
@SuvarnaMeenakshi - have addressed the comments and rebased. |
|
@sanmalho-git Thank you for addressing the comments. PR checks are failing. ==================================== ERRORS ====================================
E KeyError: None |
- Forgot to replace 'host' with None in one of the spots
|
The pre-commit check detected issues in the files touched by this pull request. For old issues, it is not mandatory to fix them because they were not caused by this change. It is unfair to blame Detailed pre-commit check results: To run the pre-commit checks locally, you can follow below steps:
|
|
The pre-commit check detected issues in the files touched by this pull request. For old issues, it is not mandatory to fix them because they were not caused by this change. It is unfair to blame Detailed pre-commit check results: To run the pre-commit checks locally, you can follow below steps:
|
What is the motivation for this PR? In our pipeline runs, autorestart tests against a multi-asic DUT were failing with error ' Feature 'x' auto-restart is not consistent across namespaces The reason was that pretest goes and disables autorestart on all the containers. Then ACL tests run with change the config_db's and to retore in it's cleanup does a config load_minigraph. The above results in setting the autorestart state of the containers back to default of 'enabled' Now, when check_dut_health_status kicks in, it takes a snapshot of the config_db.json before the ACL suite which has the autorestart state as 'disabled'. But, after the ACL suite, it detects that autorestart state has changed to 'enabled'. Thus, it tries to restore it. However, when it restores, it only restores config_db.json, and not the other asics config_db's. This results in the state of autorestart to be not consistent across namespaces - global has autorestart 'disabled', while namespace has autorestart 'enabled'. How did you do it? Fix for the above it to have enhance check_dut_health_status to be multi-asic aware. It compares not just config_db.json, but also the config_db's of all the asics. If it finds that config has changed, restore not just config_db.json, but also the config_db's of all the asics before rebooting the DUT. How did you verify/test it? Ran check_dut_health_status with changes to config_db's and validated that it is restored to what it is before the suite is run
…nfig (#6918) What is the motivation for this PR? Some test cases failed at teardown when comparing pre config and current config: for key in cur_config_extra_keys: > cur_only_config[duthost.hostname].update({key: cur_running_config[key]}) E KeyError: u'MUX_LINKMGR' The issue is introduced by #6527. How did you do it? Get previous and current config keys after removing exclusive keys. Also add [cfg_context] for cur_only_config and pre_only_config How did you verify/test it? Run dualtor case,such as dualtor/test_tor_ecn.py::test_dscp_to_queue_during_encap_on_standby Signed-off-by: Zhaohui Sun <[email protected]>
…nfig (#6918) What is the motivation for this PR? Some test cases failed at teardown when comparing pre config and current config: for key in cur_config_extra_keys: > cur_only_config[duthost.hostname].update({key: cur_running_config[key]}) E KeyError: u'MUX_LINKMGR' The issue is introduced by #6527. How did you do it? Get previous and current config keys after removing exclusive keys. Also add [cfg_context] for cur_only_config and pre_only_config How did you verify/test it? Run dualtor case,such as dualtor/test_tor_ecn.py::test_dscp_to_queue_during_encap_on_standby Signed-off-by: Zhaohui Sun <[email protected]>
…ic. (#8884) What is the motivation for this PR? In PR (#6527), it enhanced function core_dump_and_config_check to be multi-asic aware. But in single-asic scenerio, it simply set the key None, which does not make scene. In this PR, I reset the key "asic0" in single-asic scenerio to keep consistent with the key value of multi-asic scenerio. How did you do it? Change the key in single-asic scenerio from None to asic0.
…ic. (sonic-net#8884) What is the motivation for this PR? In PR (sonic-net#6527), it enhanced function core_dump_and_config_check to be multi-asic aware. But in single-asic scenerio, it simply set the key None, which does not make scene. In this PR, I reset the key "asic0" in single-asic scenerio to keep consistent with the key value of multi-asic scenerio. How did you do it? Change the key in single-asic scenerio from None to asic0.
…ic. (#8884) What is the motivation for this PR? In PR (#6527), it enhanced function core_dump_and_config_check to be multi-asic aware. But in single-asic scenerio, it simply set the key None, which does not make scene. In this PR, I reset the key "asic0" in single-asic scenerio to keep consistent with the key value of multi-asic scenerio. How did you do it? Change the key in single-asic scenerio from None to asic0.
…ic-mgmt into internal-202205 Fix merge conflicts. - Fix verify_no_packet_any call in fib_test (sonic-net#6461) - Fix the test case test_TSA failure when check the routes on the eos host (sonic-net#6483) - Use conditional mark to skip testcase instead of required_mocked_dualtor (sonic-net#6766) - [tagged_arp] fix issue 'fixture ports_list not found' (sonic-net#6773) - [QoS] fixes after moving to python3 (sonic-net#6786) - update parse funciton for image url (sonic-net#6848) - Fix typo in get_queue_counter (sonic-net#6852) - Revert "Fix loganalyzer.py UnicodeDecodeError (sonic-net#6524)" (sonic-net#6858) - Enhancing core_dump_and_config_check to be multi-asic aware (sonic-net#6527) - Adding support for calculating balancing in multi-lc/multi-asic case (Test_fib.py) (sonic-net#6391) - Support different RC in case of pre or post sanity check failed (sonic-net#6860) - Update getbuild.py to support pass an empty access_token - [202205] Fixing auto_techsupport (sonic-net#6882) - Merge branch 'azure-202205' into dev/yaqiangzhu/202205_manually_merge
…ic. (sonic-net#8884) What is the motivation for this PR? In PR (sonic-net#6527), it enhanced function core_dump_and_config_check to be multi-asic aware. But in single-asic scenerio, it simply set the key None, which does not make scene. In this PR, I reset the key "asic0" in single-asic scenerio to keep consistent with the key value of multi-asic scenerio. How did you do it? Change the key in single-asic scenerio from None to asic0.
Description of PR
Summary:
Fixes # (issue)
Type of change
Back port request
Approach
What is the motivation for this PR?
In our pipeline runs, autorestart tests against a multi-asic DUT were failing with error '
The reason was that pretest goes and disables autorestart on all the containers. Then ACL tests run with change the config_db's and to retore in it's cleanup does a config load_minigraph. The above results in setting the autorestart state of the containers back to default of 'enabled'
Now, when check_dut_health_status kicks in, it takes a snapshot of the config_db.json before the ACL suite which has the autorestart state as 'disabled'. But, after the ACL suite, it detects that autorestart state has changed to 'enabled'. Thus, it tries to restore it.
However, when it restores, it only restores config_db.json, and not the other asics config_db's.
This results in the state of autorestart to be not consistent across namespaces - global has autorestart 'disabled', while namespace has autorestart 'enabled'.
How did you do it?
Fix for the above it to have enhance check_dut_health_status to be multi-asic aware.
How did you verify/test it?
Ran check_dut_health_status with changes to config_db's and validated that it is restored to what it is before the suite is run
Any platform specific information?
Supported testbed topology if it's a new test case?
Documentation