-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Collecting dump during SAI failure #1212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
14 commits
Select commit
Hold shift + click to select a range
ea4c064
Create dump_on_sai_failure.md
dgsudharsan e16b074
Update dump_on_sai_failure.md
dgsudharsan 5f06966
Update dump_on_sai_failure.md
dgsudharsan c11c4ff
Adding images
dgsudharsan b325d09
Update dump_on_sai_failure.md
dgsudharsan 072697d
Update dump_on_sai_failure.md
dgsudharsan 530a9a7
Update dump_on_sai_failure.md
dgsudharsan c547ccc
Update dump_on_sai_failure.md
dgsudharsan 249cd3e
Update dump_on_sai_failure.md
dgsudharsan 124eb87
Add files via upload
dgsudharsan 9e793a6
Updating flow diagram
dgsudharsan fa0a6ee
Merge branch 'master' into sai_failure_dump
dgsudharsan e479234
Update dump_on_sai_failure.md
dgsudharsan b403182
Merge branch 'master' into sai_failure_dump
prsunny File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,90 @@ | ||
| # Dump on SAI failure # | ||
|
|
||
| ## Table of Content | ||
|
|
||
| - [Revision](#revision) | ||
| - [Scope](#scope) | ||
| - [Definitions/Abbreviations](#definitionsabbreviations) | ||
| - [Overview](#overview) | ||
| - [Requirements](#requirements) | ||
| - [High-Level Design](#high-level-design) | ||
| - [SAI API Requirements](#sai-api-requirements) | ||
| - [Configuration and management ](#configuration-and-management) | ||
| - [Config command](#config-command) | ||
| - [Show command](#show-command) | ||
| - [DB Migrator](#db-migrator) | ||
| - [YANG Model changes](#yang-model-changes) | ||
| - [Warmboot and Fastboot Considerations](#warmboot-and-fastboot-considerations) | ||
| - [Testing Design](#testing-design) | ||
| - [Unit Tests](#unit-tests) | ||
| - [System tests](#system-tests) | ||
|
|
||
|
|
||
| ### Revision | ||
|
|
||
| | Rev | Date | Author | Change Description | | ||
| |:---:|:-----------:|:-------------------:|--------------------------------------------| | ||
| | 0.1 | | Sudharsan | Initial version | | ||
|
|
||
| ### Scope | ||
| The scope of this document is to design the handling of taking a dump during a SAI failure. | ||
|
|
||
| ### Definitions/Abbreviations | ||
|
|
||
|
|
||
| ### Overview | ||
| In the existing design, when SAI failure occurs, the orchagent aborts and all the dependent services including syncd restart. This results in failure to take the SAI, SDK and lower layer state during the problem state resulting in loss of information to debug. | ||
| To solve this issue, whenever there is a SAI failure, orchagent requests syncd to take relevant dumps and once done, it proceeds for the abort. | ||
|
|
||
| ### Requirements | ||
|
|
||
| Primary requirements for taking dump during SAI failure are | ||
| - The dump needs to be taken synchronosly before abort. | ||
dgsudharsan marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| - The infra to take dump should be flexible to allow for platform specific calls similar to techsupport. | ||
| - The dumps should be accessible in the host which can be then collected by techsupport. | ||
| - Limit the number of dumps (Rotation) | ||
|
|
||
|
|
||
| ### High-Level Design | ||
| A new enum value for SAI_REDIS_SWITCH_ATTR_NOTIFY_SYNCD is defined (SAI_REDIS_NOTIFY_SYNCD_INVOKE_DUMP). When there is a SAI failure, before calling the abort, orchagent sets the switch attribute SAI_REDIS_SWITCH_ATTR_NOTIFY_SYNCD with value SAI_REDIS_NOTIFY_SYNCD_INVOKE_DUMP attribute. On receiving this attribute syncd calls the generic dump script which is present in /usr/bin/syncd_dump.sh. This script will check for the presence of platform specific dump script which should be located at /usr/bin/platform_syncd_dump.sh. If this script is present, it would be invoked to take the necessary dump. Vendors if they intend to take dumps during SAI failure can define the script in their syncd docker. The dumps collected from this script should be stored in /var/log/sai_failure_dump/ which will be exposed to the host. Only one file should be stored per dump in order to facilitate the rotation logic. Once the dump is finished, the generic syncd dump script will perform rotation on /var/log/sai_failure_dump/ to restrict the number of dumps. A variable by name SAI_MAX_FAILURE_DUMPS is defined in the generic script which by default is set to 10. This variable can be overwritten in the platform specific script if the platform wants a different number of dumps. | ||
|
|
||
| Later when techsupport is invoked manually or invoked through auto techsupport, these dumps will be collected and once collected, they will be cleared from /var/log/sai_failure_dump/ | ||
|
|
||
| The below diagram explains the sequence when a SAI failure happens | ||
|  | ||
|
|
||
| The flow inside syncd is shown below | ||
|  | ||
dgsudharsan marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| ### SAI API Requirements | ||
| None | ||
|
|
||
| ### Configuration and management | ||
|
|
||
| #### Config command | ||
|
|
||
| No new commands are introduced as part of this design. | ||
|
|
||
| #### Show command | ||
|
|
||
| No new commands are introduced as part of this design | ||
|
|
||
| #### DB Migrator | ||
| N/A | ||
|
|
||
| ### YANG model changes | ||
| N/A | ||
|
|
||
| ### Warmboot and Fastboot Considerations | ||
| N/A | ||
|
|
||
| ### Testing Design | ||
|
|
||
| #### Unit tests | ||
| 1) Gtest for syncd infrastructure to test the SAI_REDIS_SWITCH_ATTR_NOTIFY_SYNCD. | ||
| 2) Gtest in orchagent to test the SAI failure scenario | ||
|
|
||
| #### System tests | ||
| 1) Simulate SAI failure and verify if SAI failure dump is created. | ||
| 2) Verify if the dump in techsupport contains the SAI failure dump is collected. | ||
|
|
||
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.