[process-reboot-cause]Address the issue: Incorrect reboot cause returned when warm reboot follows a hardware caused reboot#3880
Conversation
…ned when warm reboot follows a hardware caused reboot 1. check whether /proc/cmdline indicates warm/fast reboot. if yes the software reboot cause file will be treated as the reboot cause. finish 2. check whether platform api returns a reboot cause. if yes it is treated as the reboot cause. finish. 3. check whether /hosts/reboot-cause contains a cause. if yes it is treated as the cause otherwise return unknown.
| FIRST_BOOT_PLATFORM_FILE = "/tmp/notify_firstboot_to_platform" | ||
| REBOOT_CAUSE_KEXEC = "/proc/cmdline" | ||
| # The following SONIC_BOOT_TYPEs come from the warm/fast reboot script which is in sonic-utilities | ||
| REBOOT_CAUSE_KEXEC_PATTERN = ".*SONIC_BOOT_TYPE=(fast-reboot|warm|fastfast)$" |
There was a problem hiding this comment.
According to new design for fast reboot - "SONIC_BOOT_TYPE=fast", but also need to check for "fast-reboot" in case reboot was done from 201803
There was a problem hiding this comment.
Regular expression will match if SONIC_BOOT_TYPE is the last argument, lets have '.*' instead of '$'
There was a problem hiding this comment.
@stephenxs IMHO:
REBOOT_CAUSE_KEXEC_PATTERN = ".*SONIC_BOOT_TYPE=(fast-reboot|warm|fastfast)$" ->
REBOOT_CAUSE_KEXEC_PATTERN = "SONIC_BOOT_TYPE=(fast-reboot|warm|fastfast)"
re.match(REBOOT_CAUSE_KEXEC_PATTERN, cause_file_kexec) ->
re.search(REBOOT_CAUSE_KEXEC_PATTERN, cause_file_kexec)
| # ============================= Functions ============================= | ||
| def is_warmfast_reboot_from_proc_cmdline(): | ||
| if os.path.isfile(REBOOT_CAUSE_KEXEC): | ||
| cause_file = open(REBOOT_CAUSE_KEXEC, "r") |
There was a problem hiding this comment.
files was not closed, try to use with open(REBOOT_CASE_KEXEC, "r"):
There was a problem hiding this comment.
fixed. use "with" for all open file cases.
| # The reboot was not caused by hardware. If there is a REBOOT_CAUSE_FILE, it will | ||
| # contain any software-related reboot info. We will use it as the previous cause. | ||
| if os.path.isfile(REBOOT_CAUSE_FILE): | ||
| cause_file = open(REBOOT_CAUSE_FILE, "r") |
1. use "with" statement 2. update fast/warm reboot BOOT_ARG
| FIRST_BOOT_PLATFORM_FILE = "/tmp/notify_firstboot_to_platform" | ||
| REBOOT_CAUSE_KEXEC = "/proc/cmdline" | ||
| # The following SONIC_BOOT_TYPEs come from the warm/fast reboot script which is in sonic-utilities | ||
| REBOOT_CAUSE_KEXEC_PATTERN = ".*SONIC_BOOT_TYPE=(fast-reboot|warm|fastfast|fast).*" |
There was a problem hiding this comment.
Will not work for 201803->master upgrade. Please take a look at https://github.com/Azure/sonic-utilities/blob/201803/scripts/fast-reboot#L30
There was a problem hiding this comment.
thank you for pointing out. will fix it.
Btw, we don't need to consider versions earlier than 201803, right?
@jleveque as Stepan isn't in working hours, can you help ensure that
There was a problem hiding this comment.
@stephenxs: You are correct; we don't need to consider versions earlier than 201803.
| prev_cause_file.write(previous_reboot_cause) | ||
| prev_cause_file.close() | ||
| with open(PREVIOUS_REBOOT_CAUSE_FILE, "w") as prev_cause_file: | ||
| prev_cause_file.write(previous_reboot_cause) |
There was a problem hiding this comment.
This is not an issue for this PR or change request for this PR. Just want to start the discussion here.
This change is the result of a team brainstorm of various combinations of hard/soft reboot reason. We fell that we've covered all the cases for now. But in case we didn't. We could consider change the previous reboot cause file to 3 lines:
Reboot cause: (calculation result)
Detected hardware cause: (hardware cause)
Detected software cause: (software cause)
Not sure if any other part of our code has dependency on this file being single line?
There was a problem hiding this comment.
personally I agree your point.
in addition, if reboot causes from different sources are conflict, it's better to have it in a detailed way.
meanwhile, the drawback is it can confuse user sometimes. for example, if the following info displayed:
calculated reboot cause: warm reboot
hardware cause: power off
software cause: warm reboot
probably it's better to only display in a 3-line way in case of we known for sure there is conflicts among causes from different sources?
There was a problem hiding this comment.
You got a good point. However, if we know there is a conflict, then we would have figured it out and fixed it in code. We cannot predict what we don't know yet. :-)
yxieca
left a comment
There was a problem hiding this comment.
Thanks for working on this change!
| else: | ||
| previous_reboot_cause = hardware_reboot_cause | ||
| # 2. Check if the previous reboot was caused by hardware | ||
| # If yes, the hardware reboot cause will be treated as teh reboot cause |
There was a problem hiding this comment.
teh [](start = 70, length = 3)
-> the
| if os.path.isfile(FIRST_BOOT_PLATFORM_FILE): | ||
| if (previous_reboot_cause == UNKNOWN_REBOOT_CAUSE): | ||
| previous_reboot_cause = UNKNOWN_REBOOT_CAUSE | ||
| os.remove(FIRST_BOOT_PLATFORM_FILE) |
There was a problem hiding this comment.
line 121 to line 132 are same as line 140 to line 151.
suggest to consolidate them as a single function.
previous_reboot_cause = find_software_reboot_cause()
| else: | ||
| # /proc/cmdline says it's a warm/fast reboot but /host/reboot-cause.txt doesn't exist. | ||
| # report an error. | ||
| log_error("/proc/cmdline indicates a fast/warm reboot but {} doesn't exist".format(REBOOT_CAUSE_DIR)) |
There was a problem hiding this comment.
other than report error,
I think we should still assign the
proc_cmdline_reboot_cause = warm or fast
we should treat proc cmdline as truth.
|
retest vs please |
…ned when warm reboot follows a hardware caused reboot (#3880) * [process-reboot-cause]Address the issue: Incorrect reboot cause returned when warm reboot follows a hardware caused reboot 1. check whether /proc/cmdline indicates warm/fast reboot. if yes the software reboot cause file will be treated as the reboot cause. finish 2. check whether platform api returns a reboot cause. if yes it is treated as the reboot cause. finish. 3. check whether /hosts/reboot-cause contains a cause. if yes it is treated as the cause otherwise return unknown. * [process-reboot-cause]Fix review comments * [process-reboot-cause]address comments 1. use "with" statement 2. update fast/warm reboot BOOT_ARG * [process-reboot-cause]address comments * refactor the code flow * Remove escape * Remove extra ':'
…ned when warm reboot follows a hardware caused reboot (#3880) * [process-reboot-cause]Address the issue: Incorrect reboot cause returned when warm reboot follows a hardware caused reboot 1. check whether /proc/cmdline indicates warm/fast reboot. if yes the software reboot cause file will be treated as the reboot cause. finish 2. check whether platform api returns a reboot cause. if yes it is treated as the reboot cause. finish. 3. check whether /hosts/reboot-cause contains a cause. if yes it is treated as the cause otherwise return unknown. * [process-reboot-cause]Fix review comments * [process-reboot-cause]address comments 1. use "with" statement 2. update fast/warm reboot BOOT_ARG * [process-reboot-cause]address comments * refactor the code flow * Remove escape * Remove extra ':'
…atically (#22614) #### Why I did it src/sonic-utilities ``` * 6ce5257 - (HEAD -> master, origin/master, origin/HEAD) Update ubuntu version in azure pipeline (#3891) (5 hours ago) [Vasundhara Volam] * c78e0f7 - (origin/202505) VNET CLI- ADD/DEL VNET, ADD/DEL VNET ROUTE, VRF BIND/UNBIND to consider VNET + ADDITIONAL MODIFICATIONS TO SHOW VNET CLIs (#3826) (7 days ago) [KavyaVaniBedida] * 1d6e050 - Revert "Vnet_name added in create_only patterns (#3878)" (#3879) (7 days ago) [miatttao] * 5db13c2 - skip pfcwd if disabled in golden_config (#3880) (8 days ago) [Dashuai Zhang] * 07b232a - Add GCU Support for SKU Mellanox-SN4280-O8C80 (#3871) (8 days ago) [Sai Rama Mohan Reddy S] ``` #### How I did it #### How to verify it #### Description for the changelog
…atically (#22664) #### Why I did it src/sonic-utilities ``` * 6964f652 - (HEAD -> 202505, origin/202505) add TH5-512 hwsku into gcu support list (#3898) (16 hours ago) [mssonicbld] * 1ec19bc3 - feat: support namespace arg for show bfd commands (#3892) (6 days ago) [mssonicbld] * 6f93c7ff - feat: support namespace arg for show mac (#3893) (6 days ago) [mssonicbld] * c78e0f7 - VNET CLI- ADD/DEL VNET, ADD/DEL VNET ROUTE, VRF BIND/UNBIND to consider VNET + ADDITIONAL MODIFICATIONS TO SHOW VNET CLIs (#3826) (13 days ago) [KavyaVaniBedida] * 1d6e050 - Revert "Vnet_name added in create_only patterns (#3878)" (#3879) (13 days ago) [miatttao] * 5db13c2 - skip pfcwd if disabled in golden_config (#3880) (2 weeks ago) [Dashuai Zhang] * 07b232a - Add GCU Support for SKU Mellanox-SN4280-O8C80 (#3871) (2 weeks ago) [Sai Rama Mohan Reddy S] ``` #### How I did it #### How to verify it #### Description for the changelog
…lly (#23997) #### Why I did it src/sonic-swss ``` * badf36fa - (HEAD -> 202505, origin/202505) fpmsyncd crashes during execution of sonic-mgmt script vxlan/test_vnet_bgp_route_precedence.py (#3880) (15 hours ago) [mssonicbld] ``` #### How I did it #### How to verify it #### Description for the changelog
- What I did
Fix issue reboot cause test failed if warm-reboot follows any kind of hardware-caused reboot #3871
- How I did it
if yes the software reboot cause file will be treated as the reboot cause.
finish
if yes it is treated as the reboot cause.
finish.
if yes it is treated as the cause otherwise return unknown.
Detailed information can be found in the issue description.
- How to verify it
reboot the DUT in the following sequence
reboot-cause-test.txt
- Description for the changelog
- A picture of a cute animal (not mandatory but encouraged)