Skip to content

[sysctl] Increase hung_task_timeout_secs to 300#6312

Merged
lguohan merged 1 commit intosonic-net:masterfrom
daall:increase_kernel_timeout
Dec 30, 2020
Merged

[sysctl] Increase hung_task_timeout_secs to 300#6312
lguohan merged 1 commit intosonic-net:masterfrom
daall:increase_kernel_timeout

Conversation

@daall
Copy link
Contributor

@daall daall commented Dec 29, 2020

Depending on the performance characteristics of a given hardware platform, it's possible to exceed the default 120 second kernel timeout during I/O intensive operations like image installation. This can cause a kernel panic like so:

kernel:[ 852.441781] Kernel panic - not syncing: hung_task: blocked tasks

If this happens during image installation, it's possible for the install to become corrupted and leave the device in an unreachable state that requires a power cycle to resolve. This risk increases as image size continues to increase. So, we need to increase the timeout so that we don't encounter kernel panics on devices with lower disk throughput.

Signed-off-by: Danny Allen [email protected]

Depending on the performance characteristics of a given hardware platform,
it's possible to exceed the default 120 second kernel timeout during I/O
intensive operations like image installation. This risk increases as image
size continues to increase. So, we need to increase the timeout so that we
don't encounter kernel panics on devices with lower disk throughput.

Signed-off-by: Danny Allen <[email protected]>
@daall daall requested review from jleveque, lguohan and yxieca December 29, 2020 02:02
@lguohan
Copy link
Collaborator

lguohan commented Dec 29, 2020

i feel we should not change the default value here. instead, we should nice the installer. if kernel is in such bad state, the user space application might hang as well, hardware watchdog might timeout and reboot the box.

Copy link
Collaborator

@lguohan lguohan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as comments

@yxieca
Copy link
Contributor

yxieca commented Dec 29, 2020

i feel we should not change the default value here. instead, we should nice the installer. if kernel is in such bad state, the user space application might hang as well, hardware watchdog might timeout and reboot the box.

I think the issue is that the kernel takes too much time to flush IO to the hard drive. I am not sure if nice the use application would help. Unless we throttle the amount of data writing to the hard drive. Even that, because there is cache in between, we don't know when the kernel will flush how much data to hard drive. So I think we don't have control over this issue in user space.

Copy link
Collaborator

@lguohan lguohan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add more context for this change, so that later when we trace back to this pr, we know exactly why we made this change.

@daall
Copy link
Contributor Author

daall commented Dec 30, 2020

retest vsimage please

@lguohan lguohan merged commit a64994e into sonic-net:master Dec 30, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants