[sysctl] Increase hung_task_timeout_secs to 300#6312
[sysctl] Increase hung_task_timeout_secs to 300#6312lguohan merged 1 commit intosonic-net:masterfrom
Conversation
Depending on the performance characteristics of a given hardware platform, it's possible to exceed the default 120 second kernel timeout during I/O intensive operations like image installation. This risk increases as image size continues to increase. So, we need to increase the timeout so that we don't encounter kernel panics on devices with lower disk throughput. Signed-off-by: Danny Allen <[email protected]>
|
i feel we should not change the default value here. instead, we should nice the installer. if kernel is in such bad state, the user space application might hang as well, hardware watchdog might timeout and reboot the box. |
I think the issue is that the kernel takes too much time to flush IO to the hard drive. I am not sure if nice the use application would help. Unless we throttle the amount of data writing to the hard drive. Even that, because there is cache in between, we don't know when the kernel will flush how much data to hard drive. So I think we don't have control over this issue in user space. |
lguohan
left a comment
There was a problem hiding this comment.
can you add more context for this change, so that later when we trace back to this pr, we know exactly why we made this change.
|
retest vsimage please |
Depending on the performance characteristics of a given hardware platform, it's possible to exceed the default 120 second kernel timeout during I/O intensive operations like image installation. This can cause a kernel panic like so:
kernel:[ 852.441781] Kernel panic - not syncing: hung_task: blocked tasks
If this happens during image installation, it's possible for the install to become corrupted and leave the device in an unreachable state that requires a power cycle to resolve. This risk increases as image size continues to increase. So, we need to increase the timeout so that we don't encounter kernel panics on devices with lower disk throughput.
Signed-off-by: Danny Allen [email protected]