can the hardware watchdog reboot a hung kernel?

Fri Nov 15 04:24:19 UTC 2019

> On 15 Nov 2019, at 14:29, Eugene Grosbein <eugen at grosbein.net> wrote:
> 
> 15.11.2019 1:19, Ian Lepore wrote:
> 
>> One thing to be careful of here is multicore systems.  If you have a
>> critical app running on a multicore system, that app can hang (maybe it
>> tries to read from a device that has malfunctioned and essentially gets
>> hung forever in a device driver that doesn't implement timeouts very
>> well or something).  In that case, only one core is hung, so watchdogd
>> will be able to keep petting the dog to prevent a reboot, but since
>> your app is hung on a different core, you aren't really getting the
>> protection you need.
>> 
>> The fix for that is to either turn you app into watchdogd (have it make
>> the periodic ioctl() calls to pet the dog), or use the '-e cmd' option
>> with watchdogd, and make 'cmd' be a script that somehow verifies that
>> your critical application is still running properly.
> 
> I have not tried it myself, but there may be easier way
> if the app is single-process and single-threaded: use cpuset(1) to bind
> both of the app and watchdogd to same core.

You can get watchdogd to run a script, so you could have it check for liveness somehow and the dog will bite if it isn't.

--
Daniel O'Connor
"The nice thing about standards is that there
are so many of them to choose from."
 -- Andrew Tanenbaum