can the hardware watchdog reboot a hung kernel?

Fri Nov 15 16:58:28 UTC 2019

> On 14 Nov 2019, at 20:19, Ian Lepore <ian at freebsd.org> wrote:
> 
> On Thu, 2019-11-14 at 20:10 +0200, Daniel Braniss wrote:
>>> On 14 Nov 2019, at 18:02, Ian Lepore <ian at freebsd.org> wrote:
>>> 
>>> On Thu, 2019-11-14 at 17:35 +0200, Daniel Braniss wrote:
>>>>> On 14 Nov 2019, at 17:28, Eugene Grosbein <eugen at grosbein.net>
>>>>> wrote:
>>>>> 
>>>>> 14.11.2019 21:52, Daniel Braniss wrote:
>>>>> 
>>>>>> hi,
>>>>>> I have serveral hundred Nano-pi NEO running, and sometimes they
>>>>>> hang, since there is no console
>>>>>> available, the only solution is to do a power cycle - not so easy
>>>>>> since they are distributed in three buildings :-)
>>>>>> 
>>>>>> I am looking at the watchdog stuff, but it seems that what I want
>>>>>> is not supported, i.e.
>>>>>> 	reboot the kernel when hung 
>>>>>> 
>>>>>> wishful thinking?
>>>>> 
>>>>> It's possible if the hardware has such a watchdog and kernel
>>>>> subsystem watchdog(4) supports it.
>>>>> rc.conf(5) manual page describes watchdogd_enable option.
>>>>> 
>>>> 
>>>> yes, but it relys  on user land, what if the kernel is hung? 
>>>> 
>>> 
>>> It relies on the userland daemon to issue the ioctl() calls to pet the
>>> dog.  If the kernel is hung, then userland code isn't going to run
>>> either, and the watchdog petting won't happen, and eventually the
>>> hardware reboots.
>>> 
>>> We use this at $work specifically to reboot if the kernel hangs, using
>>> this config:
>>> 
>>> watchdogd_enable=YES
>>> watchdogd_flags="-s 16 -t 64 -x 64"
>>> 
>>> That says the daemon should pet the dog every 16 seconds, and the
>>> hardware is programmed to reboot if 64 seconds elapses without petting.
>>> In addition, when watchdogd is shutdown normally (like during a normal
>>> system reboot) it doesn't disable the watchdog hardware, it sets the
>>> timeout to 64s to protect against any kind of hang during the reboot. 
>>> The -t and -x times can be different, 64s just happens to work well for
>>> us in both cases.
>>> 
>>> -- Ian
>>> 
>> 
>> ok, that is very encouraging, now a last question
>> how can i hang the kernel to test that the watchdog kicks in? apart from writing a kernel module :-)
>> 
> 
> One thing to be careful of here is multicore systems.  If you have a
> critical app running on a multicore system, that app can hang (maybe it
> tries to read from a device that has malfunctioned and essentially gets
> hung forever in a device driver that doesn't implement timeouts very
> well or something).  In that case, only one core is hung, so watchdogd
> will be able to keep petting the dog to prevent a reboot, but since
> your app is hung on a different core, you aren't really getting the
> protection you need.
> 
> The fix for that is to either turn you app into watchdogd (have it make
> the periodic ioctl() calls to pet the dog), or use the '-e cmd' option
> with watchdogd, and make 'cmd' be a script that somehow verifies that
> your critical application is still running properly.
> 
> —Ian

in my case the kernel is hung, probably by my app - which is using 2 i2c devices, , BTW, this does not happen very often, 
maybe once a month, but is annoying.

now the watchdog stuff:
1- the all winner/nanopi neo can only handle up to 8 sec timeout (the next  is 16sec (2^34))
    the watchdogd complainsif >8sec:
	aw_wdog0: Can't arm, timeout is more than 16 sec
   and continues trying - IMHO it should exit.

2- this is a bit more annoying:
	entering the debugger will trigger the timeout and it will the perform a clean reboot (*)
	doing a shutdown -r leaves the watchdog in some weird state so the reboot hangs when starting the watchdog
	  win some, loose some :-)

*: in MHO, entering the debugger should stop the hardware timeout - or at least optional

cheers and thanks

	danny