Re: watchdog timer programming (progress)

From: Stephane Rochoy <stephane.rochoy_at_stormshield.eu>
Date: Wed, 02 Oct 2024 09:13:34 UTC
mike tancsa <mike@sentex.net> writes:

> On 10/1/2024 5:02 PM, mike tancsa wrote:
>> On 10/1/2024 4:03 PM, mike tancsa wrote:
>>> On 10/1/2024 2:07 AM, Stephane Rochoy wrote:
>>>>
>>>> mike tancsa <mike@sentex.net> writes:
>>>>
>>>>> WARNING: This e-mail comes from someone outside your 
>>>>> organisation.
>>>>> Do not click
>>>>> on links or open attachments if you do not know the sender 
>>>>> and are
>>>>> not sure that
>>>>> the content is safe.
>>>>>
>>>>> On 9/30/2024 3:18 AM, Stephane Rochoy wrote:
>>>>>>
>>>>>> mike tancsa <mike@sentex.net> writes:
>>>>>>
>>>>>>> Do you know off hand how to set the system to just reboot 
>>>>>>> ? The
>>>>>>> ddb man
>>>>>>> page seems to imply I need options DDB as well, which is 
>>>>>>> not in
>>>>>>> GENERIC
>>>>>>> in order to set script actions.
>>>>>>
>>>>>> I would try the following:
>>>>>>
>>>>>>  ddb script kdb.enter.default=reset
>>>>>>
>>>>> If I build a custom kernel then that will work. But with 
>>>>> GENERIC (I am
>>>>> tracking project via freebsd-update), it fails
>>>>>
>>>>> # ddb script kdb.enter.default=reset
>>>>> ddb: sysctl: debug.ddb.scripting.scripts: No such file or 
>>>>> directory
>>>>>
>>>>> With a customer kernel, adding
>>>>>
>>>>> options DDB
>>>>>
>>>>> it works perfectly.
>>>>>
>>>>> Is there any way to get this to work without having ddb 
>>>>> custom
>>>>> compiled in ?
>>>>
>>>> I don't understand what's happening here. AFAIK, the code
>>>> corresponding to the soft watchdog being triggered is the
>>>> following:
>>>>
>>>>  static void
>>>>  wd_timeout_cb(void *arg)
>>>>  {
>>>>    const char *type = arg;
>>>>
>>>>  #ifdef DDB
>>>>    if ((wd_pretimeout_act & WD_SOFT_DDB)) {
>>>>      char kdb_why[80];
>>>>      snprintf(kdb_why, sizeof(kdb_why), "watchdog 
>>>>      %s-timeout",
>>>> type);
>>>>      kdb_backtrace();
>>>>      kdb_enter(KDB_WHY_WATCHDOG, kdb_why);
>>>>    }
>>>>  #endif
>>>>    if ((wd_pretimeout_act & WD_SOFT_LOG))
>>>>      log(LOG_EMERG, "watchdog %s-timeout, WD_SOFT_LOG\n", 
>>>>      type);
>>>>    if ((wd_pretimeout_act & WD_SOFT_PRINTF))
>>>>      printf("watchdog %s-timeout, WD_SOFT_PRINTF\n", type);
>>>>    if ((wd_pretimeout_act & WD_SOFT_PANIC))
>>>>      panic("watchdog %s-timeout, WD_SOFT_PANIC set", type);
>>>>  }
>>>>
>>>> So without DDB, it should call panic. But in your case, it
>>>> called kdb_backtrace. So initial hypothesis was wrong. What I
>>>> missed is that panic was natively able to kdb_backtrace if 
>>>> gently
>>>> asked to do so:
>>>>
>>>>  #ifdef KDB
>>>>    if ((newpanic || trace_all_panics) && trace_on_panic)
>>>>      kdb_backtrace();
>>>>    if (debugger_on_panic)
>>>>      kdb_enter(KDB_WHY_PANIC, "panic");
>>>>    else if (!newpanic && debugger_on_recursive_panic)
>>>>      kdb_enter(KDB_WHY_PANIC, "re-panic");
>>>>  #endif
>>>>    /*thread_lock(td); */
>>>>    td->td_flags |= TDF_INPANIC;
>>>>    /* thread_unlock(td); */
>>>>    if (!sync_on_panic)
>>>>      bootopt |= RB_NOSYNC;
>>>>    if (poweroff_on_panic)
>>>>      bootopt |= RB_POWEROFF;
>>>>    if (powercycle_on_panic)
>>>>      bootopt |= RB_POWERCYCLE;
>>>>    kern_reboot(bootopt);
>>>>
>>>> So it definitely should reboot but as it don't, maybe playing 
>>>> with
>>>> kern.powercycle_on_panic would help?
>>>>
>>>>
>>>
>>> Thank you for your continued help on this. Still no luck with 
>>> the
>>> GENERIC kernel
>>>
>>> 0{p9999}# sysctl -w kern.powercycle_on_panic=1
>>> kern.powercycle_on_panic: 0 -> 1
>>> 0{p9999}# ps -auxwww | grep dog
>>> root     4752   0.0  0.2   12820  12916  -  S<s  15:38 0:00.01
>>> watchdogd --softtimeout-action panic -t 10
>>> root     4792   0.0  0.0   12808   2644 u0  S+   15:39 0:00.00 
>>> grep dog
>>> 0{p9999}# kill -9 4752
>>> 0{p9999}# KDB: stack backtrace:
>>> #0 0xffffffff80b7fefd at kdb_backtrace+0x5d
>>> #1 0xffffffff80abec93 at hardclock+0x103
>>> #2 0xffffffff80abfe8b at handleevents+0xab
>>> #3 0xffffffff80ac0b7c at timercb+0x24c
>>> #4 0xffffffff810d0ebb at lapic_handle_timer+0xab
>>> #5 0xffffffff80fd8a71 at Xtimerint+0xb1
>>> #6 0xffffffff804b3685 at acpi_cpu_idle+0x2c5
>>> #7 0xffffffff80fc48f6 at cpu_idle_acpi+0x46
>>> #8 0xffffffff80fc49ad at cpu_idle+0x9d
>>> #9 0xffffffff80b67bb6 at sched_idletd+0x576
>>> #10 0xffffffff80aecf7f at fork_exit+0x7f
>>> #11 0xffffffff80fd7dae at fork_trampoline+0xe
>>>
>>> 0{p9999}#
>>>
>>> Where would be the best place to hack in something like this 
>>> in the
>>> driver ?
>>>  sysctl -w debug.kdb.panic_str="Watchdog Panic"
>>>
>>> which actually does panic the box
>>>
>>>
>>
>> One other datapoint. It seems starting
>>
>> watchdogd --softtimeout-action panic --softtimeout -t 10
>>
>> After kill -9
>> it eventually prints out
>>
>> watchdog soft-timeout, WD_SOFT_LOG
>>
>> to dmesg.  But after that, I cannot start a new watchdogd with 
>> just
>>
>> watchdogd --softtimeout-action panic -t 10
>>
>> I get
>>
>> watchdogd: setting WDIOC_SETSOFT 1: Invalid argument
>> watchdogd: patting the dog: Invalid argument
>
>
> I made these 2 changes to the driver
>
> --- watchdog.c  2024-10-01 20:37:28.667869000 -0400
> +++ /tmp/watchdog.c     2024-10-01 20:36:59.764330000 -0400
> @@ -61,7 +61,8 @@
>  static struct callout wd_softtimeo_handle;
>  static int wd_softtimer;       /* true = use softtimer instead 
>  of hardware
>                                    watchdog */
> -static int wd_softtimeout_act = WD_SOFT_LOG;   /* action for 
> the
> software timeout */
> +// static int wd_softtimeout_act = WD_SOFT_LOG;        /* 
> action for
> the software timeout */
> +static int wd_softtimeout_act = WD_SOFT_PANIC; /* action for 
> the
> software timeout */
>
>  static struct cdev *wd_dev;
>  static volatile u_int wd_last_u;    /* last timeout value set 
>  by
> kern_do_pat */
> @@ -241,6 +242,7 @@
>  wd_timeout_cb(void *arg)
>  {
>         const char *type = arg;
> +       panic("mdt watchdog %s-timeout, WD_SOFT_PANIC set", 
> type);
>
>  #ifdef DDB
>         if ((wd_pretimeout_act & WD_SOFT_DDB)) {
>
>
> and it works now
>
> KDB: stack backtrace:
> #0 0xffffffff80b8943d at kdb_backtrace+0x5d
> #1 0xffffffff80b3bfd1 at vpanic+0x131
> #2 0xffffffff80b3be93 at panic+0x43
> #3 0xffffffff8098b585 at wd_timeout_cb+0x15
> #4 0xffffffff80b59fcc at softclock_call_cc+0x12c
> #5 0xffffffff80b5b815 at softclock_thread+0xe5
> #6 0xffffffff80af61df at fork_exit+0x7f
> #7 0xffffffff80ff76ce at fork_trampoline+0xe
> Uptime: 1m13s
>
> it seems the soft timeout value action is never overridden for 
> some reason.
>
> This kinda feels like a bug / pr ?

Well, honestly I'm puzzled:
- in one hand, watchdog.c don't seems to use wd_softtimeout_act
- and on the other hand hardclock seems to directly call
  watchdog_fire which just kdb_enter or panic.

Note that wd_timeout_cb seems to be about both pretimeout and
timeout handling.

Regards,
-- 
Stéphane Rochoy
O: Stormshield