Re: watchdog timer programming (progress)

Reply: Stephane Rochoy : "Re: watchdog timer programming (progress)"
In reply to: mike tancsa : "Re: watchdog timer programming"
Go to: [ bottom of page ] [ top of archives ] [ this month ]
From: mike tancsa <mike_at_sentex.net>
Date: Wed, 02 Oct 2024 00:40:17 UTC
On 10/1/2024 5:02 PM, mike tancsa wrote:
> On 10/1/2024 4:03 PM, mike tancsa wrote:
>> On 10/1/2024 2:07 AM, Stephane Rochoy wrote:
>>>
>>> mike tancsa <mike@sentex.net> writes:
>>>
>>>> WARNING: This e-mail comes from someone outside your organisation. 
>>>> Do not click
>>>> on links or open attachments if you do not know the sender and are 
>>>> not sure that
>>>> the content is safe.
>>>>
>>>> On 9/30/2024 3:18 AM, Stephane Rochoy wrote:
>>>>>
>>>>> mike tancsa <mike@sentex.net> writes:
>>>>>
>>>>>> Do you know off hand how to set the system to just reboot ? The 
>>>>>> ddb man
>>>>>> page seems to imply I need options DDB as well, which is not in 
>>>>>> GENERIC
>>>>>> in order to set script actions.
>>>>>
>>>>> I would try the following:
>>>>>
>>>>>  ddb script kdb.enter.default=reset
>>>>>
>>>> If I build a custom kernel then that will work. But with GENERIC (I am
>>>> tracking project via freebsd-update), it fails
>>>>
>>>> # ddb script kdb.enter.default=reset
>>>> ddb: sysctl: debug.ddb.scripting.scripts: No such file or directory
>>>>
>>>> With a customer kernel, adding
>>>>
>>>> options DDB
>>>>
>>>> it works perfectly.
>>>>
>>>> Is there any way to get this to work without having ddb custom
>>>> compiled in ?
>>>
>>> I don't understand what's happening here. AFAIK, the code
>>> corresponding to the soft watchdog being triggered is the
>>> following:
>>>
>>>  static void
>>>  wd_timeout_cb(void *arg)
>>>  {
>>>    const char *type = arg;
>>>
>>>  #ifdef DDB
>>>    if ((wd_pretimeout_act & WD_SOFT_DDB)) {
>>>      char kdb_why[80];
>>>      snprintf(kdb_why, sizeof(kdb_why), "watchdog %s-timeout",      
>>> type);
>>>      kdb_backtrace();
>>>      kdb_enter(KDB_WHY_WATCHDOG, kdb_why);
>>>    }
>>>  #endif
>>>    if ((wd_pretimeout_act & WD_SOFT_LOG))
>>>      log(LOG_EMERG, "watchdog %s-timeout, WD_SOFT_LOG\n", type);
>>>    if ((wd_pretimeout_act & WD_SOFT_PRINTF))
>>>      printf("watchdog %s-timeout, WD_SOFT_PRINTF\n", type);
>>>    if ((wd_pretimeout_act & WD_SOFT_PANIC))
>>>      panic("watchdog %s-timeout, WD_SOFT_PANIC set", type);
>>>  }
>>>
>>> So without DDB, it should call panic. But in your case, it
>>> called kdb_backtrace. So initial hypothesis was wrong. What I
>>> missed is that panic was natively able to kdb_backtrace if gently
>>> asked to do so:
>>>
>>>  #ifdef KDB
>>>    if ((newpanic || trace_all_panics) && trace_on_panic)
>>>      kdb_backtrace();
>>>    if (debugger_on_panic)
>>>      kdb_enter(KDB_WHY_PANIC, "panic");
>>>    else if (!newpanic && debugger_on_recursive_panic)
>>>      kdb_enter(KDB_WHY_PANIC, "re-panic");
>>>  #endif
>>>    /*thread_lock(td); */
>>>    td->td_flags |= TDF_INPANIC;
>>>    /* thread_unlock(td); */
>>>    if (!sync_on_panic)
>>>      bootopt |= RB_NOSYNC;
>>>    if (poweroff_on_panic)
>>>      bootopt |= RB_POWEROFF;
>>>    if (powercycle_on_panic)
>>>      bootopt |= RB_POWERCYCLE;
>>>    kern_reboot(bootopt);
>>>
>>> So it definitely should reboot but as it don't, maybe playing with
>>> kern.powercycle_on_panic would help?
>>>
>>>
>>
>> Thank you for your continued help on this. Still no luck with the 
>> GENERIC kernel
>>
>> 0{p9999}# sysctl -w kern.powercycle_on_panic=1
>> kern.powercycle_on_panic: 0 -> 1
>> 0{p9999}# ps -auxwww | grep dog
>> root     4752   0.0  0.2   12820  12916  -  S<s  15:38 0:00.01 
>> watchdogd --softtimeout-action panic -t 10
>> root     4792   0.0  0.0   12808   2644 u0  S+   15:39 0:00.00 grep dog
>> 0{p9999}# kill -9 4752
>> 0{p9999}# KDB: stack backtrace:
>> #0 0xffffffff80b7fefd at kdb_backtrace+0x5d
>> #1 0xffffffff80abec93 at hardclock+0x103
>> #2 0xffffffff80abfe8b at handleevents+0xab
>> #3 0xffffffff80ac0b7c at timercb+0x24c
>> #4 0xffffffff810d0ebb at lapic_handle_timer+0xab
>> #5 0xffffffff80fd8a71 at Xtimerint+0xb1
>> #6 0xffffffff804b3685 at acpi_cpu_idle+0x2c5
>> #7 0xffffffff80fc48f6 at cpu_idle_acpi+0x46
>> #8 0xffffffff80fc49ad at cpu_idle+0x9d
>> #9 0xffffffff80b67bb6 at sched_idletd+0x576
>> #10 0xffffffff80aecf7f at fork_exit+0x7f
>> #11 0xffffffff80fd7dae at fork_trampoline+0xe
>>
>> 0{p9999}#
>>
>> Where would be the best place to hack in something like this in the 
>> driver ?
>>  sysctl -w debug.kdb.panic_str="Watchdog Panic"
>>
>> which actually does panic the box
>>
>>
>
> One other datapoint. It seems starting
>
> watchdogd --softtimeout-action panic --softtimeout -t 10
>
> After kill -9
> it eventually prints out
>
> watchdog soft-timeout, WD_SOFT_LOG
>
> to dmesg.  But after that, I cannot start a new watchdogd with just
>
> watchdogd --softtimeout-action panic -t 10
>
> I get
>
> watchdogd: setting WDIOC_SETSOFT 1: Invalid argument
> watchdogd: patting the dog: Invalid argument


I made these 2 changes to the driver

--- watchdog.c  2024-10-01 20:37:28.667869000 -0400
+++ /tmp/watchdog.c     2024-10-01 20:36:59.764330000 -0400
@@ -61,7 +61,8 @@
  static struct callout wd_softtimeo_handle;
  static int wd_softtimer;       /* true = use softtimer instead of hardware
                                    watchdog */
-static int wd_softtimeout_act = WD_SOFT_LOG;   /* action for the 
software timeout */
+// static int wd_softtimeout_act = WD_SOFT_LOG;        /* action for 
the software timeout */
+static int wd_softtimeout_act = WD_SOFT_PANIC; /* action for the 
software timeout */

  static struct cdev *wd_dev;
  static volatile u_int wd_last_u;    /* last timeout value set by 
kern_do_pat */
@@ -241,6 +242,7 @@
  wd_timeout_cb(void *arg)
  {
         const char *type = arg;
+       panic("mdt watchdog %s-timeout, WD_SOFT_PANIC set", type);

  #ifdef DDB
         if ((wd_pretimeout_act & WD_SOFT_DDB)) {


and it works now

KDB: stack backtrace:
#0 0xffffffff80b8943d at kdb_backtrace+0x5d
#1 0xffffffff80b3bfd1 at vpanic+0x131
#2 0xffffffff80b3be93 at panic+0x43
#3 0xffffffff8098b585 at wd_timeout_cb+0x15
#4 0xffffffff80b59fcc at softclock_call_cc+0x12c
#5 0xffffffff80b5b815 at softclock_thread+0xe5
#6 0xffffffff80af61df at fork_exit+0x7f
#7 0xffffffff80ff76ce at fork_trampoline+0xe
Uptime: 1m13s

it seems the soft timeout value action is never overridden for some reason.

This kinda feels like a bug / pr ?

     ---Mike