From nobody Wed Oct 02 09:13:34 2024 X-Original-To: freebsd-hardware@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4XJTyG4p3Vz5YFt7 for ; Wed, 02 Oct 2024 09:28:38 +0000 (UTC) (envelope-from Stephane.ROCHOY@stormshield.eu) Received: from mail.stormshield.eu (mail.stormshield.eu [91.212.116.25]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "mail.stormshield.eu", Issuer "Sectigo RSA Organization Validation Secure Server CA" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4XJTyG26wRz4v6D for ; Wed, 2 Oct 2024 09:28:38 +0000 (UTC) (envelope-from Stephane.ROCHOY@stormshield.eu) Authentication-Results: mx1.freebsd.org; none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=stormshield.eu; s=signer2; t=1727861311; h=From:Subject:Date:Message-ID:To:Cc :MIME-Version:Content-Type:Content-Transfer-Encoding:In-Reply-To :References; bh=kGv9jRhCU6UG6AX/W4LNfrQ3vu1ayIgM38LxjK/5eEg=; b=zKL8mh2Qi uHVxNA0r2tyNHT1AXMO6aJgBOU1Xl/ag/FASBXm2vh3iFj7iZChtM7PWYMjWkYPOtKIw+DVP7 6puDeA7z33L+6Bpxs4YAu6UWs3jwQwXQZH8woo2yKGp8yE9GR9na3CDqOkl2f04zn/WghxtLc XJJvqBG4ZyCuVViE/bL1siKZ4TxZly3JhuqJSD8uwDuSHUMNrltVGit9MyB1vVwEvcOu3nEvn B/OD8VS5HTi71sAlx9XYSkj+KT6vdEWyRgN4sVMa7TnPPd7JeqvrLuC847dbYWzfRcn8PiCND TDdr0TaBCSxoVSp0ZtzIYbcK6ByDshFOuEKPXrlNg==; References: <3065debc-8d4f-4487-abbb-c9408810cea6@sentex.net> <86plotbk5b.fsf@cthulhu.stephaner.labo.int> <9008b389-ab06-401d-9a95-84f849ca602a@sentex.net> <86plosdv48.fsf@cthulhu.stephaner.labo.int> <78e9461c-b93d-403f-b3a1-3568548b9283@sentex.net> <86h6a1egcs.fsf@cthulhu.stephaner.labo.int> <868qvddwph.fsf@cthulhu.stephaner.labo.int> <2d850ccc-2e90-4a1a-927c-045d4750d570@sentex.net> <864j5xehes.fsf@cthulhu.stephaner.labo.int> <86zfnocpb8.fsf@cthulhu.stephaner.labo.int> <8b730043-a759-4bb4-b7ee-323a317ce6d2@sentex.net> <1b346afb-d6ed-4f00-8dcf-5cdd389d210b@sentex.net> <82dc6dbf-8aa7-45ef-8fe9-08dc54973c2c@sentex.net> User-agent: mu4e 1.10.7; emacs 29.4 From: Stephane Rochoy To: mike tancsa CC: Chris6 via freebsd-hardware Subject: Re: watchdog timer programming (progress) Date: Wed, 2 Oct 2024 11:13:34 +0200 In-Reply-To: <82dc6dbf-8aa7-45ef-8fe9-08dc54973c2c@sentex.net> Message-ID: <86r08ydfb5.fsf@cthulhu.stephaner.labo.int> List-Id: General discussion of FreeBSD hardware List-Archive: https://lists.freebsd.org/archives/freebsd-hardware List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-hardware@FreeBSD.org MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8"; format=flowed Content-Transfer-Encoding: quoted-printable X-ClientProxiedBy: ICTDCCEXCH003.one.local (10.180.4.3) To ICTDCCEXCH002.one.local (10.180.4.2) X-DKIM-Signer: DkimX (v3.60.360) X-Rspamd-Pre-Result: action=no action; module=replies; Message is reply to one we originated X-Spamd-Result: default: False [-4.00 / 15.00]; REPLY(-4.00)[]; ASN(0.00)[asn:49068, ipnet:91.212.116.0/24, country:FR] X-Rspamd-Queue-Id: 4XJTyG26wRz4v6D X-Spamd-Bar: ---- mike tancsa writes: > On 10/1/2024 5:02 PM, mike tancsa wrote: >> On 10/1/2024 4:03 PM, mike tancsa wrote: >>> On 10/1/2024 2:07 AM, Stephane Rochoy wrote: >>>> >>>> mike tancsa writes: >>>> >>>>> WARNING: This e-mail comes from someone outside your=20 >>>>> organisation. >>>>> Do not click >>>>> on links or open attachments if you do not know the sender=20 >>>>> and are >>>>> not sure that >>>>> the content is safe. >>>>> >>>>> On 9/30/2024 3:18 AM, Stephane Rochoy wrote: >>>>>> >>>>>> mike tancsa writes: >>>>>> >>>>>>> Do you know off hand how to set the system to just reboot=20 >>>>>>> ? The >>>>>>> ddb man >>>>>>> page seems to imply I need options DDB as well, which is=20 >>>>>>> not in >>>>>>> GENERIC >>>>>>> in order to set script actions. >>>>>> >>>>>> I would try the following: >>>>>> >>>>>> ddb script kdb.enter.default=3Dreset >>>>>> >>>>> If I build a custom kernel then that will work. But with=20 >>>>> GENERIC (I am >>>>> tracking project via freebsd-update), it fails >>>>> >>>>> # ddb script kdb.enter.default=3Dreset >>>>> ddb: sysctl: debug.ddb.scripting.scripts: No such file or=20 >>>>> directory >>>>> >>>>> With a customer kernel, adding >>>>> >>>>> options DDB >>>>> >>>>> it works perfectly. >>>>> >>>>> Is there any way to get this to work without having ddb=20 >>>>> custom >>>>> compiled in ? >>>> >>>> I don't understand what's happening here. AFAIK, the code >>>> corresponding to the soft watchdog being triggered is the >>>> following: >>>> >>>> static void >>>> wd_timeout_cb(void *arg) >>>> { >>>> const char *type =3D arg; >>>> >>>> #ifdef DDB >>>> if ((wd_pretimeout_act & WD_SOFT_DDB)) { >>>> char kdb_why[80]; >>>> snprintf(kdb_why, sizeof(kdb_why), "watchdog=20 >>>> %s-timeout", >>>> type); >>>> kdb_backtrace(); >>>> kdb_enter(KDB_WHY_WATCHDOG, kdb_why); >>>> } >>>> #endif >>>> if ((wd_pretimeout_act & WD_SOFT_LOG)) >>>> log(LOG_EMERG, "watchdog %s-timeout, WD_SOFT_LOG\n",=20 >>>> type); >>>> if ((wd_pretimeout_act & WD_SOFT_PRINTF)) >>>> printf("watchdog %s-timeout, WD_SOFT_PRINTF\n", type); >>>> if ((wd_pretimeout_act & WD_SOFT_PANIC)) >>>> panic("watchdog %s-timeout, WD_SOFT_PANIC set", type); >>>> } >>>> >>>> So without DDB, it should call panic. But in your case, it >>>> called kdb_backtrace. So initial hypothesis was wrong. What I >>>> missed is that panic was natively able to kdb_backtrace if=20 >>>> gently >>>> asked to do so: >>>> >>>> #ifdef KDB >>>> if ((newpanic || trace_all_panics) && trace_on_panic) >>>> kdb_backtrace(); >>>> if (debugger_on_panic) >>>> kdb_enter(KDB_WHY_PANIC, "panic"); >>>> else if (!newpanic && debugger_on_recursive_panic) >>>> kdb_enter(KDB_WHY_PANIC, "re-panic"); >>>> #endif >>>> /*thread_lock(td); */ >>>> td->td_flags |=3D TDF_INPANIC; >>>> /* thread_unlock(td); */ >>>> if (!sync_on_panic) >>>> bootopt |=3D RB_NOSYNC; >>>> if (poweroff_on_panic) >>>> bootopt |=3D RB_POWEROFF; >>>> if (powercycle_on_panic) >>>> bootopt |=3D RB_POWERCYCLE; >>>> kern_reboot(bootopt); >>>> >>>> So it definitely should reboot but as it don't, maybe playing=20 >>>> with >>>> kern.powercycle_on_panic would help? >>>> >>>> >>> >>> Thank you for your continued help on this. Still no luck with=20 >>> the >>> GENERIC kernel >>> >>> 0{p9999}# sysctl -w kern.powercycle_on_panic=3D1 >>> kern.powercycle_on_panic: 0 -> 1 >>> 0{p9999}# ps -auxwww | grep dog >>> root 4752 0.0 0.2 12820 12916 - S>> watchdogd --softtimeout-action panic -t 10 >>> root 4792 0.0 0.0 12808 2644 u0 S+ 15:39 0:00.00=20 >>> grep dog >>> 0{p9999}# kill -9 4752 >>> 0{p9999}# KDB: stack backtrace: >>> #0 0xffffffff80b7fefd at kdb_backtrace+0x5d >>> #1 0xffffffff80abec93 at hardclock+0x103 >>> #2 0xffffffff80abfe8b at handleevents+0xab >>> #3 0xffffffff80ac0b7c at timercb+0x24c >>> #4 0xffffffff810d0ebb at lapic_handle_timer+0xab >>> #5 0xffffffff80fd8a71 at Xtimerint+0xb1 >>> #6 0xffffffff804b3685 at acpi_cpu_idle+0x2c5 >>> #7 0xffffffff80fc48f6 at cpu_idle_acpi+0x46 >>> #8 0xffffffff80fc49ad at cpu_idle+0x9d >>> #9 0xffffffff80b67bb6 at sched_idletd+0x576 >>> #10 0xffffffff80aecf7f at fork_exit+0x7f >>> #11 0xffffffff80fd7dae at fork_trampoline+0xe >>> >>> 0{p9999}# >>> >>> Where would be the best place to hack in something like this=20 >>> in the >>> driver ? >>> sysctl -w debug.kdb.panic_str=3D"Watchdog Panic" >>> >>> which actually does panic the box >>> >>> >> >> One other datapoint. It seems starting >> >> watchdogd --softtimeout-action panic --softtimeout -t 10 >> >> After kill -9 >> it eventually prints out >> >> watchdog soft-timeout, WD_SOFT_LOG >> >> to dmesg. But after that, I cannot start a new watchdogd with=20 >> just >> >> watchdogd --softtimeout-action panic -t 10 >> >> I get >> >> watchdogd: setting WDIOC_SETSOFT 1: Invalid argument >> watchdogd: patting the dog: Invalid argument > > > I made these 2 changes to the driver > > --- watchdog.c 2024-10-01 20:37:28.667869000 -0400 > +++ /tmp/watchdog.c 2024-10-01 20:36:59.764330000 -0400 > @@ -61,7 +61,8 @@ > static struct callout wd_softtimeo_handle; > static int wd_softtimer; /* true =3D use softtimer instead=20 > of hardware > watchdog */ > -static int wd_softtimeout_act =3D WD_SOFT_LOG; /* action for=20 > the > software timeout */ > +// static int wd_softtimeout_act =3D WD_SOFT_LOG; /*=20 > action for > the software timeout */ > +static int wd_softtimeout_act =3D WD_SOFT_PANIC; /* action for=20 > the > software timeout */ > > static struct cdev *wd_dev; > static volatile u_int wd_last_u; /* last timeout value set=20 > by > kern_do_pat */ > @@ -241,6 +242,7 @@ > wd_timeout_cb(void *arg) > { > const char *type =3D arg; > + panic("mdt watchdog %s-timeout, WD_SOFT_PANIC set",=20 > type); > > #ifdef DDB > if ((wd_pretimeout_act & WD_SOFT_DDB)) { > > > and it works now > > KDB: stack backtrace: > #0 0xffffffff80b8943d at kdb_backtrace+0x5d > #1 0xffffffff80b3bfd1 at vpanic+0x131 > #2 0xffffffff80b3be93 at panic+0x43 > #3 0xffffffff8098b585 at wd_timeout_cb+0x15 > #4 0xffffffff80b59fcc at softclock_call_cc+0x12c > #5 0xffffffff80b5b815 at softclock_thread+0xe5 > #6 0xffffffff80af61df at fork_exit+0x7f > #7 0xffffffff80ff76ce at fork_trampoline+0xe > Uptime: 1m13s > > it seems the soft timeout value action is never overridden for=20 > some reason. > > This kinda feels like a bug / pr ? Well, honestly I'm puzzled: - in one hand, watchdog.c don't seems to use wd_softtimeout_act - and on the other hand hardclock seems to directly call watchdog_fire which just kdb_enter or panic. Note that wd_timeout_cb seems to be about both pretimeout and timeout handling. Regards, --=20 St=C3=A9phane Rochoy O: Stormshield