stop_cpus_hard when multiple CPUs are panicking from an NMI
Attilio Rao
attilio at freebsd.org
Fri Nov 16 14:41:55 UTC 2012
On Fri, Nov 16, 2012 at 1:18 PM, Andriy Gapon <avg at freebsd.org> wrote:
> on 16/11/2012 14:30 Attilio Rao said the following:
>> On Fri, Nov 16, 2012 at 7:54 AM, Andriy Gapon <avg at freebsd.org> wrote:
>>> on 16/11/2012 00:58 Ryan Stone said the following:
>>>> At work we have some custom watchdog hardware that sends an NMI upon
>>>> expiry. We've modified the kernel to panic when it receives the watchdog
>>>> NMI. I've been trying the "stop_scheduler_on_panic" mode, and I've
>>>> discovered that when my watchdog expires, the system gets completely
>>>> wedged. After some digging, I've discovered is that I have multiple CPUs
>>>> getting the watchdog NMI and trying to panic concurrently. One of the CPUs
>>>> wins, and the rest spin forever in this code:
>>>>
>>>> /*
>>>> * We don't want multiple CPU's to panic at the same time, so we
>>>> * use panic_cpu as a simple spinlock. We have to keep checking
>>>> * panic_cpu if we are spinning in case the panic on the first
>>>> * CPU is canceled.
>>>> */
>>>> if (panic_cpu != PCPU_GET(cpuid))
>>>> while (atomic_cmpset_int(&panic_cpu, NOCPU,
>>>> PCPU_GET(cpuid)) == 0)
>>>> while (panic_cpu != NOCPU)
>>>> ; /* nothing */
>>>>
>>>> The system wedges when stop_cpus_hard() is called, which sends NMIs to all
>>>> of the other CPUs and waits for them to acknowledge that they are stopped
>>>> before returning. However the CPU will not deliver an NMI to a CPU that is
>>>> already handling an NMI, so the other CPUs that got a watchdog NMI and are
>>>> spinning will never go into the NMI handler and acknowledge that they are
>>>> stopped.
>>>
>>> I thought about this issue and fixed (in my tree) in a different way:
>>> http://people.freebsd.org/~avg/cpu_stop-race.diff
>>>
>>> The need for spinlock_enter in the patch in not entirely clear.
>>> The main idea is that a CPU which calls cpu_stop and loses a race should
>>> voluntary enter cpustop_handler.
>>> I am also not sure about MI-cleanness of this patch.
>>
>> It is similar to what I propose but with some differences:
>> - It is not clean from MI perspective
>
> OK.
>
>> - I don't think we need to treact it specially, I would just
>> unconditionally stop all the CPUs entering in the "spinlock zone",
>> making the patch simpler.
>
> I couldn't understand this part.
I'm sorry, I think I misread your patch.
I was basically suggesting to fix Ryan case by calling
cpustop_handler() (or the new MI interface) into the panic() function,
in the case the CPU don't win the race for panic_cpu.
Basically doing:
Index: sys/kern/kern_shutdown.c
===================================================================
--- sys/kern/kern_shutdown.c (revision 243154)
+++ sys/kern/kern_shutdown.c (working copy)
@@ -568,15 +568,11 @@ panic(const char *fmt, ...)
#ifdef SMP
/*
* We don't want multiple CPU's to panic at the same time, so we
- * use panic_cpu as a simple spinlock. We have to keep checking
- * panic_cpu if we are spinning in case the panic on the first
- * CPU is canceled.
+ * use panic_cpu as a simple lock.
*/
if (panic_cpu != PCPU_GET(cpuid))
- while (atomic_cmpset_int(&panic_cpu, NOCPU,
- PCPU_GET(cpuid)) == 0)
- while (panic_cpu != NOCPU)
- ; /* nothing */
+ if (atomic_cmpset_int(&panic_cpu, NOCPU, PCPU_GET(cpuid)) == 0)
+ cpustop_handler();
if (stop_scheduler_on_panic) {
if (panicstr == NULL && !kdb_active) {
Infact it seems to me that the comment is outdated and no longer
represent truth.
Attilio
--
Peace can only be achieved by understanding - A. Einstein
More information about the freebsd-hackers
mailing list