hwpmc: NMI cpuxx ... going to debugger

Tue Sep 3 18:03:47 UTC 2019

Hi,

I'm running a profiler which utilizes hwpmc and am frequently running into
this situation: after a process context switch out I get an NMI. The NMI
handler is unable to service it due to the fact that the pmc hardware is
marked as free at the end of the process_csw_out() (hwpmc_mod.c) routine.

Here are some logs I've collected when this happens:

 1750 231   49360676470980 MDP:CFG:1: cpu=231 ri=0 pm=0xfffff801e7b2ea00
  1749 231   49360676470360 CSW:SWI:1: cpu=231 proc=0xfffff813e80ff000
(5930, java) pp=0xfffff81077179000
  1748 249   49360676421260 CSW:SWO:3: cpu=249 ri=33
  1748 249   49360676421260 CSW:SWO:3: cpu=249 ri=33
  1747 249   49360676420320 MDP:SWO:1: pc=0xfffff8010a06c400
pp=0xfffff81077179000 enable-msr=0
  1746 249   49360676419000 CSW:SWO:1: cpu=249 proc=0xfffff813e80ff000
(5930, java) pp=0xfffff81077179000
  1745 231   49360675789440 MDP:INT:2: retval=0 isnull=16 ovrflw=0
  1744 231   49360675789240 MDP:INT:1: cpu=231 tf=0xfffffe0001577f30 um=0
  1743 231   49360675788260 CSW:SWO:3: cpu=231 ri=33
  1742 231   49360675788080 MDP:SWO:1: pc=0xfffff80106fe1400
pp=0xfffff81077179000 enable-msr=0
  1741 231   49360675787880 MDP:CFG:1: cpu=231 ri=0 pm=0x0
  1740 231   49360675787020 CSW:SWO:2: cpu=231 ri=17 val=-164 (samp)
  1739 231   49360675786940 MDP:REA:2: amd-read (post-munge) id=0 -> -164
  1738 231   49360675786880 MDP:REA:2: amd-read (pre-munge) id=0 -> 164
  1737 231   49360675786720 MDP:REA:1: amd-read id=0 class=2
  1736 231   49360675785180 MDP:STO:1: amd-stop ri=0
  1735 231   49360675784160 CSW:SWO:1: cpu=231 proc=0xfffff813e80ff000
(5930, java) pp=0xfffff81077179000
  1734 231   49360675617080 MDP:INT:2: retval=1 isnull=0 ovrflw=0
  1733 231   49360675615240 MDP:INT:1: cpu=231 tf=0xfffffe0001577f30 um=0
  1732 231   49360675505160 MDP:SWI:1: pc=0xfffff80106fe1400
pp=0xfffff81077179000 enable-msr=0
  1731 231   49360675503600 MDP:STA:2: amd-start config=0x530076
  1730 231   49360675503060 MDP:STA:1: amd-start cpu=231 ri=0
  1729 231   49360675500720 MDP:WRI:1: amd-write cpu=231 ri=0
v=fffffffffffd97d4
  1728 231   49360675500240 CSW:SWI:1: cpu=231 ri=17 new=157740
  1727 231   49360675498680 MDP:CFG:1: cpu=231 ri=0 pm=0xfffff801e7b2ea00
  1726 231   49360675498060 CSW:SWI:1: cpu=231 proc=0xfffff813e80ff000
(5930, java) pp=0xfffff81077179000
  1725 154   49360675363180 CSW:SWO:3: cpu=154 ri=33
  1724 154   49360675362020 MDP:SWO:1: pc=0xfffff80106d59600
pp=0xfffff81077179000 enable-msr=0
  1723 154   49360675360220 CSW:SWO:1: cpu=154 proc=0xfffff813e80ff000
(5930, java) pp=0xfffff81077179000
  1722 231   49360675009200 CSW:SWO:3: cpu=231 ri=33
  1721 231   49360675009020 MDP:SWO:1: pc=0xfffff80106fe1400
pp=0xfffff81077179000 enable-msr=0
  1720 231   49360675008320 MDP:CFG:1: cpu=231 ri=0 pm=0x0
  1719 231   49360675006220 CSW:SWO:2: cpu=231 ri=17 val=157740 (samp)
  1718 231   49360675005700 MDP:REA:2: amd-read (post-munge) id=0 -> 157740
  1717 231   49360675005240 MDP:REA:2: amd-read (pre-munge) id=0 ->
281474976552916
  1716 231   49360675004140 MDP:REA:1: amd-read id=0 class=2
  1715 231   49360675002660 MDP:STO:1: amd-stop ri=0
  1714 231   49360675001460 CSW:SWO:1: cpu=231 proc=0xfffff813e80ff000
(5930, java) pp=0xfffff81077179000
  1713 231   49360674941580 MDP:INT:2: retval=1 isnull=0 ovrflw=0
  1712 231   49360674939500 MDP:INT:1: cpu=231 tf=0xfffffe0001577f30 um=0
  1711 231   49360674793500 MDP:SWI:1: pc=0xfffff80106fe1400
pp=0xfffff81077179000 enable-msr=0
  1710 231   49360674792680 MDP:STA:2: amd-start config=0x530076
  1709 220   49360674792340 CSW:SWO:3: cpu=220 ri=33
  1708 220   49360674791060 MDP:SWO:1: pc=0xfffff80106f8a600
pp=0xfffff81077179000 enable-msr=0
  1707 231   49360674790840 MDP:STA:1: amd-start cpu=231 ri=0

The stray NMI is entry 1744 & 1745 (isnull=16). In entry 1741 when the
context switch out occurs, the hardware is marked free (pm=0x0). The NMI
handler is unable to find the hardware and ends up returning 0. Entry 1740
reports the val=-164, I'm not sure about the significance but the value
always seems to be negative when this problem occurs.

Once i get a stray NMI, I do not get any further interrupts which is
results in loss of samples in the generated pmclog.

I'm using hwpmc from the most recent stable/12 branch which is same as
current (with a stable/12 kernel). I'm running it on a java application
with a single event in sampling mode.

Any idea on how to resolve/avoid this problem?
Any pointers or suggestion are appreciated.

Thanks
Shreyank