high cpu irq load and slow boot after update from 10.4 to 11.2

Mon Nov 26 12:35:07 UTC 2018

26.11.2018 15:46, Gerrit Kühn wrote:

> A couple of weeks ago, I updated an older storage server (2 CPUs, 4 cores
> each, 48GB RAM, 36x4GB HDDs, 3 LSI-based mps controllers) from 10.4 to
> 11.2. The first thing I noticed was that booting takes much longer now. The
> system probes each HDD (there are 36 of them, attached to mps controllers)
> very slowly multiple times (I can see the light of each disk blinking,
> it takes seconds to go on to the next disk), the whole process takes
> several minutes (was much faster before).
> 
> A more nasty issue appears after a couple of weeks of operation (so far,
> roughly between 15 and 30 days):
> Suddenly there is a very high irq load on one of the CPU cores
> (cpu<n>:timer), causing high system load and high cpu load (top easily
> shows average load over 10, whereas it was always below 1 before). I cannot
> find any process or device as a culprit. First I thought this problem can
> only be made to go away by rebooting, but now I managed to get rid of it
> (at least for some time, don't know if or when it will be back) while
> checking out the latest source in background (I actually intended to fiddle
> with some kernel settings, but suddenly the issue was gone after
> persisting permanently over the weekend), causing.
> 
> Looking around, I found a couple of vaguely similar reports (like
> https://lists.freebsd.org/pipermail/freebsd-current/2017-January/064419.html),
> but these all appear to be fixed by now.
> I have a couple of other storage machines (mostly mps-based, but always
> slightly different hardware) that show no such issue after updating to
> 11.2.
> 
> Any ideas?

Maybe this box has some clocking problems incompatible with tickless kernel.
Try get back to old periodic ticking with sysctl kern.eventtimer.periodic=1
instead of now default 0.

Of, if you are curious, run ntpd if it is not already running, wait about an hour
then look to its /var/db/ntpd.drift file to see if system clock is good or not.

Perhaps, you can get better behaviour changing default value
of kern.timecounter.hardware to another one from kern.timecounter.choice;
same with kern.eventtimer.timer and kern.eventtimer.choice