My problems with stability on -current
Doug Barton
dougb at FreeBSD.org
Tue May 10 02:13:33 UTC 2011
New symptom, today (still running r221566) I compiled a small port, that
worked without any freezes or interactivity problems. Then I tried
compiling a larger port (java/openjdk6 if anyone cares) and still no
interactivity problems, but I got the "system wedge requiring power
cycle" problem I was seeing previously that I tracked to the one-shot
timer update.
More below.
On 05/07/2011 02:43, Alexander Motin wrote:
> Doug Barton wrote:
>> On 05/05/2011 13:55, Alexander Motin wrote:
>>> I see several possibly unrelated problems there:
>>> - crashes are always crashes. They should be debugged.
>>> - calcru going backwards could have the same roots as lost wall clock
>>> time.
>>
>> I think you're right about that. What usually happens when the load
>> maxes out is that the system visibly freezes for a minute or 2, and when
>> it comes back to life the log is flooded with calcru messages. If it
>> stays up long enough after that the wall clock drift becomes noticeable.
>> This is in spite of running ntpd.
>
> These system freezes are very suspicious. Most time counters need only
> few seconds to overflow, some even less. So freeze for few minutes will
> easily overflow most of them. So the freezes are probably the cause of
> time problems, but the question now is what the cause of freezes. You
> should try to investigate what is going on during freezes. Does the
> system do anything, are there any interrupts working (`vmstat -i` just
> before and after), are there any interrupt storms, etc?
Here is the output on a mostly-idle system, shortly after reboot:
vmstat -i
interrupt total rate
irq1: atkbd0 1784 0
irq9: acpi0 1 0
irq14: ata0 213355 89
irq15: ata1 58 0
irq17: wpi0 74331 31
irq20: hpet0 uhci0+ 787767 331
irq22: uhci2 21453 9
irq256: hdac0 11 0
Total 1098760 462
At a more opportune time I'll try crashing it again and get another result.
>>> If there are some problems with timer interrupts, timecounters
>>> could wrap unnoticed that will cause random time jumps.
>>> - interactivity problems. I can't prove it is unrelated, but have no
>>> real ideas now.
>>>
>>> I would start from most obvious problems. I need to know more about
>>> crashes. As usual: how to trigger, stack backtraces, etc.
>>
>> Triggering is easy, I can start a buildworld with -j2, and a build of
>> ports/www/firefox with FORCE_MAKE_JOBS, and within 30 minutes the system
>> will reboot. I posted a panic message relative to r220282, (-current
>> archives, 4/4) but kib said it didn't make any sense. Usually I don't
>> get a panic at all.
>
> Could you hint me the thread?
Go to http://www.FreeBSD.org/
Click 'mailing lists'
Click 'listed in the FreeBSD Handbook.'
Click freebsd-current
Click freebsd-current Archives
Click April 2011
search for r220282
Voila! :)
>>> What's about time problems, I would try to collect more data:
>>> - show `sysctl kern.eventtimer`, `sysctl kern.timecounter` and verbose
>>> dmesg outputs;
>>
>> http://people.freebsd.org/~dougb/dougb-current-r221566.txt
>>
>>> - what eventtimer is used now and does it helps to switch to another
>>> one with kern.eventtimer.timer sysctl?
>>
>> When I was trying to track down the problems last summer I vaguely
>> remember trying RTC, but eventually we realized that the real problem
>> was throttling, so I stopped specifying RTC and let it go back to the
>> default. What do you suggest I try?
>
> As I see, now you are using HPET (chosen automatically). I would try
> switch to the LAPIC. Just make sure to disable C-states if you are
> enabled them to be sure that LAPIC timer won't stop.
Ok, so kern.eventtimer.timer="LAPIC" in /boot/loader.conf should do
that, right?
I don't use C-states (in part as a result of previous investigation) but
I do use powerd as such:
powerd_flags="-a adaptive -b adaptive -n adaptive"
>>> - does the timer runs in periodic or one-shot mode and does it helps to
>>> switch to another one?
>>
>> How could I tell, and how would I switch?
>
> `sysctl kern.eventtimer.periodic`.
kern.eventtimer.periodic: 0
> And read eventtimers(4) please.
I did that, but I don't see anything in there as to which choice is
one-shot, and how to change to periodic. I assume 0 is the default,
which I also assume is one-shot. Does setting that to 1 change to
periodic? Also, can I safely do this while the system is running, or
should it be in /boot/loader.conf as well?
>>> - if full CPU load makes time to stop, try to track what is going on
>>> with timer interrupts using `vmstat -i` and `systat -vm 1`. Under full
>>> CPU load in one-shot mode you should have stable timer interrupt rate
>>> about hz+stathz.
>>
>> Ok, I'll do that tomorrow, tired now.
>>
>>> - if timer interrupts are not working well, you can build kernel with
>>> options KTR
>>> options ALQ
>>> options KTR_ALQ
>>> options KTR_COMPILE=(KTR_SPARE2)
>>> options KTR_ENTRIES=131072
>>> options KTR_MASK=(KTR_SPARE2)
>>> to track event timers operation and use ktrdump to save the trace when
>>> problem exist (preferably when it begins).
>>>
>>> And let's experiment with fresh CURRENT.
>>
>> Done and done. I'm up to r221566, and I added those options to my kernel
>> config. I ran ktrdump -cH -o ktrdumpfile and posted the results here:
>> http://people.freebsd.org/~dougb/ktrdumpfile.txt This was shortly after
>> boot, with no load. Not sure if it helps, but there you go.
>
> Dump looks fine, but I need dump specifically for the time of the
> problem. As soon as time probably can't be trusted here, it would be
> nice to make dump as localized as possible: clear buffer with `sysctl
> debug.ktr.clear=1`, trigger freeze for few seconds, stop collecting with
> `sysctl debug.ktr.mask=0` and do the dump.
Ok, I'll give that a try after work.
Thanks,
Doug
--
Nothin' ever doesn't change, but nothin' changes much.
-- OK Go
Breadth of IT experience, and depth of knowledge in the DNS.
Yours for the right price. :) http://SupersetSolutions.com/
More information about the freebsd-current
mailing list