Intermittent system hangs on 7.2-RELEASE-p1

Thu Sep 10 16:46:46 UTC 2009

On Thu, Aug 27, 2009 at 5:29 PM, John Baldwin<jhb at freebsd.org> wrote:
> Ah, cool, what you want to do is use KTR with KTR_SCHED and then use
> schedgraph.py (src/tools/sched) to get a visual picture of what the box does
> during a hang.  The timestamps in KTR are TSC cycle counts rather than an
> actual wall time which is why they look off.  If you have a lot of events you
> may want to use a larger KTR_ENTRIES size btw (I use 1048576 (2 ^ 20) here at
> work to get large (multiple-second) traces).

I'm still working on this.

I enabled KTR and set it up to log KTR_SCHED events.  Then, I wrote a
script to exercise the HTTP server that actually ran on that machine,
and set it to issue "sysctl debug.ktr.cpumask=0" and abort if a
request took over 2 seconds.  28,613 requests later, it tripped over
one that took 2007ms.

(Just a refresher: this is a static file being served by an Apache
process that has nothing else to do but serve this file on a
relatively unloaded machine.)

I don't have access to any machines that can run X, so I did the best
I could to examine it from the shell.

First, this machine has two CPU's so I split up the KTR results on a
per-CPU basis so I could look at each individually.

With KTR_ENTRIES set to 1048576, I got about 53 seconds of data with
just KTR_SCHED enabled.  Since I was interested in a 2.007 second
period of time right at the end, I hacked it down to the last 3.795
seconds.

In the 3.795 seconds captured in the trace period on CPU 0 that
includes the entire 2.007 second stall, CPU 0 was idle for 3.175
seconds.

In the same period, CPU 1 was idle for 3.2589 seconds.

I did the best I could to manually page through all the scheduling
activity on both CPUs during that 3.7 second time, and I didn't see
anything really disruptive.  Mainly idle, with jumps into the clock
and ethernet kernel threads, as well as httpd.

If I understand that correctly and have done everything right, that
means that whatever happened, it wasn't related to CPU contention or
scheduling issues of any sort.

So, a couple of follow-up questions:

First, what else should I be looking at?  I built the kernel with kind
of a lot of KTR flags
(KTR_LOCK|KTR_SCHED|KTR_PROC|KTR_INTR|KTR_CALLOUT|KTR_UMA|KTR_SYSC)
but enabling them all produces enough output that even 1048576 entries
doesn't always go back two seconds; the volume of data is all but
unmanageable.

Second, is there any way to correlate the process address reported by
the KTR scheduler entries back to a PID?  It'd be nice to be able to
view the scheduler activity just for the process I'm interested in,
but I can't figure out which one it is. :)

Thanks!