My problems with stability on -current
Doug Barton
dougb at dougbarton.us
Thu May 5 08:06:26 UTC 2011
This is long, sorry. I wish I could condense things down to just the
answer, or even just the question, but here goes. I've used HEAD on my
main workstation(s) for many years. It's common for there to be ups and
downs, and that's fine. Lately however the problems have been debilitating.
First a timeline. Since sometime before January 2008 I've been using a
Dell Latitude D620 laptop as my primary system. It has a core 2 duo
running at 2.33 G, and 2 G RAM. I 4xboot it with windows xp, freebsd
current (amd64), another freebsd (usually 8.N-RELEASE i386) and Ubuntu.
On the first and last I don't do a lot of compiling obviously, but even
under heavy load on 8.2-RELEASE I'm not seeing problems, so the problems
I _am_ seeing are not hardware related.
I keep my system very close to stock. My kernel config is GENERIC minus
devices I don't have, and plus the following:
options EXT2FS
options IEEE80211_DEBUG # enable debug msgs
options VESA
device atapicam
device sound
device snd_hda
device snp
I was building with clang for a while, but when the problems started I
went back to gcc. I still have INVARIANTS on but I disabled WITNESS
because with all the known+unfixed LORs it's kind of pointless. Nothing
interesting in make/src.conf either, the latter is just a list of stuff
not to build, KERNCONF, and MODULES_OVERRIDE.
Starting around December 2009 I started having problems under load with
-current. Often I reported them, sometimes problems were found,
sometimes not. In the course of trying to debug those problems I
disabled throttling, which helped. Switching to SCHED_4BSD also helped
quite a bit with interactivity under load, although it was still worse
than on 8.x.
In October of 2010 I was lucky enough to receive a donation of a Dell
Optiplex desktop that I started using as my primary workstation. Around
that same time there was some work being done in the scheduler(s) and
various related systems, and my desktop (which had a slightly faster
core 2 duo and 4 G RAM) was running great. I assumed that the problems
were solved.
Then 2 months ago I packed up the desktop system and pulled out the
laptop again. I updated to the latest -current on the laptop, and all
heck broke loose. I couldn't do anything on my laptop that created even
a mediocre load without it crashing. Trying to do something like a
buildworld (even without -j) would cause the system to absolutely crawl.
I'd get tons of the dreaded "calcru" messages about time going
backwards, and the system clock would lose literally minutes of wall
clock time. At one point when I could keep it up long enough to build
the world without crashing it had lost 40 minutes of wall clock time
when it finished. I think that specific problem happened sometime
between March 15 and r220282.
In trying to find that problem, I uncovered another, deeper problem with
the "one-shot timers" from r212541. In order to make my binary search
easier for the problem described above I was using a -current snapshot
CD from August 2010 that I had laying around. I could easily build world
with -j2, run X, do normal desktop stuff (firefox, thunderbird, pidgin,
etc.) all at the same time. When I got closer to the more recent
-current, it would crash as soon as I put a load on it. I eventually
bifurcated down to that exact commit. I've been running on 212540 for
over a week now without any problems, including lots of port builds with
FORCE_MAKE_JOBS, etc.
Alexander suggested some knobs to twist for the timers, and I'll be glad
to do that once he gets back to me with more concrete suggestions now
that he knows more about my specific problems.
Doug
--
Nothin' ever doesn't change, but nothin' changes much.
-- OK Go
Breadth of IT experience, and depth of knowledge in the DNS.
Yours for the right price. :) http://SupersetSolutions.com/
More information about the freebsd-current
mailing list