FreeBSD Crash without Errors, Warnings, or Panics
Julian Elischer
julian at elischer.org
Thu Apr 13 19:15:41 UTC 2006
Matthew Hagerty wrote:
> John Baldwin wrote:
>
>> On Thursday 13 April 2006 14:17, Matthew Hagerty wrote:
>>
>>
>>> Greetings,
>>>
>>> I'm running 6.0-RELEASE-p5 on a Toshiba built server: dual Xeon
>>> Intel motherboard with a LSILogic MegaRAID (amr0) controller. This
>>> machine has been running for about 2 years now, and was very stable
>>> until I updated from 5.3 to 5.4, and now 6.0. The crashing seems to
>>> be totally random and I have had it crash in as little as 12 hours
>>> and as long as 143 days.
>>>
>>> When the box goes down it does so in a strange way. First, it still
>>> responds to network probes like ping (usually), however, all console
>>> access is ignored. Also, some network ports still respond, like a
>>> telnet to port 22 to test SSH will yield an SSH banner, but trying
>>> to connect with SSH just hangs. Sometimes this is also true of the
>>> SMTP server, but not always. This also makes it impossible for me
>>> to use CARP to swap to the recently purchased spare machine, since
>>> the network interface is generally still responding so CARP does not
>>> detect a problem.
>>>
>>> My biggest problem with this is that there are *never* any console
>>> messages or log entries in any logs, no warnings about disk failure,
>>> buffer exhaustion, system failures, etc.. The machine simply seems
>>> to stop responding and the only way to correct the problem is a hard
>>> reboot.
>>>
>>> A strange thing did happen yesterday though, I believe I caught the
>>> box on the verge of failure. I was SSH'd in and did a ps to check
>>> things out. There were about 100 of these entries:
>>>
>>> 55050 ?? D 0:00.00 postmaster: ipa ipa ::1(63061) startup
>>> (postgres)
>>>
>>> The box runs a web-based app and connects to a local Postgres DB
>>> which seemed to be unable to start new connections being requested
>>> by the PHP scripts. At any rate, I stopped Apache and then tried to
>>> stop Postgres which resulted in (or just happened to coincide with)
>>> the box locking up and no longer responding to my SSH commands or
>>> attempts to reconnect with SSH. I hardly think this is a Postgres
>>> problem, but even if it was, a userland app should *not* be able to
>>> bring down a box...
>>>
>>> Can anyone shed some light on this, give me some options to try?
>>> What happened to kernel panics and such when there were serious
>>> errors going on? The only glimmer of information I have is that
>>> *one* time there was an error on the console about there not being
>>> any RAID controller available. I did purchase a spare controller
>>> and I'm about to swap it out and see if it helps, but for some
>>> reason I doubt it. If a controller like that was failing, I would
>>> certainly hope to see some serious error messages or panics going on.
>>>
>>> I have been running FreeBSD since version 1.01 and have never had a
>>> box so unstable in the last 12 or so years, especially one that is
>>> supposed to be "server" quality instead of the make-shift ones I put
>>> together with desktop hardware. And last, I'm getting sick of my
>>> Linux admin friends telling me "told you so! should have run
>>> Linux...", please give me something to stick in their pie holes!
>>>
>>
>>
>> It sounds like a livelock (or deadlock) more than a crash. Can you add
>> 'DDB' in your kernel config and break into the debugger when it hangs
>> and grab the output of 'ps'?
>>
>>
>
> I can probably figure out how to compile in DDB (I've never done if
> before though), but just two questions:
add
options DDB
to your kenrnel config file.
>
> 1. How do I break into DDB and grab the ps output?
on the console, hit <CTRL><ALT><ESC> keys (at once)
that should put you into the debugger..
then 'ps' will give you some output.
It's a lot to write down but I've found a camera phone makes good enough
snapshots :-)
alternatively you can use a serial console, but getting into the
debugger is harder,
you have to have compiled in ALT_BREAK_TO_DEBUGGER
into your kernel by adding
# Solaris implements a new BREAK which is initiated by a character
# sequence CR ~ ^b which is similar to a familiar pattern used on
# Sun servers by the Remote Console.
options ALT_BREAK_TO_DEBUGGER
to the kernel config file you are using..
at the boot prompt (where the 10 second delay is)
type
set console="comconsole"
(from memory)
to make the serial port the console.
then you can do console stuff from another window/machine and capture
the outout easily.
>
> 2. How can I login if the box is not responding to SSH or the
> console? It was only by sheer luck that I caught it yesterday just
> before the lockup, I have never been able to do that before.
>
> Thanks,
> Matthew
>
> _______________________________________________
> freebsd-hackers at freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
> To unsubscribe, send any mail to
> "freebsd-hackers-unsubscribe at freebsd.org"
More information about the freebsd-hackers
mailing list