Kernel panic and hard disk failure
Will Senn
will.senn at gmail.com
Mon Nov 16 15:23:02 UTC 2015
Hi,
disclosure: I am a freebsd newbie coming from Mac OS X and Linux, even
some Windows..., please be gentle. My questions are listed at the
bottom, here is the background.
I have a Dell 755 Optiplex configured as follows:
Two SATA disks - 240GB SSD + 750GB HDD
8 GB RAM
Quad core Intel 2.83Ghz CPU
FreeBSD 10.2-Release
I came into my home office yesterday and the console was displaying a
disk error and the system was prompting for a shell. I entered shell and
a core dump was generated and saved in /var/crash. Since this was my
first experience with such an event, I just merrily went about my day
after a reboot. It happened again, later in the day. I figured it was a
bad hard drive and replaced it with a spare and restored from rsync
backup. After thinking a bit more about the situation, I decided to look
at the crash directory to see if there was anything to be learned there.
Apparently, there is quite a bit for me to learn yet :).
In /var/crash, there were 12 files and two symlinks:
bounds
core.txt.0
core.txt.1
core.txt.2
info.0
info.1
info.2
info.last
minfree
vmcore.0
vmcore.1
vmcore.2
vmcore.last
Three dumps? Hmm... I did file on the files to see if any were ASCII,
and sure enough, bounds, core.txt.X, info.X, minfree were.
bounds contained the single number 3
minfree contained the number 2048
info.X contained basic crash dump information. The first had Panic
String: page fault, the other two had Panic String:
softdep_deallocate_dependencies: dangling deps.
core.txt.X files look like a lot of different system tools being run and
the results concatenated together
Next, I looked at the vmcore.0 files using kgdb /boot/kernel/kernel
/var/crash/vmcore.X, this produced yet more information (overload? not
yet, but getting there):
--- the first crash, snip
Unread portion of the kernel message buffer:
<118>Oct 28 19:33:05 freebird syslogd: exiting on signal 15
Fatal trap 12: page fault while in kernel mode
cpuid = 2; apic id = 02
fault virtual address = 0x18
fault code = supervisor read data, page not present
instruction pointer = 0x20:0xffffffff80906fa9
stack pointer = 0x28:0xfffffe0231eb8830
frame pointer = 0x28:0xfffffe0231eb8a20
code segment = base 0x0, limit 0xfffff, type 0x1b
= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags = interrupt enabled, resume, IOPL = 0
current process = 961 (kwin)
trap number = 12
panic: page fault
cpuid = 2
--- the second and third crashes, snip
Unread portion of the kernel message buffer:
Device ada1p1 went missing before all of the data could be written to
it; expect data loss.
panic: softdep_deallocate_dependencies: dangling deps
cpuid = 0
I didn't know what signal 15 was, so I did kill -l and figured out it
was SIGTERM. I got the feeling the reason I didn't know about the first
crash was that I probably killed/reset a reboot process or something.
Out of this exercise, I have the following questions that I hope someone
can help with:
1. Is bounds the number of crashes in /var/crashes, or what?
2. What is minfree?
3. What does it mean that the device went missing?
4. Does the information above sound like a faulty hard drive or are
there additional tests that will tell me more about the failure?
The device in question is the 750GB HDD, it is formatted ufs and is the
target of rsync jobs running on another FBSD machine and Mac machine
through rysncd. I replaced it out of due caution, but haven't thrown
away the drive yet.
Thanks,
Will
More information about the freebsd-questions
mailing list