Kernel panic and hard disk failure

Mon Nov 16 15:23:02 UTC 2015

Hi,

disclosure: I am a freebsd newbie coming from Mac OS X and Linux, even 
some Windows..., please be gentle. My questions are listed at the 
bottom, here is the background.

I have a Dell 755 Optiplex configured as follows:
Two SATA disks - 240GB SSD + 750GB HDD
8 GB RAM
Quad core Intel 2.83Ghz CPU
FreeBSD 10.2-Release

I came into my home office yesterday and the console was displaying a 
disk error and the system was prompting for a shell. I entered shell and 
a core dump was generated and saved in /var/crash. Since this was my 
first experience with such an event, I just merrily went about my day 
after a reboot. It happened again, later in the day. I figured it was a 
bad hard drive and replaced it with a spare and restored from rsync 
backup. After thinking a bit more about the situation, I decided to look 
at the crash directory to see if there was anything to be learned there. 
Apparently, there is quite a bit for me to learn yet :).

In /var/crash, there were 12 files and two symlinks:
bounds
core.txt.0
core.txt.1
core.txt.2
info.0
info.1
info.2
info.last
minfree
vmcore.0
vmcore.1
vmcore.2
vmcore.last
Three dumps? Hmm... I did file on the files to see if any were ASCII, 
and sure enough, bounds, core.txt.X, info.X, minfree were.

bounds contained the single number 3
minfree contained the number 2048
info.X contained basic crash dump information. The first had Panic 
String: page fault, the other two had Panic String: 
softdep_deallocate_dependencies: dangling deps.
core.txt.X files look like a lot of different system tools being run and 
the results concatenated together

Next, I looked at the vmcore.0 files using kgdb /boot/kernel/kernel 
/var/crash/vmcore.X, this produced yet more information (overload? not 
yet, but getting there):

--- the first crash, snip
Unread portion of the kernel message buffer:
<118>Oct 28 19:33:05 freebird syslogd: exiting on signal 15

Fatal trap 12: page fault while in kernel mode
cpuid = 2; apic id = 02
fault virtual address    = 0x18
fault code        = supervisor read data, page not present
instruction pointer    = 0x20:0xffffffff80906fa9
stack pointer            = 0x28:0xfffffe0231eb8830
frame pointer            = 0x28:0xfffffe0231eb8a20
code segment        = base 0x0, limit 0xfffff, type 0x1b
             = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags    = interrupt enabled, resume, IOPL = 0
current process        = 961 (kwin)
trap number        = 12
panic: page fault
cpuid = 2

--- the second and third crashes, snip
Unread portion of the kernel message buffer:
Device ada1p1 went missing before all of the data could be written to 
it; expect data loss.
panic: softdep_deallocate_dependencies: dangling deps
cpuid = 0

I didn't know what signal 15 was, so I did kill -l and figured out it 
was SIGTERM. I got the feeling the reason I didn't know about the first 
crash was that I probably killed/reset a reboot process or something.

Out of this exercise, I have the following questions that I hope someone 
can help with:

1. Is bounds the number of crashes in /var/crashes, or what?
2. What is minfree?
3. What does it mean that the device went missing?
4. Does the information above sound like a faulty hard drive or are 
there additional tests that will tell me more about the failure?

The device in question is the 750GB HDD, it is formatted ufs and is the 
target of rsync jobs running on another FBSD machine and Mac machine 
through rysncd. I replaced it out of due caution, but haven't thrown 
away the drive yet.

Thanks,

Will