ATA problems again ... This time system froze!

Johan Ström johan at stromnet.org
Wed Aug 16 01:29:50 UTC 2006


On Jul 28, 2006, at 13:15 , Johan Ström wrote:

>
> On 17 jul 2006, at 17.40, Miroslav Lachman wrote:
>
>> Mike Tancsa wrote:
>> [..]
>>> Install the smartmontools from
>>> /usr/ports/sysutils/smartmontools/
>>> and post the output of
>>> smartctl -a /dev/ad8
>>
>> smartmontools was previously installed and running as daemon  
>> without any bad reports.
>> I can not run "smartctl -a /dev/ad8" now, because my server  
>> housing provider replaced HDD with the new one and after an hour  
>> of synchronization "ad8: FAILURE - device detached". So provider  
>> replaced whole server, only ad4 is original piece of HW.
>> On new server synchronization was much faster then in previous  
>> server (1:30 hour compared to 5 hours in previous server) - so I  
>> think it was HW problem.
>> Now I am running stresstest with copying /usr/ports to another  
>> partition in infinite loop.
>> I will post results later. (On bad server, test failed after about  
>> 30 minutes. On another server the test is running fine second day,  
>> so I think if disk will not fail after 1 day, problem is solved)
>>
>> At last - now I think this was not GEOM/gmirror related. I tried  
>> remove ad8 provider from gmirror (gm0), boot up system from gm0  
>> with one provider (ad4) and test ad8 mounted separately - ad8  
>> failed again.
>
> Just got another one..
>
> Jul 25 13:30:47 elfi kernel: ad4: FAILURE - device detached
> Jul 25 13:30:47 elfi kernel: subdisk4: detached
> Jul 25 13:30:47 elfi kernel: ad4: detached
> Jul 25 13:30:47 elfi kernel: GEOM_MIRROR: Device gm0s1: provider  
> ad4s1 disconnected.
> Jul 25 13:30:47 elfi kernel: g_vfs_done():mirror/gm0s1f[READ 
> (offset=46318008320, length=2048)]error = 6
> Jul 25 13:30:47 elfi kernel: g_vfs_done():mirror/gm0s1f[READ 
> (offset=77269614592, length=16384)]error = 6
>
> 6 days uptime when this occured... Both disks are tested with  
> PowerMax without a single problem (same with smartctl), both SATA  
> cables are new. So the only hwproblem that I cant rule out would be  
> the mobo, but that is quite new too...
>
> Solutions? Try RELENG_6 as recommended earlier?

Okay still on 6.1-RELEASE:

FreeBSD elfi.stromnet.org 6.1-RELEASE FreeBSD 6.1-RELEASE #3: Tue  
May  9 20:40:23 CEST 2006     johan at elfi.stromnet.org:/usr/obj/usr/ 
src/sys/GENERIC  i386

Uptime approx 12 days since last reboot for raid fix... Just got home  
to meet a box which doesnt respond to SSH.. monitor tells me it has  
crashed totaly. From /var/log/message:

Aug 16 00:58:37 elfi kernel: ad4: FAILURE - device detached
Aug 16 00:58:37 elfi kernel: subdisk4: detached
Aug 16 00:58:37 elfi kernel: ad4: detached
Aug 16 00:58:37 elfi kernel: GEOM_MIRROR: Cannot write metadata on  
ad4s1 (device=gm0s1, error=6).
Aug 16 00:58:37 elfi kernel: GEOM_MIRROR: Cannot update metadata on  
disk ad4s1 (error=6).
Aug 16 00:58:37 elfi last message repeated 2 times
Aug 16 00:58:37 elfi kernel: GEOM_MIRROR: Device gm0s1: provider  
ad4s1 disconnected.
Aug 16 00:58:37 elfi kernel: g_vfs_done():mirror/gm0s1f[READ 
(offset=112910630912, length=32768)]error = 6
Aug 16 00:58:37 labdator kernel: nfs: server 192.168.1.2 not  
responding, still trying
Aug 16 00:58:37 labdator kernel: nfs: server 192.168.1.2 OK
Aug 16 03:04:21 elfi syslogd: kernel boot file is /boot/kernel/kernel
Aug 16 03:04:21 elfi kernel: g_vfs_done():mirror/gm0s1d[WRITE 
(offset=2325168128, length=16384)]error = 6
Aug 16 03:04:21 elfi kernel: g_vfs_done():mirror/gm0s1d[WRITE 
(offset=2325184512, length=16384)]error = 6
Aug 16 03:04:21 elfi kernel: g_vfs_done():mirror/gm0s1d[WRITE 
(offset=2325200896, length=16384)]error = 6
Aug 16 03:04:21 elfi kernel: g_vfs_done():mirror/gm0s1d[WRITE 
(offset=2325217280, length=16384)]error = 6
Aug 16 03:04:21 elfi kernel: g_vfs_done():mirror/gm0s1d[WRITE 
(offset=2325233664, length=16384)]error = 6
Aug 16 03:04:21 elfi kernel: g_vfs_done():mirror/gm0s1d[WRITE 
(offset=2325250048, length=16384)]error = 6
Aug 16 03:04:21 elfi kernel: g_vfs_done():mirror/gm0s1d[WRITE 
(offset=2319169536, length=2048)]error = 6
Aug 16 03:04:21 elfi kernel: g_vfs_done():mirror/gm0s1d[WRITE 
(offset=2312404992, length=16384)]error = 6
Aug 16 03:04:21 elfi kernel: Copyright (c) 1992-2006 The FreeBSD  
Project.
Aug 16 03:04:21 elfi kernel: Copyright (c) 1979, 1980, 1983, 1986,  
1988, 1989, 1991, 1992, 1993, 1994
Aug 16 03:04:21 elfi kernel: The Regents of the University of  
California. All rights reserved.
Aug 16 03:04:21 elfi kernel: FreeBSD 6.1-RELEASE #3: Tue May  9  
20:40:23 CEST 2006
...(regular boot stuff)...

(labdator is a box with a elfi nfs export mounted)

dmesg shows me some other stuff not in messages:

ad4: FAILURE - device detached
subdisk4: detached
ad4: detached
GEOM_MIRROR: Cannot write metadata on ad4s1 (device=gm0s1, error=6).
GEOM_MIRROR: Cannot update metadata on disk ad4s1 (error=6).
GEOM_MIRROR: Cannot update metadata on disk ad4s1 (error=6).
GEOM_MIRROR: Cannot update metadata on disk ad4s1 (error=6).
GEOM_MIRROR: Device gm0s1: provider ad4s1 disconnected.
g_vfs_done():mirror/gm0s1f[READ(offset=112910630912, length=32768)] 
error = 6
ad6: FAILURE - device detached
subdisk6: detached
ad6: detached
GEOM_MIRROR: Cannot write metadata on ad6s1 (device=gm0s1, error=6).
GEOM_MIRROR: Cannot update metadata on disk ad6s1 (error=6).
GEOM_MIRROR: Device gm0s1: provider ad6s1 disconnected.
GEOM_MIRROR: Device gm0s1: provider mirror/gm0s1 destroyed.
GEOM_MIRROR: Device gm0s1 destroyed.
g_vfs_done():mirror/gm0s1f[READ(offset=27868381184, length=32768)] 
error = 6
g_vfs_done():mirror/gm0s1d[READ(offset=2324807680, length=16384)] 
error = 6
g_vfs_done():mirror/gm0s1d[READ(offset=2324824064, length=16384)] 
error = 6
g_vfs_done():mirror/gm0s1d[READ(offset=2324840448, length=16384)] 
error = 6
g_vfs_done():mirror/gm0s1d[READ(offset=2324856832, length=16384)] 
error = 6
g_vfs_done():mirror/gm0s1d[READ(offset=2324873216, length=16384)] 
error = 6
g_vfs_done():mirror/gm0s1f[READ(offset=17173594112, length=32768)] 
error = 6
g_vfs_done():mirror/gm0s1d[WRITE(offset=2325168128, length=16384)] 
error = 6
g_vfs_done():mirror/gm0s1d[WRITE(offset=2325184512, length=16384)] 
error = 6
g_vfs_done():mirror/gm0s1d[WRITE(offset=2325200896, length=16384)] 
error = 6
g_vfs_done():mirror/gm0s1d[WRITE(offset=2325217280, length=16384)] 
error = 6
g_vfs_done():mirror/gm0s1d[WRITE(offset=2325233664, length=16384)] 
error = 6
g_vfs_done():mirror/gm0s1d[WRITE(offset=2325250048, length=16384)] 
error = 6
g_vfs_done():mirror/gm0s1d[WRITE(offset=2319169536, length=2048)] 
error = 6
g_vfs_done():mirror/gm0s1d[WRITE(offset=2312404992, length=16384)] 
error = 6
Copyright (c) 1992-2006 The FreeBSD Project.
Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
         The Regents of the University of California. All rights  
reserved.
FreeBSD 6.1-RELEASE #3: Tue May  9 20:40:23 CEST 2006
(...boot..)

03:04 was when i got home, from other sources i've been told the box  
died around ~01:21 (IRC pinged out, maybe this was just logs that  
failed to write to disk which froze irssi or something).

Ok so this time it didnt just fail the raid (which it have done  
before, a reboot and it started to rebuild..), this time it took the  
whole box down with it.. This is the first time it has happened since  
I got that new motherboard (read earlier thread)..

Later in boot:

Aug 16 03:04:21 elfi kernel: ad4: 286188MB <Maxtor 7L300S0 BANC1G10>  
at ata2-master SATA150
Aug 16 03:04:21 elfi kernel: ad6: 286188MB <Maxtor 7L300S0 BANC1G10>  
at ata3-master SATA150
Aug 16 03:04:21 elfi kernel: GEOM_MIRROR: Device gm0s1 created  
(id=4118114647).
Aug 16 03:04:21 elfi kernel: GEOM_MIRROR: Device gm0s1: provider  
ad4s1 detected.
Aug 16 03:04:21 elfi kernel: GEOM_MIRROR: Device gm0s1: provider  
ad6s1 detected.
Aug 16 03:04:21 elfi kernel: GEOM_MIRROR: Component ad4s1 (device  
gm0s1) broken, skipping.
Aug 16 03:04:21 elfi kernel: GEOM_MIRROR: Device gm0s1: provider  
ad6s1 activated.
Aug 16 03:04:21 elfi kernel: GEOM_MIRROR: Device gm0s1: provider  
mirror/gm0s1 launched.

Usually when the box has been rebooted before the failed component  
has been rebuilt automaticly.. Solved with:

$ gmirror forget
$ gmirror insert gm0s1 ad4s1

And now its rebuilding ad4 again...

Any new hints? Should i try RELENG_6 instead?

Johan


More information about the freebsd-stable mailing list