Bad sector on a gstripe

Sat Feb 9 14:03:34 UTC 2008

Hi all,

I'm having trouble locating a bad sector on a gstriped file system.  
Smartd has been nagging about this single bad sector for months now,  
there don't appear to appear any new ones. It's about time I look  
into this...

I got so far that I know the sector number in the partition involved.  
I detailed my attempts after the problem description. I tried newfs- 
ing the filesystem; it's my /tmp - there's nothing of relevance on  
it, but newfs-ing doesn't seem to have marked the sector bad.  
Anything wrong with: newfs -U -o time /dev/stripe/tmp ? I performed  
that from single-user mode after umounting all file-systems.

I tried opening the filesystem with fsdb, but it can't open the  
partition, only the striped file-system - how do I determine which  
sector I'm dealing with on a striped fs? And how do I write to it to  
have it marked as a bad sector?

I'm not sure whether this error means my disk is at the end of its  
life, smartd has been spamming me with this single error about the  
same sector for months now (every half hour!), and it's only the  
third error in the disks' smart log. If I understand the docs of  
smartmontools correctly, this could well be caused by the sector not  
having been written to all this time, which seems plausible to me;  
it's near the end of a mostly empty /tmp...

 From the lifetime it appears the disk is nearly two years old  
already, and it's been on pretty much 24/7. Maybe it is time to  
replace it (by a server version probably).

Time for some data.

The disk is an:
Model Family:     Seagate Barracuda 7200.7 and 7200.7 Plus family
Device Model:     ST3200822A
Serial Number:    3LJ020SJ
Firmware Version: 3.01

smartctl says:

Error 3 occurred at disk power-on lifetime: 18356 hours (764 days +  
20 hours)
   When the command that caused the error occurred, the device was  
active or idle
.

   After command completion occurred, registers were:
   ER ST SC SN CL CH DH
   -- -- -- -- -- -- --
   40 51 00 30 ed 61 40  Error: UNC at LBA = 0x0061ed30 = 6417712

   Commands leading to the command that caused the error were:
   CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
   -- -- -- -- -- -- -- --  ----------------  --------------------
   25 00 20 1f ed 61 40 00      15:42:14.650  READ DMA EXT
   25 00 40 9f e6 61 40 00      15:42:14.419  READ DMA EXT
   25 00 40 df f1 61 40 00      15:42:14.293  READ DMA EXT
   25 00 40 5f e6 61 40 00      15:42:14.049  READ DMA EXT
   25 00 40 5f e9 61 40 00      15:42:13.795  READ DMA EXT

According to fdisk and bsdlabel that's on partition e of slice 1:

# fdisk -s /dev/ad0
/dev/ad0: 387621 cyl 16 hd 63 sec
Part        Start        Size Type Flags
    1:          63   390716802 0xa5 0x80

So the bad sector is at 6417712 - 63 = 6417649 in /dev/ad0s1.

# bsdlabel /dev/ad0s1
# /dev/ad0s1:
8 partitions:
#        size   offset    fstype   [fsize bsize bps/cpg]
   a:   524288        0    4.2BSD     2048 16384 32776
   b:  4194304   524288      swap
   c: 390716802        0    unused        0     0         # "raw"  
part, don't edit
   d:  1048576  4718592    4.2BSD     2048 16384     8
   e:  1048576  5767168    4.2BSD     2048 16384     8
   f: 20971520  6815744    4.2BSD     2048 16384 28552
   g: 362929538 27787264    4.2BSD     2048 16384 28552

So the bad sector is 6417649 - 5767168 = 650481 in partition /dev/ 
ad0s1e at around 62% of its total size. This is where I started to  
get lost...

I set up partition ad0s1e to be used in /dev/stripe/tmp:

# gstripe list tmp
Geom name: tmp
State: UP
Status: Total=2, Online=2
Type: AUTOMATIC
Stripesize: 4096
ID: 1982480573
Providers:
1. Name: stripe/tmp
    Mediasize: 1073733632 (1.0G)
    Sectorsize: 512
    Mode: r1w1e1
Consumers:
1. Name: ad0s1e
    Mediasize: 536870912 (512M)
    Sectorsize: 512
    Mode: r1w1e2
    Number: 0
2. Name: ad1s1e
    Mediasize: 536870912 (512M)
    Sectorsize: 512
    Mode: r1w1e2
    Number: 1

I tried: (used -r to prevent it marking my FS's dirty while I was  
testing)

# fsdb -r /dev/ad0s1e
** /dev/ad0s1e (NO WRITE)
Cannot find file system superblock

LOOK FOR ALTERNATE SUPERBLOCKS? no

fsdb: cannot set up file system `/dev/ad0s1e'
Exit 1

and:

fsdb -r /dev/stripe/tmp
** /dev/stripe/tmp (NO WRITE)
Examining file system `/dev/stripe/tmp'
Last Mounted on /tmp
current inode: directory
I=2 MODE=40777 SIZE=512
         BTIME=Feb  9 12:01:18 2008 [0 nsec]
         MTIME=Feb  9 12:54:41 2008 [0 nsec]
         CTIME=Feb  9 12:54:41 2008 [0 nsec]
         ATIME=Feb  9 13:23:07 2008 [0 nsec]
OWNER=root GRP=wheel LINKCNT=7 FLAGS=0 BLKCNT=4 GEN=7a46458d
fsdb (inum: 2)>

I figured the findblk command would give me the inode of the problem  
area (although there won't be one if there are no files in that  
sector I think?), but I'm dealing with sectors striped across two  
disks... I have no idea which "block number" would be appropriate.  
The disk containing the bad sector is apparently the first in the  
stripe, that much I gathered.

So, how to continue?

Regards,
Alban Hertroys

--
If you can't see the forest for the trees,
cut the trees and you'll see there is no forest.

!DSPAM:760,47ada565167321710067946!