sanity check: is 9211-8i, on 8.3, with IT firmware still "the one"

Sat Jan 21 03:43:18 UTC 2012

Data points update:

I thought this problem may be related to a specific RAID controller (LSI
9211-8i - "R") first used on the disks. So I used it on a new, different
set of disks. Those disks work fine afterwards:

ada3 at ata0 bus 0 scbus6 target 0 lun 0
ada3: <ST31000524AS JC4B> ATA-8 SATA 3.x device
ada3: 150.000MB/s transfers (SATA, UDMA6, PIO 8192bytes)
ada3: 953869MB (1953525168 512 byte sectors: 16H 63S/T 16383C)
ada3: Previously was known as ad0

ada4: <ST31000524AS JC4B> ATA-8 SATA 3.x device
ada4: 150.000MB/s transfers (SATA, UDMA6, PIO 8192bytes)
ada4: 953869MB (1953525168 512 byte sectors: 16H 63S/T 16383C)
ada4: Previously was known as ad1

bd3# dd if=/dev/zero of=/dev/ada3 count=8
8+0 records in
8+0 records out
4096 bytes transferred in 0.006000 secs (682662 bytes/sec)

bd3# dd if=/dev/zero of=/dev/ada4 count=8
8+0 records in
8+0 records out
4096 bytes transferred in 0.001953 secs (2097408 bytes/sec)

I used Seatools on one of the disks from the first set
(ST1000DL002-9TT153). On a long test the tools declared there were
errors that it could not fix. I didn't see  much point in trying the
second disk.

So, two separately purchased disks from the same vendor bad?
(TigerDirect) What's the odds of that? Hmm...

On Fri, 2012-01-20 at 10:18 -0800, Jeremy Chadwick wrote:
> On Fri, Jan 20, 2012 at 08:31:34AM -0800, Dennis Glatting wrote:
> > On Fri, 2012-01-20 at 07:31 -0800, Jeremy Chadwick wrote:
> > 
> > > On Fri, Jan 20, 2012 at 06:22:11AM -0800, Dennis Glatting wrote:
> > > > I am having a problem with Seagate ST1000DL002 disks but I haven't yet
> > > > determined weather it is the disks themselves (they -- two of them, new
> > > > -- fail under a MB controller too.
> > > 
> > > Assuming the disks are seen directly on the bus (e.g. show up as daX,
> > > adaX, or whatever), please install ports/sysutils/smartmontools (make
> > > sure you're using version 5.42 or newer) and please provide output from
> > > the following command: "smartctl -a /dev/XXX" where XXX is the device
> > > name of the ST1000DL002 disk(s).  Please be sure to state which device
> > > name is associated with which smartctl output.  You can delete or
> > > remove the disk serial numbers from the output (for privacy) if you
> > > wish.  I'll be happy to review the data and tell you whether or not the
> > > disks themselves are showing problems or if the issue is elsewhere.
> > 
> > That is the motivation I needed to reboot that system, which was 50%
> > through a task. That said, as remains the case today, for the last 20
> > years I haven't been able to find that "Any Key" on reboot. :)
> > 
> > Regardless...
> 
> First off, let's start with the full picture.  Readers need to know
> exactly what is going on within your controller setup, what disks are
> connected to what, etc..  Taken from your full dmesg below, and turned
> into something easy-to-read (mostly)
> 
> Controller mps0
>   --> LSI SAS2008
>   --> IRQ 19 on pci1
>   --> Firmware 12.00.00.00
>   --> Disks attached:
>       --> da0  --> WDC WD25EZRS, SATA300
>       --> da1  --> WDC WD25EZRS, SATA300
>       --> da2  --> WDC WD25EZRS, SATA300
>       --> da3  --> WDC WD25EZRS, SATA300
>       --> da4  --> WDC WD25EZRS, SATA300
>       --> da5  --> WDC WD25EZRS, SATA300
>       --> da6  --> WDC WD25EZRS, SATA300
>       --> da7  --> WDC WD25EZRS, SATA300
> 
> Controller mps1
>   --> LSI SAS2008
>   --> IRQ 19 on pci5
>   --> Firmware 12.00.00.00
>   --> Disks attached:
>       --> None
> 
> Controller mps2
>   --> LSI SAS2008
>   --> IRQ 16 on pci6
>   --> Firmware 12.00.00.00
>   --> Disks attached:
>       --> da8  --> WDC WD25EZRS, SATA300
>       --> da9  --> WDC WD25EZRS, SATA300
>       --> da10 --> WDC WD25EZRS, SATA300
>       --> da11 --> WDC WD25EZRS, SATA300
>       --> da12 --> ST1000DL002, SATA300
> 
> Controller ahci0
>   --> ATI IXP700 AHCI (4-port)
>   --> IRQ 19 on pci0
>   --> Disks attached:
>       --> ahcich0 --> ada0 --> Corsair Force 3 SSD, SATA600
>       --> ahcich1 --> ada1 --> OCZ-AGILITY2 SSD, SATA300
>       --> ahcich2 --> ada2 --> ST31000333AS, SATA300
> 
> Controller ata0
>   --> ATI IXP700/800 ATA133 (2-port/4-device, PATA)
>   --> IRQ <???> on pci0
>   --> I would assume this would be on IRQ 14 or 15, sigh...
>   --> Disks attached:
>       --> None
> 
> Now that we have a full picture, let's continue.
> 
> > An attempt to write to it:
> > 
> > bd3# dd if=/dev/zero of=/dev/da12
> > dd: /dev/da12: Input/output error
> > 1+0 records in
> > 0+0 records out
> > 0 bytes transferred in 0.378153 secs (0 bytes/sec)
> 
> The dd command you executed to write zeros to the disk, 512-bytes at
> time, starting at LBA 0, failed when writing the first 512 bytes.  So,
> from my perspective, writing to LBA 0 is failing.
> 
> You should also keep in mind that this dd command to zero the disk (if
> it was to work) would take a very long time to complete.  If you used a
> larger block size (bs=64k or maybe larger), it would be a lot faster.
> Just a tip.  Starting with bs=512 (default) is fine, or in this case
> using 4096 would probably be better (see below), but whatever.
> 
> > The disk is presently connected  to this device (LSI 9211-8i) but I have
> > also had it connected to the devices on the MB and I think to a
> > SuperMicro board. I have also tried a different LSI board.
> 
> Thanks for sharing this -- this is important information, but let's not
> start moving the drive around any more, okay?  There's no point.  The
> information you've given is enough, and I'll explain it in detail.
> 
> > {snipping for brevity}
> > 
> > bd3# smartctl -a /dev/da12
> > smartctl 5.42 2011-10-20 r3458 [FreeBSD 9.0-STABLE amd64] (local build)
> > Copyright (C) 2002-11 by Bruce Allen,
> > http://smartmontools.sourceforge.net
> > 
> > === START OF INFORMATION SECTION ===
> > Model Family:     Seagate Barracuda Green (Adv. Format)
> > Device Model:     ST1000DL002-9TT153
> > Serial Number:    W1V06SLR
> > LU WWN Device Id: 5 000c50 037e11be9
> > Firmware Version: CC32
> > User Capacity:    1,000,204,886,016 bytes [1.00 TB]
> > Sector Size:      512 bytes logical/physical
> > Device is:        In smartctl database [for details use: -P show]
> > ATA Version is:   8
> > ATA Standard is:  ATA-8-ACS revision 4
> > Local Time is:    Fri Jan 20 08:22:34 2012 PST
> > SMART support is: Available - device has SMART capability.
> > SMART support is: Enabled
> >
> > {snipping for brevity}
> > 
> > SMART Attributes Data Structure revision number: 10
> > Vendor Specific SMART Attributes with Thresholds:
> > ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
> >   1 Raw_Read_Error_Rate     0x000f   108   099   006    Pre-fail  Always -       241488
> >   3 Spin_Up_Time            0x0003   087   070   000    Pre-fail  Always -       0
> >   4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always -       28
> >   5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always -       0
> >   7 Seek_Error_Rate         0x000f   100   253   030    Pre-fail  Always -       136324
> >   9 Power_On_Hours          0x0032   100   100   000    Old_age   Always -       576
> >  10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always -       0
> >  12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always -       29
> > 183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always -       0
> > 184 End-to-End_Error        0x0032   100   100   099    Old_age   Always -       0
> > 187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always -       0
> > 188 Command_Timeout         0x0032   100   100   000    Old_age   Always -       0
> > 189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always -       0
> > 190 Airflow_Temperature_Cel 0x0022   073   062   045    Old_age   Always -       27 (Min/Max 21/27)
> > 191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always -       0
> > 192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always -       23
> > 193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always -       29
> > 194 Temperature_Celsius     0x0022   027   040   000    Old_age   Always -       27 (0 21 0 0 0)
> > 195 Hardware_ECC_Recovered  0x001a   027   008   000    Old_age   Always -       241488
> > 197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always -       0
> > 198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline -      0
> > 199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always -       0
> > 240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline -      265544943010369
> > 241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline -      3746932548
> > 242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline -      3212957483
> > 
> > SMART Error Log Version: 1
> > No Errors Logged
> >
> > {snipping more}
> 
> Your SMART attributes here appear perfectly fine.  There is no
> indication of bad LBAs (sectors) on the drive, or even "suspect" LBAs on
> the drive.  If LBA 0, for example, was actually bad (meaning the sector
> itself), that would show up in the SMART error log (most likely), and if
> not there, at bare minimum as some form of incremented RAW_VALUE field
> in one of many attributes (either 5, 197, or 198; possibly 187, I forget).
> 
> SMART attributes 1, 7, and 195 on Seagate drives are always "crazy";
> that is to say, they are not incremental counters, they are
> vendor-encoded.  smartmontools does not know how to decode some of these
> attributes (on SOME Seagate drives it does, on others it doesn't).  I
> state this because people read SMART attributes wrong ~70% of the time;
> they see non-zero numbers and go "oh my god, it's broken!"  No it isn't.
> SMART attribute values/decoding are not part of the ATA specification
> (even working draft), so it's all proprietary more or less.
> 
> I also want to assume attribute 240 is vendor-encoded as well, probably
> as multiple data sets stored within the full 6-byte attribute field;
> again, smartmontools doesn't know how to decode this.  I wouldn't worry
> about this, again even though the number is huge.  :-)
> 
> SMART attribute 184 keeps track of errors occurring between the drive
> controller (on the PCB) and the drive cache; there are no cache errors.
> That's good, and I'm glad to see vendors implementing this.
> 
> SMART attribute 188 indicates the drive itself has not counted any
> command timeouts (these would be ATA commands sent from the OS through
> the SATA/SAS controller to the drive controller, which timed out at the
> phase when the drive attempted to read/write data from a sector).
> 
> SMART attribute 199 indicates there are no cabling problems or "physical
> issues between the disk and the SATA/SAS controller" (bad connectors,
> dust in the connectors, shoddy hot-swap plane, bad port, etc.).
> 
> SMART attribute 183 is something I haven't seen before (I'm more
> familiar with Western Digital disks), but it also looks fine.
> 
> So again: your drive looks perfectly healthy per SMART stats.  But
> there's something amusing about this situation that a lot of people
> overlook...
> 
> > {snipping dmesg for brevity, but here's the URL for readers so they
> > can see it themselves:
> > http://lists.freebsd.org/pipermail/freebsd-fs/2012-January/013481.html
> > }
> >
> > {simplify the SCSI errors shown}
> >
> > (da12:mps2:0:5:0): READ(6). CDB: 8 0 0 1 1 0
> > (da12:mps2:0:5:0): CAM status: SCSI Status Error
> > (da12:mps2:0:5:0): SCSI status: Check Condition
> > (da12:mps2:0:5:0): SCSI sense: ABORTED COMMAND asc:0,0 (No additional sense information)
> > (da12:mps2:0:5:0): SYNCHRONIZE CACHE(10). CDB: 35 0 0 0 0 0 0 0 0 0
> > (da12:mps2:0:5:0): SCSI sense: ABORTED COMMAND asc:0,0 (No additional sense information)
> > (da12:mps2:0:5:0): READ(10). CDB: 28 0 74 70 6d af 0 0 1 0
> > (da12:mps2:0:5:0): CAM status: SCSI Status Error
> > (da12:mps2:0:5:0): SCSI status: Check Condition
> > (da12:mps2:0:5:0): SCSI sense: ABORTED COMMAND asc:0,0 (No additional sense information)
> > (da12:mps2:0:5:0): WRITE(6). CDB: a 0 0 0 1 0
> > (da12:mps2:0:5:0): CAM status: SCSI Status Error
> > (da12:mps2:0:5:0): SCSI status: Check Condition
> > (da12:mps2:0:5:0): SCSI sense: ABORTED COMMAND asc:0,0 (No additional sense information)
> 
> Based on this, we know the following:
> 
> - The da12 disk is doing something weird when it comes to reads AND
>   writes.
> - The da12 disk is not timing out; it receives an immediate error on
>   reads and writes (coming back from the controller; whether or not the
>   ATA command block makes it to the disk is unknown, but I have to
>   assume it does).
> - The da12 disk, at one time, was working/usable as indicated by some
>   SMART attributes.
> - The da12 disk is the only ST1000DL002 disk in the system.
> - The da12 disk is on the same controller as 4 other disks.
> - The da8 through da11 disks (WD25EZRS) on the mps2 controller are
>   performing fine with no issues (I have to assume this).
> - The ST1000DL002 disk is an Advanced Format disk (4096-byte sectors).
> - All the WD25EZRS disks are Advanced Format disks (4096-byte sectors).
> - The ST1000DL002 disk behaves badly when used on the on-board AHCI
>   controller as well as a completely different motherboard (presumably).
> 
> Here's the fun part:
> 
> ATA commands being submit from the OS to the disk (specifically the
> controller on the disk itself) are working fine.  SMART attributes are
> obtained via an ATA command that, internally on mechanical drives,
> fetches data from the HPA (Host Protected Area) region of the drive (see
> Wikipedia if you don't know about this), and returns that data.  AFAIK
> this data is not cached in any way, it's almost always read straight
> from the HPA.
> 
> So this means we know I/O communication between the OS and controller,
> and the controller and the disk, works fine.  And we also know, at least
> with regards to the HPA region, that the heads can read data from the HPA
> region successfully.  Great.
> 
> Could this be a controller problem (e.g. a firmware bug that affects 
> compatibility with ST1000DL002 drives)?  I'm about 95% certain the
> answer is no.  The reason is that the ST1000DL002 drive behaved the same
> when put on other controllers.
> 
> What all this means is that the drive, in effect, refuses to read data
> from non-HPA regions of the disk -- that means LBA 0 to <last LBA>.  Why
> or how could this happen?  Unknown, because there's a *ton* of
> possibilities -- way more than I care to speculate.  :-)
> 
> Have I seen this problem before?  Yes -- many times, but only once with
> a SATA drive:
> 
> - I see this on rare occasion with Fujitsu SCSI disks at my workplace,
> where the drives flat out refuse to do I/O any longer.  However, these
> return a vendor-specific ASC + ASCQ that indicate the drive is in a
> "locked" or "frozen" state, requiring Fujitsu to investigate.  I've seen
> it happen a good 10, maybe 20 times over the past few years on drives
> manufactured from 2001 to 2007.  Thankfully Fujitsu provides full docs
> on their SCSI drives so I was able to look up the ASC/ASCQ and figure
> out it was an internal drive failure.  We disposed of the disks
> properly/securely.
> 
> - In the SATA case, the end-user's drive behaved the same as yours.  I
> do not remember what brand (it really doesn't matter though).  In their
> case, however, the HPA region was corrupt; the drive spit out weird
> errors during SMART attribute fetch, and those attributes which it did
> fetch were *completely* garbled.  My guess was a bad HPA region of the
> drive, combined with either a firmware bug or something mechanical or
> head problems.  The end-user RMA'd the drive and the replacement worked
> fine.
> 
> My advice at this point (#1 is optional):
> 
> 1. If you're curious and just interested in learning: put the
> ST1000DL002 disk on a system where it's the only disk, and hooked
> directly to the motherboard (and not in AHCI mode), and boot SeaTools
> from a CD or USB stick.
> 
> I'm willing to bet you get back an error code on the quick/short test
> (which does more than just a SMART short test).  If that does pass, try
> doing a long test (which reads all the LBAs on the drive).  I'll be
> very, VERY surprised if that passes.
> 
> 2. File an RMA with Seagate.  The simple version is that all LBA I/O
> (standard read/write) is being rejected by the drive for unknown
> reasons.
> 
> Good luck, and hope this sheds some light on the "fun" (or not so fun)
> world of hard disk troubleshooting.  And don't ask me to troubleshoot an
> SSD.  ;-)
>