sanity check: is 9211-8i, on 8.3, with IT firmware still "the
one"
Dennis Glatting
freebsd at pki2.com
Sat Jan 21 03:43:18 UTC 2012
Data points update:
I thought this problem may be related to a specific RAID controller (LSI
9211-8i - "R") first used on the disks. So I used it on a new, different
set of disks. Those disks work fine afterwards:
ada3 at ata0 bus 0 scbus6 target 0 lun 0
ada3: <ST31000524AS JC4B> ATA-8 SATA 3.x device
ada3: 150.000MB/s transfers (SATA, UDMA6, PIO 8192bytes)
ada3: 953869MB (1953525168 512 byte sectors: 16H 63S/T 16383C)
ada3: Previously was known as ad0
ada4: <ST31000524AS JC4B> ATA-8 SATA 3.x device
ada4: 150.000MB/s transfers (SATA, UDMA6, PIO 8192bytes)
ada4: 953869MB (1953525168 512 byte sectors: 16H 63S/T 16383C)
ada4: Previously was known as ad1
bd3# dd if=/dev/zero of=/dev/ada3 count=8
8+0 records in
8+0 records out
4096 bytes transferred in 0.006000 secs (682662 bytes/sec)
bd3# dd if=/dev/zero of=/dev/ada4 count=8
8+0 records in
8+0 records out
4096 bytes transferred in 0.001953 secs (2097408 bytes/sec)
I used Seatools on one of the disks from the first set
(ST1000DL002-9TT153). On a long test the tools declared there were
errors that it could not fix. I didn't see much point in trying the
second disk.
So, two separately purchased disks from the same vendor bad?
(TigerDirect) What's the odds of that? Hmm...
On Fri, 2012-01-20 at 10:18 -0800, Jeremy Chadwick wrote:
> On Fri, Jan 20, 2012 at 08:31:34AM -0800, Dennis Glatting wrote:
> > On Fri, 2012-01-20 at 07:31 -0800, Jeremy Chadwick wrote:
> >
> > > On Fri, Jan 20, 2012 at 06:22:11AM -0800, Dennis Glatting wrote:
> > > > I am having a problem with Seagate ST1000DL002 disks but I haven't yet
> > > > determined weather it is the disks themselves (they -- two of them, new
> > > > -- fail under a MB controller too.
> > >
> > > Assuming the disks are seen directly on the bus (e.g. show up as daX,
> > > adaX, or whatever), please install ports/sysutils/smartmontools (make
> > > sure you're using version 5.42 or newer) and please provide output from
> > > the following command: "smartctl -a /dev/XXX" where XXX is the device
> > > name of the ST1000DL002 disk(s). Please be sure to state which device
> > > name is associated with which smartctl output. You can delete or
> > > remove the disk serial numbers from the output (for privacy) if you
> > > wish. I'll be happy to review the data and tell you whether or not the
> > > disks themselves are showing problems or if the issue is elsewhere.
> >
> > That is the motivation I needed to reboot that system, which was 50%
> > through a task. That said, as remains the case today, for the last 20
> > years I haven't been able to find that "Any Key" on reboot. :)
> >
> > Regardless...
>
> First off, let's start with the full picture. Readers need to know
> exactly what is going on within your controller setup, what disks are
> connected to what, etc.. Taken from your full dmesg below, and turned
> into something easy-to-read (mostly)
>
> Controller mps0
> --> LSI SAS2008
> --> IRQ 19 on pci1
> --> Firmware 12.00.00.00
> --> Disks attached:
> --> da0 --> WDC WD25EZRS, SATA300
> --> da1 --> WDC WD25EZRS, SATA300
> --> da2 --> WDC WD25EZRS, SATA300
> --> da3 --> WDC WD25EZRS, SATA300
> --> da4 --> WDC WD25EZRS, SATA300
> --> da5 --> WDC WD25EZRS, SATA300
> --> da6 --> WDC WD25EZRS, SATA300
> --> da7 --> WDC WD25EZRS, SATA300
>
> Controller mps1
> --> LSI SAS2008
> --> IRQ 19 on pci5
> --> Firmware 12.00.00.00
> --> Disks attached:
> --> None
>
> Controller mps2
> --> LSI SAS2008
> --> IRQ 16 on pci6
> --> Firmware 12.00.00.00
> --> Disks attached:
> --> da8 --> WDC WD25EZRS, SATA300
> --> da9 --> WDC WD25EZRS, SATA300
> --> da10 --> WDC WD25EZRS, SATA300
> --> da11 --> WDC WD25EZRS, SATA300
> --> da12 --> ST1000DL002, SATA300
>
> Controller ahci0
> --> ATI IXP700 AHCI (4-port)
> --> IRQ 19 on pci0
> --> Disks attached:
> --> ahcich0 --> ada0 --> Corsair Force 3 SSD, SATA600
> --> ahcich1 --> ada1 --> OCZ-AGILITY2 SSD, SATA300
> --> ahcich2 --> ada2 --> ST31000333AS, SATA300
>
> Controller ata0
> --> ATI IXP700/800 ATA133 (2-port/4-device, PATA)
> --> IRQ <???> on pci0
> --> I would assume this would be on IRQ 14 or 15, sigh...
> --> Disks attached:
> --> None
>
> Now that we have a full picture, let's continue.
>
> > An attempt to write to it:
> >
> > bd3# dd if=/dev/zero of=/dev/da12
> > dd: /dev/da12: Input/output error
> > 1+0 records in
> > 0+0 records out
> > 0 bytes transferred in 0.378153 secs (0 bytes/sec)
>
> The dd command you executed to write zeros to the disk, 512-bytes at
> time, starting at LBA 0, failed when writing the first 512 bytes. So,
> from my perspective, writing to LBA 0 is failing.
>
> You should also keep in mind that this dd command to zero the disk (if
> it was to work) would take a very long time to complete. If you used a
> larger block size (bs=64k or maybe larger), it would be a lot faster.
> Just a tip. Starting with bs=512 (default) is fine, or in this case
> using 4096 would probably be better (see below), but whatever.
>
> > The disk is presently connected to this device (LSI 9211-8i) but I have
> > also had it connected to the devices on the MB and I think to a
> > SuperMicro board. I have also tried a different LSI board.
>
> Thanks for sharing this -- this is important information, but let's not
> start moving the drive around any more, okay? There's no point. The
> information you've given is enough, and I'll explain it in detail.
>
> > {snipping for brevity}
> >
> > bd3# smartctl -a /dev/da12
> > smartctl 5.42 2011-10-20 r3458 [FreeBSD 9.0-STABLE amd64] (local build)
> > Copyright (C) 2002-11 by Bruce Allen,
> > http://smartmontools.sourceforge.net
> >
> > === START OF INFORMATION SECTION ===
> > Model Family: Seagate Barracuda Green (Adv. Format)
> > Device Model: ST1000DL002-9TT153
> > Serial Number: W1V06SLR
> > LU WWN Device Id: 5 000c50 037e11be9
> > Firmware Version: CC32
> > User Capacity: 1,000,204,886,016 bytes [1.00 TB]
> > Sector Size: 512 bytes logical/physical
> > Device is: In smartctl database [for details use: -P show]
> > ATA Version is: 8
> > ATA Standard is: ATA-8-ACS revision 4
> > Local Time is: Fri Jan 20 08:22:34 2012 PST
> > SMART support is: Available - device has SMART capability.
> > SMART support is: Enabled
> >
> > {snipping for brevity}
> >
> > SMART Attributes Data Structure revision number: 10
> > Vendor Specific SMART Attributes with Thresholds:
> > ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
> > 1 Raw_Read_Error_Rate 0x000f 108 099 006 Pre-fail Always - 241488
> > 3 Spin_Up_Time 0x0003 087 070 000 Pre-fail Always - 0
> > 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 28
> > 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0
> > 7 Seek_Error_Rate 0x000f 100 253 030 Pre-fail Always - 136324
> > 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 576
> > 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
> > 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 29
> > 183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0
> > 184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0
> > 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
> > 188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0
> > 189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0
> > 190 Airflow_Temperature_Cel 0x0022 073 062 045 Old_age Always - 27 (Min/Max 21/27)
> > 191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0
> > 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 23
> > 193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 29
> > 194 Temperature_Celsius 0x0022 027 040 000 Old_age Always - 27 (0 21 0 0 0)
> > 195 Hardware_ECC_Recovered 0x001a 027 008 000 Old_age Always - 241488
> > 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
> > 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
> > 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
> > 240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 265544943010369
> > 241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 3746932548
> > 242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 3212957483
> >
> > SMART Error Log Version: 1
> > No Errors Logged
> >
> > {snipping more}
>
> Your SMART attributes here appear perfectly fine. There is no
> indication of bad LBAs (sectors) on the drive, or even "suspect" LBAs on
> the drive. If LBA 0, for example, was actually bad (meaning the sector
> itself), that would show up in the SMART error log (most likely), and if
> not there, at bare minimum as some form of incremented RAW_VALUE field
> in one of many attributes (either 5, 197, or 198; possibly 187, I forget).
>
> SMART attributes 1, 7, and 195 on Seagate drives are always "crazy";
> that is to say, they are not incremental counters, they are
> vendor-encoded. smartmontools does not know how to decode some of these
> attributes (on SOME Seagate drives it does, on others it doesn't). I
> state this because people read SMART attributes wrong ~70% of the time;
> they see non-zero numbers and go "oh my god, it's broken!" No it isn't.
> SMART attribute values/decoding are not part of the ATA specification
> (even working draft), so it's all proprietary more or less.
>
> I also want to assume attribute 240 is vendor-encoded as well, probably
> as multiple data sets stored within the full 6-byte attribute field;
> again, smartmontools doesn't know how to decode this. I wouldn't worry
> about this, again even though the number is huge. :-)
>
> SMART attribute 184 keeps track of errors occurring between the drive
> controller (on the PCB) and the drive cache; there are no cache errors.
> That's good, and I'm glad to see vendors implementing this.
>
> SMART attribute 188 indicates the drive itself has not counted any
> command timeouts (these would be ATA commands sent from the OS through
> the SATA/SAS controller to the drive controller, which timed out at the
> phase when the drive attempted to read/write data from a sector).
>
> SMART attribute 199 indicates there are no cabling problems or "physical
> issues between the disk and the SATA/SAS controller" (bad connectors,
> dust in the connectors, shoddy hot-swap plane, bad port, etc.).
>
> SMART attribute 183 is something I haven't seen before (I'm more
> familiar with Western Digital disks), but it also looks fine.
>
> So again: your drive looks perfectly healthy per SMART stats. But
> there's something amusing about this situation that a lot of people
> overlook...
>
> > {snipping dmesg for brevity, but here's the URL for readers so they
> > can see it themselves:
> > http://lists.freebsd.org/pipermail/freebsd-fs/2012-January/013481.html
> > }
> >
> > {simplify the SCSI errors shown}
> >
> > (da12:mps2:0:5:0): READ(6). CDB: 8 0 0 1 1 0
> > (da12:mps2:0:5:0): CAM status: SCSI Status Error
> > (da12:mps2:0:5:0): SCSI status: Check Condition
> > (da12:mps2:0:5:0): SCSI sense: ABORTED COMMAND asc:0,0 (No additional sense information)
> > (da12:mps2:0:5:0): SYNCHRONIZE CACHE(10). CDB: 35 0 0 0 0 0 0 0 0 0
> > (da12:mps2:0:5:0): SCSI sense: ABORTED COMMAND asc:0,0 (No additional sense information)
> > (da12:mps2:0:5:0): READ(10). CDB: 28 0 74 70 6d af 0 0 1 0
> > (da12:mps2:0:5:0): CAM status: SCSI Status Error
> > (da12:mps2:0:5:0): SCSI status: Check Condition
> > (da12:mps2:0:5:0): SCSI sense: ABORTED COMMAND asc:0,0 (No additional sense information)
> > (da12:mps2:0:5:0): WRITE(6). CDB: a 0 0 0 1 0
> > (da12:mps2:0:5:0): CAM status: SCSI Status Error
> > (da12:mps2:0:5:0): SCSI status: Check Condition
> > (da12:mps2:0:5:0): SCSI sense: ABORTED COMMAND asc:0,0 (No additional sense information)
>
> Based on this, we know the following:
>
> - The da12 disk is doing something weird when it comes to reads AND
> writes.
> - The da12 disk is not timing out; it receives an immediate error on
> reads and writes (coming back from the controller; whether or not the
> ATA command block makes it to the disk is unknown, but I have to
> assume it does).
> - The da12 disk, at one time, was working/usable as indicated by some
> SMART attributes.
> - The da12 disk is the only ST1000DL002 disk in the system.
> - The da12 disk is on the same controller as 4 other disks.
> - The da8 through da11 disks (WD25EZRS) on the mps2 controller are
> performing fine with no issues (I have to assume this).
> - The ST1000DL002 disk is an Advanced Format disk (4096-byte sectors).
> - All the WD25EZRS disks are Advanced Format disks (4096-byte sectors).
> - The ST1000DL002 disk behaves badly when used on the on-board AHCI
> controller as well as a completely different motherboard (presumably).
>
> Here's the fun part:
>
> ATA commands being submit from the OS to the disk (specifically the
> controller on the disk itself) are working fine. SMART attributes are
> obtained via an ATA command that, internally on mechanical drives,
> fetches data from the HPA (Host Protected Area) region of the drive (see
> Wikipedia if you don't know about this), and returns that data. AFAIK
> this data is not cached in any way, it's almost always read straight
> from the HPA.
>
> So this means we know I/O communication between the OS and controller,
> and the controller and the disk, works fine. And we also know, at least
> with regards to the HPA region, that the heads can read data from the HPA
> region successfully. Great.
>
> Could this be a controller problem (e.g. a firmware bug that affects
> compatibility with ST1000DL002 drives)? I'm about 95% certain the
> answer is no. The reason is that the ST1000DL002 drive behaved the same
> when put on other controllers.
>
> What all this means is that the drive, in effect, refuses to read data
> from non-HPA regions of the disk -- that means LBA 0 to <last LBA>. Why
> or how could this happen? Unknown, because there's a *ton* of
> possibilities -- way more than I care to speculate. :-)
>
> Have I seen this problem before? Yes -- many times, but only once with
> a SATA drive:
>
> - I see this on rare occasion with Fujitsu SCSI disks at my workplace,
> where the drives flat out refuse to do I/O any longer. However, these
> return a vendor-specific ASC + ASCQ that indicate the drive is in a
> "locked" or "frozen" state, requiring Fujitsu to investigate. I've seen
> it happen a good 10, maybe 20 times over the past few years on drives
> manufactured from 2001 to 2007. Thankfully Fujitsu provides full docs
> on their SCSI drives so I was able to look up the ASC/ASCQ and figure
> out it was an internal drive failure. We disposed of the disks
> properly/securely.
>
> - In the SATA case, the end-user's drive behaved the same as yours. I
> do not remember what brand (it really doesn't matter though). In their
> case, however, the HPA region was corrupt; the drive spit out weird
> errors during SMART attribute fetch, and those attributes which it did
> fetch were *completely* garbled. My guess was a bad HPA region of the
> drive, combined with either a firmware bug or something mechanical or
> head problems. The end-user RMA'd the drive and the replacement worked
> fine.
>
> My advice at this point (#1 is optional):
>
> 1. If you're curious and just interested in learning: put the
> ST1000DL002 disk on a system where it's the only disk, and hooked
> directly to the motherboard (and not in AHCI mode), and boot SeaTools
> from a CD or USB stick.
>
> I'm willing to bet you get back an error code on the quick/short test
> (which does more than just a SMART short test). If that does pass, try
> doing a long test (which reads all the LBAs on the drive). I'll be
> very, VERY surprised if that passes.
>
> 2. File an RMA with Seagate. The simple version is that all LBA I/O
> (standard read/write) is being rejected by the drive for unknown
> reasons.
>
> Good luck, and hope this sheds some light on the "fun" (or not so fun)
> world of hard disk troubleshooting. And don't ask me to troubleshoot an
> SSD. ;-)
>
More information about the freebsd-fs
mailing list