ZFS raidz2, errors in file?

Thu Oct 18 01:28:10 UTC 2012

On Wed, 2012-10-17 at 20:39 +0300, Heikki Suonsivu wrote:
> Summary:  This could indicate that in some situations ZFS or FreeBSD 
> ignore disk errors in some situations, or disks do not pass them to 
> FreeBSD.
> 
> There are three SCSI timeouts all on different disks reported, though it 
> seems unlikely they hit all three redundancies of a single block of 
> single file, so the more likely scenario would be that there are other 
> errors causing this, which are not reported or are ignored.  Also, 
> timeouts should trigger retries.
> 
> SMART data indicates problems on two other disks, but no indication of 
> those are seen in logs (the disks work, but SMART information indicates 
> problems).  The fs is 8*3T pool on raidz2.
> 

I can't say about your disks but I had nine 4TB Hitachi disks suffering
a 30% failure rate in their first year and consistently, and randomly,
being marked off-line by ZFS in a RAIDz2 pool. I eventually stripped
those disks from that role, replacing them with 3TB Seagates (the
Hitachies appear to work find in a ZFS mirror). 

I haven't experienced the same failure since I replaced the Hitachies
with Seagate.

> Further data below:
> 
> ------------------------------------------------------------------------
> zpool status -v
>    pool: raid6
>   state: ONLINE
> status: One or more devices has experienced an error resulting in data
>          corruption.  Applications may be affected.
> action: Restore the file in question if possible.  Otherwise restore the
>          entire pool from backup.
>     see: http://www.sun.com/msg/ZFS-8000-8A
>   scan: none requested
> config:
> 
>          NAME        STATE     READ WRITE CKSUM
>          raid6       ONLINE       0     0     0
>            raidz2-0  ONLINE       0     0     0
>              da0     ONLINE       0     0     0
>              da1     ONLINE       0     0     0
>              da2     ONLINE       0     0     0
>              da3     ONLINE       0     0     0
>              da4     ONLINE       0     0     0
>              da5     ONLINE       0     0     0
>              da6     ONLINE       0     0     0
>              da7     ONLINE       0     0     0
> 
> errors: Permanent errors have been detected in the following files:
> 
>          /raid6/backup/rsnapshot/daily.1/foobar.net/users/mail/foo
> ------------------------------------------------------------------------
> 
> All disks online, so this is kind of odd.
> 
> One disk indeed has pending sector, not unusual and should be survivable:
> 
> ------------------------------------------------------------------------
> ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE 
> UPDATED  WHEN_FAILED RAW_VALUE
>    1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail 
> Always       -       0
>    3 Spin_Up_Time            0x0027   162   154   021    Pre-fail 
> Always       -       8891
>    4 Start_Stop_Count        0x0032   100   100   000    Old_age 
> Always       -       46
>    5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail 
> Always       -       0
>    7 Seek_Error_Rate         0x002e   100   253   000    Old_age 
> Always       -       0
>    9 Power_On_Hours          0x0032   094   094   000    Old_age 
> Always       -       4776
>   10 Spin_Retry_Count        0x0032   100   253   000    Old_age 
> Always       -       0
>   11 Calibration_Retry_Count 0x0032   100   253   000    Old_age 
> Always       -       0
>   12 Power_Cycle_Count       0x0032   100   100   000    Old_age 
> Always       -       43
> 192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always 
>        -       40
> 193 Load_Cycle_Count        0x0032   196   196   000    Old_age   Always 
>        -       13595
> 194 Temperature_Celsius     0x0022   119   106   000    Old_age   Always 
>        -       33
> 196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always 
>        -       0
> 197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always 
>        -       1
> 198 Offline_Uncorrectable   0x0030   200   200   000    Old_age 
> Offline      -       1
> 199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always 
>        -       0
> 200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age 
> Offline      -       1
> 
> SMART Error Log Version: 1
> No Errors Logged
> 
> SMART Self-test log structure revision number 1
> No self-tests have been logged.  [To run self-tests, use: smartctl -t]
> ------------------------------------------------------------------------
> 
> But there is no corresponding read error in /var/log/messages, which 
> would indicate that FreeBSD never saw the error, or ignored it.
> 
> In addition, there seems to be ICRC DMA errors on da0.  Looks nasty, but 
> only show up in SMART log, not in /var/log/messages.
> 
> ------------------------------------------------------------------------
> smartctl 5.43 2012-06-30 r3573 [FreeBSD 9.0-RELEASE-p3 amd64] (local build)
> Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net
> 
> === START OF INFORMATION SECTION ===
> Model Family:     Hitachi Deskstar 7K3000
> Device Model:     Hitachi HDS723030ALA640
> Serial Number:    MK0331YHGVDK6A
> LU WWN Device Id: 5 000cca 225cc00c7
> Firmware Version: MKAOA5C0
> User Capacity:    3,000,592,982,016 bytes [3.00 TB]
> Sector Size:      512 bytes logical/physical
> Device is:        In smartctl database [for details use: -P show]
> ATA Version is:   8
> ATA Standard is:  ATA-8-ACS revision 4
> Local Time is:    Wed Oct 17 20:03:05 2012 EEST
> SMART support is: Available - device has SMART capability.
> SMART support is: Enabled
> 
> === START OF READ SMART DATA SECTION ===
> SMART overall-health self-assessment test result: PASSED
> 
> General SMART Values:
> Offline data collection status:  (0x84) Offline data collection activity
>                                          was suspended by an 
> interrupting command from host.
>                                          Auto Offline Data Collection: 
> Enabled.
> Self-test execution status:      (   0) The previous self-test routine 
> completed
>                                          without error or no self-test 
> has ever
>                                          been run.
> Total time to complete Offline
> data collection:                (29509) seconds.
> Offline data collection
> capabilities:                    (0x5b) SMART execute Offline immediate.
>                                          Auto Offline data collection 
> on/off support.
>                                          Suspend Offline collection upon new
>                                          command.
>                                          Offline surface scan supported.
>                                          Self-test supported.
>                                          No Conveyance Self-test supported.
>                                          Selective Self-test supported.
> SMART capabilities:            (0x0003) Saves SMART data before entering
>                                          power-saving mode.
>                                          Supports SMART auto save timer.
> Error logging capability:        (0x01) Error logging supported.
>                                          General Purpose Logging supported.
> Short self-test routine
> recommended polling time:        (   1) minutes.
> Extended self-test routine
> recommended polling time:        ( 492) minutes.
> SCT capabilities:              (0x003d) SCT Status supported.
>                                          SCT Error Recovery Control 
> supported.
>                                          SCT Feature Control supported.
>                                          SCT Data Table supported.
> 
> SMART Attributes Data Structure revision number: 16
> Vendor Specific SMART Attributes with Thresholds:
> ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE 
> UPDATED  WHEN_FAILED RAW_VALUE
>    1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail 
> Always       -       0
>    2 Throughput_Performance  0x0005   136   136   054    Pre-fail 
> Offline      -       83
>    3 Spin_Up_Time            0x0007   125   125   024    Pre-fail 
> Always       -       617 (Average 617)
>    4 Start_Stop_Count        0x0012   100   100   000    Old_age 
> Always       -       62
>    5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail 
> Always       -       0
>    7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail 
> Always       -       0
>    8 Seek_Time_Performance   0x0005   135   135   020    Pre-fail 
> Offline      -       26
>    9 Power_On_Hours          0x0012   100   100   000    Old_age 
> Always       -       4570
>   10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail 
> Always       -       0
>   12 Power_Cycle_Count       0x0032   100   100   000    Old_age 
> Always       -       30
> 192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always 
>        -       129
> 193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always 
>        -       129
> 194 Temperature_Celsius     0x0002   166   166   000    Old_age   Always 
>        -       36 (Min/Max 25/50)
> 196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always 
>        -       0
> 197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always 
>        -       0
> 198 Offline_Uncorrectable   0x0008   100   100   000    Old_age 
> Offline      -       0
> 199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always 
>        -       112
> 
> SMART Error Log Version: 1
> ATA Error Count: 112 (device log contains only the most recent five errors)
>          CR = Command Register [HEX]
>          FR = Features Register [HEX]
>          SC = Sector Count Register [HEX]
>          SN = Sector Number Register [HEX]
>          CL = Cylinder Low Register [HEX]
>          CH = Cylinder High Register [HEX]
>          DH = Device/Head Register [HEX]
>          DC = Device Command Register [HEX]
>          ER = Error register [HEX]
>          ST = Status register [HEX]
> Powered_Up_Time is measured from power on, and printed as
> DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
> SS=sec, and sss=millisec. It "wraps" after 49.710 days.
> 
> Error 112 occurred at disk power-on lifetime: 52 hours (2 days + 4 hours)
>    When the command that caused the error occurred, the device was 
> active or idle.
> 
>    After command completion occurred, registers were:
>    ER ST SC SN CL CH DH
>    -- -- -- -- -- -- --
>    84 51 20 28 c5 ae 06  Error: ICRC, ABRT 32 sectors at LBA = 
> 0x06aec528 = 112117032
> 
>    Commands leading to the command that caused the error were:
>    CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
>    -- -- -- -- -- -- -- --  ----------------  --------------------
>    35 03 f0 58 c4 ae 40 00   2d+04:00:39.416  WRITE DMA EXT
>    35 03 20 38 c4 ae 40 00   2d+04:00:39.409  WRITE DMA EXT
>    35 03 f0 48 c3 ae 40 00   2d+04:00:39.402  WRITE DMA EXT
>    35 03 f0 58 c2 ae 40 00   2d+04:00:39.394  WRITE DMA EXT
>    35 03 60 f8 c1 ae 40 00   2d+04:00:39.388  WRITE DMA EXT
> 
> Error 111 occurred at disk power-on lifetime: 52 hours (2 days + 4 hours)
>    When the command that caused the error occurred, the device was 
> active or idle.
> 
>    After command completion occurred, registers were:
>    ER ST SC SN CL CH DH
>    -- -- -- -- -- -- --
>    84 51 30 08 c4 90 06  Error: ICRC, ABRT 48 sectors at LBA = 
> 0x0690c408 = 110150664
> 
>    Commands leading to the command that caused the error were:
>    CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
>    -- -- -- -- -- -- -- --  ----------------  --------------------
>    35 03 f0 48 c3 90 40 00   2d+03:59:29.925  WRITE DMA EXT
>    35 03 f0 58 c2 90 40 00   2d+03:59:29.917  WRITE DMA EXT
>    35 03 60 f8 c1 90 40 00   2d+03:59:29.910  WRITE DMA EXT
>    35 03 b0 48 c1 90 40 00   2d+03:59:29.907  WRITE DMA EXT
>    35 03 f0 58 c0 90 40 00   2d+03:59:29.899  WRITE DMA EXT
> 
> Error 110 occurred at disk power-on lifetime: 52 hours (2 days + 4 hours)
>    When the command that caused the error occurred, the device was 
> active or idle.
> 
>    After command completion occurred, registers were:
>    ER ST SC SN CL CH DH
>    -- -- -- -- -- -- --
>    84 51 20 28 47 90 05  Error: ICRC, ABRT 32 sectors at LBA = 
> 0x05904728 = 93341480
> 
>    Commands leading to the command that caused the error were:
>    CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
>    -- -- -- -- -- -- -- --  ----------------  --------------------
>    35 da f0 58 46 90 40 00   2d+03:49:28.919  WRITE DMA EXT
>    35 da 20 38 46 90 40 00   2d+03:49:28.917  WRITE DMA EXT
>    35 da f0 48 45 90 40 00   2d+03:49:28.909  WRITE DMA EXT
>    35 da f0 58 44 90 40 00   2d+03:49:28.901  WRITE DMA EXT
>    35 da 20 38 44 90 40 00   2d+03:49:28.899  WRITE DMA EXT
> 
> Error 109 occurred at disk power-on lifetime: 52 hours (2 days + 4 hours)
>    When the command that caused the error occurred, the device was 
> active or idle.
> 
>    After command completion occurred, registers were:
>    ER ST SC SN CL CH DH
>    -- -- -- -- -- -- --
>    84 51 10 38 e7 41 05  Error: ICRC, ABRT 16 sectors at LBA = 
> 0x0541e738 = 88205112
> 
>    Commands leading to the command that caused the error were:
>    CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
>    -- -- -- -- -- -- -- --  ----------------  --------------------
>    35 03 f0 58 e6 41 40 00   2d+03:46:20.939  WRITE DMA EXT
>    35 03 20 38 e6 41 40 00   2d+03:46:20.933  WRITE DMA EXT
>    35 03 f0 48 e5 41 40 00   2d+03:46:20.926  WRITE DMA EXT
>    35 03 f0 58 e4 41 40 00   2d+03:46:20.921  WRITE DMA EXT
>    35 03 20 38 e4 41 40 00   2d+03:46:20.916  WRITE DMA EXT
> 
> Error 108 occurred at disk power-on lifetime: 52 hours (2 days + 4 hours)
>    When the command that caused the error occurred, the device was 
> active or idle.
> 
>    After command completion occurred, registers were:
>    ER ST SC SN CL CH DH
>    -- -- -- -- -- -- --
>    84 51 10 28 0a 2f 05  Error: ICRC, ABRT 16 sectors at LBA = 
> 0x052f0a28 = 86968872
> 
>    Commands leading to the command that caused the error were:
>    CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
>    -- -- -- -- -- -- -- --  ----------------  --------------------
>    35 03 f0 48 09 2f 40 00   2d+03:45:06.250  WRITE DMA EXT
>    35 03 f0 58 08 2f 40 00   2d+03:45:06.242  WRITE DMA EXT
>    35 03 20 38 08 2f 40 00   2d+03:45:06.238  WRITE DMA EXT
>    35 03 f0 48 07 2f 40 00   2d+03:45:06.231  WRITE DMA EXT
>    35 03 08 40 bc 25 40 00   2d+03:45:06.225  WRITE DMA EXT
> 
> SMART Self-test log structure revision number 1
> No self-tests have been logged.  [To run self-tests, use: smartctl -t]
> 
> 
> SMART Selective self-test log data structure revision number 1
>   SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
>      1        0        0  Not_testing
>      2        0        0  Not_testing
>      3        0        0  Not_testing
>      4        0        0  Not_testing
>      5        0        0  Not_testing
> Selective self-test flags (0x0):
>    After scanning selected spans, do NOT read-scan remainder of disk.
> If Selective self-test is pending on power-up, resume after 0 minute delay.
> ------------------------------------------------------------------------
> 
> These only show up in disk's SMART log, not as errors in other logs 
> (logs exists for all the history of the system).  These do not seem to 
> have passed to the FreeBSD, or FreeBSD has ignored the errors.
> 
>  From logs, I can find da3, da4, da6 all have reported scsi timeout once 
> (different times, within one month in July.
> 
> They all look like this, one case for each of the disks:
> 
> ------------------------------------------------------------------------
> Jul 24 12:16:55 tila kernel: (da6:mps0:0:6:0): SCSI command timeout on 
> device handle 0x000d SMID 699
> Jul 24 12:16:55 tila kernel: (da6:mps0:0:6:0): SCSI command timeout on 
> device handle 0x000d SMID 419
> Jul 24 12:16:55 tila kernel: (da6:mps0:0:6:0): SCSI command timeout on 
> device handle 0x000d SMID 359
> Jul 24 12:16:55 tila kernel: (da6:mps0:0:6:0): SCSI command timeout on 
> device handle 0x000d SMID 472
> Jul 24 12:16:55 tila kernel: (da6:mps0:0:6:0): SCSI command timeout on 
> device handle 0x000d SMID 124
> Jul 24 12:16:55 tila kernel: (da6:mps0:0:6:0): SCSI command timeout on 
> device handle 0x000d SMID 762
> Jul 24 12:16:55 tila kernel: (da6:mps0:0:6:0): SCSI command timeout on 
> device handle 0x000d SMID 243
> Jul 24 12:16:55 tila kernel: (da6:mps0:0:6:0): SCSI command timeout on 
> device handle 0x000d SMID 890
> Jul 24 12:16:55 tila kernel: (da6:mps0:0:6:0): SCSI command timeout on 
> device handle 0x000d SMID 204
> Jul 24 12:16:55 tila kernel: (da6:mps0:0:6:0): SCSI command timeout on 
> device handle 0x000d SMID 2
> Jul 24 12:16:55 tila kernel: (da6:mps0:0:6:0): SCSI command timeout on 
> device handle 0x000d SMID 291
> Jul 24 12:16:55 tila kernel: mps0: (0:6:0) terminated ioc 804b scsi 0 
> state c xfer 0
> Jul 24 12:16:55 tila kernel: mps0: mpssas_abort_complete: abort request 
> on handle 0x0d SMID 699 complete
> Jul 24 12:16:55 tila kernel: mps0: mpssas_complete_tm_request: sending 
> deferred task management request for handle 0x0d SMID 419
> Jul 24 12:16:55 tila kernel: mps0: mpssas_abort_complete: abort request 
> on handle 0x0d SMID 419 complete
> Jul 24 12:16:55 tila kernel: mps0: mpssas_complete_tm_request: sending 
> deferred task management request for handle 0x0d SMID 359
> Jul 24 12:16:55 tila kernel: mps0: mpssas_abort_complete: abort request 
> on handle 0x0d SMID 359 complete
> Jul 24 12:16:55 tila kernel: mps0: mpssas_complete_tm_request: sending 
> deferred task management request for handle 0x0d SMID 472
> Jul 24 12:16:55 tila kernel: mps0: mpssas_abort_complete: abort request 
> on handle 0x0d SMID 472 complete
> Jul 24 12:16:55 tila kernel: mps0: mpssas_complete_tm_request: sending 
> deferred task management request for handle 0x0d SMID 124
> Jul 24 12:16:55 tila kernel: mps0: mpssas_abort_complete: abort request 
> on handle 0x0d SMID 124 complete
> Jul 24 12:16:55 tila kernel: mps0: mpssas_complete_tm_request: sending 
> deferred task management request for handle 0x0d SMID 762
> Jul 24 12:16:55 tila kernel: mps0: mpssas_abort_complete: abort request 
> on handle 0x0d SMID 762 complete
> Jul 24 12:16:55 tila kernel: mps0: mpssas_complete_tm_request: sending 
> deferred task management request for handle 0x0d SMID 243
> Jul 24 12:16:55 tila kernel: mps0: mpssas_abort_complete: abort request 
> on handle 0x0d SMID 243 complete
> Jul 24 12:16:55 tila kernel: mps0: mpssas_complete_tm_request: sending 
> deferred task management request for handle 0x0d SMID 890
> Jul 24 12:16:55 tila kernel: mps0: mpssas_abort_complete: abort request 
> on handle 0x0d SMID 890 complete
> Jul 24 12:16:55 tila kernel: mps0: mpssas_complete_tm_request: sending 
> deferred task management request for handle 0x0d SMID 204
> Jul 24 12:16:55 tila kernel: mps0: mpssas_abort_complete: abort request 
> on handle 0x0d SMID 204 complete
> Jul 24 12:16:55 tila kernel: mps0: mpssas_complete_tm_request: sending 
> deferred task management request for handle 0x0d SMID 2
> Jul 24 12:16:55 tila kernel: mps0: mpssas_abort_complete: abort request 
> on handle 0x0d SMID 2 complete
> Jul 24 12:16:55 tila kernel: mps0: mpssas_complete_tm_request: sending 
> deferred task management request for handle 0x0d SMID 291
> Jul 24 12:16:55 tila kernel: mps0: mpssas_abort_complete: abort request 
> on handle 0x0d SMID 291 complete
> Jul 24 12:16:55 tila kernel: (da6:mps0:0:6:0): WRITE(16). CDB: 8a 0 0 0 
> 0 1 4a 17 dc 5e 0 0 0 57 0 0
> Jul 24 12:16:55 tila kernel: (da6:mps0:0:6:0): CAM status: SCSI Status Error
> Jul 24 12:16:55 tila kernel: (da6:mps0:0:6:0): SCSI status: Check Condition
> Jul 24 12:16:55 tila kernel: (da6:mps0:0:6:0): SCSI sense: UNIT 
> ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
> Jul 24 12:16:56 tila kernel: (da6:mps0:0:6:0): WRITE(10). CDB: 2a 0 28 0 
> 85 fb 0 0 1 0
> Jul 24 12:16:56 tila kernel: (da6:mps0:0:6:0): CAM status: SCSI Status Error
> Jul 24 12:16:56 tila kernel: (da6:mps0:0:6:0): SCSI status: Check Condition
> Jul 24 12:16:56 tila kernel: (da6:mps0:0:6:0): SCSI sense: UNIT 
> ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
> ------------------------------------------------------------------------
> 
> If the da0 ICRC errors would have been seen by ZFS, it should have made 
> a) note of that in some log?  b) retried write?  c) Something else?  If 
> we assume that the disk firmware is broken and does not report these to 
> OS, so da0 might be corrupt.  But that should still be ok with raidz2.
> 
> We do have 3 random SCSI timeouts, which were seen by FreeBSD, and thus 
> should have prompted ZFS do handle the errors, and one read error on 
> data, which is not reported as read error in any log, other than disk's 
> SMART info says so.
> 
> Is it possible to get more information, which block of the file is 
> corrupt, on which disks the offending blocks reside, and similar 
> information?
> 
> 
> 
> 
> 
> _______________________________________________
> freebsd-fs at freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-fs
> To unsubscribe, send any mail to "freebsd-fs-unsubscribe at freebsd.org"