ZFS checksum errors on USB attach (Was: ZFS data error without
reasons)
Damian Gerow
dgerow at afflictions.org
Tue Mar 31 21:40:28 PDT 2009
Mark Powell wrote:
: > the problem can be solved - The weird thing is that it will give CRC
: > errros (and permenent errors) in blocks that has not been touched (or at
: > least I think so)
:
: Can you be a little clearer? Perhaps some zpool status output with the
: steps you've taken?
I've run into this problem four times in the past week or so. I haven't
reliably been able to reproduce it (which isn't /that/ upsetting; I don't
much like data loss, even if I can predict it), but my current hunch is that
it has something to do with uptime: the longer the system is up, the more
likely the bug is to trigger.
Here's what happened an hour ago:
I walked over to my laptop (Lenovo X200), and plugged in a Cowon D2 to
charge up the battery. When I tried to do other things, I received a large
number of input/output errors, so I knew I'd triggered the bug. Two things
that I tried to do:
- start sshd (this machine doesn't normally run sshd)
- open a new urxvt session
Appropriate snippets from /var/log/messages (note: nothing shows up in
dmesg):
-----
Mar 31 23:29:56 plebeian kernel: ugen7.2: <COWON Systems, Inc.> at usbus7
Mar 31 23:29:56 plebeian kernel: umass0: <COWON Systems, Inc. COWON D2 @-e" 3.57, class 0/0, rev 2.00/1.00, addr 2> on usbus7
Mar 31 23:29:56 plebeian kernel: umass0: SCSI over Bulk-Only; quirks = 0x0000
Mar 31 23:29:56 plebeian root: Unknown USB device: vendor 0x0e21 product 0x0800 bus uhub7
Mar 31 23:29:57 plebeian kernel: umass0:0:0:-1: Attached to scbus0
Mar 31 23:29:57 plebeian kernel: da0 at umass-sim0 bus 0 target 0 lun 0
Mar 31 23:29:57 plebeian kernel: da0: <COWON D2 0100> Removable Direct Access SCSI-0 device
Mar 31 23:29:57 plebeian kernel: da0: 40.000MB/s transfers
Mar 31 23:29:57 plebeian kernel: da0: 7808MB (15990784 512 byte sectors: 255H 63S/T 995C)
Mar 31 23:29:57 plebeian kernel: da1 at umass-sim0 bus 0 target 0 lun 1
Mar 31 23:29:57 plebeian kernel: da1: <COWON D2 0100> Removable Direct Access SCSI-0 device
Mar 31 23:29:57 plebeian kernel: da1: 40.000MB/s transfers
Mar 31 23:29:57 plebeian kernel: da1: 15359MB (31456320 512 byte sectors: 255H 63S/T 1958C)
Mar 31 23:29:57 plebeian kernel: GEOM_LABEL: Label for provider da0 is msdosfs/D2.
Mar 31 23:29:57 plebeian kernel: GEOM: da1: partition 1 does not start on a track boundary.
Mar 31 23:29:57 plebeian kernel: GEOM: da1: partition 1 does not end on a track boundary.
Mar 31 23:30:50 plebeian root: ZFS: checksum mismatch, zpool=storage path=/dev/ad4s1d.eli offset=297610772480 size=131072
Mar 31 23:30:50 plebeian root: ZFS: checksum mismatch, zpool=storage path=/dev/ad4s1d.eli offset=297610772480 size=131072
Mar 31 23:30:50 plebeian root: ZFS: zpool I/O failure, zpool=storage error=86
Mar 31 23:31:20 plebeian root: ZFS: checksum mismatch, zpool=storage path=/dev/ad4s1d.eli offset=23159373824 size=131072
Mar 31 23:31:20 plebeian root: ZFS: checksum mismatch, zpool=storage path=/dev/ad4s1d.eli offset=23159373824 size=131072
Mar 31 23:31:20 plebeian root: ZFS: zpool I/O failure, zpool=storage error=86
Mar 31 23:31:34 plebeian root: ZFS: checksum mismatch, zpool=storage path=/dev/ad4s1d.eli offset=18063163392 size=131072
Mar 31 23:31:34 plebeian root: ZFS: checksum mismatch, zpool=storage path=/dev/ad4s1d.eli offset=18063163392 size=131072
Mar 31 23:31:34 plebeian root: ZFS: zpool I/O failure, zpool=storage error=86
Mar 31 23:31:34 plebeian root: ZFS: checksum mismatch, zpool=storage path=/dev/ad4s1d.eli offset=18062901248 size=131072
Mar 31 23:31:34 plebeian root: ZFS: checksum mismatch, zpool=storage path=/dev/ad4s1d.eli offset=18062901248 size=131072
Mar 31 23:31:34 plebeian root: ZFS: zpool I/O failure, zpool=storage error=86
Mar 31 23:31:35 plebeian root: ZFS: checksum mismatch, zpool=storage path=/dev/ad4s1d.eli offset=17453809664 size=131072
Mar 31 23:31:35 plebeian root: ZFS: checksum mismatch, zpool=storage path=/dev/ad4s1d.eli offset=17453809664 size=131072
Mar 31 23:31:35 plebeian root: ZFS: zpool I/O failure, zpool=storage error=86
Mar 31 23:31:35 plebeian root: ZFS: checksum mismatch, zpool=storage path=/dev/ad4s1d.eli offset=17453809664 size=131072
Mar 31 23:31:35 plebeian root: ZFS: checksum mismatch, zpool=storage path=/dev/ad4s1d.eli offset=17453809664 size=131072
Mar 31 23:31:35 plebeian root: ZFS: zpool I/O failure, zpool=storage error=86
Mar 31 23:31:40 plebeian sudo: dgerow : TTY=pts/4 ; PWD=/home/dgerow ; USER=root ; COMMAND=/etc/rc.d/sshd start
Mar 31 23:31:50 plebeian sudo: dgerow : TTY=pts/4 ; PWD=/home/dgerow ; USER=root ; COMMAND=/etc/rc.d/sshd onestart
Mar 31 23:31:51 plebeian root: ZFS: checksum mismatch, zpool=storage path=/dev/ad4s1d.eli offset=18045468672 size=131072
Mar 31 23:31:51 plebeian dgerow: /etc/rc.d/sshd: WARNING: failed to start sshd
Mar 31 23:31:51 plebeian root: ZFS: checksum mismatch, zpool=storage path=/dev/ad4s1d.eli offset=18045468672 size=131072
Mar 31 23:31:51 plebeian root: ZFS: zpool I/O failure, zpool=storage error=86
-----
I explicitly received errors reading from /etc/termcap after this. So I
shut everything down -- there is a strong tendancy for applications to core
dump at this point -- and rebooted into single-user mode. Then I checked
the status of the zfs pool (zpool status -v):
-----
pool: storage
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: http://www.sun.com/msg/ZFS-8000-8A
scrub: none requested
config:
NAME STATE READ WRITE CKSUM
storage ONLINE 0 0 0
ad4s1d.eli ONLINE 0 0 0
errors: Permanent errors have been detected in the following files:
storage/usr:/local/share/zsh/4.3.9/functions/Completion/Debian.zwc
storage/usr:/local/share/zsh/4.3.9/functions/Completion/Linux.zwc
storage/usr:<0xf02d>
storage/usr:/local/bin/mutt
storage/usr:/sbin/sshd
storage/usr:/local/sbin/cupsd
storage/usr:/share/games/fortune/fortunes
storage/usr:/local/lib/libgtk-x11-2.0.so.0
storage/usr:/local/lib/firefox3/chrome/browser.jar
storage/usr:/share/misc/termcap
storage/usr:/share/misc/termcap.db
storage/usr:/local/bin/transmission
storage/usr:/local/bin/openbox
storage/home:<0x331d>
storage/home:/dgerow/X-Files/402 - Home.mp4
storage/home:/dgerow/.mozilla/firefox/bk0ibcxu.default/urlclassifier3.sqlite
storage/home:/dgerow/X-Files/705 - Rush.mp4
storage/home:/dgerow/.procmail.log
-----
There doesn't seem to be a pattern as to which files are affected: mutt,
openbox, cupsd, and transmission were all running before the checksum errors,
whereas fortune, sshd, and procmail were all run post-checksum errors. zsh
and firefox were running, of course, both before and after.
Oddly, and as was noted as quoted above, I'm not sure what 0x331d would be.
I didn't explicitly delete anything after plugging in the D2, though it is
possible that a program removed a temporary file.
The scrub found an additional five errors, and post scrub, my pool now looks
like this:
-----
pool: storage
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: http://www.sun.com/msg/ZFS-8000-8A
scrub: scrub completed after 0h34m with 6 errors on Wed Apr 1 00:22:25 2009
config:
NAME STATE READ WRITE CKSUM
storage ONLINE 0 0 6
ad4s1d.eli ONLINE 0 0 12
errors: Permanent errors have been detected in the following files:
storage/home:/dgerow/.mozilla/firefox/bk0ibcxu.default/urlclassifier3.sqlite
storage/home:/dgerow/.config/transmission/resume/X-Files.2a82218f000bc93d.resume
storage/home:/dgerow/.mozilla/firefox/bk0ibcxu.default/Cache/_CACHE_001_
storage/home:/dgerow/mutt.core
storage/home:/dgerow/X-Files/707 - Orison.mp4
-----
To finish fixing this, I've already deleted all these files, and another
scrub should clean things up, with one final command to tell zfs to ignore
remaining errors (I forget what it is, it shows up in 'action'.)
Thankfully, I maintain regular backups, and nothing affected (this time
around) was important.
: I expect this is a red hering, but do you not have some kind of
: kernel/module sync problem?
I don't. I'm running a GENERIC kernel from this morning in this case, with
no special modules loaded:
-----
plebeian% sysctl kern.osreldate kern.osrevision
kern.osreldate: 800074
kern.osrevision: 199506
plebeian% uname -a
FreeBSD plebeian.afflictions.org 8.0-CURRENT FreeBSD 8.0-CURRENT #0: Tue Mar 31 08:41:28 EDT 2009 dgerow at plebeian.afflictions.org:/usr/obj/usr/src/sys/GENERIC amd64
plebeian% kldstat
Id Refs Address Size Name
1 20 0xffffffff80100000 e48440 kernel
2 1 0xffffffff81022000 a5c7 geom_eli.ko
3 1 0xffffffff8102d000 1b446 crypto.ko
4 1 0xffffffff81049000 a192 zlib.ko
5 1 0xffffffff81054000 f0a64 zfs.ko
6 1 0xffffffff81145000 1914 opensolaris.ko
7 1 0xffffffff81147000 771a i915.ko
8 1 0xffffffff8114f000 111a8 drm.ko
plebeian%
-----
- Damian
More information about the freebsd-current
mailing list