Problems Terminating zpool scrub...

Tue Apr 26 17:40:00 UTC 2011

On 26 April 2011 14:49, Jeremy Chadwick <freebsd at jdc.parodius.com> wrote:
> On Tue, Apr 26, 2011 at 02:25:00PM +0100, Conall O'Brien wrote:
>> On 26 April 2011 13:15, ambrosehuang ambrose <ambrosehua at gmail.com> wrote:
>> > Could you post your PR number?I was curious about the driver used by
>> > West Digital Disk, cause I use
>> > the WR10EARS?
>>
>> http://www.freebsd.org/cgi/query-pr.cgi?pr=156647
>>
>> I chalked it up to the SATA controller, since only 2 of my 5 identical
>> WD20EARS disks were reporting DMA issues.
>>
>> >
>> > 2011/4/25 Conall O'Brien <conall at conall.net>
>> >>
>> >> On 15 April 2011 15:59, Conall O'Brien <conall at conall.net> wrote:
>> >> > Hello,
>> >> >
>> >> >
>> >> > I've got a NAS box running 8-STABLEW [1] which I'm running with 5x
>> >> > Western Digital 2TB disks.
>> >> >
>> >> >
>> >> > One of the disks was having DMA issues as reported in dmesg, so I
>> >> > began the usual zfs workflow of "zpool offline pool dev", physically
>> >> > removing it and tried to "zpool replace pool dev" but my attempts to
>> >> > do so fail, actually the zpool command keeps ending up in
>> >> > uninterruptable wait (the D state). Before resorting to replacing the
>> >> > disk, a zpool scrub was in progress. Now, I can't kill it using "zpool
>> >> > scrub -s pool", it too ends up in the D state.
>> >> >
>> >> >
>> >> > Is there another way than "zpool scrub -s pool" to terminate a scrub
>> >> > process, so I can proceed with the disk replacement. I care more about
>> >> > resilvering my pool before getting around to scrubbing it.
>> >> >
>> >> >
>> >> > Thanks!
>> >> >
>> >> >
>> >> > [1] For completeness, uname -a reports FreeBSD galvatron.taku.ie
>> >> > 8.2-STABLE FreeBSD 8.2-STABLE #1: Sat Mar 19 13:18:46 UTC 2011
>> >> > root at galvatron.taku.ie:/usr/src/obj/usr/src/sys/GALVATRON ??amd64
>> >>
>> >> I worked out the problem. There's a regression in one of the drivers
>> >> between the kernel I was running and my previous kernel:
>> >>
>> >> FreeBSD galvatron.taku.ie 8.2-PRERELEASE FreeBSD 8.2-PRERELEASE #0:
>> >> Wed Dec 29 04:00:27 UTC 2010
>> >> root at galvatron.taku.ie:/usr/src/obj/usr/src/sys/GALVATRON ??amd64
>> >>
>> >>
>> >> I'll file a PR to get it fixed.
>
> The PR is extremely terse/sub-part quality.  There isn't actual evidence
> of the problem being a driver regression.  What needs to be provided in
> the PR:

Yeah, I wasn't sure what specifics would be needed, but I wanted to
open a PR and go from there. It was the first time I've run into a
kernel related issue, PRs for bugs in the ports collection are so much
easier to describe.

> - Relevant dmesg output (pertaining to ataX and adX devices and anything
>  else seen around that time; stuff from /var/adm/messages might be more
>  useful since it contains timestamps)
> - Full dmesg seen during a fresh reboot
> - vmstat -i
> - atacontrol cap ataX (for each ataX channel.  You can XXX out the
>  serial number if desired)
> - smartctl -a /dev/adX (for each disk, be sure to label which disk
>  is associated with what data.  You can XXX out the serial number if
>  desired)
>
> What really needs to be shown are the actual errors themselves, and in
> sequential order / with timestamps.  "DMA errors" is too vague; I want
> to assume READ_DMA48 but I cannot assume that.

Now that my RAID array is healthy again, I'm happy to reboot into my
suspect kernel and collect better diagnostics reports.

> Next:
>
> I'm not sure if your system support its, but can you run the controller
> in AHCI mode (BIOS setting) and load ahci.ko instead (ahci_load="yes" in
> /boot/loader.conf, your disks will change to /dev/adaX)?  If so, this
> would allow you to narrow down whether or not the issue is truly a
> driver problem.  You should try this *before* attempting the below.

I actually intended to convert my disks over to AHCI anyway, to
facilitiate hot swapping better. I assume I can do a "zpool import" to
get my ZFS pool to work using the new devices.

> Try updating your source to something newer than March 19th.  There have
> been ata(4) changes since then that might pertain to your issue.  If the
> same issue happens on a present-day build of RELENG_8 then we can start
> by trying to narrow it down to commits between, roughly, late December
> 2010 to mid-March 2011.  Since you follow RELENG_8, you will need to
> follow commits.  src/sys/dev/ata is what's relevant here, as well as the
> chipsets/ directory under that.

Agreed, I probably shouldn't have left it so long between kernel
rebuilds. I guess I was hoping there weren't too many changes related
to my SATA controller, but that does naively assume the problem is the
SATA controller driver.

> http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/dev/ata/
> http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/dev/ata/chipsets/
>
> Let's get this figured out before other users start correlating their
> problems with whatever this is.

Agreed.

-- 

Conall O'Brien