zfs, raidz, spare and jbod

Fri Jul 25 09:45:16 UTC 2008

On Fri, Jul 25, 2008 at 09:46:34AM +0200, Claus Guttesen wrote:
> Hi.
> 
> I installed FreeBSD 7 a few days ago and upgraded to the latest stable
> release using GENERIC kernel. I also added these entries to
> /boot/loader.conf:
> 
> vm.kmem_size="1536M"
> vm.kmem_size_max="1536M"
> vfs.zfs.prefetch_disable=1
> 
> Initially prefetch was enabled and I would experience hangs but after
> disabling prefetch copying large amounts of data would go along
> without problems. To see if FreeBSD 8 (current) had better (copy)
> performance I upgraded to current as of yesterday. After upgrading and
> rebooting the server responded fine.

With regards to RELENG_7, I completely agree with disabling prefetch.
The overall performance (of the system and disk I/O) appears signicantly
"smoother", e.g. less hard lock-ups and stalls, is better when prefetch
is disabled.

I have not tried CURRENT.  I'm told the ZFS code in CURRENT is the same
as RELENG_7, so I'm not sure what you were trying to test by switching
from RELENG_7 to CURRENT.

> The server is a supermicro with a quad-core harpertown e5405 with two
> internal sata-drives and 8 GB of ram. I installed an areca arc-1680
> sas-controller and configured it in jbod-mode. I attached an external
> sas-cabinet with 16 sas-disks at 1 TB (931 binary GB).
> 
> I created a raidz2 pool with 10 disks and added one spare. I copied
> approx. 1 TB of small files (each approx. 1 MB) and during the copy I
> simulated a disk-crash by pulling one of the disks out of the cabinet.
> Zfs did not activate the spare and the copying stopped until I
> rebooted after 5-10 minutes. When I performed a 'zpool status' the
> command would not complete. I did not see any messages in
> /var/log/message. State in top showed 'ufs-'.
> 
> A similar test on solaris express developer edition b79 activated the
> spare after zfs tried to write to the missing disk enough times and
> then marked it as faulted. Has any one else tried to simulate a
> disk-crash in raidz(2) and succeeded?

Is there any way to confirm the behaviour is specific to raidz2, or
would it affect raidz1 as well?  I have a raidz1 pool at home (3 disks
though; pulling one will probably result in bad things) which I can
pull a disk from, though it's off of an ICHx controller.

I have no experience with Areca controllers or their driver, but I do
have experience with standard onboard Intel ICHx chips.  WRT those
chips, "pulling disks" without administratively downing the ATA channel
will cause a kernel panic.  If the Areca controller/driver handles
things better, great.

I'm trying to say that I can offer to help with raidz1, but not on Areca
controllers.  The hardware is similar to yours; Supermicro PDSMi+, Intel
E6600 (C2D), 4GB RAM, running RELENG_7 amd64.  System contains 4 disks,
ad6,8,10 are in a ZFS pool, ad4 is the OS disk:

ad4: 190782MB <WDC WD2000JD-00HBB0 08.02D08> at ata2-master SATA150
ad6: 476940MB <WDC WD5000AAKS-00YGA0 12.01C02> at ata3-master SATA300
ad8: 476940MB <WDC WD5000AAKS-00TMA0 12.01C01> at ata4-master SATA300
ad10: 476940MB <WDC WD5000AAKS-00TMA0 12.01C01> at ata5-master SATA300

        NAME        STATE     READ WRITE CKSUM
        storage     ONLINE       0     0     0
          raidz1    ONLINE       0     0     0
            ad6     ONLINE       0     0     0
            ad8     ONLINE       0     0     0
            ad10    ONLINE       0     0     0

-- 
| Jeremy Chadwick                                jdc at parodius.com |
| Parodius Networking                       http://www.parodius.com/ |
| UNIX Systems Administrator                  Mountain View, CA, USA |
| Making life hard for others since 1977.              PGP: 4BD6C0CB |