zfs mirror reads only on one disk
Jeremy Chadwick
freebsd at jdc.parodius.com
Tue Aug 9 13:52:14 UTC 2011
On Tue, Aug 09, 2011 at 03:10:57PM +0200, Jeremie Le Hen wrote:
> Please Cc: me when replying, as I've not subscribed. Thanks.
>
> I'm using FreeBSD 8.2-STABLE, with a mirrored ZFS pool v15:
>
> NAME STATE READ WRITE CKSUM
> data ONLINE 0 0 0
> mirror ONLINE 0 0 0
> ad10s1 ONLINE 0 0 0
> ad6s1 ONLINE 0 0 0
>
> ad6: 1907729MB <Hitachi HDS723020BLA642 MN6OA180> at ata3-master UDMA100 SATA 3Gb/s
> ad10: 1907729MB <WDC WD2002FAEX-007BA0 05.01D05> at ata5-master UDMA100 SATA 3Gb/s
>
> (For those who wonder why I use a sliced disk, this is because the disks
> are not the same and this allows me to get the same size on each side.
> Besides, ZFS v15 doesn't have the autoexpand property, this is a
> workaround.)
>
> The mirror is correctly synchronized and when I write on it, I get the
> following iostat(8) output (3 seconds interval):
>
> extended device statistics
> device r/s w/s kr/s kw/s wait svc_t %b
> ad6 0.0 682.8 0.0 41593.3 16 18.7 77
> ad10 0.3 686.8 21.3 41465.4 19 19.4 80
> extended device statistics
> device r/s w/s kr/s kw/s wait svc_t %b
> ad6 0.0 680.9 0.0 41910.7 16 17.3 78
> ad10 0.0 671.2 0.0 41228.1 16 19.6 80
>
>
> However, when I read on the mirror, only ad10 is being used:
>
> extended device statistics
> device r/s w/s kr/s kw/s wait svc_t %b
> ad6 0.0 0.0 0.0 0.0 0 0.0 0
> ad10 762.7 0.0 48796.7 0.0 2 1.8 82
> extended device statistics
> device r/s w/s kr/s kw/s wait svc_t %b
> ad6 0.0 0.0 0.0 0.0 0 0.0 0
> ad10 740.2 0.0 47373.1 0.0 1 1.9 81
> extended device statistics
> device r/s w/s kr/s kw/s wait svc_t %b
> ad6 0.0 0.3 0.0 1.3 0 0.2 0
> ad10 716.2 0.3 45836.0 1.3 2 1.9 82
>
>
> One of my colleagues told me this was maybe an optimization of ZFS for
> sequentials reads, so I tried to run two reading processes in parallel,
> with the same unfortunate outcome.
>
> I also tried to run "cat *" in a highly populated Maildir, so I'm sure
> the reads are not sequential, same outcome.
>
> Do you have any idea why this happens?
Since I have a ZFS mirror setup I can test this. Let's take a look:
ada1 at ahcich1 bus 0 scbus1 target 0 lun 0
ada1: <WDC WD1002FAEX-00Z3A0 05.01D05> ATA-8 SATA 3.x device
ada1: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)
ada1: Command Queueing enabled
ada1: 953869MB (1953525168 512 byte sectors: 16H 63S/T 16383C)
ada3 at ahcich3 bus 0 scbus3 target 0 lun 0
ada3: <WDC WD1001FALS-00J7B1 05.00K05> ATA-8 SATA 2.x device
ada3: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)
ada3: Command Queueing enabled
ada3: 953869MB (1953525168 512 byte sectors: 16H 63S/T 16383C)
Each of these disks can push about 140MByte/sec (sequential) but I don't
expect to see that kind of I/O. I do expect to see around 100MByte/sec
per disk (just have to trust me; I'm used to my disks! :-) ).
zpool-wise, absolutely nothing special (note I am using ZFSv28 on
RELENG_8 however), and *VERY* little tuning is done in loader.conf:
icarus# zpool status data
pool: data
state: ONLINE
scan: scrub repaired 0 in 0h54m with 0 errors on Tue Jun 14 10:24:49 2011
config:
NAME STATE READ WRITE CKSUM
data ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ada1 ONLINE 0 0 0
ada3 ONLINE 0 0 0
errors: No known data errors
icarus# egrep ^vfs.zfs /boot/loader.conf
vfs.zfs.arc_max="5120M"
So let's test. I have some pretty big files on the data/storage
filesystem, so let's try dd'ing one of those while simultaneously using
"gstat -I500ms -f 'ada1|ada3'" to watch disk I/O. It's *extremely*
important that I dd a file which isn't already in ARC (ARC right now for
me takes up about 6GB of RAM, so I'll pick a CD image I haven't accessed
since the machine has rebooted).
icarus# cd /storage/CD_Images/FreeBSD/7.4-STABLE/
icarus# ls -l *disc1*
-rwxr--r-- 1 storage storage 663519232 Mar 4 06:54 FreeBSD-7.4-RELEASE-amd64-disc1.iso
icarus# dd if=FreeBSD-7.4-RELEASE-amd64-disc1.iso of=/dev/null bs=64k
10124+1 records in
10124+1 records out
663519232 bytes transferred in 3.965980 secs (167302715 bytes/sec)
And in another window:
dT: 0.504s w: 0.500s filter: ada1|ada3
L(q) ops/s r/s kBps ms/r w/s kBps ms/w %busy Name
10 750 750 94557 13.4 0 0 0.0 100.4| ada1
10 631 631 80771 15.9 0 0 0.0 100.2| ada3
Looks to me like both disks were getting utilised. Let's double check
with "zpool iostat -v data 1" and use another file which isn't in the
ARC:
icarus# cd ../8.2-STABLE/
icarus# ls -l *memstick*
-rwxr--r-- 1 storage storage 1087774720 Mar 4 06:17 FreeBSD-8.2-RELEASE-amd64-memstick.img
icarus# dd if=FreeBSD-8.2-RELEASE-amd64-memstick.img of=/dev/null bs=64k
16598+1 records in
16598+1 records out
1087774720 bytes transferred in 6.802677 secs (159903917 bytes/sec)
capacity operations bandwidth
pool alloc free read write read write
---------- ----- ----- ----- ----- ----- -----
data 278G 650G 0 0 0 0
mirror 278G 650G 0 0 0 0
ada1 - - 0 0 0 0
ada3 - - 0 0 0 0
---------- ----- ----- ----- ----- ----- -----
capacity operations bandwidth
pool alloc free read write read write
---------- ----- ----- ----- ----- ----- -----
data 278G 650G 1.09K 0 138M 0
mirror 278G 650G 1.09K 0 138M 0
ada1 - - 595 0 74.2M 0
ada3 - - 519 0 63.7M 0
---------- ----- ----- ----- ----- ----- -----
capacity operations bandwidth
pool alloc free read write read write
---------- ----- ----- ----- ----- ----- -----
data 278G 650G 1.10K 0 140M 0
mirror 278G 650G 1.10K 0 140M 0
ada1 - - 542 0 66.8M 0
ada3 - - 584 0 73.1M 0
---------- ----- ----- ----- ----- ----- -----
capacity operations bandwidth
pool alloc free read write read write
---------- ----- ----- ----- ----- ----- -----
data 278G 650G 1.32K 0 168M 0
mirror 278G 650G 1.32K 0 168M 0
ada1 - - 724 0 89.3M 0
ada3 - - 626 0 78.3M 0
---------- ----- ----- ----- ----- ----- -----
capacity operations bandwidth
pool alloc free read write read write
---------- ----- ----- ----- ----- ----- -----
data 278G 650G 1.26K 0 161M 0
mirror 278G 650G 1.26K 0 161M 0
ada1 - - 655 0 80.7M 0
ada3 - - 637 0 79.7M 0
---------- ----- ----- ----- ----- ----- -----
capacity operations bandwidth
pool alloc free read write read write
---------- ----- ----- ----- ----- ----- -----
data 278G 650G 1.23K 0 156M 0
mirror 278G 650G 1.23K 0 156M 0
ada1 - - 635 0 78.2M 0
ada3 - - 625 0 78.2M 0
---------- ----- ----- ----- ----- ----- -----
capacity operations bandwidth
pool alloc free read write read write
---------- ----- ----- ----- ----- ----- -----
data 278G 650G 1.17K 0 148M 0
mirror 278G 650G 1.17K 0 148M 0
ada1 - - 600 0 73.8M 0
ada3 - - 595 0 74.4M 0
---------- ----- ----- ----- ----- ----- -----
capacity operations bandwidth
pool alloc free read write read write
---------- ----- ----- ----- ----- ----- -----
data 278G 650G 955 0 119M 0
mirror 278G 650G 955 0 119M 0
ada1 - - 411 0 50.8M 0
ada3 - - 544 0 68.1M 0
---------- ----- ----- ----- ----- ----- -----
capacity operations bandwidth
pool alloc free read write read write
---------- ----- ----- ----- ----- ----- -----
data 278G 650G 0 0 0 0
mirror 278G 650G 0 0 0 0
ada1 - - 0 0 0 0
ada3 - - 0 0 0 0
---------- ----- ----- ----- ----- ----- -----
Performance was a little less than I estimated (I really don't care to
be honest), but this double-confirms that yes, reads do get split across
mirror members.
Therefore I cannot explain what you're seeing. Maybe consider upgrading
to a newer RELENG_8 and ZFSv28 and see if things improve? I wish I had
a way to confirm this would fix your problem but I do not.
--
| Jeremy Chadwick jdc at parodius.com |
| Parodius Networking http://www.parodius.com/ |
| UNIX Systems Administrator Mountain View, CA, US |
| Making life hard for others since 1977. PGP 4BD6C0CB |
More information about the freebsd-fs
mailing list