Uneven load on drives in ZFS RAIDZ1
Dan Nelson
dnelson at allantgroup.com
Mon Dec 19 21:53:22 UTC 2011
In the last episode (Dec 19), Stefan Esser said:
> Am 19.12.2011 17:22, schrieb Dan Nelson:
> > In the last episode (Dec 19), Stefan Esser said:
> >> for quite some time I have observed an uneven distribution of load
> >> between drives in a 4 * 2TB RAIDZ1 pool. The following is an excerpt
> >> of a longer log of 10 second averages logged with gstat:
> >>
> >> dT: 10.001s w: 10.000s filter: ^a?da?.$
> >> L(q) ops/s r/s kBps ms/r w/s kBps ms/w %busy Name
> >> 0 130 106 4134 4.5 23 1033 5.2 48.8| ada0
> >> 0 131 111 3784 4.2 19 1007 4.0 47.6| ada1
> >> 0 90 66 2219 4.5 24 1031 5.1 31.7| ada2
> >> 1 81 58 2007 4.6 22 1023 2.3 28.1| ada3
> > [...]
>
> This is a ZFS only system. The first partition on each drive holds just
> the gptzfsloader.
>
> pool alloc free read write read write
> ---------- ----- ----- ----- ----- ----- -----
> raid1 4.41T 2.21T 139 72 12.3M 818K
> raidz1 4.41T 2.21T 139 72 12.3M 818K
> ada0p2 - - 114 17 4.24M 332K
> ada1p2 - - 106 15 3.82M 305K
> ada2p2 - - 65 20 2.09M 337K
> ada3p2 - - 58 18 2.18M 329K
>
> The same difference of read operations per second as shown by gstat ...
I was under the impression that the parity blocks were scattered evenly
across all disks, but from reading vdev_raidz.c, it looks like that isn't
always the case. See the comment at the bottom of the
vdev_raidz_map_alloc() function; it looks like it will toggle parity between
the first two disks in a stripe every 1MB. It's not necessarily the first
two disks assigned to the zvol, since stripes don't have to span all disks
as long as there's one parity block (a small sync write may just hit two
disks, essentially being written mirrored). The imbalance is only visible
if you're writing full-width stripes in sequence, so if you write a 1TB file
in one long stream, chances are that that file's parity blocks will be
concentrated on just two disks, so those two disks will get less I/O on
later reads. I don't know why the code toggles parity between just the
first two columns; rotating it between all columns would give you an even
balance.
Is it always the last two disks that have less load, or does it slowly
rotate to different disks depending on the data that you are reading? An
interesting test would be to idle the system, run a "tar cvf /dev/null
/raidz1" in one window, and watch iostat output on another window. If the
load moves from disk to disk as tar reads different files, then my parity
guess is probably right. If ada0 and ada1 are always busier, than you can
ignore me :)
Since it looks like the algorithm ends up creating two half-cold parity
disks instead of one cold disk, I bet a 3-disk RAIDZ would exhibit even
worse balancing, and a 5-disk set would be more even.
--
Dan Nelson
dnelson at allantgroup.com
More information about the freebsd-current
mailing list