Hours of tiny transfers at the end of a ZFS resilver?
Andrew Reilly
areilly at bigpond.net.au
Mon Feb 15 10:19:09 UTC 2016
Hi Filesystem experts,
I have a question about the nature of ZFS and the resilvering
that occurs after a driver replacement from a raidz array.
I have a fairly simple home file server that (by way of
gradually replaced pieces and upgrades) has effectively been
doing great service since well, forever, but its last re-build
replaced its main UFS file systems with a four-drive ZFS raidz
pool. It's been going very nicely over the years, and now it's
full, so I've nearly finished replacing its 1TB drives with new
4TB ones. I'm doing that the slow way, replacing one at a time
and resilvering before going on to the next, because that only
requires a minute or two of down-time for the drive swaps each
time. Replacing the whole array and restoring from backup would
have had the system off-line for many hours (I guess).
Now, one thing that I didn't realise at the start of this
process was that the zpool has the original 512B sector size
baked in at a fairly low level, so it is using some sort of
work-around for the fact that the new drives actually have 4096B
sectors (although they lie about that in smartctl -i queries):
The four new drives appear to smartctl as:
Model Family: HGST Deskstar NAS
Device Model: HGST HDN724040ALE640
Serial Number: PK1334PEHYSZ6S
LU WWN Device Id: 5 000cca 250dba043
Firmware Version: MJAOA5E0
User Capacity: 4,000,787,030,016 bytes [4.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 7200 rpm
Form Factor: 3.5 inches
Device is: In smartctl database [for details use: -P show]
ATA Version is: ATA8-ACS T13/1699-D revision 4
SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Mon Feb 15 20:57:30 2016 AEDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
They show up in zpool status as:
pool: tank
state: DEGRADED
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Mon Feb 15 07:19:45 2016
3.12T scanned out of 3.23T at 67.4M/s, 0h29m to go
798G resilvered, 96.48% done
config:
NAME STATE READ WRITE CKSUM
tank DEGRADED 0 0 0
raidz1-0 DEGRADED 0 0 0
ada0p1 ONLINE 0 0 0 block size: 512B configured, 4096B native
ada1p1 ONLINE 0 0 0 block size: 512B configured, 4096B native
ada3p1 ONLINE 0 0 0 block size: 512B configured, 4096B native
replacing-3 DEGRADED 0 0 0
17520966596084542745 UNAVAIL 0 0 0 was /dev/ada4p1/old
ada4p1 ONLINE 0 0 0 block size: 512B configured, 4096B native (resilvering)
errors: No known data errors
While clearly sub-optimal, I expect that the performance will
still be good enough for my purposes: I can build a new,
properly aligned file system when I do the next re-build.
The odd thing is that after charging through the resilver using
large blocks (around 64k according to systat), when they get to
the end, as this one is now, the process drags on for hours with
millions of tiny, sub-2K transfers:
Here's the systat -vmstat output right now.
19 users Load 0.28 0.30 0.27 15 Feb 21:01
Mem:KB REAL VIRTUAL VN PAGER SWAP PAGER
Tot Share Tot Share Free in out in out
Act 832452 38920 5151576 111272 258776 count
All 859068 52388 5349268 249964 pages 3
Proc: Interrupts
r p d s w Csw Trp Sys Int Sof Flt ioflt 1026 total
174 4417 126 4775 1026 45 90 cow atkbd0 1
88 zfod hdac1 16
2.8%Sys 0.5%Intr 3.9%User 0.0%Nice 92.9%Idle ozfod ehci0 ehci
| | | | | | | | | | %ozfod 249 siis0 ohci
=+>> daefr 147 hpet0:t0
4 dtbuf prcfr 165 hpet0:t1
Namei Name-cache Dir-cache 213520 desvn 3 totfr 114 hpet0:t2
Calls hits % hits % 12132 numvn react hdac0 259
77130 77093 100 4853 frevn pdwak xhci1 261
32 pdpgs ahci0:ch0
Disks da0 ada0 ada1 ada2 ada3 ada4 pass0 intrn 170 ahci0:ch1
KB/t 0.00 1.41 1.46 0.00 1.39 1.47 0.00 6013640 wire 151 ahci0:3
tps 0 173 138 0 176 151 0 77140 act 30 re0 266
MB/s 0.00 0.24 0.20 0.00 0.24 0.22 0.00 1732676 inact
%busy 0 18 15 0 17 99 0 cache
258776 free
29632 buf
So there's a problem wth the zpool status output: it's
predicting half an hour to go based on the averaged 67M/s over
the whole drive, not the <2MB/s that it's actually doing, and
will probably continue to do so for several hours, if tonight
goes the same way as last night. Last night zpool status said
"0h05m to go" for more than three hours, before I gave up
waiting to start the next drive.
Is this expected behaviour, or something bad and peculiar about
my system?
I'm confused about how ZFS really works, given this state. I
had thought that the zpool layer did parity calculation in big
256k-ish stripes across the drives, and the zfs filesystem layer
coped with that large block size because it had lots of caching
and wrote everything in log-structure. Clearly that mental
model must be incorrect, because then it would only ever be
doing large transfers. Anywhere I could go to find a nice
write-up of how ZFS is working?
Cheers,
--
Andrew
More information about the freebsd-fs
mailing list