quantifying zpool performance with number of vdevs

Fri Jan 29 21:11:15 UTC 2016

On Jan 29, 2016, at 13:06, Graham Allan <allan at physics.umn.edu> wrote:

> In many of the storage systems I built to date I was slightly conservative (?) in wanting to keep any one pool confined to a single JBOD chassis. In doing this I've generally been using the Supermicro 45-drive chassis with pools made of 4x (8+2) raidz2, other slots being kept for spares, ZIL and L2ARC.

> Obviously theory says that iops should scale with number of vdevs but it would be nice to try and quantify.
> 
> Getting relevant data out of iperf seems problematic on machines with 128GB+ RAM - it's hard to blow out the ARC.

In a pervious life, where I was responsible for over 200 TB of storage (in 2008, back when that was a lot), I did some testing for both reliability and performance before committing to a configuration for our new storage system. It was not FreeBSD but Solaris and we have 5 x J4400 chassis (each with 24 drives) all dual SAS attached on four HBA ports.

This link https://docs.google.com/spreadsheets/d/13sLzYKkmyi-ceuIlUS2q0oxcmRnTE-BRvBYHmEJteAY/edit?usp=sharing has some of the performance testing I did. I did not look at Sequential Read as that was not in our workload, in hindsight I should have. By limiting the ARC, the entire ARC, to 4 GB I was able to get reasonable accurate results. The number of vdevs made very little difference to Sequential Writes, but Random Reads and Writes scaled very linearly with the number of top level vdevs.

Our eventual config was RAIDz2 based because we could not meet the space requirements with mirrors, especially as we would have to have gone with 3-way mirrors to get the same MTTDL as with the RAIDz2. The production pool consisted of 22 top level vdevs, each was a 5-drive RAIDz2 where each drive was a in a different disk chassis. So all of the drives in slot 0 and 1 were hot spares, all of the drives in slot 2 made up one vdev, all of the drives in slot 3 made up one vdev, etc. So we were striping data across 22 vdevs. During pre-production testing we completely lost connectivity to 2 of the 5 disk chassis and had no loss of data or availability. When those chassis came back, they resilvered and went along their merry way (just as they should).

Once the system went live we took hourly snapshots and replicated them both locally and remotely for backup purposes. We estimated that it would have taken over 3 weeks to restore all the data from tape if we had to, and that was unacceptable. The only issue we ran into related to resilvering after a drive failure. Due to the large number of snapshots and the ongoing snapshot creation, a resilver could take over a week.

--
Paul Kraus
paul at kraus-haus.org