From nobody Fri Jun 18 23:39:55 2021 X-Original-To: freebsd-fs@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4AD0511DA204 for ; Fri, 18 Jun 2021 23:40:07 +0000 (UTC) (envelope-from asomers@gmail.com) Received: from mail-ot1-f47.google.com (mail-ot1-f47.google.com [209.85.210.47]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "smtp.gmail.com", Issuer "GTS CA 1O1" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4G6FlW1X44z3k5v for ; Fri, 18 Jun 2021 23:40:07 +0000 (UTC) (envelope-from asomers@gmail.com) Received: by mail-ot1-f47.google.com with SMTP id w22-20020a0568304116b02904060c6415c7so11375778ott.1 for ; Fri, 18 Jun 2021 16:40:07 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=iRWe4H4lYS0EG9btiAOdLZvZ5uc0q8OWTDbuwDibsPI=; b=dFe1p7URnEF+yj/mvEAuCqpzNT9F+GRYcLW8ewedrSdXyDPlXsVJfhWbS1rr+faQYF lE0YkV5RhmxLe4Dvjmi3Rvzgpx8byfwfMmaUFO5Azti1bI3LwQ+oWYi88VAq1P6YnIBS GjRWu5KCXqzps3LtnMm6ewdRHCKVbQo9ljWRfWP2IufbBUOVeei+xhDT12u7i4OpL1wO WgU/Ru+zMBEEzeKMVJCLK01AEhbprwS652cMGdw0a3phZ6kBKWJXTWJNj2ylS9xwOrsG w0oRKjBn0qNNWgTiwd5dFG+3UKvM/jnx2oEkY2db/id6bmawXb34CVhHtMO7Cd3anKhk /aZg== X-Gm-Message-State: AOAM530SG07/3fEerNx8TKzqlnA/+PheovRiXjSlL6sTd+czWHSoShN+ 2eS6GGvPeIcw5HuYha6sOL937r2YRWWnsf2tH/wdvbFArV0= X-Google-Smtp-Source: ABdhPJwFMFmiLrgGn9iyq4pAS+qlvkq30g3vNOn+cOzJwATSi8ykEkC2tl1fZD985F4O7znkWTQ0vCjfZH/Ubha6q8E= X-Received: by 2002:a05:6830:1686:: with SMTP id k6mr10818796otr.291.1624059605853; Fri, 18 Jun 2021 16:40:05 -0700 (PDT) List-Id: Filesystems List-Archive: https://lists.freebsd.org/archives/freebsd-fs List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-fs@freebsd.org MIME-Version: 1.0 References: <43127C8C-8CEA-4796-A906-E2149B4262DE@via.net> In-Reply-To: <43127C8C-8CEA-4796-A906-E2149B4262DE@via.net> From: Alan Somers Date: Fri, 18 Jun 2021 17:39:55 -0600 Message-ID: Subject: Re: ZFS config question To: joe mcguckin Cc: freebsd-fs Content-Type: multipart/alternative; boundary="000000000000e6aa2d05c512d718" X-Rspamd-Queue-Id: 4G6FlW1X44z3k5v X-Spamd-Bar: ---- Authentication-Results: mx1.freebsd.org; none X-Spamd-Result: default: False [-4.00 / 15.00]; REPLY(-4.00)[] X-ThisMailContainsUnwantedMimeParts: Y --000000000000e6aa2d05c512d718 Content-Type: text/plain; charset="UTF-8" You definitely don't want 60-drives in the same RAIDZ vdev, and this is why: RAIDZ1 is not the same layout as traditional RAID5 (ditto with RAIDZ2 and RAID6). With RAID5, each set of data+parity chunks is distributed over all of the disks. For example, an 8+1 array is composed of identical rows that each have 8 data chunks and 1 parity chunk of perhaps a few dozen KB per chunk. But with RAIDZ, each set of data+parity chunks is distributed over as many disks as are needed for _a_single_record_. For example, in that same 8+1 array, a 32KB record would be divided into 8 data chunks and 1 parity chunk of 4KB apiece. But assuming ashift=9, a 16 KB record would be divided into _4_ data chunks and 1 parity chunk of 4KB apiece. So small records are less space efficient to store on RAIDZ, and the problem gets worse the larger the RAIDZ vdev. In fact, the problem is a little bit worse than this example shows, due to padding blocks. I won't go into those right now. But it's not just space efficiency, it's IOPs too. In our 8+1 RAID5 array, if the chunksize is 64KB or larger, then randomly reading a 64KB record requires just a single operation from a single disk. But reading a 64KB record from a 8+1 RAIDZ array requires a single operation from _8_ disks. So RAIDZ has worse IOPs than RAID5. Basically, if a single disk has X read IOPs, then n+m RAID5 provids n * X read IOPs, but n + m RAIDZ only provides X. But it's not just space efficiency and IOPs, it's rebuild time, too. When rebuilding a failed disk, whether it's RAID5 or RAIDZ, you basically have to read the full contents of every other disk in the RAID group (slightly less for RAIDZ, for the reasons discussed in paragraph 2). For large RAID arrays, this can take a lot of IOPs and CPU cycles away from servicing user-facing requests. ZFS's DRAID is a partial improvement, but only a partial one. The best size of RAIDZ for you depends on the typical record size you're going to have, your random read IOPs requirement, the ashift of your drives, and how much performance hit you're willing to accept during rebuild. But 60 is way too many. -Alan On Fri, Jun 18, 2021 at 5:21 PM joe mcguckin wrote: > If I have a box with 60 SAS drives - Why not hake it one big RAID volume? > > Is there a benefit to a filesystem composed of multiple, smaller VDEVS vs > one giant 40-50 drive zpool? > > Are there guidelines or rules-of-thumb for sizing vdevs and zpools? > > Thanks, > > Joe > > Joe McGuckin > ViaNet Communications > > joe@via.net > 650-207-0372 cell > 650-213-1302 office > 650-969-2124 fax > > > > --000000000000e6aa2d05c512d718--