svn commit: r44084 - projects/zfsupdate-201307/en_US.ISO8859-1/books/handbook/zfs

Wed Feb 26 23:49:38 UTC 2014

Author: wblock
Date: Wed Feb 26 23:49:37 2014
New Revision: 44084
URL: http://svnweb.freebsd.org/changeset/doc/44084

Log:
  ZFS tuning content addtions by Allan Jude <freebsd at allanjude.com>.
  
  Submitted by:	Allan Jude <freebsd at allanjude.com>

Modified:
  projects/zfsupdate-201307/en_US.ISO8859-1/books/handbook/zfs/chapter.xml

Modified: projects/zfsupdate-201307/en_US.ISO8859-1/books/handbook/zfs/chapter.xml
==============================================================================

--- projects/zfsupdate-201307/en_US.ISO8859-1/books/handbook/zfs/chapter.xml	Wed Feb 26 23:44:33 2014	(r44083)
+++ projects/zfsupdate-201307/en_US.ISO8859-1/books/handbook/zfs/chapter.xml	Wed Feb 26 23:49:37 2014	(r44084)
@@ -675,7 +675,11 @@ errors: No known data errors</screen>
 	ideally at least once every three months.  The
 	<command>scrub</command> operating is very disk-intensive and
 	will reduce performance while running.  Avoid high-demand
-	periods when scheduling <command>scrub</command>.</para>
+	periods when scheduling <command>scrub</command> or use <link
+	linkend="zfs-advanced-tuning-scrub_delay"><varname>vfs.zfs.scrub_delay</varname></link>
+	to adjust the relative priority of the
+	<command>scrub</command> to prevent it interfering with other
+	workloads.</para>
 
       <screen>&prompt.root; <userinput>zpool scrub <replaceable>mypool</replaceable></userinput>
 &prompt.root; <userinput>zpool status</userinput>
@@ -890,7 +894,8 @@ errors: No known data errors</screen>
 
       <para>After the scrub operation has completed and all the data
 	has been synchronized from <filename>ada0</filename> to
-	<filename>ada1</filename>, the error messages can be cleared
+	<filename>ada1</filename>, the error messages can be <link
+	linkend="zfs-zpool-clear">cleared</link>
 	from the pool status by running <command>zpool
 	  clear</command>.</para>
 
@@ -2014,7 +2019,258 @@ mypool/compressed_dataset  logicalused  
     <sect2 xml:id="zfs-advanced-tuning">
       <title><acronym>ZFS</acronym> Tuning</title>
 
-      <para></para>
+      <para>There are a number of tunables that can be adjusted to
+	make <acronym>ZFS</acronym> perform best for different
+	workloads.</para>
+
+      <itemizedlist>
+	<listitem>
+	  <para xml:id="zfs-advanced-tuning-arc_max">
+	    <emphasis><varname>vfs.zfs.arc_max</varname></emphasis> -
+	    Sets the maximum size of the <link
+	    linkend="zfs-term-arc"><acronym>ARC</acronym></link>.
+	    The default is all <acronym>RAM</acronym> less 1 GB,
+	    or 1/2 of ram, whichever is more.  However a lower value
+	    should be used if the system will be running any other
+	    daemons or processes that may require memory.  This value
+	    can only be adjusted at boot time, and is set in
+	    <filename>/boot/loader.conf</filename>.</para>
+	</listitem>
+
+	<listitem>
+	  <para xml:id="zfs-advanced-tuning-arc_meta_limit">
+	    <emphasis><varname>vfs.zfs.arc_meta_limit</varname></emphasis>
+	    - Limits the portion of the <link
+	    linkend="zfs-term-arc"><acronym>ARC</acronym></link>
+	    that can be used to store metadata.  The default is 1/4 of
+	    <varname>vfs.zfs.arc_max</varname>.  Increasing this value
+	    will improve performance if the workload involves
+	    operations on a large number of files and directories, or
+	    frequent metadata operations, at the cost of less file
+	    data fitting in the <link
+	    linkend="zfs-term-arc"><acronym>ARC</acronym></link>.
+	    This value can only be adjusted at boot time, and is set
+	    in <filename>/boot/loader.conf</filename>.</para>
+	</listitem>
+
+	<listitem>
+	  <para xml:id="zfs-advanced-tuning-arc_min">
+	    <emphasis><varname>vfs.zfs.arc_min</varname></emphasis> -
+	    Sets the minimum size of the <link
+	    linkend="zfs-term-arc"><acronym>ARC</acronym></link>.
+	    The default is 1/2 of
+	    <varname>vfs.zfs.arc_meta_limit</varname>.  Adjust this
+	    value to prevent other applications from pressuring out
+	    the entire <link
+	    linkend="zfs-term-arc"><acronym>ARC</acronym></link>.
+	    This value can only be adjusted at boot time, and is set
+	    in <filename>/boot/loader.conf</filename>.</para>
+	</listitem>
+
+	<listitem>
+	  <para xml:id="zfs-advanced-tuning-vdev-cache-size">
+	    <emphasis><varname>vfs.zfs.vdev.cache.size</varname></emphasis>
+	    - A preallocated amount of memory reserved as a cache for
+	    each device in the pool.  The total amount of memory used
+	    will be this value multiplied by the number of devices.
+	    This value can only be adjusted at boot time, and is set
+	    in <filename>/boot/loader.conf</filename>.</para>
+	</listitem>
+
+	<listitem>
+	  <para xml:id="zfs-advanced-tuning-prefetch_disable">
+	    <emphasis><varname>vfs.zfs.prefetch_disable</varname></emphasis>
+	    - Toggles prefetch, a value of 0 is enabled and 1 is
+	    disabled.  The default is 0, unless the system has less
+	    than 4 GB of <acronym>RAM</acronym>.  Prefetch works
+	    by reading larged blocks than were requested into the
+	    <link linkend="zfs-term-arc"><acronym>ARC</acronym></link>
+	    in hopes that the data will be needed soon.  If the
+	    workload has a large number of random reads, disabling
+	    prefetch may actually improve performance by reducing
+	    unnecessary reads.  This value can be adjusted at any time
+	    with &man.sysctl.8;.</para>
+	</listitem>
+
+	<listitem>
+	  <para xml:id="zfs-advanced-tuning-vdev-trim_on_init">
+	    <emphasis><varname>vfs.zfs.vdev.trim_on_init</varname></emphasis>
+	    - Controls whether new devices added to the pool have the
+	    <literal>TRIM</literal> command run on them.  This ensures
+	    the best performance and longevity for
+	    <acronym>SSD</acronym>s, but takes extra time.  If the
+	    device has already been secure erased, disabling this
+	    setting will make the addition of the new device faster.
+	    This value can be adjusted at any time with
+	    &man.sysctl.8;.</para>
+	</listitem>
+
+	<listitem>
+	  <para xml:id="zfs-advanced-tuning-write_to_degraded">
+	    <emphasis><varname>vfs.zfs.write_to_degraded</varname></emphasis>
+	    - Controls whether new data is written to a vdev that is
+	    in the <link linkend="zfs-term-degraded">DEGRADED</link>
+	    state.  Defaults to 0, preventing writes to any top level
+	    vdev that is in a degraded state.  The administrator may
+	    with to allow writing to degraded vdevs to prevent the
+	    amount of free space across the vdevs from becoming
+	    unbalanced, which will reduce read and write performance.
+	    This value can be adjusted at any time with
+	    &man.sysctl.8;.</para>
+	</listitem>
+
+	<listitem>
+	  <para xml:id="zfs-advanced-tuning-vdev-max_pending">
+	    <emphasis><varname>vfs.zfs.vdev.max_pending</varname></emphasis>
+	    - Limits the number of pending I/O requests per device.
+	    A higher value will keep the device command queue full
+	    and may give higher throughput.  A lower value will reduce
+	    latency.  This value can be adjusted at any time with
+	    &man.sysctl.8;.</para>
+	</listitem>
+
+	<listitem>
+	  <para xml:id="zfs-advanced-tuning-top_maxinflight">
+	    <emphasis><varname>vfs.zfs.top_maxinflight</varname></emphasis>
+	    - The maxmimum number of outstanding I/Os per top-level
+	    <link linkend="zfs-term-vdev">vdev</link>.  Limits the
+	    depth of the command queue to prevent high latency.  The
+	    limit is per top-level vdev, meaning the limit applies to
+	    each <link linkend="zfs-term-vdev-mirror">mirror</link>,
+	    <link linkend="zfs-term-vdev-raidz">RAID-Z</link>, or
+	    other vdev independantly.  This value can be adjusted at
+	    any time with &man.sysctl.8;.</para>
+	</listitem>
+
+	<listitem>
+	  <para xml:id="zfs-advanced-tuning-l2arc_write_max">
+	    <emphasis><varname>vfs.zfs.l2arc_write_max</varname></emphasis>
+	    - Limits the amount of data written to the <link
+	    linkend="zfs-term-l2arc"><acronym>L2ARC</acronym></link>
+	    per second.  This tunable is designed to extend the
+	    longevity of <acronym>SSD</acronym>s by limiting the
+	    amount of data written to the device.  This value can be
+	    adjusted at any time with &man.sysctl.8;.</para>
+	</listitem>
+
+	<listitem>
+	  <para xml:id="zfs-advanced-tuning-l2arc_write_boost">
+	    <emphasis><varname>vfs.zfs.l2arc_write_boost</varname></emphasis>
+	    - The value of this tunable is added to <link
+	    linkend="zfs-advanced-tuning-l2arc_write_max"><varname>vfs.zfs.l2arc_write_max</varname></link>
+	    and increases the write speed to the
+	    <acronym>SSD</acronym> until the first block is evicted
+	    from the <link
+	    linkend="zfs-term-l2arc"><acronym>L2ARC</acronym></link>.
+	    This "Turbo Warmup Phase" is designed to reduce the
+	    performance loss from an empty <link
+	    linkend="zfs-term-l2arc"><acronym>L2ARC</acronym></link>
+	    after a reboot.  This value can be adjusted at any time
+	    with &man.sysctl.8;.</para>
+	</listitem>
+
+	<listitem>
+	  <para xml:id="zfs-advanced-tuning-no_scrub_io">
+	    <emphasis><varname>vfs.zfs.no_scrub_io</varname></emphasis>
+	    - Disable <link
+	    linkend="zfs-term-scrub"><command>scrub</command></link>
+	    I/O.  Causes <command>scrub</command> to not actually read
+	    the data blocks and verify their checksums, effectively
+	    turning any <command>scrub</command> in progress into a
+	    no-op.  This may be useful if a <command>scrub</command>
+	    is interferring with other operations on the pool.  This
+	    value can be adjusted at any time with
+	    &man.sysctl.8;.</para>
+
+	  <warning><para>If this tunable is set to cancel an
+	    in-progress <command>scrub</command>, be sure to unset
+	    it afterwards or else all future
+	    <link linkend="zfs-term-scrub">scrub</link> and <link
+	    linkend="zfs-term-resilver">resilver</link> operations
+	    will be ineffective.</para></warning>
+	</listitem>
+
+	<listitem>
+	  <para xml:id="zfs-advanced-tuning-scrub_delay">
+	    <emphasis><varname>vfs.zfs.scrub_delay</varname></emphasis>
+	    - Determines the milliseconds of delay inserted between
+	    each I/O during a <link
+	    linkend="zfs-term-scrub"><command>scrub</command></link>.
+	    To ensure that a <command>scrub</command> does not
+	    interfere with the normal operation of the pool, if any
+	    other I/O is happening the <command>scrub</command> will
+	    delay between each command.  This value allows you to
+	    limit the total <acronym>IOPS</acronym> (I/Os Per Second)
+	    generated by the <command>scrub</command>.  The default
+	    value is 4, resulting in a limit of: 1000  ms / 4 =
+	    250 <acronym>IOPS</acronym>.  Using a value of
+	    <replaceable>20</replaceable> would give a limit of:
+	    1000 ms / 20 = 50 <acronym>IOPS</acronym>.  The
+	    speed of <command>scrub</command> is only limited when
+	    there has been only recent activity on the pool, as
+	    determined by <link
+	    linkend="zfs-advanced-tuning-scan_idle"><varname>vfs.zfs.scan_idle</varname></link>.
+	    This value can be adjusted at any time with
+	    &man.sysctl.8;.</para>
+	</listitem>
+
+	<listitem>
+	  <para xml:id="zfs-advanced-tuning-resilver_delay">
+	    <emphasis><varname>vfs.zfs.resilver_delay</varname></emphasis>
+	    - Determines the milliseconds of delay inserted between
+	    each I/O during a <link
+	    linkend="zfs-term-resilver">resilver</link>.  To ensure
+	    that a <literal>resilver</literal> does not interfere with
+	    the normal operation of the pool, if any other I/O is
+	    happening the <literal>resilver</literal> will delay
+	    between each command.  This value allows you to limit the
+	    total <acronym>IOPS</acronym> (I/Os Per Second) generated
+	    by the <literal>resilver</literal>.  The default value is
+	    2, resulting in a limit of: 1000  ms / 2 =
+	    500 <acronym>IOPS</acronym>.  Returning the pool to
+	    an <link linkend="zfs-term-online">Online</link> state may
+	    be more important if another device failing could <link
+	    linkend="zfs-term-faulted">Fault</link> the pool, causing
+	    data loss.  A value of 0 will give the
+	    <literal>resilver</literal> operation the same priority as
+	    other operations, speeding the healing process.  The speed
+	    of <literal>resilver</literal> is only limited when there
+	    has been other recent activity on the pool, as determined
+	    by <link
+	    linkend="zfs-advanced-tuning-scan_idle"><varname>vfs.zfs.scan_idle</varname></link>.
+	    This value can be adjusted at any time with
+	    &man.sysctl.8;.</para>
+	</listitem>
+
+	<listitem>
+	  <para xml:id="zfs-advanced-tuning-scan_idle">
+	    <emphasis><varname>vfs.zfs.scan_idle</varname></emphasis>
+	    - How many milliseconds since the last operation before
+	    the pool is considered idle.  When the pool is idle the
+	    rate limiting for <link
+	    linkend="zfs-term-scrub"><command>scrub</command></link>
+	    and <link
+	    linkend="zfs-term-resilver">resilver</link> are disabled.
+	    This value can be adjusted at any time with
+	    &man.sysctl.8;.</para>
+	</listitem>
+
+	<listitem>
+	  <para xml:id="zfs-advanced-tuning-txg-timeout">
+	    <emphasis><varname>vfs.zfs.txg.timeout</varname></emphasis>
+	    - Maximum seconds between <link
+	    linkend="zfs-term-txg">transaction group</link>s.  The
+	    current transaction group will be written to the pool and
+	    a fresh transaction group started if this amount of time
+	    has elapsed since the previous transaction group.  A
+	    transaction group my be triggered earlier if enough data
+	    is written.  The default value is 5 seconds.  A larger
+	    value may improve read performance by delaying
+	    asynchronous writes, but this may cause uneven performance
+	    when the transaction group is written.  This value can be
+	    adjusted at any time with &man.sysctl.8;.</para>
+	</listitem>
+      </itemizedlist>
     </sect2>
 
     <sect2 xml:id="zfs-advanced-booting">
@@ -2356,6 +2612,76 @@ vfs.zfs.vdev.cache.size="5M"</programlis
 	  </row>
 
 	  <row>
+	    <entry xml:id="zfs-term-txg">Transaction Group
+	      (<acronym>TXG</acronym>)</entry>
+
+	    <entry>Transaction Groups are the way changed blocks are
+	      grouped together and eventually written to the pool.
+	      Transaction groups are the atomic unit that
+	      <acronym>ZFS</acronym> uses to assert consistency.  Each
+	      transaction group is assigned a unique 64-bit
+	      consecutive identifier.  There can be up to three active
+	      transaction groups at a time, one in each of these three
+	      states:
+
+	      <itemizedlist>
+		<listitem>
+		  <para><emphasis>Open</emphasis> - When a new
+		    transaction group is created, it is in the open
+		    state, and accepts new writes.  There is always
+		    a transaction group in the open state, however the
+		    transaction group may refuse new writes if it has
+		    reached a limit.  Once the open transaction group
+		    has reached a limit, or the <link
+		    linkend="zfs-advanced-tuning-txg-timeout"><varname>vfs.zfs.txg.timeout</varname></link>
+		    has been reached, the transaction group advances
+		    to the next state.</para>
+		</listitem>
+
+		<listitem>
+		  <para><emphasis>Quiescing</emphasis> - A short state
+		    that allows any pending operations to finish while
+		    not blocking the creation of a new open
+		    transaction group.  Once all of the transactions
+		    in the group have completed, the transaction group
+		    advances to the final state.</para>
+		</listitem>
+
+		<listitem>
+		  <para><emphasis>Syncing</emphasis> - All of the data
+		    in the transaction group is written to stable
+		    storage.  This process will in turn modify other
+		    data, such as metadata and space maps, that will
+		    also need to be written to stable storage.  The
+		    process of syncing involves multiple passes.  The
+		    first, all of the changed data blocks, is the
+		    biggest, followed by the metadata, which may take
+		    multiple passes to complete.  Since allocating
+		    space for the data blocks generates new metadata,
+		    the syncing state cannot finish until a pass
+		    completes that does not allocate any additional
+		    space.  The syncing state is also where
+		    <literal>synctasks</literal> are completed.
+		    <literal>Synctasks</literal> are administrative
+		    operations, such as creating or destroying
+		    snapshots and datasets, that modify the uberblock
+		    are completed.  Once the sync state is complete,
+		    the transaction group in the quiescing state is
+		    advanced to the syncing state.</para>
+		</listitem>
+	      </itemizedlist>
+
+	      All administrative functions, such as <link
+	      linkend="zfs-term-snapshot"><command>snapshot</command></link>
+	      are written as part of the transaction group.  When a
+	      <literal>synctask</literal> is created, it is added to
+	      the currently open transaction group, and that group is
+	      advances as quickly as possible to the syncing state in
+	      order to reduce the latency of administrative
+	      commands.</entry>
+	  </row>
+
+	  <row>
 	    <entry xml:id="zfs-term-arc">Adaptive Replacement
 	      Cache (<acronym>ARC</acronym>)</entry>
 
@@ -2419,12 +2745,13 @@ vfs.zfs.vdev.cache.size="5M"</programlis
 	      room), writing to the <acronym>L2ARC</acronym> is
 	      limited to the sum of the write limit and the boost
 	      limit, then after that limited to the write limit.  A
-	      pair of sysctl values control these rate limits;
-	      <literal>vfs.zfs.l2arc_write_max</literal> controls how
-	      many bytes are written to the cache per second, while
-	      <literal>vfs.zfs.l2arc_write_boost</literal> adds to
-	      this limit during the "Turbo Warmup Phase" (Write
-	      Boost).</entry>
+	      pair of sysctl values control these rate limits; <link
+	      linkend="zfs-advanced-tuning-l2arc_write_max"><varname>vfs.zfs.l2arc_write_max</varname></link>
+	      controls how many bytes are written to the cache per
+	      second, while <link
+	      linkend="zfs-advanced-tuning-l2arc_write_boost"><varname>vfs.zfs.l2arc_write_boost</varname></link>
+	      adds to this limit during the "Turbo Warmup Phase"
+	      (Write Boost).</entry>
 	  </row>
 
 	  <row>
@@ -2682,7 +3009,7 @@ vfs.zfs.vdev.cache.size="5M"</programlis
 		    (zero length encoding) is a special compression
 		    algorithm that only compresses continuous runs of
 		    zeros.  This compression algorithm is only useful
-		    when the dataset contains large, continous runs of
+		    when the dataset contains large, continuous runs of
 		    zeros.</para>
 		</listitem>
 	      </itemizedlist></entry>
@@ -2746,7 +3073,12 @@ vfs.zfs.vdev.cache.size="5M"</programlis
 	      but a <command>scrub</command> makes sure even
 	      infrequently used blocks are checked for silent
 	      corruption.  This improves the security of the data,
-	      especially in archival storage situations.</entry>
+	      especially in archival storage situations.  The relative
+	      priority of <command>scrub</command> can be adjusted
+	      with <link
+	      linkend="zfs-advanced-tuning-scrub_delay"><varname>vfs.zfs.scrub_delay</varname></link>
+	      to prevent the scrub from degrading the performance of
+	      other workloads on your pool.</entry>
 	  </row>
 
 	  <row>