docs/159897: [patch] improve HAST section of Handbook

Sun Aug 21 01:56:10 UTC 2011

On Thu, 18 Aug 2011, Warren Block wrote:

> FreeBSD lightning 8.2-STABLE FreeBSD 8.2-STABLE #0: Wed Aug 17 19:31:39 MDT 2011     root at lightning:/usr/obj/usr/src/sys/LIGHTNING  i386
>> Description:
> Edit and polish the HAST section of the Handbook with an eye to conciseness and clarity.
"concision" is three fewer characters :) (though OED has conciseness as 
older)
>> How-To-Repeat:
>
>> Fix:
> Apply patch.
>
> Patch attached with submission follows:
>
> --- en_US.ISO8859-1/books/handbook/disks/chapter.sgml.orig	2011-08-18 15:22:56.000000000 -0600
> +++ en_US.ISO8859-1/books/handbook/disks/chapter.sgml	2011-08-18 16:35:46.000000000 -0600
> @@ -4038,7 +4038,7 @@
>     <sect2>
>       <title>Synopsis</title>
>
> -      <para>High-availability is one of the main requirements in serious
> +      <para>High availability is one of the main requirements in serious
> 	business applications and highly-available storage is a key
> 	component in such environments.  Highly Available STorage, or
> 	<acronym>HAST<remark role="acronym">Highly Available
> @@ -4109,7 +4109,7 @@
> 	  drives.</para>
> 	</listitem>
> 	<listitem>
> -	  <para>File system agnostic, thus allowing to use any file
> +	  <para>File system agnostic, thus allowing use of any file

I think "allowing the use" is better here.

> 	    system supported by &os;.</para>
> 	</listitem>
> 	<listitem>
> @@ -4152,7 +4152,7 @@
> 	total.</para>
>       </note>
>
> -      <para>Since the <acronym>HAST</acronym> works in
> +      <para>Since <acronym>HAST</acronym> works in

"in a primary-secondary"

> 	primary-secondary configuration, it allows only one of the
> 	cluster nodes to be active at any given time.  The
> 	<literal>primary</literal> node, also called
> @@ -4334,51 +4334,51 @@
> 	  available.</para>
>       </note>
>
> -      <para>HAST is not responsible for selecting node's role
> -	(<literal>primary</literal> or <literal>secondary</literal>).
> -	Node's role has to be configured by an administrator or other
> -	software like <application>Heartbeat</application> using the
> +      <para>A HAST node's role (<literal>primary</literal> or
> +        <literal>secondary</literal>) is selected by an administrator
> +        or other
> +        software like <application>Heartbeat</application> using the
> 	&man.hastctl.8; utility.  Move to the primary node
> 	(<literal><replaceable>hasta</replaceable></literal>) and
> -	issue the following command:</para>
> +	issue this command:</para>
>
>       <screen>&prompt.root; <userinput>hastctl role primary test</userinput></screen>
>
> -      <para>Similarly, run the following command on the secondary node
> +      <para>Similarly, run this command on the secondary node
> 	(<literal><replaceable>hastb</replaceable></literal>):</para>
>
>       <screen>&prompt.root; <userinput>hastctl role secondary test</userinput></screen>
>
>       <caution>
> -	<para>It may happen that both of the nodes are not able to
> -	  communicate with each other and both are configured as
> -	  primary nodes; the consequence of this condition is called
> -	  <literal>split-brain</literal>.  In order to troubleshoot
> +	<para>When the nodes are unable to
> +	  communicate with each other, and both are configured as
> +	  primary nodes, the condition is called
> +	  <literal>split-brain</literal>.  To troubleshoot
> 	  this situation, follow the steps described in <xref
> 	  linkend="disks-hast-sb">.</para>
>       </caution>
>
> -      <para>It is possible to verify the result with the
> +      <para>Verify the result with the
> 	&man.hastctl.8; utility on each node:</para>
>
>       <screen>&prompt.root; <userinput>hastctl status test</userinput></screen>
>
> -      <para>The important text is the <literal>status</literal> line
> -	from its output and it should say <literal>complete</literal>
> +      <para>The important text is the <literal>status</literal> line,
> +	which should say <literal>complete</literal>
> 	on each of the nodes.  If it says <literal>degraded</literal>,
> 	something went wrong.  At this point, the synchronization
> 	between the nodes has already started.  The synchronization
> -	completes when the <command>hastctl status</command> command
> +	completes when <command>hastctl status</command>
> 	reports 0 bytes of <literal>dirty</literal> extents.</para>
>
>
> -      <para>The last step is to create a filesystem on the
> +      <para>The next step is to create a filesystem on the
> 	<devicename>/dev/hast/<replaceable>test</replaceable></devicename>
> -	GEOM provider and mount it.  This has to be done on the
> -	<literal>primary</literal> node (as the
> +	GEOM provider and mount it.  This must be done on the
> +	<literal>primary</literal> node, as
> 	<filename>/dev/hast/<replaceable>test</replaceable></filename>
> -	appears only on the <literal>primary</literal> node), and
> -	it can take a few minutes depending on the size of the hard
> +	appears only on the <literal>primary</literal> node.
> +	It can take a few minutes depending on the size of the hard

The pronoun "it" may be confusing, here -- I would probably just say 
"Creating the filesystem".

> 	drive:</para>
>
>       <screen>&prompt.root; <userinput>newfs -U /dev/hast/test</userinput>
> @@ -4387,9 +4387,9 @@
>
>       <para>Once the <acronym>HAST</acronym> framework is configured
> 	properly, the final step is to make sure that
> -	<acronym>HAST</acronym> is started during the system boot time
> -	automatically.  The following line should be added to the
> -	<filename>/etc/rc.conf</filename> file:</para>
> +	<acronym>HAST</acronym> is started automatically during the system
> +	boot.  This line is added to
> +	<filename>/etc/rc.conf</filename>:</para>

"This line is added" is a pretty unusual grammatical construct for what is 
attempting to be conveyed.  "To do so, add this line to" I think says 
things more clearly.

>
>       <programlisting>hastd_enable="YES"</programlisting>
>
> @@ -4397,26 +4397,25 @@
> 	<title>Failover Configuration</title>
>
> 	<para>The goal of this example is to build a robust storage
> -	  system which is resistant from the failures of any given node.
> -	  The key task here is to remedy a scenario when a
> -	  <literal>primary</literal> node of the cluster fails.  Should
> -	  it happen, the <literal>secondary</literal> node is there to
> +	  system which is resistant to failures of any given node.

The plural is not consistent between "failures" and "node".  "resistant to 
the failure of any given node" is I think the conventional way to say 
this (note that the original also had the incorrect plural "failures").

> +	  The scenario is that a
> +	  <literal>primary</literal> node of the cluster fails.  If
> +	  this happens, the <literal>secondary</literal> node is there to
> 	  take over seamlessly, check and mount the file system, and
> 	  continue to work without missing a single bit of data.</para>
>
> -	<para>In order to accomplish this task, it will be required to
> -	  utilize another feature available under &os; which provides
> +	<para>To accomplish this task, another &os; feature provides
> 	  for automatic failover on the IP layer —
> -	  <acronym>CARP</acronym>.  <acronym>CARP</acronym> stands for
> -	  Common Address Redundancy Protocol and allows multiple hosts
> +	  <acronym>CARP</acronym>.  <acronym>CARP</acronym> (Common Address
> +	  Redundancy Protocol) allows multiple hosts
> 	  on the same network segment to share an IP address.  Set up
>  	  <acronym>CARP</acronym> on both nodes of the cluster according
> 	  to the documentation available in <xref linkend="carp">.
> -	  After completing this task, each node should have its own
> +	  After setup, each node will have its own
> 	  <devicename>carp0</devicename> interface with a shared IP
> 	  address <replaceable>172.16.0.254</replaceable>.
> -	  Obviously, the primary <acronym>HAST</acronym> node of the
> -	  cluster has to be the master <acronym>CARP</acronym>
> +	  The primary <acronym>HAST</acronym> node of the
> +	  cluster must be the master <acronym>CARP</acronym>
> 	  node.</para>
>
> 	<para>The <acronym>HAST</acronym> pool created in the previous
> @@ -4430,17 +4429,17 @@
>
> 	<para>In the event of <acronym>CARP</acronym> interfaces going
> 	  up or down, the &os; operating system generates a &man.devd.8;
> -	  event, which makes it possible to watch for the state changes
> +	  event, making it possible to watch for the state changes
> 	  on the <acronym>CARP</acronym> interfaces.  A state change on
> 	  the <acronym>CARP</acronym> interface is an indication that
> -	  one of the nodes failed or came back online.  In such a case,
> -	  it is possible to run a particular script which will
> +	  one of the nodes failed or came back online.  These state change
> +	  events make it possible to run a script which will
> 	  automatically handle the failover.</para>

I think "handle HAST failover" would be an improvement.

>
> -	<para>To be able to catch the state changes on the
> -	  <acronym>CARP</acronym> interfaces, the following
> -	  configuration has to be added to the
> -	  <filename>/etc/devd.conf</filename> file on each node:</para>
> +	<para>To be able to catch state changes on the
> +	  <acronym>CARP</acronym> interfaces, add this
> +	  configuration to
> +	  <filename>/etc/devd.conf</filename> on each node:</para>
>
> 	<programlisting>notify 30 {
> 	match "system" "IFNET";
> @@ -4456,12 +4455,12 @@
> 	action "/usr/local/sbin/carp-hast-switch slave";
> };</programlisting>
>
> -	<para>To put the new configuration into effect, run the
> -	  following command on both nodes:</para>
> +	<para>Restart &man.devd.8; on both nodes o put the new configuration

"to"

> +	  into effect:</para>
>
> 	<screen>&prompt.root; <userinput>/etc/rc.d/devd restart</userinput></screen>
>
> -	<para>In the event that the <devicename>carp0</devicename>
> +	<para>When the <devicename>carp0</devicename>
> 	  interface goes up or down (i.e. the interface state changes),
> 	  the system generates a notification, allowing the &man.devd.8;
> 	  subsystem to run an arbitrary script, in this case
> @@ -4615,41 +4614,40 @@
>       <sect3>
> 	<title>General Troubleshooting Tips</title>
>
> -	<para><acronym>HAST</acronym> should be generally working
> -	  without any issues, however as with any other software
> +	<para><acronym>HAST</acronym> should generally work
> +	  without issues.  However, as with any other software
> 	  product, there may be times when it does not work as
> 	  supposed.  The sources of the problems may be different, but
> 	  the rule of thumb is to ensure that the time is synchronized
> 	  between all nodes of the cluster.</para>
>
> -	<para>The debugging level of the &man.hastd.8; should be
> -	  increased when troubleshooting <acronym>HAST</acronym>
> -	  problems.  This can be accomplished by starting the
> +	<para>When troubleshooting <acronym>HAST</acronym> problems,
> +	  the debugging level of &man.hastd.8; should be increased
> +	  by starting the
> 	  &man.hastd.8; daemon with the <literal>-d</literal>
> -	  argument.  Note, that this argument may be specified
> +	  argument.  Note that this argument may be specified
> 	  multiple times to further increase the debugging level.  A
> -	  lot of useful information may be obtained this way.  It
> -	  should be also considered to use <literal>-F</literal>
> -	  argument, which will start the &man.hastd.8; daemon in
> +	  lot of useful information may be obtained this way.  Consider
> +	  also using the <literal>-F</literal>
> +	  argument, which starts the &man.hastd.8; daemon in the
> 	  foreground.</para>
>      </sect3>
>
>       <sect3 id="disks-hast-sb">
> 	<title>Recovering from the Split-brain Condition</title>
>
> -	<para>The consequence of a situation when both nodes of the
> -	  cluster are not able to communicate with each other and both
> -	  are configured as primary nodes is called
> -	  <literal>split-brain</literal>.  This is a dangerous
> +	<para><literal>Split-brain</literal> is when the nodes of the
> +	  cluster are unable to communicate with each other, and both
> +	  are configured as primary.  This is a dangerous
> 	  condition because it allows both nodes to make incompatible
> -	  changes to the data.  This situation has to be handled by
> -	  the system administrator manually.</para>
> +	  changes to the data.  This problem must be corrected
> +	  manually by the system administrator.</para>
>
> -	<para>In order to fix this situation the administrator has to
> +	<para>The administrator must
> 	  decide which node has more important changes (or merge them
> -	  manually) and let the <acronym>HAST</acronym> perform
> +	  manually) and let <acronym>HAST</acronym> perform
> 	  the full synchronization of the node which has the broken

Just "full synchronization", I think.

Thanks for spotting these grammar rough edges and putting together a 
patch!

-Ben Kaduk

> -	  data.  To do this, issue the following commands on the node
> +	  data.  To do this, issue these commands on the node
> 	  which needs to be resynchronized:</para>
>
>         <screen>&prompt.root; <userinput>hastctl role init <resource></userinput>
>
>