HAST instability

Mikolaj Golub trociny at freebsd.org
Mon May 30 18:42:28 UTC 2011

On Mon, 30 May 2011 17:43:04 +0300 Daniel Kalchev wrote:

 DK> Some further investigation:

 DK> The HAST nodes do not disconnect when checksum is enabled (either
 DK> crc32 or sha256).

 DK> One strange thing is that there is never established TCP connection
 DK> between both nodes:

 DK> tcp4       0      0       FIN_WAIT_2
 DK> tcp4       0   1288       CLOSE_WAIT
 DK> tcp4       0      0       FIN_WAIT_2
 DK> tcp4       0  90648       CLOSE_WAIT
 DK> tcp4       0      0       *.*                    LISTEN

It is normal. hastd uses the connections only in one direction so it calls
shutdown to close unused directions.

 DK> When using sha256 one CPU core is 100% utilized by each hastd process,
 DK> while 70-80MB/sec per HAST resource is being transferred (total of up
 DK> to 140 MB/sec traffic for both);

 DK> When using crc32 each CPU core is at 22% utilization;

 DK> When using none as checksum, CPU usage is under 10%

I suppose when checksum is enabled the bottleneck is cpu, the triffic rate is
lower and the problem is not triggered.

 DK> Eventually after many hours, got corrupted communication:

 DK> May 30 17:32:35 b1b hastd[9827]: [data0] (secondary) Hash mismatch.

"Hash mismatch" message suggests that actually you were using checksum then,
weren't you?

 DK> May 30 17:32:35 b1b hastd[9827]: [data0] (secondary) Unable to receive
 DK> request data: No such file or directory.
 DK> May 30 17:32:38 b1b hastd[9397]: [data0] (secondary) Worker process
 DK> exited ungracefully (pid=9827, exitcode=75).

 DK> and

 DK> May 30 17:32:27 b1a hastd[1837]: [data0] (primary) Unable to receive
 DK> reply header: Operation timed out.
 DK> May 30 17:32:30 b1a hastd[1837]: [data0] (primary) Disconnected from
 DK> May 30 17:32:30 b1a hastd[1837]: [data0] (primary) Unable to send
 DK> request (Broken pipe): WRITE(99128470016, 131072).

It looks a little different than in your fist message.

Do you have clock in sync on both nodes?

I would like to look at full logs for some rather large period, with several
cases, from both primary and secondary (and be sure about synchronized time).

Also, it might worth checking that there is no network packet corruption (some
strange things in netstat -di, netstat -s, may be copying large files via net
and comparing checksums).

Mikolaj Golub

More information about the freebsd-stable mailing list