svn commit: r313006 - in head: sys/conf sys/libkern sys/libkern/x86 sys/sys tests/sys/kern

Tue Jan 31 18:48:17 UTC 2017

On Tue, Jan 31, 2017 at 7:36 AM, Bruce Evans <brde at optusnet.com.au> wrote:
> On Tue, 31 Jan 2017, Bruce Evans wrote:
> Unrolling (or not) may be helpful or harmful for entry and exit code.

Helpful, per my earlier benchmarks.

> I
> think there should by no alignment on entry -- just assume the buffer is
> aligned in the usual case, and only run 4% slower when it is misaligned.

Please write such a patch and demonstrate the improvement.

> The exit code handles up to SHORT * 3 = 768 bytes, not up to 4 or 8
> bytes or up to 3 times that like simpler algorithms.  768 is quite
> large, and the exit code is quite slow.  It reduces 8 or 4 bytes at a
> time without any dependency reduction, and then 1 byte at a time.

Yes, this is the important loop to unroll for small inputs.  Somehow
with the unrolling, it is only ~19% slower than the by-3 algorithm on
my system — not 66%.  Clang 3.9.1 unrolls both of these trailing
loops; here is the first:

   0x0000000000401b88 <+584>:   cmp    $0x38,%rbx
   0x0000000000401b8c <+588>:   jae    0x401b93 <sse42_crc32c+595>
   0x0000000000401b8e <+590>:   mov    %rsi,%rdx
   0x0000000000401b91 <+593>:   jmp    0x401be1 <sse42_crc32c+673>
   0x0000000000401b93 <+595>:   lea    -0x1(%rdi),%rbx
   0x0000000000401b97 <+599>:   sub    %rdx,%rbx
   0x0000000000401b9a <+602>:   mov    %rsi,%rdx
   0x0000000000401b9d <+605>:   nopl   (%rax)
   0x0000000000401ba0 <+608>:   crc32q (%rdx),%rax
   0x0000000000401ba6 <+614>:   crc32q 0x8(%rdx),%rax
   0x0000000000401bad <+621>:   crc32q 0x10(%rdx),%rax
   0x0000000000401bb4 <+628>:   crc32q 0x18(%rdx),%rax
   0x0000000000401bbb <+635>:   crc32q 0x20(%rdx),%rax
   0x0000000000401bc2 <+642>:   crc32q 0x28(%rdx),%rax
   0x0000000000401bc9 <+649>:   crc32q 0x30(%rdx),%rax
   0x0000000000401bd0 <+656>:   crc32q 0x38(%rdx),%rax
   0x0000000000401bd7 <+663>:   add    $0x40,%rdx
   0x0000000000401bdb <+667>:   add    $0x8,%rbx
   0x0000000000401bdf <+671>:   jne    0x401ba0 <sse42_crc32c+608>

> I
> don't understand the algorithm for joining crcs -- why doesn't it work
> to reduce to 12 or 24 bytes in the main loop?

It would, but I haven't implemented or tested that.  You're welcome to
do so and demonstrate an improvement.  It does add more lookup table
bloat, but perhaps we could just remove the 3x8k table — I'm not sure
it adds any benefit over the 3x256 table.

> Your benchmarks mainly give results for the <= 768 bytes where most of
> the manual optimizations don't apply.

0x000400: asm:68 intrins:62 multitable:684  (ns per buf)
0x000800: asm:132 intrins:133  (ns per buf)
0x002000: asm:449 intrins:446  (ns per buf)
0x008000: asm:1501 intrins:1497  (ns per buf)
0x020000: asm:5618 intrins:5609  (ns per buf)

(All routines are in a separate compilation unit with no full-program
optimization, as they are in the kernel.)

> Compiler optimizations are more
> likely to help there.  So I looked more closely at the last 2 loop.
> clang indeed only unrolls the last one,

Not in 3.9.1.

> only for the unreachable case
> with more than 8 bytes on amd64.

How is it unreachable?

Best,
Conrad