svn commit: r313006 - in head: sys/conf sys/libkern sys/libkern/x86 sys/sys tests/sys/kern
Conrad Meyer
cem at freebsd.org
Tue Jan 31 18:48:17 UTC 2017
On Tue, Jan 31, 2017 at 7:36 AM, Bruce Evans <brde at optusnet.com.au> wrote:
> On Tue, 31 Jan 2017, Bruce Evans wrote:
> Unrolling (or not) may be helpful or harmful for entry and exit code.
Helpful, per my earlier benchmarks.
> I
> think there should by no alignment on entry -- just assume the buffer is
> aligned in the usual case, and only run 4% slower when it is misaligned.
Please write such a patch and demonstrate the improvement.
> The exit code handles up to SHORT * 3 = 768 bytes, not up to 4 or 8
> bytes or up to 3 times that like simpler algorithms. 768 is quite
> large, and the exit code is quite slow. It reduces 8 or 4 bytes at a
> time without any dependency reduction, and then 1 byte at a time.
Yes, this is the important loop to unroll for small inputs. Somehow
with the unrolling, it is only ~19% slower than the by-3 algorithm on
my system — not 66%. Clang 3.9.1 unrolls both of these trailing
loops; here is the first:
0x0000000000401b88 <+584>: cmp $0x38,%rbx
0x0000000000401b8c <+588>: jae 0x401b93 <sse42_crc32c+595>
0x0000000000401b8e <+590>: mov %rsi,%rdx
0x0000000000401b91 <+593>: jmp 0x401be1 <sse42_crc32c+673>
0x0000000000401b93 <+595>: lea -0x1(%rdi),%rbx
0x0000000000401b97 <+599>: sub %rdx,%rbx
0x0000000000401b9a <+602>: mov %rsi,%rdx
0x0000000000401b9d <+605>: nopl (%rax)
0x0000000000401ba0 <+608>: crc32q (%rdx),%rax
0x0000000000401ba6 <+614>: crc32q 0x8(%rdx),%rax
0x0000000000401bad <+621>: crc32q 0x10(%rdx),%rax
0x0000000000401bb4 <+628>: crc32q 0x18(%rdx),%rax
0x0000000000401bbb <+635>: crc32q 0x20(%rdx),%rax
0x0000000000401bc2 <+642>: crc32q 0x28(%rdx),%rax
0x0000000000401bc9 <+649>: crc32q 0x30(%rdx),%rax
0x0000000000401bd0 <+656>: crc32q 0x38(%rdx),%rax
0x0000000000401bd7 <+663>: add $0x40,%rdx
0x0000000000401bdb <+667>: add $0x8,%rbx
0x0000000000401bdf <+671>: jne 0x401ba0 <sse42_crc32c+608>
> I
> don't understand the algorithm for joining crcs -- why doesn't it work
> to reduce to 12 or 24 bytes in the main loop?
It would, but I haven't implemented or tested that. You're welcome to
do so and demonstrate an improvement. It does add more lookup table
bloat, but perhaps we could just remove the 3x8k table — I'm not sure
it adds any benefit over the 3x256 table.
> Your benchmarks mainly give results for the <= 768 bytes where most of
> the manual optimizations don't apply.
0x000400: asm:68 intrins:62 multitable:684 (ns per buf)
0x000800: asm:132 intrins:133 (ns per buf)
0x002000: asm:449 intrins:446 (ns per buf)
0x008000: asm:1501 intrins:1497 (ns per buf)
0x020000: asm:5618 intrins:5609 (ns per buf)
(All routines are in a separate compilation unit with no full-program
optimization, as they are in the kernel.)
> Compiler optimizations are more
> likely to help there. So I looked more closely at the last 2 loop.
> clang indeed only unrolls the last one,
Not in 3.9.1.
> only for the unreachable case
> with more than 8 bytes on amd64.
How is it unreachable?
Best,
Conrad
More information about the svn-src-all
mailing list