Checksum/copy (was: Re: cvs commit: src/sys/netinet ip_output.c)
Bruce Evans
bde at zeta.org.au
Sun Mar 30 22:31:55 PST 2003
On Fri, 28 Mar 2003, Peter Jeremy wrote:
> On Fri, Mar 28, 2003 at 05:04:21PM +1100, Bruce Evans wrote:
> >"i686" basically means "second generation Pentium" (PentiumPro/PII/Celeron)
> >(later x86's are mostly handled better using CPU features instead of
> >a 1-dimensional class number). Hand-"optimized" bzero's are especially
> >pessimal for this class of CPU.
>
> That matches my memory of my test results as well. The increasing
> clock multipliers mean that it doesn't matter how slow "rep stosl" is
> in clock cycle terms - maim memory is always going to be slower.
There are still some surprising differences (see attached timings for
some examples), but I think they are more for how the code affects
caches and write buffers. The exact behaviour is very machine-dependent
so it is hard to optimize in general-purpose production code.
> >Benefits from SSE for bzeroing and bcopying, if any, would probably
> >come more from bypassing caches and/or not doing read-before-write
> >(SSE instructions give control over this) than from operating on wider
> >data. I'm dubious about practical benefits. Obviously it is not useful
> >to bust the cache when bzeroing 8MB of data, but real programs and OS's
> >mostly operate on smaller buffers. It is negatively useful not to put
> >bzero'ed data in the (L[1-2]) cache if the data will be used soon, and
> >generally hard to predict if it will be used soon.
>
> Unless Intel have fixed the P4 caches, you definitely don't want to
> use the L1 cache for page sized bzero/bcopy.
Athlons have many similarities to Celerons here.
> Avoiding read-before-write should roughly double bzero speed and give
> you about 50% speedup on bcopy - this should be worthwhile. Caching
It actually gives a 66% speedop for bzero on my AthlonXP. For some
reason, at least for very large buffers, read accesses through the
cache can use only 1/2 of the memory bandwidth, and write accesses can
use only 1/3 of it (and this is after tuning for bank organization --
I get a 33% speedup for the write benchmark and 0% for real work by
including a bit for the bank number in the page color in a deteriministic
way, and almost as much for including the bit in a random way). Using
SSE instructions (mainly movntps) gives the full bandwidth for at
least bzero for large buffers (3x better), but it reduces bandwidth
for small already cached buffers (more than 3x worse):
%%%
Times on an AthlonXP-1600 overclocked by 146/133, with 1024MB of PC2700
memory and all memory timings tuned as low as possible (CAS2, but 2T cmds):
4K buffer (almost always cached):
zero0: 5885206293 B/s (6959824 us) (stosl)
zero1: 7842053086 B/s (5223122 us) (unroll 16)
zero2: 7049051312 B/s (5810711 us) (unroll 16 preallocate)
zero3: 9377720907 B/s (4367799 us) (unroll 32)
zero4: 7803040290 B/s (5249236 us) (unroll 32 preallocate)
zero5: 9802682719 B/s (4178448 us) (unroll 64)
zero6: 8432350664 B/s (4857483 us) (unroll 64 preallocate)
zero7: 5957318200 B/s (6875577 us) (fstl)
zero8: 3007928933 B/s (13617343 us) (movl)
zero9: 4011348905 B/s (10211029 us) (unroll 8)
zeroA: 5835984056 B/s (7018525 us) (generic_bzero)
zeroB: 8334888325 B/s (4914283 us) (i486_bzero)
zeroC: 2545022700 B/s (16094159 us) (i586_bzero)
zeroD: 7650723550 B/s (5353742 us) (i686_pagezero)
zeroE: 5755535593 B/s (7116627 us) (bzero (stosl))
zeroF: 2282741753 B/s (17943335 us) (movntps)
movntps is the SSE method. It's significantly slower for this case.
400MB buffer (never cached):
zero0: 714045391 B/s ( 573633 us) (stosl)
zero1: 705180737 B/s ( 580844 us) (unroll 16)
zero2: 670897998 B/s ( 610525 us) (unroll 16 preallocate)
zero3: 690538809 B/s ( 593160 us) (unroll 32)
zero4: 661854647 B/s ( 618867 us) (unroll 32 preallocate)
zero5: 670525682 B/s ( 610864 us) (unroll 64)
zero6: 663334877 B/s ( 617486 us) (unroll 64 preallocate)
zero7: 781025057 B/s ( 524439 us) (fstl)
zero8: 608491547 B/s ( 673140 us) (movl)
zero9: 696489665 B/s ( 588092 us) (unroll 8)
zeroA: 713958268 B/s ( 573703 us) (generic_bzero)
zeroB: 689875870 B/s ( 593730 us) (i486_bzero)
zeroC: 721477338 B/s ( 567724 us) (i586_bzero)
zeroD: 746453616 B/s ( 548728 us) (i686_pagezero)
zeroE: 714016763 B/s ( 573656 us) (bzero (stosl))
zeroF: 2240602162 B/s ( 182808 us) (movntps)
Now movntps is about 3 times faster than everything else. This is the
first time I've seen a magic number near 2100 for memory named with a
magic number near 2100. This machine used to use PC2100 with the
same timing, but it developed errors (burnt out?). Now it has PC2700
memory so it is within spec an can reasonably be expected to run a little
faster than PC2100 should.
%%%
> is more dubious - placing a slow-zeroed page in L1 cache is very
> probably a waste of time. At least part of an on-demand zeroed page
> is likely to be used in the near future - but probably not all of it.
> Copying is even harder to predict - at least one word of a COW page is
> going to be used immediately, but bcopy() won't be able to tell which
> word.
For makeworld, using movntps in i686_pagezero() gives a whole 14 seconds
(0.7%) improvement:
%%%
Before:
bde-current with ... + KSEIII + idlezero_enable + pmap - even coloring
async mounted /c
my-Makefile
after perl removal and new gcc and ufs2 and aout utilities removal
with 2 fairly new drives
1532 MHz AthlonXP 1600
1024MB
make catches SIGCHLD
i686_bzero not used
--------------------------------------------------------------
>>> elf make world completed on Mon Mar 31 02:10:47 EST 2003
(started on Mon Mar 31 01:38:24 EST 2003)
--------------------------------------------------------------
1943.14 real 1575.25 user 218.88 sys
40204 maximum resident set size
2166 average shared memory size
1988 average unshared data size
128 average unshared stack size
13039568 page reclaims
11639 page faults
0 swaps
20008 block input operations
6265 block output operations
0 messages sent
0 messages received
33037 signals received
207588 voluntary context switches
518358 involuntary context switches
After:
bde-current with ... + KSEIII + idlezero_enable + pmap - even coloring
async mounted /c
my-Makefile
after perl removal and new gcc and ufs2 and aout utilities removal
with 2 fairly new drives
1532 MHz AthlonXP 1600
1024MB
make catches SIGCHLD
i686_bzero used and replaced by one that uses SSE (movntps)
--------------------------------------------------------------
>>> elf make world completed on Mon Mar 31 02:46:43 EST 2003
(started on Mon Mar 31 02:14:35 EST 2003)
--------------------------------------------------------------
1929.02 real 1576.67 user 205.30 sys
40204 maximum resident set size
2166 average shared memory size
1990 average unshared data size
128 average unshared stack size
13039590 page reclaims
11645 page faults
0 swaps
20014 block input operations
6416 block output operations
0 messages sent
0 messages received
33037 signals received
208376 voluntary context switches
512820 involuntary context switches
%%%
Whether 14 seconds is a lot depends on your viewpoint. It is a lot
out of the kernel time of 218 seconds considering that only one function
was optimized and some of the optimization doesn't affect the real
time since it is done at idle priority in pagezero. pagezero's time
was reduced from 57 seconds to 28 seconds.
Code for the above (no warranties; only works for !SMP and I didn't
check that the FP context switching is safe...):
%%%
Index: support.s
===================================================================
RCS file: /home/ncvs/src/sys/i386/i386/support.s,v
retrieving revision 1.93
diff -u -2 -r1.93 support.s
--- support.s 22 Sep 2002 04:45:20 -0000 1.93
+++ support.s 31 Mar 2003 02:37:02 -0000
@@ -66,4 +68,9 @@
.space 3
#endif
+#define HACKISH_SSE_PAGEZERO
+#ifdef HACKISH_SSE_PAGEZERO
+zero:
+ .long 0, 0, 0, 0
+#endif
.text
@@ -333,70 +342,101 @@
movl %edx,%edi
xorl %eax,%eax
- shrl $2,%ecx
cld
+ shrl $2,%ecx
rep
stosl
movl 12(%esp),%ecx
andl $3,%ecx
- jne 1f
- popl %edi
- ret
-
-1:
+ je 1f
rep
stosb
+1:
popl %edi
ret
-#endif /* I586_CPU && defined(DEV_NPX) */
+#endif /* I586_CPU && DEV_NPX */
+#ifdef I686_CPU
ENTRY(i686_pagezero)
- pushl %edi
- pushl %ebx
+ movl 4(%esp),%edx
+ movl $PAGE_SIZE, %ecx
- movl 12(%esp), %edi
- movl $1024, %ecx
- cld
+#ifdef HACKISH_SSE_PAGEZERO
+ pushfl
+ cli
+ movl %cr0,%eax
+ clts
+ subl $16,%esp
+ movups %xmm0,(%esp)
+ movups zero,%xmm0
+ ALIGN_TEXT
+1:
+ movntps %xmm0,(%edx)
+ movntps %xmm0,16(%edx)
+ movntps %xmm0,32(%edx)
+ movntps %xmm0,48(%edx)
+ addl $64,%edx
+ subl $64,%ecx
+ jne 1b
+ movups (%esp),%xmm0
+ addl $16,%esp
+ movl %eax,%cr0
+ popfl
+ ret
+2:
+#endif /* HACKISH_SSE_PAGEZERO */
ALIGN_TEXT
1:
- xorl %eax, %eax
- repe
- scasl
- jnz 2f
+ movl (%edx), %eax
+ orl 4(%edx), %eax
+ orl 8(%edx), %eax
+ orl 12(%edx), %eax
+ orl 16(%edx), %eax
+ orl 20(%edx), %eax
+ orl 24(%edx), %eax
+ orl 28(%edx), %eax
+ jne 2f
+ movl 32(%edx), %eax
+ orl 36(%edx), %eax
+ orl 40(%edx), %eax
+ orl 44(%edx), %eax
+ orl 48(%edx), %eax
+ orl 52(%edx), %eax
+ orl 56(%edx), %eax
+ orl 60(%edx), %eax
+ jne 3f
+
+ addl $64, %edx
+ subl $64, %ecx
+ jne 1b
- popl %ebx
- popl %edi
ret
ALIGN_TEXT
-
2:
- incl %ecx
- subl $4, %edi
-
- movl %ecx, %edx
- cmpl $16, %ecx
-
- jge 3f
-
- movl %edi, %ebx
- andl $0x3f, %ebx
- shrl %ebx
- shrl %ebx
- movl $16, %ecx
- subl %ebx, %ecx
-
+ movl $0, (%edx)
+ movl $0, 4(%edx)
+ movl $0, 8(%edx)
+ movl $0, 12(%edx)
+ movl $0, 16(%edx)
+ movl $0, 20(%edx)
+ movl $0, 24(%edx)
+ movl $0, 28(%edx)
3:
- subl %ecx, %edx
- rep
- stosl
-
- movl %edx, %ecx
- testl %edx, %edx
- jnz 1b
+ movl $0, 32(%edx)
+ movl $0, 36(%edx)
+ movl $0, 40(%edx)
+ movl $0, 44(%edx)
+ movl $0, 48(%edx)
+ movl $0, 52(%edx)
+ movl $0, 56(%edx)
+ movl $0, 60(%edx)
+
+ addl $64, %edx
+ subl $64, %ecx
+ jne 1b
- popl %ebx
- popl %edi
ret
+#endif /* I686_CPU */
/* fillw(pat, base, cnt) */
%%%
> I don't know how much control SSE gives you over caching - is it just
> cache/no-cache, or can you control L1+L2/L2-only/none? In the latter
> case, telling bzero and bcopy destination to use L2-only is probably a
> reasonable compromise. The bcopy source should probably not evict
> cache data - if data is cached, use it, otherwise fetch from main
> memory and bypass caches.
There seems to be control in individual instructions for reads, but only
a complete bypass for writes (movntps from an SSE register to memory).
Writing can still be tuned with explicit reads or prefetches after
writes. I've only looked briefly at 3-year-old Intel manuals.
> Finally, how many different bcopy/bzero variants to we want? A
I don't want many :-).
Bruce
More information about the cvs-src
mailing list