Checksum/copy
Bruce Evans
bde at zeta.org.au
Thu Mar 27 23:44:32 PST 2003
On Thu, 27 Mar 2003, Dag-Erling [iso-8859-1] Smørgrav wrote:
> David Malone <dwmalone at maths.tcd.ie> writes:
> > On Thu, Mar 27, 2003 at 09:57:35AM +0100, des at ofug.org wrote:
> > > Might it be a good idea to have separate b{copy,zero} implementations
> > > for special purposes like pmap_{copy,zero}_page?
> > We do have a i686_pagezero already, which seems to be used in
> > pmap_zero_page - I guess it may not be well tuned to modern processors,
> > as it is almost 5 years old.
Indeed.
> i686_pagezero uses 'rep stosl' after an initial 'rep scasl' to check
> if the page was already zero (which is a pessimization unless we zero
> a lot of pages that are already zeroed). SSE can do far better than
> that.
Even integer instructions can do significantly better than scasl/stosl
on "686"s (PentiumPro and similar CPUs). Implementation bugs in
i686_pagezero() include:
- scasl is one of the slowest ways to read memory, at least on old
Celerons (I believe PPro's have similar timing for string operations).
It is a bit slower than lodsl, which is about 3.5 times slower than
a lightly unrolled movl loop for the L1-cached case and about 2 times
slower for the uncached case.
- the code apparently intends to check 16 words at a time, but due to
getting a comparison backwards it actually zeros everything else as
soon as it finds a nonzero word, with extra obfuscations and
pessimizations when it is within 16 words of the end.
Implementation non-bugs include using stosl to do the zeroing. Unlike
lodsl and scasl, stosl is actually useful for its intended purpos on
"686"s.
Instead of fixing the comparison and any other logic bugs, I rewrote the
function using orl instead of scasl, and simpler logic (ignore the changes
for the previous function in the same hunk).
%%%
Index: support.s
===================================================================
RCS file: /home/ncvs/src/sys/i386/i386/support.s,v
retrieving revision 1.93
diff -u -2 -r1.93 support.s
--- support.s 22 Sep 2002 04:45:20 -0000 1.93
+++ support.s 22 Sep 2002 09:51:27 -0000
@@ -333,70 +337,58 @@
movl %edx,%edi
xorl %eax,%eax
- shrl $2,%ecx
cld
+ shrl $2,%ecx
rep
stosl
movl 12(%esp),%ecx
andl $3,%ecx
- jne 1f
- popl %edi
- ret
-
-1:
+ je 1f
rep
stosb
+1:
popl %edi
ret
-#endif /* I586_CPU && defined(DEV_NPX) */
+#endif /* I586_CPU && DEV_NPX */
+#ifdef I686_CPU
ENTRY(i686_pagezero)
- pushl %edi
- pushl %ebx
-
- movl 12(%esp), %edi
+ movl 4(%esp), %edx
movl $1024, %ecx
- cld
ALIGN_TEXT
1:
- xorl %eax, %eax
- repe
- scasl
- jnz 2f
+ movl (%edx), %eax
+ orl 4(%edx), %eax
+ orl 8(%edx), %eax
+ orl 12(%edx), %eax
+ orl 16(%edx), %eax
+ orl 20(%edx), %eax
+ orl 24(%edx), %eax
+ orl 28(%edx), %eax
+ jne 2f
+
+ addl $32, %edx
+ subl $32/4, %ecx
+ jne 1b
- popl %ebx
- popl %edi
ret
ALIGN_TEXT
-
2:
- incl %ecx
- subl $4, %edi
+ movl $0, (%edx)
+ movl $0, 4(%edx)
+ movl $0, 8(%edx)
+ movl $0, 12(%edx)
+ movl $0, 16(%edx)
+ movl $0, 20(%edx)
+ movl $0, 24(%edx)
+ movl $0, 28(%edx)
+
+ addl $32, %edx
+ subl $32/4, %ecx
+ jne 1b
- movl %ecx, %edx
- cmpl $16, %ecx
-
- jge 3f
-
- movl %edi, %ebx
- andl $0x3f, %ebx
- shrl %ebx
- shrl %ebx
- movl $16, %ecx
- subl %ebx, %ecx
-
-3:
- subl %ecx, %edx
- rep
- stosl
-
- movl %edx, %ecx
- testl %edx, %edx
- jnz 1b
-
- popl %ebx
- popl %edi
ret
+#endif /* I686_CPU */
/* fillw(pat, base, cnt) */
%%%
Implementation notes: using orl might not be best (due to pipelining issues).
Using movl instead of stosl might not be best (I used it to simplify the
logic and reduce initilization overheads).
This hasn't been tested recently. I've had it disabled in pmap.c for
as long as I can remember, to prepare for complete testing (my pmap.c
just uses bzero()).
The importance of optimizing this function can be gauged by the number of
people who have noticed that it never worked right and the number of
commits to make it work right.
Zeroing pages is not completely unimportant, however. The pagezero task
takes about 5% of the time for a makeworld here. Most of this time is
"free" here since pagezero can run while the system is waiting for disks,
and I don't run much else while doing makeworld benchmarks. However, it
is not free time under different/heavier loads.
Bruce
More information about the cvs-src
mailing list