Assembly string functions in i386 libc
Sean C. Farley
scf at FreeBSD.org
Wed Jul 11 23:46:30 UTC 2007
On Thu, 12 Jul 2007, Peter Jeremy wrote:
> On 2007-Jul-11 15:24:01 -0500, "Sean C. Farley" <scf at freebsd.org> wrote:
>> libc compared to the version I was writing. After more testing, I
>> found it was only the assembly version that is really slow. The C
>> version is fairly quick. Is there a need to continue to use the
>> assembly versions of string functions on i386? Does it mainly help
>> slower systems such as those with i386 or i486 CPU's?
>
> The performance of string instructions has varied wildly across
> various x86 implementations. Definitely, for short strings, the
> overhead in initialising the various registers outweighs any actual
> difference in loop performance. For any recent CPU, the location of
> the string in the memory hierarchy far outweighs implementation
> issues. bde@ has done various testing in the last and posted results.
>
> Some comments:
> - comparing the strlen() in a shared libc with a statically linked one
> is unfair - especially on the i386.
I had been testing with strlen.S linked into the test program, but the
results were the same (at least for me) as linking against libc.
> - Your results don't include non-aligned inputs
I ran the test again but skipping to the next byte in a given string.
They are in a results-non-aligned directory. The string given to the
program was always one byte bigger than before to allow the results to
match up between aligned and non-aligned.
> - Your results don't include non-power-of-2 lengths
I have tested values of various lengths. The Makefile in the main
directory shows other values I have tried. I can output some more
outputs including the assembly file compiled directly into the program.
>> I would appreciate it if anyone could see if strlen and strlen2
>> perform any better on an amd64. Although the current C version of
>> strlen() in 7-CURRENT is faster than mine for smaller values, they
>> perform better for larger strings.
>
> I've tested on:
> FreeBSD 6.2-STABLE #28: Fri Jun 22 11:44:13 EST 2007
> root at turion.vk2pj.dyndns.org:/usr/obj/usr/src/sys/turion
> CPU: AMD Turion(tm) 64 Mobile ML-40 (2194.52-MHz K8-class CPU)
> Origin = "AuthenticAMD" Id = 0x20f42 Stepping = 2
> Features=0x78bfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,MMX,FXSR,SSE,SSE2>
> Features2=0x1<SSE3>
> AMD Features=0xe2500800<SYSCALL,NX,MMX+,FFXSR,LM,3DNow!+,3DNow!>
> AMD Features2=0x1<LAHF>
>
> There is no asm strlen so libcstrlen and basestrlen should be
> identical (and disassembling [x]strlen() shows that the code _is_
> identical) but there are significant differences for short strings and
> measurable differences for all lengths except 32 bytes. This
> indicates that your program is not able to accurately compare strlen()
> performance.
I am not sure I understand. The 32-byte test results show a measurable
difference in your output and mine.
I just switched the program to use getrusage() from gettimeofday. This
should show more accurate results for 32 bytes and the 4- and 8-byte
tests below.
> I've tried statically linking all the test programs and this removes
> the libcstrlen/basestrlen differences. The very poor results for 4
> and 8 byte strings are unexpected but (as expected), your unrolled
> strlen() implementations behave better for longer strings.
>
> The attached results all reflect your code with '-static' added to
> every gcc/link step.
I redid my tests with everything compiled statically. Also, getrusage()
was used instead of gettimeofday().
Sean
--
scf at FreeBSD.org
More information about the freebsd-arch
mailing list