Assembly string functions in i386 libc

Wed Jul 11 23:46:30 UTC 2007

On Thu, 12 Jul 2007, Peter Jeremy wrote:

> On 2007-Jul-11 15:24:01 -0500, "Sean C. Farley" <scf at freebsd.org> wrote:
>> libc compared to the version I was writing.  After more testing, I
>> found it was only the assembly version that is really slow.  The C
>> version is fairly quick.  Is there a need to continue to use the
>> assembly versions of string functions on i386?  Does it mainly help
>> slower systems such as those with i386 or i486 CPU's?
>
> The performance of string instructions has varied wildly across
> various x86 implementations.  Definitely, for short strings, the
> overhead in initialising the various registers outweighs any actual
> difference in loop performance.  For any recent CPU, the location of
> the string in the memory hierarchy far outweighs implementation
> issues.  bde@ has done various testing in the last and posted results.
>
> Some comments:
> - comparing the strlen() in a shared libc with a statically linked one
>   is unfair - especially on the i386.

I had been testing with strlen.S linked into the test program, but the
results were the same (at least for me) as linking against libc.

> - Your results don't include non-aligned inputs

I ran the test again but skipping to the next byte in a given string.
They are in a results-non-aligned directory.  The string given to the
program was always one byte bigger than before to allow the results to
match up between aligned and non-aligned.

> - Your results don't include non-power-of-2 lengths

I have tested values of various lengths.  The Makefile in the main
directory shows other values I have tried.  I can output some more
outputs including the assembly file compiled directly into the program.

>> I would appreciate it if anyone could see if strlen and strlen2
>> perform any better on an amd64.  Although the current C version of
>> strlen() in 7-CURRENT is faster than mine for smaller values, they
>> perform better for larger strings.
>
> I've tested on:
> FreeBSD 6.2-STABLE #28: Fri Jun 22 11:44:13 EST 2007
>    root at turion.vk2pj.dyndns.org:/usr/obj/usr/src/sys/turion
> CPU: AMD Turion(tm) 64 Mobile ML-40                  (2194.52-MHz K8-class CPU)
>  Origin = "AuthenticAMD"  Id = 0x20f42  Stepping = 2
>  Features=0x78bfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,MMX,FXSR,SSE,SSE2>
>  Features2=0x1<SSE3>
>  AMD Features=0xe2500800<SYSCALL,NX,MMX+,FFXSR,LM,3DNow!+,3DNow!>
>  AMD Features2=0x1<LAHF>
>
> There is no asm strlen so libcstrlen and basestrlen should be
> identical (and disassembling [x]strlen() shows that the code _is_
> identical) but there are significant differences for short strings and
> measurable differences for all lengths except 32 bytes.  This
> indicates that your program is not able to accurately compare strlen()
> performance.

I am not sure I understand.  The 32-byte test results show a measurable
difference in your output and mine.

I just switched the program to use getrusage() from gettimeofday.  This
should show more accurate results for 32 bytes and the 4- and 8-byte
tests below.

> I've tried statically linking all the test programs and this removes
> the libcstrlen/basestrlen differences.  The very poor results for 4
> and 8 byte strings are unexpected but (as expected), your unrolled
> strlen() implementations behave better for longer strings.
>
> The attached results all reflect your code with '-static' added to
> every gcc/link step.

I redid my tests with everything compiled statically.  Also, getrusage()
was used instead of gettimeofday().

Sean
-- 
scf at FreeBSD.org