Assembly string functions in i386 libc

Bruce Evans brde at optusnet.com.au
Thu Jul 12 11:32:36 UTC 2007


On Thu, 12 Jul 2007, Bruce Evans wrote:

> On Wed, 11 Jul 2007, Sean C. Farley wrote:
>
>> While looking at increasing the speed of strlen(), I noticed that on
>> i386 platforms (PIII, P4 and Athlon XP) the performance is abysmal in
>> libc compared to the version I was writing.  After more testing, I found
>> it was only the assembly version that is really slow.  The C version is
>> fairly quick.  Is there a need to continue to use the assembly versions
>> of string functions on i386?  Does it mainly help slower systems such as
>> those with i386 or i486 CPU's?
>
> I think you are mistaken about the asm version being slow.  In my tests
> ...

Partly.

>> I have the results from my P4 (Id = 0xf24 Stepping = 4) system and the
>> test program here[1].  strlen.tar.bz2 is the archive of it for anyone's
>> testing.  In the strlen/results subdirectory, there are the results for
>> strings of increasing lengths.
>
> Sorry, I didn't look at this.  I just wrote a quick re-test and ran it

Now I've looked at it.  I think it is not testing strlen() at all, except
for the libc case, because __pure prevents more than 1 call to strlen().
(The existence of __pure is also a bug.  __pure was the FreeBSD spelling
of the __const__ attribute in gcc-1.  It was removed when special support
for gcc-1 was dropped, and should not have been recycled.)  __pure is a
syntax error in the old version of FreeBSD that I tested on.  I first
tried __pure2, which is the FreeBSD spelling of the __const__ attribute
in gcc-2.  I think it is weaker than the __pure__ attribute in gcc-3.

After removing __pure* and adding -static -g to CFLAGS, with gcc-3.3.3:

On a old Celeron (400MHz) (all P2's probably behave like this):

%%%
libcstrlen:	time spent executing strlen(string) = 64:	7.786868
basestrlen:	time spent executing strlen(string) = 64:	3.816736
strlen:		time spent executing strlen(string) = 64:	3.364313
strlen2:	time spent executing strlen(string) = 64:	2.662973
%%%

rep scasb is apparently very slow on P2's.

On an A64 in i386 mode:

%%%
libcstrlen:	time spent executing strlen(string) = 64:	0.709657
basestrlen:	time spent executing strlen(string) = 64:	0.691397
strlen:		time spent executing strlen(string) = 64:	0.527339
strlen2:	time spent executing strlen(string) = 64:	0.441090
%%%

Now rep scasb is only slightly slower than the simple C loop (since all
small loops take 2 cycles on AXP and A64...).  strlen and strlen2 are
marginally faster since their loops do more.

basestrlen is fastest for lengths <= 5 on the Celeron.

basestrlen is fastest for lengths <= 9 on the A64.

Bruce


More information about the freebsd-arch mailing list