Assembly string functions in i386 libc
Bruce Evans
brde at optusnet.com.au
Thu Jul 12 11:32:36 UTC 2007
On Thu, 12 Jul 2007, Bruce Evans wrote:
> On Wed, 11 Jul 2007, Sean C. Farley wrote:
>
>> While looking at increasing the speed of strlen(), I noticed that on
>> i386 platforms (PIII, P4 and Athlon XP) the performance is abysmal in
>> libc compared to the version I was writing. After more testing, I found
>> it was only the assembly version that is really slow. The C version is
>> fairly quick. Is there a need to continue to use the assembly versions
>> of string functions on i386? Does it mainly help slower systems such as
>> those with i386 or i486 CPU's?
>
> I think you are mistaken about the asm version being slow. In my tests
> ...
Partly.
>> I have the results from my P4 (Id = 0xf24 Stepping = 4) system and the
>> test program here[1]. strlen.tar.bz2 is the archive of it for anyone's
>> testing. In the strlen/results subdirectory, there are the results for
>> strings of increasing lengths.
>
> Sorry, I didn't look at this. I just wrote a quick re-test and ran it
Now I've looked at it. I think it is not testing strlen() at all, except
for the libc case, because __pure prevents more than 1 call to strlen().
(The existence of __pure is also a bug. __pure was the FreeBSD spelling
of the __const__ attribute in gcc-1. It was removed when special support
for gcc-1 was dropped, and should not have been recycled.) __pure is a
syntax error in the old version of FreeBSD that I tested on. I first
tried __pure2, which is the FreeBSD spelling of the __const__ attribute
in gcc-2. I think it is weaker than the __pure__ attribute in gcc-3.
After removing __pure* and adding -static -g to CFLAGS, with gcc-3.3.3:
On a old Celeron (400MHz) (all P2's probably behave like this):
%%%
libcstrlen: time spent executing strlen(string) = 64: 7.786868
basestrlen: time spent executing strlen(string) = 64: 3.816736
strlen: time spent executing strlen(string) = 64: 3.364313
strlen2: time spent executing strlen(string) = 64: 2.662973
%%%
rep scasb is apparently very slow on P2's.
On an A64 in i386 mode:
%%%
libcstrlen: time spent executing strlen(string) = 64: 0.709657
basestrlen: time spent executing strlen(string) = 64: 0.691397
strlen: time spent executing strlen(string) = 64: 0.527339
strlen2: time spent executing strlen(string) = 64: 0.441090
%%%
Now rep scasb is only slightly slower than the simple C loop (since all
small loops take 2 cycles on AXP and A64...). strlen and strlen2 are
marginally faster since their loops do more.
basestrlen is fastest for lengths <= 5 on the Celeron.
basestrlen is fastest for lengths <= 9 on the A64.
Bruce
More information about the freebsd-arch
mailing list