[Bug 199587] libc strncmp() performance

From: <bugzilla-noreply_at_freebsd.org>
Date: Wed, 27 Nov 2024 13:27:30 UTC
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=199587

Robert Clausecker <fuz@FreeBSD.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         Resolution|---                         |FIXED
           Assignee|bugs@FreeBSD.org            |fuz@FreeBSD.org
             Status|New                         |Closed
                 CC|                            |fuz@FreeBSD.org

--- Comment #2 from Robert Clausecker <fuz@FreeBSD.org> ---
strncmp has been rewritten in assembly for better performance in FreeBSD 14.1.

Do the performance issues still occur?  If yes, please reopen.

Note that your benchmark favours your own implementation: you always use the
same inputs to strncmp(), allowing the branch predictor to predict the control
flow of your very branchy implementation perfectly.  This is not reflective of
real-world input where each pair of strings passed to strncmp() is likely
different.

I have designed a more comprehensive benchmark here:

    https://github.com/clausecker/strperf

Comparing your function with the SIMD-enhanced implementation I wrote, we get:

os: FreeBSD
arch: amd64
cpu: Intel(R) Core(TM) i7-4910MQ CPU @ 2.90GHz
                      │ strncmp.simple.out │          strncmp.scalar.out       
  │        strncmp.baseline.out         │
                      │       sec/op       │    sec/op     vs base             
  │   sec/op     vs base                │
StrncmpShortAligned           144.11µ ± 0%   141.34µ ± 1%   -1.92% (p=0.000
n=20)   63.05µ ± 0%  -56.25% (p=0.000 n=20)
StrncmpMidAligned              83.93µ ± 0%    64.46µ ± 1%  -23.20% (p=0.000
n=20)   21.51µ ± 1%  -74.37% (p=0.000 n=20)
StrncmpLongAligned            59.528µ ± 1%   38.266µ ± 0%  -35.72% (p=0.000
n=20)   6.154µ ± 1%  -89.66% (p=0.000 n=20)
StrncmpShortUnaligned         143.07µ ± 0%   142.14µ ± 1%   -0.65% (p=0.003
n=20)   69.63µ ± 0%  -51.33% (p=0.000 n=20)
StrncmpMidUnaligned            83.61µ ± 1%    64.62µ ± 0%  -22.71% (p=0.000
n=20)   26.60µ ± 1%  -68.19% (p=0.000 n=20)
StrncmpLongUnaligned          59.416µ ± 0%   38.392µ ± 1%  -35.39% (p=0.000
n=20)   6.356µ ± 1%  -89.30% (p=0.000 n=20)
StrncmpShortQsort              1.000m ± 1%    1.053m ± 0%   +5.28% (p=0.000
n=20)   1.221m ± 1%  +22.11% (p=0.000 n=20)
StrncmpMidQsort                243.2µ ± 0%    243.4µ ± 0%        ~ (p=0.968
n=20)   271.7µ ± 0%  +11.71% (p=0.000 n=20)
geomean                        137.1µ         115.4µ       -15.78%             
    48.88µ       -64.33%

                      │ strncmp.simple.out │          strncmp.scalar.out       
  │          strncmp.baseline.out          │
                      │        B/s         │     B/s       vs base             
  │      B/s       vs base                 │
StrncmpShortAligned           867.4Mi ± 0%   884.4Mi ± 1%   +1.95% (p=0.000
n=20)   1982.6Mi ± 0%  +128.57% (p=0.000 n=20)
StrncmpMidAligned             1.454Gi ± 0%   1.894Gi ± 1%  +30.20% (p=0.000
n=20)    5.675Gi ± 1%  +290.14% (p=0.000 n=20)
StrncmpLongAligned            2.051Gi ± 1%   3.190Gi ± 0%  +55.56% (p=0.000
n=20)   19.835Gi ± 1%  +867.25% (p=0.000 n=20)
StrncmpShortUnaligned         873.7Mi ± 0%   879.4Mi ± 1%   +0.66% (p=0.003
n=20)   1795.2Mi ± 0%  +105.47% (p=0.000 n=20)
StrncmpMidUnaligned           1.460Gi ± 1%   1.889Gi ± 0%  +29.39% (p=0.000
n=20)    4.589Gi ± 1%  +214.35% (p=0.000 n=20)
StrncmpLongUnaligned          2.054Gi ± 0%   3.180Gi ± 1%  +54.76% (p=0.000
n=20)   19.207Gi ± 1%  +834.86% (p=0.000 n=20)
StrncmpShortQsort             125.0Mi ± 1%   118.7Mi ± 0%   -5.01% (p=0.000
n=20)    102.4Mi ± 1%   -18.10% (p=0.000 n=20)
StrncmpMidQsort               514.0Mi ± 0%   513.6Mi ± 0%        ~ (p=0.968
n=20)    460.1Mi ± 0%   -10.49% (p=0.000 n=20)
geomean                       912.1Mi        1.058Gi       +18.74%             
     2.497Gi       +180.37%

where "simple" is your implementation (compiled with -O3), "scalar" is our
non-SIMD implementation (see simd(7), somewhat similar to yours, but manually
unrolled) and "baseline" is our SIMD implementation.  As you can see, the
simple implementation is only faster on the qsort benchmark (sorting an array
of random strings) and there only slightly.

-- 
You are receiving this mail because:
You are the assignee for the bug.