[Bug 199587] libc strncmp() performance
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Wed, 27 Nov 2024 13:27:30 UTC
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=199587 Robert Clausecker <fuz@FreeBSD.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Resolution|--- |FIXED Assignee|bugs@FreeBSD.org |fuz@FreeBSD.org Status|New |Closed CC| |fuz@FreeBSD.org --- Comment #2 from Robert Clausecker <fuz@FreeBSD.org> --- strncmp has been rewritten in assembly for better performance in FreeBSD 14.1. Do the performance issues still occur? If yes, please reopen. Note that your benchmark favours your own implementation: you always use the same inputs to strncmp(), allowing the branch predictor to predict the control flow of your very branchy implementation perfectly. This is not reflective of real-world input where each pair of strings passed to strncmp() is likely different. I have designed a more comprehensive benchmark here: https://github.com/clausecker/strperf Comparing your function with the SIMD-enhanced implementation I wrote, we get: os: FreeBSD arch: amd64 cpu: Intel(R) Core(TM) i7-4910MQ CPU @ 2.90GHz │ strncmp.simple.out │ strncmp.scalar.out │ strncmp.baseline.out │ │ sec/op │ sec/op vs base │ sec/op vs base │ StrncmpShortAligned 144.11µ ± 0% 141.34µ ± 1% -1.92% (p=0.000 n=20) 63.05µ ± 0% -56.25% (p=0.000 n=20) StrncmpMidAligned 83.93µ ± 0% 64.46µ ± 1% -23.20% (p=0.000 n=20) 21.51µ ± 1% -74.37% (p=0.000 n=20) StrncmpLongAligned 59.528µ ± 1% 38.266µ ± 0% -35.72% (p=0.000 n=20) 6.154µ ± 1% -89.66% (p=0.000 n=20) StrncmpShortUnaligned 143.07µ ± 0% 142.14µ ± 1% -0.65% (p=0.003 n=20) 69.63µ ± 0% -51.33% (p=0.000 n=20) StrncmpMidUnaligned 83.61µ ± 1% 64.62µ ± 0% -22.71% (p=0.000 n=20) 26.60µ ± 1% -68.19% (p=0.000 n=20) StrncmpLongUnaligned 59.416µ ± 0% 38.392µ ± 1% -35.39% (p=0.000 n=20) 6.356µ ± 1% -89.30% (p=0.000 n=20) StrncmpShortQsort 1.000m ± 1% 1.053m ± 0% +5.28% (p=0.000 n=20) 1.221m ± 1% +22.11% (p=0.000 n=20) StrncmpMidQsort 243.2µ ± 0% 243.4µ ± 0% ~ (p=0.968 n=20) 271.7µ ± 0% +11.71% (p=0.000 n=20) geomean 137.1µ 115.4µ -15.78% 48.88µ -64.33% │ strncmp.simple.out │ strncmp.scalar.out │ strncmp.baseline.out │ │ B/s │ B/s vs base │ B/s vs base │ StrncmpShortAligned 867.4Mi ± 0% 884.4Mi ± 1% +1.95% (p=0.000 n=20) 1982.6Mi ± 0% +128.57% (p=0.000 n=20) StrncmpMidAligned 1.454Gi ± 0% 1.894Gi ± 1% +30.20% (p=0.000 n=20) 5.675Gi ± 1% +290.14% (p=0.000 n=20) StrncmpLongAligned 2.051Gi ± 1% 3.190Gi ± 0% +55.56% (p=0.000 n=20) 19.835Gi ± 1% +867.25% (p=0.000 n=20) StrncmpShortUnaligned 873.7Mi ± 0% 879.4Mi ± 1% +0.66% (p=0.003 n=20) 1795.2Mi ± 0% +105.47% (p=0.000 n=20) StrncmpMidUnaligned 1.460Gi ± 1% 1.889Gi ± 0% +29.39% (p=0.000 n=20) 4.589Gi ± 1% +214.35% (p=0.000 n=20) StrncmpLongUnaligned 2.054Gi ± 0% 3.180Gi ± 1% +54.76% (p=0.000 n=20) 19.207Gi ± 1% +834.86% (p=0.000 n=20) StrncmpShortQsort 125.0Mi ± 1% 118.7Mi ± 0% -5.01% (p=0.000 n=20) 102.4Mi ± 1% -18.10% (p=0.000 n=20) StrncmpMidQsort 514.0Mi ± 0% 513.6Mi ± 0% ~ (p=0.968 n=20) 460.1Mi ± 0% -10.49% (p=0.000 n=20) geomean 912.1Mi 1.058Gi +18.74% 2.497Gi +180.37% where "simple" is your implementation (compiled with -O3), "scalar" is our non-SIMD implementation (see simd(7), somewhat similar to yours, but manually unrolled) and "baseline" is our SIMD implementation. As you can see, the simple implementation is only faster on the qsort benchmark (sorting an array of random strings) and there only slightly. -- You are receiving this mail because: You are the assignee for the bug.