Performance improvement to strnlen().

Sun May 26 06:30:40 UTC 2013

On 05/25/2013 10:27 PM, Lee Thomas wrote:
> Hello FreeBSD devs,
> 
> I have found a performance improvement to libc's strnlen().
> lib/libc/string/strnlen.c is a trivial byte-by-byte implementation,
> where strlen.c has a smarter word-by-word implementation. I have
> implemented strnlen similarly to strlen. It runs about 4x as fast, at
> the cost of a binary codesize increase from 30 bytes to 221.
> 
> In writing this I needed a test, and there isn't one in
> tools/regression/lib/libc/string, so I wrote a unit test for strnlen.
> This really only makes sense for a word-by-word implementation, as it
> tests each combination of string and limit length, overread characters,
> and starting alignment.
> 
> Could someone please review these patches? I am not very experienced in
> C, and am even less experienced with FreeBSD's style guidelines, so they
> likely have a bunch of style and idiom issues. Even if one or both of
> them prove not worth committing, it would still be useful for my learning.
> 
> If/When these patches prove worthy of submitting, what is the next step
> after that? Should I submit a PR, or is there some other process? This
> code is quite similar to the existing strlen.c, and doesn't do anything
> fancy with e.g. endianness, but I haven't tested it for correctness or
> performance on anything other than x86...
> 
> And finally, there is some other low-hanging fruit in the other strn*
> functions. Would it be worth it for me to give those the same treatment?
> 
> Thanks,
> Lee Thomas
> 
> Test platform:
>     $uname -a
>         FreeBSD LeeDesktop 9.1-STABLE FreeBSD 9.1-STABLE #15 r250831:
> Mon May 20 17:31:28 EDT 2013
>             lthomas at LeeDesktop:/usr/obj/usr/src/sys/NOSOUND  amd64
>     $dmesg | grep CPU:
>         CPU: Intel(R) Xeon(R) CPU           X5650  @ 2.67GHz
> (2666.81-MHz K8-class CPU)
>     $clang --version
>         FreeBSD clang version 3.2 (tags/RELEASE_32/final 170710) 20121221
>         Target: x86_64-unknown-freebsd9.1
>         Thread model: posix
> 
> My timing test was a file of 10000 strings, of uniformly random length,
> 50% between 0 and 20 bytes long, and 50% between 21 and 1000 bytes long.
> Each was filled with random generated bytes in the range [1, 255]. The
> test executables run their strlen on each string in the file 1000 times,
> and a shell script ran the test programs alternately 100 times. Here are
> the clang results; gcc results are roughly the same. I will share the
> actual timing code if someone wants it, but it is pretty ugly C++ and
> shell and I don't guarantee it working :-).
> 
> Results:
> 
> x ./times_bsd_strnlen.txt
> + ./times_my_strnlen.txt
> +--------------------------------------------------------------------------+
> 
> |+                                                                       x|
> |+                                                                       x|
> |+                                                                       x|
> |+                                                                       x|
> |+                                                                       x|
> |+                                                                       x|
> |+                                                                       x|
> |+                                                                       x|
> |+                                                                       x|
> |+                                                                       x|
> |+                                                                       x|
> |+                                                                       x|
> |+                                                                       x|
> |+                                                                       x|
> |+                                                                       x|
> |+                                                                       x|
> |+                                                                       x|
> |A                                                                       A|
> +--------------------------------------------------------------------------+
> 
>     N           Min           Max        Median           Avg        Stddev
> x 101     1.8685486     1.8693889     1.8687506     1.8687894  0.0001484903
> + 101    0.42704683    0.42779486    0.42712813    0.42715597 0.00010681494
> Difference at 95.0% confidence
>     -1.44163 +/- 3.56739e-05
>     -77.1426% +/- 0.00190893%
>     (Student's t, pooled s = 0.000129342)
> 
> Note: I manually shortened the histogram, as it was 100 lines long of
> exactly the same.
> 
> 
> _______________________________________________
> freebsd-hackers at freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
> To unsubscribe, send any mail to "freebsd-hackers-unsubscribe at freebsd.org"
> 

> Index: strnlen.c
> ===================================================================
> diff --git a/head/lib/libc/string/strnlen.c b/head/lib/libc/string/strnlen.c
> --- a/head/lib/libc/string/strnlen.c	(revision 250951)
> +++ b/head/lib/libc/string/strnlen.c	(working copy)
> @@ -1,5 +1,6 @@
>  /*-
> - * Copyright (c) 2009 David Schultz <das at FreeBSD.org>
> + * Copyright (c) 2009, 2010 Xin LI <delphij at FreeBSD.org>
> + * Copyright (c) 2013 Lee Thomas <lee_thomas at AslanTools.com>
>   * All rights reserved.
>   *
>   * Redistribution and use in source and binary forms, with or without
> @@ -27,16 +28,91 @@
>  #include <sys/cdefs.h>
>  __FBSDID("$FreeBSD$");
>  
> +#include <sys/limits.h>
> +#include <sys/types.h>
>  #include <string.h>
>  
> +/*
> + * Portable strnlen() for 32-bit and 64-bit systems.
> + *
> + * Rationale: it is generally much more efficient to do word length
> + * operations and avoid branches on modern computer systems, as
> + * compared to byte-length operations with a lot of branches.
> + *
> + * The expression:
> + *
> + *	((x - 0x01....01) & ~x & 0x80....80)
> + *
> + * would evaluate to a non-zero value iff any of the bytes in the
> + * original word is zero.
> + *
> + * On multi-issue processors, we can divide the above expression into:
> + *	a)  (x - 0x01....01)
> + *	b) (~x & 0x80....80)
> + *	c) a & b
> + *
> + * Where, a) and b) can be partially computed in parallel.
> + *
> + * The algorithm above is found on "Hacker's Delight" by
> + * Henry S. Warren, Jr.
> + */
> +
> +/* Magic numbers for the algorithm */
> +#if LONG_BIT == 32
> +static const unsigned long mask01 = 0x01010101;
> +static const unsigned long mask80 = 0x80808080;
> +#elif LONG_BIT == 64
> +static const unsigned long mask01 = 0x0101010101010101;
> +static const unsigned long mask80 = 0x8080808080808080;
> +#else
> +#error Unsupported word size
> +#endif
> +
> +#define	LONGPTR_MASK (sizeof(long) - 1)
> +
>  size_t
> -strnlen(const char *s, size_t maxlen)
> +strnlen(const char *str, size_t maxlen)
>  {
> -	size_t len;
> +	const char *stop, *short_stop;
> +	const char *p;
> +	const unsigned long *lp;
> +	long va, vb;
>  
> -	for (len = 0; len < maxlen; len++, s++) {
> -		if (!*s)
> -			break;
> +	if (maxlen==0) return 0;
> +
> +	stop=str+maxlen;
> +	/*
> +	 * Before trying the hard (unaligned byte-by-byte access) way
> +	 * to figure out whether there is a nul character, try to see
> +	 * if there is a nul character is within this accessible word
> +	 * first.
> +	 *
> +	 * p and (p & ~LONGPTR_MASK) must be equally accessible since
> +	 * they always fall in the same memory page, as long as page
> +	 * boundaries is integral multiple of word size.
> +	 */
> +	lp = (const unsigned long *)((uintptr_t)str & ~LONGPTR_MASK);
> +	va = (*lp - mask01);
> +	vb = ((~*lp) & mask80);
I do not think that this correct C. This is type punning violating the
rules of the language.

One way you can rewrite this is to use memcpy() into a temporary. Do not
be afraid of the memcpy(), modern GCC, and I hope even Clang, can
optimize it into a single move instruction:

{
  long tmp;
  memcpy(&tmp, lp, sizeof(tmp));
  va = (tmp - mask01);
  vb = ((~tmp) & mask80);
}

Another way could be using the union trick.

> +	lp++;
> +	if (va & vb) {
> +		/* Check if we have \0 in the first part */
> +		short_stop=(const char *)lp;
> +		if (stop<short_stop) short_stop=stop;
> +		for (p=str; p != short_stop; p++)
> +			if (*p == '\0')
> +				return (p-str);
>  	}
> -	return (len);
> +	/* Scan the rest of the string using word sized operation */
> +	for (; (const char *)lp < stop; lp++) {
> +		va = (*lp - mask01);
> +		vb = ((~*lp) & mask80);
Ditto.

> +		if (va & vb) {
> +			for (p=(const char *)lp; p != stop; p++)
> +				if (*p == '\0')
> +					break;
> +			return (p-str);
> +		}
> +	}
> +	return (maxlen);
>  }

-- 
VZ

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 293 bytes
Desc: OpenPGP digital signature
URL: <http://lists.freebsd.org/pipermail/freebsd-hackers/attachments/20130526/11781461/attachment.sig>