Performance improvement to strnlen().

Mon May 27 18:26:22 UTC 2013

> 
> Index: strnlen.c
> ===================================================================
> diff --git a/head/lib/libc/string/strnlen.c b/head/lib/libc/string/strnlen.c
> --- a/head/lib/libc/string/strnlen.c	(revision 250951)
> +++ b/head/lib/libc/string/strnlen.c	(working copy)
> @@ -1,5 +1,6 @@
>  /*-
> - * Copyright (c) 2009 David Schultz <das at FreeBSD.org>
> + * Copyright (c) 2009, 2010 Xin LI <delphij at FreeBSD.org>
> + * Copyright (c) 2013 Lee Thomas <lee_thomas at AslanTools.com>
>   * All rights reserved.
>   *
>   * Redistribution and use in source and binary forms, with or without
> @@ -27,16 +28,91 @@
>  #include <sys/cdefs.h>
>  __FBSDID("$FreeBSD$");
>  
> +#include <sys/limits.h>
> +#include <sys/types.h>
>  #include <string.h>
>  
> +/*
> + * Portable strnlen() for 32-bit and 64-bit systems.
> + *
> + * Rationale: it is generally much more efficient to do word length
> + * operations and avoid branches on modern computer systems, as
> + * compared to byte-length operations with a lot of branches.
> + *
> + * The expression:
> + *
> + *	((x - 0x01....01) & ~x & 0x80....80)
> + *
> + * would evaluate to a non-zero value iff any of the bytes in the
> + * original word is zero.
> + *
> + * On multi-issue processors, we can divide the above expression into:
> + *	a)  (x - 0x01....01)
> + *	b) (~x & 0x80....80)
> + *	c) a & b
> + *
> + * Where, a) and b) can be partially computed in parallel.
> + *
> + * The algorithm above is found on "Hacker's Delight" by
> + * Henry S. Warren, Jr.
> + */
> +
> +/* Magic numbers for the algorithm */
> +#if LONG_BIT == 32
> +static const unsigned long mask01 = 0x01010101;
> +static const unsigned long mask80 = 0x80808080;
> +#elif LONG_BIT == 64
> +static const unsigned long mask01 = 0x0101010101010101;
> +static const unsigned long mask80 = 0x8080808080808080;
> +#else
> +#error Unsupported word size
> +#endif
> +
> +#define	LONGPTR_MASK (sizeof(long) - 1)

I would not use this at all; (sizeof(long) - 1) is a common
expression that your audience probably understands
more quickly than "LONGPTR_MASK".

> +
>  size_t
> -strnlen(const char *s, size_t maxlen)
> +strnlen(const char *str, size_t maxlen)
>  {
> -	size_t len;
> +	const char *stop, *short_stop;
> +	const char *p;
> +	const unsigned long *lp;
> +	long va, vb;
>  
> -	for (len = 0; len < maxlen; len++, s++) {
> -		if (!*s)
> -			break;
> +	if (maxlen==0) return 0;
> +
> +	stop=str+maxlen;
> +	/*
> +	 * Before trying the hard (unaligned byte-by-byte access) way
> +	 * to figure out whether there is a nul character, try to see
> +	 * if there is a nul character is within this accessible word
> +	 * first.
> +	 *
> +	 * p and (p & ~LONGPTR_MASK) must be equally accessible since
> +	 * they always fall in the same memory page, as long as page
> +	 * boundaries is integral multiple of word size.
> +	 */
> +	lp = (const unsigned long *)((uintptr_t)str & ~LONGPTR_MASK);
> +	lp++;

Have you tested to see whether the extra work you're
doing for the initial segment really gains you much?  I
would have used a simpler byte-by-byte check as
follows:

  p = s;
  while (maxlen-- > 0 && p < (const char *)lp) {
     if (*p == '\0')
       return (p - s);
  }

  while (maxlen  > 0) {
     … complicated test of *lp …
     maxlen -= sizeof(*lp);
  }

Few points here:
  * This form of the initial segment test doesn't get surprised
     by a NUL byte just before the string.  Yours does a bunch of
     extra work in that case.
  * Duff's device might help unroll the first loop; would require
     testing to see if it was faster.  For something this simple, it
     might not be.
  * Your version tests the first word twice (once in the
     initial check and then again at the start of the word-sized
     pass).  This version doesn't.
  * Comparison with zero is often free, so a countdown to zero
     is often faster than a count up to a computed limit.

This assumes that 'lp' is pointing to the first aligned word
at or after the beginning of the string.
Your version does not leave lp pointing to
the beginning of the string when the string is initially
already aligned.  If the string is already fully aligned, my
version avoids the initial check entirely.

To compute 'lp' as a pointer to the first full word
at or after the beginning of the string:

lp = (const unsigned long *)
       (
          (
              (uintptr_t)str + sizeof(*lp) - 1
          )
          &
          (
               sizeof(*lp) - 1
          )
      );

I've broken this onto multiple lines to illustrate the structure;
the final code would of course be a little more compact.

Also note the use of sizeof(*lp) rather than
sizeof(long) or sizeof(unsigned long); this way,
you guarantee that the sizeof() matches lp even if someone
comes along later and changes the declaration of lp.

> +	va = (*lp - mask01);
> +	vb = ((~*lp) & mask80);
> +	if (va & vb) {
> +		/* Check if we have \0 in the first part */
> +		short_stop=(const char *)lp;
> +		if (stop<short_stop) short_stop=stop;
> +		for (p=str; p != short_stop; p++)
> +			if (*p == '\0')
> +				return (p-str);
>  	}
> -	return (len);
> +	/* Scan the rest of the string using word sized operation */
> +	for (; (const char *)lp < stop; lp++) {
> +		va = (*lp - mask01);
> +		vb = ((~*lp) & mask80);
> +		if (va & vb) {

I don't think the extra variables are helpful.  Just write it directly:

     if ((*lp - mask01) & ~*lp & mask80) {

That matches the comment you put at the beginning.

> +			for (p=(const char *)lp; p != stop; p++)
> +				if (*p == '\0')
> +					break;
> +			return (p-str);
> +		}
> +	}
> +	return (maxlen);
>  }

You should, of course, compare my suggestions above
against your version to see which is faster.  (Preferably run
such tests with both clang and gcc, on at least two
architectures, and with a couple of different optimization
flags.)

Tim