About the memory barrier in BSD libc

Tue Apr 24 15:10:43 UTC 2012

On Tuesday, April 24, 2012 10:03:48 am Konstantin Belousov wrote:
> On Tue, Apr 24, 2012 at 02:43:40PM +0100, Martin Simmons wrote:
> > >>>>> On Mon, 23 Apr 2012 16:03:43 +0300, Konstantin Belousov said:
> > > 
> > > On Mon, Apr 23, 2012 at 08:33:05PM +0800, Fengwei yin wrote:
> > > > On Mon, Apr 23, 2012 at 8:07 PM, Konstantin Belousov
> > > > <kostikbel at gmail.com> wrote:
> > > > > On Mon, Apr 23, 2012 at 07:44:34PM +0800, Fengwei yin wrote:
> > > > >> On Mon, Apr 23, 2012 at 7:38 PM, Slawa Olhovchenkov 
<slw at zxy.spb.ru> wrote:
> > > > >> > On Mon, Apr 23, 2012 at 07:26:54PM +0800, Fengwei yin wrote:
> > > > >> >
> > > > >> >> On Mon, Apr 23, 2012 at 5:40 PM, Slawa Olhovchenkov 
<slw at zxy.spb.ru> wrote:
> > > > >> >> > On Mon, Apr 23, 2012 at 05:32:24PM +0800, Fengwei yin wrote:
> > > > >> >> >
> > > > >> >> >> On Mon, Apr 23, 2012 at 4:41 PM, Slawa Olhovchenkov 
<slw at zxy.spb.ru> wrote:
> > > > >> >> >> > On Mon, Apr 23, 2012 at 02:56:03PM +0800, Fengwei yin 
wrote:
> > > > >> >> >> >
> > > > >> >> >> >> Hi list,
> > > > >> >> >> >> If this is not correct question on the list, please let me 
know and
> > > > >> >> >> >> sorry for noise.
> > > > >> >> >> >>
> > > > >> >> >> >> I have a question regarding the BSD libc for SMP arch. I 
didn't see
> > > > >> >> >> >> memory barrier used in libc.
> > > > >> >> >> >> How can we make sure it's safe on SMP arch?
> > > > >> >> >> >
> > > > >> >> >> > /usr/include/machine/atomic.h:
> > > > >> >> >> >
> > > > >> >> >> > #define mb() ?? ??__asm __volatile("lock; addl $0,(%%esp)" 
: : : "memory")
> > > > >> >> >> > #define wmb() ?? __asm __volatile("lock; addl $0,(%%esp)" : 
: : "memory")
> > > > >> >> >> > #define rmb() ?? __asm __volatile("lock; addl $0,(%%esp)" : 
: : "memory")
> > > > >> >> >> >
> > > > >> >> >>
> > > > >> >> >> Thanks for the information. But it looks no body use it in 
libc.
> > > > >> >> >
> > > > >> >> > I think no body in libc need memory barrier: libc don't work 
with
> > > > >> >> > peripheral, for atomic opertions used different macros.
> > > > >> >>
> > > > >> >> If we check the usage of __sinit(), it is a typical singleton 
pattern which
> > > > >> >> needs memory barrier to make sure no potential SMP issue.
> > > > >> >>
> > > > >> >> Or did I miss something here?
> > > > >> >
> > > > >> > What architecture with cache incoherency and FreeBSD support?
> > > > >>
> > > > >> I suppose it's not related with cache inchoherency (I could be 
wrong).
> > > > >> It's related
> > > > >> with reorder of instruction by CPU.
> > > > >>
> > > > >> Here is the link talking about why need memory barrier for 
singleton:
> > > > >> http://www.oaklib.org/docs/oak/singleton.html
> > > > >>
> > > > >> x86 has strict memory model and may not suffer this kind of issue. 
But
> > > > >> ARM need to
> > > > >> take care of it IMHO.
> > > > >
> > > > > Please note that __sinit is idempotent, so double-initialization is 
not
> > > > > an issue there. The only possible problematic case would be other 
thread
> > > > > executing exit and not noticing non-NULL value for __cleanup while 
current
> > > > > thread just set it.
> > > > >
> > > > > I am not sure how much real this race is. Each call to _sinit() is 
immediately
> > > > > followed by a lock acquire, typically FLOCKFILE(), which enforces 
full barrier
> > > > > semantic due to pthread_mutex_lock call. The exit() performs 
__cxa_finalize()
> > > > > call before checking __cleanup value, and __cxa_finalize() itself 
locks
> > > > > atexit_mutex. So the race is tiny and probably possible only for 
somewhat
> > > > > buggy applications which perform exit() while there are stdio 
operations
> > > > > in progress.
> > > > >
> > > > > Also note that some functions assign to __cleanup unconditionally.
> > > > >
> > > > > Do you see any real issue due to non-synchronized access to 
__cleanup ?
> > > > 
> > > > No. I didn't see real issue. I am just reviewing the code.
> > > > 
> > > > If you don't think __sinit has issue, let's check another code:
> > > >      line 68 in libc/stdio/fclose.c
> > > >      line 133 in libc/stdio/findfp.c (function __sfp())
> > > > 
> > > > Which is trying to free a fp slot by assign 0 to fp->_flags. But if
> > > > the instrucation
> > > > could be re-ordered, another CPU could see fp->_flags is assigned to 0
> > > > before the
> > > > cleanup from line 57 to 67.
> > > > 
> > > > Let's say, if another CPU is in line 133 of __sfp(), it could see
> > > > fp->_flags become
> > > > 0 before it's aware of the cleanup (Line 57 to line 67 in
> > > > libc/stdio/fclose.c) happen.
> > > > 
> > > > Note: the mutex of FUNLOCKFILE(fp) in line 69 of libc/stdio/fclose.c
> > > > just could make sure
> > > > line 70 happen after line 68. It can't impact the re-order of line 57
> > > > ~ line 68 by CPU.
> > > 
> > > Yes, FUNLOCKFILE() there would have no effect on the potential CPU 
reordering
> > > of the writes.  But does the order of these writes matter at all ?
> > > 
> > > Please note that __sfp() reinitializes all fields written by fclose().
> > > Only if CPU executing fclose() is allowed to reorder operations so that
> > > the external effect of _flags = 0 assignment can be observed before that
> > > CPU executes other operations from fclose(), there could be a problem.
> > > 
> > > This is definitely impossible on Intel, and I indeed do not know about
> > > other architectures enough to reject such possibility. The _flags member
> > > is short, so atomics cannot be used there. The easier solution, if this
> > > is indeed an issue, is to lock thread_lock around _flags = 0 assignment
> > > in fclose().
> > 
> > This can be a problem, even on Intel, because the compiler can reorder the
> > stores.  E.g. if I compile the following with gcc -O4 on amd64:
> > 
> > struct foo { int x, y; };
> > 
> > int foo(struct foo *p)
> > {
> >   int x = bar();
> >   p->y = baz();
> >   p->x = x;
> > }
> > 
> > then I get the following assembly language, which sets p->x before p->y:
> > 
> > 	movq	%rdi, %rbx
> > 	call	bar
> > 	movl	%eax, %ebp
> > 	xorl	%eax, %eax
> > 	call	baz
> > 	movl	%ebp, (%rbx)
> > 	movl	%eax, 4(%rbx)
> > 
> > __Martin
> Ok, as I already said, I think that the reordering is safe there.
> 
> Anyway, the change below should remove all concerns.
> 
> diff --git a/lib/libc/stdio/fclose.c b/lib/libc/stdio/fclose.c
> index f0629e8..383040e 100644
> --- a/lib/libc/stdio/fclose.c
> +++ b/lib/libc/stdio/fclose.c
> @@ -41,9 +41,12 @@ __FBSDID("$FreeBSD$");
>  #include <stdio.h>
>  #include <stdlib.h>
>  #include "un-namespace.h"
> +#include <spinlock.h>
>  #include "libc_private.h"
>  #include "local.h"
>  
> +extern spinlock_t __stdio_thread_lock;
> +
>  int
>  fclose(FILE *fp)
>  {
> @@ -65,7 +68,11 @@ fclose(FILE *fp)
>  		FREELB(fp);
>  	fp->_file = -1;
>  	fp->_r = fp->_w = 0;	/* Mess up if reaccessed. */
> +	if (__isthreaded)
> +		_SPINLOCK(&__stdio_thread_lock);
>  	fp->_flags = 0;		/* Release this FILE for reuse. */
> +	if (__isthreaded)
> +		_SPINUNLOCK(&__stdio_thread_lock);
>  	FUNLOCKFILE(fp);
>  	return (r);
>  }
> diff --git a/lib/libc/stdio/findfp.c b/lib/libc/stdio/findfp.c
> index 89c0536..bcd6f62 100644
> --- a/lib/libc/stdio/findfp.c
> +++ b/lib/libc/stdio/findfp.c
> @@ -82,9 +82,9 @@ static struct glue *lastglue = &uglue;
>  
>  static struct glue *	moreglue(int);
>  
> -static spinlock_t thread_lock = _SPINLOCK_INITIALIZER;
> -#define THREAD_LOCK()	if (__isthreaded) _SPINLOCK(&thread_lock)
> -#define THREAD_UNLOCK()	if (__isthreaded) _SPINUNLOCK(&thread_lock)
> +spinlock_t __stdio_thread_lock = _SPINLOCK_INITIALIZER;
> +#define THREAD_LOCK()	if (__isthreaded) _SPINLOCK(&__stdio_thread_lock)
> +#define THREAD_UNLOCK()	if (__isthreaded) 
_SPINUNLOCK(&__stdio_thread_lock)
>  
>  #if NOT_YET
>  #define	SET_GLUE_PTR(ptr, val)	atomic_set_rel_ptr(&(ptr), (uintptr_t)
(val))

Could you move the extern and THREAD_LOCK/UNLOCK macros into the stdio private 
header file?

-- 
John Baldwin