threads/128180: pthread_cond_broadcast() lost wakup

Sat Oct 18 03:00:18 UTC 2008

The following reply was made to PR threads/128180; it has been noted by GNATS.

From: Kurt Miller <kurt at intricatesoftware.com>
To: freebsd-gnats-submit at freebsd.org
Cc:  
Subject: Re: threads/128180: pthread_cond_broadcast() lost wakup
Date: Fri, 17 Oct 2008 22:54:11 -0400

 Hi Daniel,

 Thanks for the review of the test program.

 On Friday 17 October 2008 7:44:58 pm Daniel Eischen wrote:
 > On Fri, 17 Oct 2008, Kurt Miller wrote:
 > 
 > > The test program outputs periodic printf's indicating
 > > progress is being made. When it stops the process is
 > > deadlocked. The lost wakeup can be confirmed by inspecting
 > > the saved_waiters local var in main(). Each time the
 > > deadlock occurs I see that saved_waiters is 8 which tells
 > > me all eight worker threads were waiting on the condition
 > > variable when the broadcast was sent. Then switch to the
 > > thread that is still waiting on the condition variable,
 > > and you can see that the last_cycle local var is one behind
 > > the cycles global var which indicates it didn't receive the
 > > last wakeup.
 > 
 > The test program doesn't look correct to me.  It seems possible
 > for only a few of the threads (as little as 2) to do all the
 > work.  Thread 1 can start doing work, then wait for a broadcast.
 > Thread 2 can start doing his work, then broadcast waking thread 1.

 I didn't fully describe why the design is the way it is. I
 understand some of the reasons why it was designed like this,
 but to fully understand it I would need to study the concurrent
 mark sweep garbage collector far more. I can explain a bit
 more of what I do understand.

 The controlling thread in jvm corresponds to the primordial
 thread in my test program. In the jvm the controlling thread
 is not in a loop. It just kicks off the worker threads and
 waits for them to complete, then returns back to the calling
 function. The jvm will create a worker thread per cpu which
 wait around for the controlling thread to kick them off. The
 garbage collection work is divided amongst them.

 The reason why my test program has 8 worker threads is because
 the problem was first reported to me on an dual quad core amd64
 system. My test systems are just dual core.

 > I think you need separate condition variables, one to wake up
 > the main thread when the last worker goes to sleep/finishes,
 > and one to wake up the workers.

 Indeed. In my first attempts to reproduce the lost wakeup
 problem I wrote the test program with a separate condition
 variable for letting the main thread know when the last worker 
 finished. However, that didn't reproduce the deadlock the
 jdk was experiencing. Only when I fully mimicked the underlying
 design of the jdk, did the deadlock get reproduced by the test
 program. Note that the jdk is written in C++ and abstraction
 it provides makes for some pretty ugly code when translated
 in plain C.

 I could make adjustments to the jvm code to introduce the
 second condition variable and incorporate that in future
 releases of the jdk. The problem is that the binary release
 of the jdk, Diablo, can't be changed without a new formal
 release process being followed. 

 While the test program and the jdk's use of condition variables
 may not be ideal and somewhat unexpected, I do believe it is
 valid. It does work on Solaris, Linux and Windows without loosing
 wakeups.

 With the 6.4 release comming soon, it would be great if the lost
 wakeup problem (which is rather serious) could be looked at and
 fixed before 6.4 is released.

 Regards,
 -Kurt