pthread_cond_timedwait() broken in 9-stable? (from JAN 10)
Jung-uk Kim
jkim at FreeBSD.org
Fri Feb 17 16:17:55 UTC 2012
On Friday 17 February 2012 06:28 am, David Xu wrote:
> On 2012/2/17 16:06, Julian Elischer wrote:
> > On 2/16/12 11:41 PM, Julian Elischer wrote:
> >> adding jkim as he seems to be the last person working with TSC.
> >>
> >> On 2/16/12 6:42 PM, David Xu wrote:
> >>> On 2012/2/17 10:19, Julian Elischer wrote:
> >>>> On 2/16/12 5:56 PM, David Xu wrote:
> >>>>> On 2012/2/17 8:42, Julian Elischer wrote:
> >>>>>> Adding David Xu for his thoughts since he reqrote the code
> >>>>>> in quesiton in revision 213098
> >>>>>>
> >>>>>> On 2/16/12 2:57 PM, Julian Elischer wrote:
> >>>>>>> On 2/16/12 1:06 PM, Julian Elischer wrote:
> >>>>>>>> On 2/16/12 9:34 AM, Andriy Gapon wrote:
> >>>>>>>>> on 15/02/2012 23:41 Julian Elischer said the following:
> >>>>>>>>>> The program fio (an IO test in ports) uses pthreads
> >>>>>>>>>>
> >>>>>>>>>> the following code (from fio-2.0.3, but its in earlier
> >>>>>>>>>> code too) has suddenly started misbehaving.
> >>>>>>>>>>
> >>>>>>>>>> clock_gettime(CLOCK_REALTIME,&t);
> >>>>>>>>>> t.tv_sec += seconds + 10;
> >>>>>>>>>>
> >>>>>>>>>> pthread_mutex_lock(&mutex->lock);
> >>>>>>>>>>
> >>>>>>>>>> while (!mutex->value&& !ret) {
> >>>>>>>>>> mutex->waiters++;
> >>>>>>>>>> ret =
> >>>>>>>>>> pthread_cond_timedwait(&mutex->cond,&mutex->lock,&t);
> >>>>>>>>>> mutex->waiters--;
> >>>>>>>>>> }
> >>>>>>>>>>
> >>>>>>>>>> if (!ret) {
> >>>>>>>>>> mutex->value--;
> >>>>>>>>>> pthread_mutex_unlock(&mutex->lock);
> >>>>>>>>>> }
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> It turns out that 'ret' sometimes comes back instantly
> >>>>>>>>>> (on my machine) with a
> >>>>>>>>>> value of 60 (ETIMEDOUT)
> >>>>>>>>>> despite the fact that we set the timeout 10 seconds into
> >>>>>>>>>> the future.
> >>>>>>>>>>
> >>>>>>>>>> Has anyone else seen anything like this?
> >>>>>>>>>> (and yes the condition variable attribute have been set
> >>>>>>>>>> to use the REALTIME clock).
> >>>>>>>>>
> >>>>>>>>> But why?
> >>>>>>>>>
> >>>>>>>>> Just a hypothesis that maybe there is some issue with
> >>>>>>>>> time keeping on that system.
> >>>>>>>>> How would that code work out for you with MONOTONIC?
> >>>>>>>>
> >>>>>>>> Jens Axboe, (CC'd) tried both CLOCK_REALTIME and
> >>>>>>>> CLOCK_MONOTONIC, and they both had the same problem..
> >>>>>>>> i.e. random early returns with ETIMEDOUT.
> >>>>>>>>
> >>>>>>>> I think we will try move out machine forward to a newer
> >>>>>>>> -stable to see if it resolves.
> >>>>>>>
> >>>>>>> Kan upgraded the machine today to today's 9.x branch tip
> >>>>>>> and the problem still occurs.
> >>>>>>> 8.x does not have this problem.
> >>>>>>>
> >>>>>>> I have not got a 9-RELEASE machine to test on.. so I can
> >>>>>>> not tell if this came in with the burst of stuff
> >>>>>>> that came in after the 9.x branch was unfrozen after the
> >>>>>>> release of 9.0.
> >>>>>
> >>>>> I am trying to reproduce the problem, do you have complete
> >>>>> sample code to test ?
> >>>>
> >>>> I'm still looking the exact set
> >>>> but on my machine (4 cpus) the program from ports sysutils/fio
> >>>> exhibits the problem when used with
> >>>> kern.timecounter.hardware=TSC-low and with the following
> >>>> config file:
> >>>>
> >>>> pu05 # cat config.fio
> >>>>
> >>>> [global]
> >>>> #clocksource=cpu
> >>>> direct=1
> >>>> rw=randread
> >>>> bs=4096
> >>>> fill_device=1
> >>>> numjobs=16
> >>>> iodepth=16
> >>>> #ioengine=posixaio
> >>>> #ioengine=psync
> >>>> ioengine=psync
> >>>> group_reporting
> >>>> norandommap
> >>>> time_based
> >>>> runtime=60000
> >>>> randrepeat=0
> >>>>
> >>>> [file1]
> >>>> filename=/dev/ada0
> >>>>
> >>>> pu05 #
> >>>> pu05 # fio config.fio
> >>>> fio: this platform does not support process shared mutexes,
> >>>> forcing use of threads. Use the 'thread' option to get rid of
> >>>> this warning. file1: (g=0): rw=randread, bs=4K-4K/4K-4K,
> >>>> ioengine=psync, iodepth=16 ...
> >>>> file1: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=psync,
> >>>> iodepth=16 fio 2.0.3
> >>>> Starting 15 threads and 1 process
> >>>> fio: job startup hung? exiting.
> >>>> fio: 5 jobs failed to start
> >>>> Segmentation fault (core dumped)
> >>>> pu05#
> >>>>
> >>>>
> >>>> The reason 5 jobs failed to start is because the parent timed
> >>>> out on them immediately.
> >>>> It didn't time out on 10 of them apparently.
> >>>>
> >>>>
> >>>> if I set the timer to ACPI-fast it works as expected..
> >>>
> >>> maybe following code can check to see if TSC-LOW works by let
> >>> the thread run
> >>> on each cpu.
> >>>
> >>> gettimeofday(&prev, NULL);
> >>> int cpu = 0;
> >>> for (;;) {
> >>> cpuset_t set;
> >>> cpu = ++cpu % 4;
> >>> CPU_ZERO(&set);
> >>> CPU_SET(cpu, &set);
> >>> pthread_setaffinity_np(pthread_self(), sizeof(set), &set);
> >>> gettimeofday(&cur, NULL);
> >>> if ( timercmp(&prev, &cur, >=)) {
> >>> abort();
> >>> }
> >>> }
> >
> > pu05# sysctl kern.timecounter.hardware=TSC-low
> > kern.timecounter.hardware: ACPI-fast -> TSC-low
> > pu05# ./test
> > ^C
> > pu05# cat test.c
> >
> > #include <stdlib.h>
> > #include <sys/param.h>
> > #include <sys/cpuset.h>
> > #include <pthread_np.h>
> >
> > #include <sys/time.h>
> >
> > main()
> > {
> > int cpu = 0;
> > struct timeval prev, cur;
> >
> > gettimeofday(&prev, NULL);
> > for (;;) {
> > cpuset_t set;
> > cpu = ++cpu % 4;
> > CPU_ZERO(&set);
> > CPU_SET(cpu, &set);
> > pthread_setaffinity_np(pthread_self(), sizeof(set),
> > &set); gettimeofday(&cur, NULL);
> > if ( timercmp(&prev, &cur, >)) {
> > abort();
> > }
> > prev = cur;
> > }
> > }
> >
> > pu05# ./test
> >
> > minutes pass.......
> >
> > ^C
> > pu05#
> >
> > so it looks as if the TSC is working ok..
> > I'm just going to check that the program is actually moving
> > CPU... yes it is moving around but I can't tell at what speed.
> > (according to top).
> >
> > so we are still left with a question of "where is the problem?"
> >
> > kernel TSC driver?
> > generic gettimeofday() code?
> > pthreads cond code?
> > the application?
>
> I am running the fio test on my notebook which is using TSC-low,
> it is on 9.0-RC3, I can not reproduce the problem for
> minutes, then I interrupt it with ctrl-c:
>
> http://people.freebsd.org/~davidxu/tsc_pthread/dmesg.txt
> http://people.freebsd.org/~davidxu/tsc_pthread/tc.txt
> http://people.freebsd.org/~davidxu/tsc_pthread/fio.txt
Your CPU is single-package, dual-core, and SMT-enabled. All cores
should be in perfect sync.
Jung-uk Kim
More information about the freebsd-threads
mailing list