pthread_cond_timedwait() broken in 9-stable? (from JAN 10)
Julian Elischer
julian at freebsd.org
Fri Feb 17 18:04:51 UTC 2012
On 2/17/12 3:28 AM, David Xu wrote:
> On 2012/2/17 16:06, Julian Elischer wrote:
>> On 2/16/12 11:41 PM, Julian Elischer wrote:
>>> adding jkim as he seems to be the last person working with TSC.
>>>
>>>
>>> On 2/16/12 6:42 PM, David Xu wrote:
>>>> On 2012/2/17 10:19, Julian Elischer wrote:
>>>>> On 2/16/12 5:56 PM, David Xu wrote:
>>>>>> On 2012/2/17 8:42, Julian Elischer wrote:
>>>>>>> Adding David Xu for his thoughts since he reqrote the code in
>>>>>>> quesiton in revision 213098
>>>>>>>
>>>>>>> On 2/16/12 2:57 PM, Julian Elischer wrote:
>>>>>>>> On 2/16/12 1:06 PM, Julian Elischer wrote:
>>>>>>>>> On 2/16/12 9:34 AM, Andriy Gapon wrote:
>>>>>>>>>> on 15/02/2012 23:41 Julian Elischer said the following:
>>>>>>>>>>> The program fio (an IO test in ports) uses pthreads
>>>>>>>>>>>
>>>>>>>>>>> the following code (from fio-2.0.3, but its in earlier
>>>>>>>>>>> code too)
>>>>>>>>>>> has suddenly started misbehaving.
>>>>>>>>>>>
>>>>>>>>>>> clock_gettime(CLOCK_REALTIME,&t);
>>>>>>>>>>> t.tv_sec += seconds + 10;
>>>>>>>>>>>
>>>>>>>>>>> pthread_mutex_lock(&mutex->lock);
>>>>>>>>>>>
>>>>>>>>>>> while (!mutex->value&& !ret) {
>>>>>>>>>>> mutex->waiters++;
>>>>>>>>>>> ret =
>>>>>>>>>>> pthread_cond_timedwait(&mutex->cond,&mutex->lock,&t);
>>>>>>>>>>> mutex->waiters--;
>>>>>>>>>>> }
>>>>>>>>>>>
>>>>>>>>>>> if (!ret) {
>>>>>>>>>>> mutex->value--;
>>>>>>>>>>> pthread_mutex_unlock(&mutex->lock);
>>>>>>>>>>> }
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> It turns out that 'ret' sometimes comes back instantly (on
>>>>>>>>>>> my machine) with a
>>>>>>>>>>> value of 60 (ETIMEDOUT)
>>>>>>>>>>> despite the fact that we set the timeout 10 seconds into
>>>>>>>>>>> the future.
>>>>>>>>>>>
>>>>>>>>>>> Has anyone else seen anything like this?
>>>>>>>>>>> (and yes the condition variable attribute have been set to
>>>>>>>>>>> use the REALTIME clock).
>>>>>>>>>> But why?
>>>>>>>>>>
>>>>>>>>>> Just a hypothesis that maybe there is some issue with time
>>>>>>>>>> keeping on that system.
>>>>>>>>>> How would that code work out for you with MONOTONIC?
>>>>>>>>>
>>>>>>>>> Jens Axboe, (CC'd) tried both CLOCK_REALTIME and
>>>>>>>>> CLOCK_MONOTONIC, and they both had the same problem..
>>>>>>>>> i.e. random early returns with ETIMEDOUT.
>>>>>>>>>
>>>>>>>>> I think we will try move out machine forward to a newer
>>>>>>>>> -stable to see if it resolves.
>>>>>>>> Kan upgraded the machine today to today's 9.x branch tip and
>>>>>>>> the problem still occurs.
>>>>>>>> 8.x does not have this problem.
>>>>>>>>
>>>>>>>> I have not got a 9-RELEASE machine to test on.. so I can not
>>>>>>>> tell if this came in with the burst of stuff
>>>>>>>> that came in after the 9.x branch was unfrozen after the
>>>>>>>> release of 9.0.
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>> I am trying to reproduce the problem, do you have complete
>>>>>> sample code to test ?
>>>>>
>>>>> I'm still looking the exact set
>>>>> but on my machine (4 cpus) the program from ports sysutils/fio
>>>>> exhibits the problem when used with
>>>>> kern.timecounter.hardware=TSC-low and with the following config
>>>>> file:
>>>>>
>>>>> pu05 # cat config.fio
>>>>>
>>>>> [global]
>>>>> #clocksource=cpu
>>>>> direct=1
>>>>> rw=randread
>>>>> bs=4096
>>>>> fill_device=1
>>>>> numjobs=16
>>>>> iodepth=16
>>>>> #ioengine=posixaio
>>>>> #ioengine=psync
>>>>> ioengine=psync
>>>>> group_reporting
>>>>> norandommap
>>>>> time_based
>>>>> runtime=60000
>>>>> randrepeat=0
>>>>>
>>>>> [file1]
>>>>> filename=/dev/ada0
>>>>>
>>>>> pu05 #
>>>>> pu05 # fio config.fio
>>>>> fio: this platform does not support process shared mutexes,
>>>>> forcing use of threads. Use the 'thread' option to get rid of
>>>>> this warning.
>>>>> file1: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=psync,
>>>>> iodepth=16
>>>>> ...
>>>>> file1: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=psync,
>>>>> iodepth=16
>>>>> fio 2.0.3
>>>>> Starting 15 threads and 1 process
>>>>> fio: job startup hung? exiting.
>>>>> fio: 5 jobs failed to start
>>>>> Segmentation fault (core dumped)
>>>>> pu05#
>>>>>
>>>>>
>>>>> The reason 5 jobs failed to start is because the parent timed
>>>>> out on them immediately.
>>>>> It didn't time out on 10 of them apparently.
>>>>>
>>>>>
>>>>> if I set the timer to ACPI-fast it works as expected..
>>>> maybe following code can check to see if TSC-LOW works by let the
>>>> thread run
>>>> on each cpu.
>>>>
>>>> gettimeofday(&prev, NULL);
>>>> int cpu = 0;
>>>> for (;;) {
>>>> cpuset_t set;
>>>> cpu = ++cpu % 4;
>>>> CPU_ZERO(&set);
>>>> CPU_SET(cpu, &set);
>>>> pthread_setaffinity_np(pthread_self(), sizeof(set), &set);
>>>> gettimeofday(&cur, NULL);
>>>> if ( timercmp(&prev, &cur, >=)) {
>>>> abort();
>>>> }
>>>> }
>>>>
>>>>
>>
>> pu05# sysctl kern.timecounter.hardware=TSC-low
>> kern.timecounter.hardware: ACPI-fast -> TSC-low
>> pu05# ./test
>> ^C
>> pu05# cat test.c
>>
>> #include <stdlib.h>
>> #include <sys/param.h>
>> #include <sys/cpuset.h>
>> #include <pthread_np.h>
>>
>> #include <sys/time.h>
>>
>> main()
>> {
>> int cpu = 0;
>> struct timeval prev, cur;
>>
>> gettimeofday(&prev, NULL);
>> for (;;) {
>> cpuset_t set;
>> cpu = ++cpu % 4;
>> CPU_ZERO(&set);
>> CPU_SET(cpu, &set);
>> pthread_setaffinity_np(pthread_self(), sizeof(set), &set);
>> gettimeofday(&cur, NULL);
>> if ( timercmp(&prev, &cur, >)) {
>> abort();
>> }
>> prev = cur;
>> }
>> }
>>
>> pu05# ./test
>>
>> minutes pass.......
>>
>> ^C
>> pu05#
>>
>> so it looks as if the TSC is working ok..
>> I'm just going to check that the program is actually moving CPU...
>> yes it is moving around but I can't tell at what speed. (according
>> to top).
>>
>> so we are still left with a question of "where is the problem?"
>>
>> kernel TSC driver?
>> generic gettimeofday() code?
>> pthreads cond code?
>> the application?
>>
>>
> I am running the fio test on my notebook which is using TSC-low,
> it is on 9.0-RC3, I can not reproduce the problem for
> minutes, then I interrupt it with ctrl-c: looks mot
>
> http://people.freebsd.org/~davidxu/tsc_pthread/dmesg.txt
> http://people.freebsd.org/~davidxu/tsc_pthread/tc.txt
> http://people.freebsd.org/~davidxu/tsc_pthread/fio.txt
>
>
looks normal to me..
I have to been able to test this on a 9-RELEASE machine.. just 9-stable..
More information about the freebsd-threads
mailing list