svn commit: r236744 - in projects/calloutng/sys: kern sys

Sat Jun 9 23:26:02 UTC 2012

On Sat, 9 Jun 2012, Attilio Rao wrote:

> 2012/6/9 Alexander Motin <mav at freebsd.org>:
>> On 06/09/12 18:54, Attilio Rao wrote:
>>>
>>> 2012/6/9 Alexander Motin<mav at freebsd.org>:
>>>>
>>>> On 06/08/12 15:52, Attilio Rao wrote:
>>>>>
>>>>> 2012/6/8 Davide Italiano<davide at freebsd.org>:
>>>>>>
> [... excessive quoting deleting]
>>>>>> [... most lines with binary characters deleted]
>>>>>>
>>>>>> + \xc2\xa0 \xc2\xa0 \xc2\xa0 \xc2\xa0 \xc2\xa0 \xc2\xa0 \xc2\xa0 sleepq_s
et_timeout_bt(&nanowait,bt);
>>>>>> + \xc2\xa0 \xc2\xa0 \xc2\xa0 \xc2\xa0 \xc2\xa0 \xc2\xa0 \xc2\xa0 error =
sleepq_timedwait_sig(&nanowait,catch);
>>>>>> + \xc2\xa0 \xc2\xa0 \xc2\xa0 \xc2\xa0 \xc2\xa0 \xc2\xa0 \xc2\xa0 binuptim
e(&bt2);
>>>>>> + \xc2\xa0 \xc2\xa0 \xc2\xa0 \xc2\xa0 \xc2\xa0 \xc2\xa0 \xc2\xa0 if (catc
h) {
>>>>>> + \xc2\xa0 \xc2\xa0 \xc2\xa0 \xc2\xa0 \xc2\xa0 \xc2\xa0 \xc2\xa0 \xc2\xa0
  \xc2\xa0 \xc2\xa0 \xc2\xa0 if (error != EWOULDBLOCK) {

Some lines with binary characters translated to ASCII by vi and vidcontrol.

>>>>> For this, I'd rather prefer that you patch sleepq_set_timeout() directly
>>>>> to be:
>>>>> void sleepq_set_timeout(void *wchan, int timo, struct bintime *bt);
>>>>>
>>>>> Then, if you pass a NULL ptr to bt the 'timo' is used, otherwise bt is
>>>>> given preference in the logic. You will need to patch the current few
>>>>> callers of sleepq_set_timo() to get NULL, but it is a small patch.
>>>>> I would really like that you do the same also in the callout KPI,
>>>>> possibly, because this avoids a lot of code duplication.
>>>>
>>>> As opposite opinion, I don't like an idea of functions with both kinds of
>>>> time arguments. It will break existing API/ABI (yes, rarely used, but
>>>> already existing and documented) without giving any real benefits.
>>>> Duplication of few code lines IMHO is not an argument at all if we are
>>>> speaking about choosing API. Speaking about such a popular API as
>>>> callout(9), I believe API breakage is just not an option.

I strongly agree.  However, ...

>>> I respectfully disagree.
>>> We can break KPI between major version and this is not a problem. You
>>> can consider to add some compatibility churn like this one for MFC, if
>>> you need, but for HEAD we should really have cleaner code rather than
>>> code duplication which brings to less mainteneability. For other
>>> example of double-loaded arguments, please look at the interrupt
>>> filter/interrupt threads interface.
>>> I really think having a specific function just for passing timo/bt is
>>> overkill. And you are really making a major change which is expected
>>> to bring to changed KPI.

I like separate functions even without the API considerations.

>> That is a question of comparing prices. In my opinion, the fact that we can
>> change KPI between major releases does not mean that we should do it without
>> really strong reason. IMO open source world too often tends to change the
>> rules and APIs. That may work for in-tree code, when person changing API
>> will catch up all consequences. But each change of that kind can make sad
>> some vendor you never heard about, when his code will no longer compile and
>> he will have to either add some more #ifdef's or drop FreeBSD support. If
>> keeping compatibility would cost us some performance or hundreds lines of
>> extra code, I would agree, but IMHO this case with few lines of code is
>> different.
>
> Well, I spoke privately about that with Davide too and already spent a
> lot of words for this too, which is an issue which just doesn't
> deserve that.
> I just wanted to show you that it is true that we already have code
> doing that (ifilter), that it doesn't violate any rule and that it
> keeps the KPI smaller and avoids code duplication.
> You are free to ignore my opinion, this is your project, but for an
> example of the "unmaintenability" please look at the newer Davide's
> code about adding direct dispatching of callouts on ISR directly. It
> patches correctly the _bt() variant of callout but completely leaves
> out the timo variant.
> While this is not a problem because of code committed to a project
> branch I think this perfectly explains what I mean.
>
> Please feel free to ignore the whole thread at you convenience, though.

... however, both APIs seem to be inadequate.  I started by pointing out
some style bugs in this commit privately, and then noticed that it had
some non-style bugs (mainly, breaking of signal handling by making its
`catch' clause unreachable, and then in connection with old bugs in
nanosleep(), I noticed that it seems to be necessary to be able to sleep
on any clock id -- especially CLOCK_REALTIME and CLOCK_MONOTONIC.  It
seems to be necessary to pass the clock id to sleepq code (if not to
callout code), and then you can easily handle the old ticks arg as a
special clock id).

One of the old bugs in nanosleep() is that it is specified to sleep
on CLOCK_REALTIME, but in FreeBSD it sleeps on CLOCK_MONOTONIC.  Bugs
in CLOCK_MONOTONIC make this bug very large.  POSIX specifies that
CLOCK_MONOTONIC is the time since some fixed time in the past, but in
FreeBSD it is the time since a highly variable time in the past
(boottimebin).  Sleeping on CLOCK_MONOTONIC is what is normally wanted,
but sleep() and nanosleep() must sleep on CLOCK_REALTIME for
compatibility.  POSIX specifies clock_nanosleep() (which is like
nanosleep() except it takes a clock id and a TIMER_ABSTIME flags
argument which can be used to specify that the time is absolute instead
of relative).  FreeBSD doesn't implement this.

Timeouts and their APIs have related bugs.  They only support relative
times on a monotonic clock that is similar to CLOCK_MONOTONIC (ticks++
is stopped during deep sleeps, while CLOCK_MONOTONIC may be stopped
or broken during deep sleeps, depending on whether the hardware behind
CLOCK_MONOTONIC was stopped or wrapped around during deep sleeps).

Fixing nanosleep() using old APIs seems to require something like the
following:
- break up long sleeps to short sleeps, to give more chance of not sleeping
   for much longer than necessary.  E.g., if the ticks counter loses 8 hours
   per night, then the error for sleep(7 * 86400) (7 days) would be 56 hours
   for a single sleep.  By sleeping for 1 day at a time, this error can be
   reduced to only 8 hours.  By sleeping for 1 hour at a time, this error
   can be reduced to only 1 hour, with large errors occurring only during the
   the hour after the system wakes up (when timeouts are bogusly extended
   acrors this hour).

   This is also needed to fix some overflow bugs (when userlands asks for
   sleeps of almost TIME_T_MAX seconds).

   This is also needed to fix cases where hardclock has a large calibration
   error.  Most x86 FreeBSD cluster machines had a calibration error of
   +-10% when they used the lapic timecounter.
- after each wakeup, check the time using CLOCK_REALTIME (nanotime())
   instead of using CLOCK_MONOTONIC_FAST_N_BROKEN (getnanouptime()).

With the new API, the above can still be done, but requires heavier
semantic fudging:
- still break up long sleeps.  Now the sleepq_set_timeout_bt() takes an
   absolute uptime where sleepq_set_timeout() takes a relative tick count.
   We still want a relative realtime, so the semantic mismatch is larger.
   But we can fudge it to essentially the same as before.  We break up
   the relative realtime so that the absolute uptime is not very far
   in the future.  Now we get errors from CLOCK_MONOTONIC being stopped
   instead of the ticks counter being stopped.
- after each wakeup, check the time using CLOCK_REALTIME (bintime())
   not quite as before instead of using CLOCK_MONOTONIC_FAST
   (binuptime()) not quite as before.  Note that the new code has
   already switched from CLOCK_MONOTONIC_FAST_N_BROKEN to
   CLOCK_MONOTONIC.  This seems to be because the new timeout code uses
   CLOCK_MONOTONIC internally since CLOCK_MONOTONIC_FAST_N_BROKEN is
   not accurate enough for some cases, and all external uses should use
   the same clock.  nanosleep() should use a precise clock for the
   check, but it can use an imprecise one for setting up the timeout
   except for very short timeouts.

   Suppose that the timeout hasn't expired, so that we sleep again.
   With the old API, we had to calculate the new sleep in ticks.  The
   code for this has been removed, and any new sleep would be to the
   same absolute uptime that was calculated for the first sleep.  This
   completed turning the whole loop into nonsense -- the check and
   possible re-sleep used to be needed because the ticks clock is not
   precisely calibrated relative to the CLOCK_MONOTONIC_FAST_N_BROKEN
   clock, but now CLOCK_MONOTONIC is used for everything.  CLOCK_MONOTONIC
   is consistently wrong, so sleepq_timewait_sig() shouldn't have
   returned unless the timeout has expired (or a signal).  However, we
   are supposed to sleep on CLOCK_REALTIME, and have broken up long
   sleeps, so we much calculate a new absolute uptime using the correct
   clocks (new relative realtime fudged to absolute uptime) and loop.

For both:
- wait for someone to fix CLOCK_MONOTONIC and timeouts so that we get
   woken up as soon as possible after the system wakes up.

Some of the above CLOCK_MONOTONICs should be CLOCK_UPTIME to match
"uptime" in function names and descriptions, but I wanted to use a
standard clock to get POSIX's specification of its non-brokenness
(it must count deep sleep times).  CLOCK_UPTIME can remain broken
and become different from CLOCK_MONOTONIC when the latter is fixed,
but this isn't very useful.  CLOCK_UPTIME doesn't actually give uptimes
when the boottime is moved.

Bruce