PSA: If you run -current, beware!
Luigi Rizzo
rizzo at iet.unipi.it
Thu Feb 5 07:48:38 UTC 2015
On Thursday, February 5, 2015, Peter Wemm <peter at wemm.org> wrote:
> On Wednesday, February 04, 2015 04:29:41 PM Konstantin Belousov wrote:
> > On Tue, Feb 03, 2015 at 01:33:15PM -0800, Peter Wemm wrote:
> > > Sometime in the Dec 10th through Jan 7th timeframe a timing bug has
> been
> > > introduced to 11.x/head/-current. With HZ=1000 (the default for bare
> > > metal, not for a vm); the clocks stop just after 24 days of uptime.
> This
> > > means things like cron, sleep, timeouts etc stop working. TCP/IP won't
> > > time out or retransmit, etc etc. It can get ugly.
> > >
> > > The problem is NOT in 10.x/-stable.
> > >
> > > We hit this in the freebsd.org cluster, the builds that we used are:
> > > FreeBSD 11.0-CURRENT #0 r275684: Wed Dec 10 20:38:43 UTC 2014 - fine
> > > FreeBSD 11.0-CURRENT #0 r276779: Wed Jan 7 18:47:09 UTC 2015 - broken
> > >
> > > If you are running -current in a situation where it'll accumulate
> uptime,
> > > you may want to take precautions. A reboot prior to 24 days uptime (as
> > > horrible a workaround as that is) will avoid it.
> > >
> > > Yes, this is being worked on.
> >
> > So the issue is reproducable in 3 minutes after boot with the following
> > change in kern_clock.c:
> > volatile int ticks = INT_MAX - (/*hz*/1000 * 3 * 60);
> >
> > It is fixed (in the proper meaning of the word, not like worked around,
> > covered by paper) by the patch at the end of the mail.
> >
> > We already have a story trying to enable much less ambitious option
> > -fno-strict-overflow, see r259045 and the revert in r259422. I do not
> > see other way than try one more time. Too many places in kernel
> > depend on the correctly wrapping 2-complement arithmetic, among others
> > are callweel and scheduler.
>
>
Rather than depending on a compiler option, wouldn't it be better/more
robust to change ticks to unsigned, which has specified wrapping behavior?
Cheers
Luigi
Ugh.
>
> I believe I have a smoking gun that suggests that the clock-stop problem is
> caused by the clang-3.5 import on Dec 31st.
>
> Backstory:
> http://blog.llvm.org/2011/05/what-every-c-programmer-should-know.html
> http://www.airs.com/blog/archives/120
>
> I suspect that what has happened is that clang's optimizer got better at
> seeing the direct or indirect effects of integer overflow and clang (and
> gcc)
> take advantage of that.
>
> I have used a slightly different change for about 10 years:
>
> --- kern/kern_clock.c 2014-12-01 15:42:21.707911656 -0800
> +++ kern/kern_clock.c 2014-12-01 15:42:21.707911656 -0800
> @@ -410,6 +415,11 @@
> #ifdef SW_WATCHDOG
> EVENTHANDLER_REGISTER(watchdog_list, watchdog_config, NULL, 0);
> #endif
> + /*
> + * Arrange for ticks to go negative just 5 minutes after boot
> + * to help catch sign problems sooner.
> + */
> + ticks = INT_MAX - (hz * 5 * 60);
> }
>
> /*
>
> This came about from when we had problems with integer overflow arithmetic
> in
> the tcp stack.
>
> In any case, I'm in the process of adding -fwrapv and the early wraparound
> to
> the freebsd.org cluster builds to give it some wider exercise.
>
> --
> Peter Wemm - peter at wemm.org <javascript:;>; peter at FreeBSD.org;
> peter at yahoo-inc.com <javascript:;>; KI6FJV
> UTF-8: for when a ' or ... just won\342\200\231t do\342\200\246
--
-----------------------------------------+-------------------------------
Prof. Luigi RIZZO, rizzo at iet.unipi.it . Dip. di Ing. dell'Informazione
http://www.iet.unipi.it/~luigi/ . Universita` di Pisa
TEL +39-050-2211611 . via Diotisalvi 2
Mobile +39-338-6809875 . 56122 PISA (Italy)
-----------------------------------------+-------------------------------
More information about the freebsd-current
mailing list