kern.sync_on_panic

Mon Jun 27 10:08:24 UTC 2011

on 26/06/2011 08:51 Warner Losh said the following:
> 
> On Jun 25, 2011, at 8:49 AM, Andriy Gapon wrote:
>> Does anybody actually use kern.sync_on_panic tunable/sysctl? If yes, then
>> in what circumstances do you need it? That is, why any other alternative
>> doesn't work for you? Like: 1. remounting filesystems R/O before panic if
>> you knowingly provoke it for testing 2. using netboot for your test system 
>> 3. using su+j, gjournal or a different filesystem altogether 4. using fsck
>> after reboot
>> 
>> It seems to me that syncing filesystems in panic context is an adventure.
>> And it may become even more of an adventure if we introduce code that
>> completely stops scheduler in and after panic.
> 
> I've used it in the past when I was developing a device driver that was in
> the late stages of maturing.  Since all the panics in the system were when
> the driver dereferenced NULL in that driver, sync was safe because all the
> data structures were sane except the aforementioned driver.
> 
> (1) It was a production system, and everything that could be was already
> mounted r/w.  However, some small, but every critical, amount of data was
> still r/w and it was very important to not lose this data.  Production here
> likely should be in quotes, because it was in the late stages of
> testing/validation.  The problem was without this sometimes the saved state
> of the GPS receiver and other hardware would wind up being zero, which meant
> that we'd have to do a cold start which cost us a few hours of time.  At the
> time I was doing this, we saw zero files a couple times a day without this
> turned on. (2) netbooting wasn't an option since we were qualifying a
> non-netbooting system. (3) these weren't available at the time, but the goal
> was to prevent data loss, not to necessarily have to avoid fsck on boot. (4)
> Data loss without it.
> 
> Now, I'll be the first to admit this has been a few years, and I haven't done
> a fresh evaluation to see if things are still safe.  I'll also be the first
> to admit that this was a useful debugging setting late in development, and
> not in production.  I'm also the first to admit this isn't what I'd call a
> very wide-spread case.  But it did come in very handy when chasing a few bugs
> to be able to do 10 panic/reboot cycles an hour rather than 2 a day.

A fine enough use-case for me.  I guess the problem ultimately boiled down to
peculiarities of UFS behavior, but still...
However, please be aware that sync_on_panic might get broken when/if we start
stopping scheduler in panic.

-- 
Andriy Gapon