RFC: root mount enhancement (round 2)

Wed Aug 25 21:51:59 UTC 2010

On Aug 25, 2010, at 2:02 PM, M. Warner Losh wrote:
> : Let me mention a problem with the currently implemented root mount
> : logic as a reminder that something needs to be fixed, even if we
> : don't want to enhance: A USB disk cannot always be used as a root
> : file system by virtue of the USB stack releasing the root mount
> : lock after creating the umass device, but before CAM has created
> : the corresponding da device. The kernel will try mounting from
> : /dev/da0 before the device exists, fails and then drops into the
> : root mount prompt. Often the story ends here -- with failure.
> 
> Actually, the problem isn't the locking at all.  The problem is that
> the umass SIMs arrive 'late' in the game.  by the time they arrive,
> CAM has already released the root lock.  But as phk points out, this
> is a bug in the usb/cam interaction and should be fixed there and
> completely irrelevant for your root mounting system.

I perceive the problem differently, because I see no value in waiting
for *all* devices to appear when the root device is already there.
That just slows down the boot. 

I prefer mounting the root file system as soon as the device appears
and enhance the fstab mounting to deal with the device not being
there yet.

Consequently: the bug is with root_mount_hold() and root_mount_rel()
as a means to do the right thing...

> : Here md# refers to the md unit created by the last .md directive.
> : Since the logic is for mounting the root file system only, a .md
> : directive implicitly detaches and releases the previously created
> : md device before creating a new one. In other words: the
> : enhancement is not for creating a bunch of md devices.
> :
> : Should this be relaxed so that any number of md device can be
> : created before we try a root mount?
> 
> I guess I'm having trouble understanding why you'd need this given
> that ram disk information is already passed from the boot loader
> (/boot/loader or in the board's init code (although the latter I don't
> think is done by any in-tree code)) to the kernel...

You're fixating on the preloaded or compiled-in ramdisk. The
.md directive is there for vnode-backed images -- the root
file system image is stored on a file system and memory is
only used for buffering and caching.

> read-write compressed works?  Also, is compression a property of the
> md device, or the GEOM that tastes it to see that it is compressed...
> What does cluster do anyway?  I see that as an option for mdconfig,
> but there's no explanation of it there or in the md man page.

The options are as useful as the md implementation is. The options
are listed because they appeared in mdconfig. Semantics is not to
be argued when syntax is discussed :-)

> How do you differentiate between these two roots:
> 
> 	mdconfig -a -t file -f /gerbil.ram
> and
> 	mdconfig -a -t swap -s 4m
> 	dd if=/gerbil.rom of=/dev/md0 bs=1m

The first is supported, the second isn't. The .md directive only
supports vnode-backed md devices. There's no point trying to mount
a malloc- or swap-backed md device because they instantiate empty
and are useless for root file systems, unless you construct them
first (using dd is a way to construct them). Supporting the
construction of a root file system is where things get complicated
and where I personally don't want to go.

>  But in that case, you're better off going through
> /boot/loader for this stuff, which leads me to my next question: Would
> any md device passed by the boot loader (or compiled into the kernel)
> would effectively be the second one and you'd not need any .md
> directives at all?

You can start off with a preloaded or compiled-in ramdisk, and then
recursively mount root, including from vnode-backed md devices, so
the .md directive is not rendered useless by preloading or compiling
in. You can even end the root mount recursion with the preloaded
ramdisk last -- this gives you premounted file systems under /.mount
without having to run /etc/rc (if you want to)...

> : 
> : To re-iterate: the logic is recursive. After mounting some file system
> : as root, the kernel will follow the directives in /.mount.conf (if the
> : file exists) for remounting the root file system. At each iteration the
> : kernel will remount devfs under /dev and remount the current root file
> : system under /.mount within the new root file system.
> : 
> : Thoughts?
> 
> How is init handled at each stage?  forked after the last one, I assume?

No, init is only spawned after the root mount recursion ends. The .init
directive is there to override defaults. This is envisioned to be useful
for rescue images where you want to swawn /rescue/init or installation
images where you may want to spawn sysinstall. It eliminates having to
hardcode the possibilities in the kernel.

In a sense it gives you more freedom in how you want to call your initial
process without the pitfalls when the root mount recursion ends early due
to a problem.

As a concrete example, consider having a single file system on a writable
medium (say /dev/da0) and software images are ISO images stored in it.
You can install some recovery procedure on /dev/da0 that gets run when
none of the ISO images can be mounted. The ISO images have /sbin/init
as init as usual, but you can select to run /sbin/recovery from /dev/da0.
This allows for a single init executable that performs the right functions
based on the program name for example...

-- 
Marcel Moolenaar
xcllnt at mac.com