new feature: private IPC for every jail
Robert Watson
rwatson at FreeBSD.org
Mon Apr 3 17:16:12 UTC 2006
On Mon, 3 Apr 2006, Marc G. Fournier wrote:
> On Mon, 3 Apr 2006, Robert Watson wrote:
>
>> (1) The fact that system v ipc primitives are loadable, and unloadable,
>> which requires some careful handling relating to registration order, etc.
>
> For this one, I'm lost at the issue ... if not loaded, jail processes just
> couldn't attach ... if loaded, and you try to unload, while there are shared
> memory segments in play, don't unload ... or is there something i'm missing
> here? What happens now, if I load ipc, start up postgresql and then try to
> unload ipc? I hardcode all the stuff I use in my kernel, so don't use the
> load/unload mechanism, so can't test this easily ...
The problem is the relationship between jails and loadable System V IPC, and
has to do with how you might implement the relationship between the two
subsystems. There are two general ways to approach adding virtualization to
the System V IPC name spaces:
(1) Add a general virtualization facility, which causes the current process
and its children to see a new name space.
(2) Key virtualization to the identity of the jail.
When dealing with the file system, jail relies on the existing chroot()
subsetting facility to introduce virtualization. This is a nice piece of
behavior, as it means file system subsetting is a facility available to be
used regardless of the use of jail, and avoids hard-coding jail
instrumentation throughout the file system code.
So the question is this: if you load System V IPC support after you start a
jail, how do we handle jails that have already started? Do we go out and
create new name spaces for jails already started (a problem for method (1),
because it implies System V IPC will have pretty intimate knowledge of jails,
and know how to walk lists, etc), do we deny access to System V IPC for jails
not present when it was loaded? Likewise, although we tend to refer to
the different IPC mechanisms as in a single category, System V IPC, there are
actually three name spaces, and the functionality for each can be loaded
separately.
It's not that these questions can't be answered, but they do have to be
answered. My leaning, btw, in implementing this would be to:
- For each System V IPC mechanism, implement a mechanism to create a new name
space, to be used by the current process and any children (until they
replace them with a new one, similar to chroot).
- In jail(), similar to the way in which it uses chroot() to subset the file
system name space, cause the creation of new name spaces, if the IPC
services are present.
- We'll need a way to flag jails as not permitting any System V IPC of a
particular type, to be used when the IPC service isn't loaded at the time
jail() is to create a new jail, and the System V IPC services will need to
check those flags (or whatever).
- We'll need a way to name the new name spaces (unlike the file system, we
can't rely on an existing facility), and we'll need to enhance the System V
IPC monitoring and management tools. For example, ipcs, ipcrm, etc, will
need to know about this, the kernel interfaces for management will need to
know how to deal with name spaces, and they will have to make sure to use
the right checks and decide how to represent the fact that processes in
jails should not be able to see name spaces other than their name space.
Maybe this is a flag to name space creation, but something is needed here.
Note there are some other tricky dependency problems, such as the fact that
jails have to interact with code that may or may not be loaded, how to have
jail vs IPC notions of privilege interact, etc.
>> (2) The name space model for system v ipc is flat, so while it's desirable
>> to allow the administrator in the host environment to monitor and control
>> resource use in the jail (for example, delete allocated but unused
>> segments), doing that requires developing an administrative model for it.
>
> Again, you've lost me here ... how is that different then not using a jail?
> from the root server, one does an 'ipcs -a' and ipcrm as required ... the
> only thing I could think of 'being a nice thing' here is to maybe add a
> 'jail' value, simpler to what is in proc, so that you know what segments
> below to a specific jail ...
>
> I'm free to admit that I may be missing something you are seeing as obvious,
> mind you ;)
>
> For instance, are you suggesting that 'root' in the jail himself could issue
> ipcs -a and ipcrm?
I'm referring to how ipcs, ipcrm, etc, in the host environment interact with
the IPC resources in the jail environments. In particular, I'm making the
assumption that it is useful and desirable for the administrator running in
the host to be able to directly monitor allocation in the jails, and manage
that allocation, without running the management commands in the jail.
I'm not sure if you've ever programmed to the System V IPC API, but if you
have done so, you'll know that the name space for IPC objects is "odd". It's
non-hierarchal, and hence highly subject to collisions between applications.
This means that we can't use neat tricks, such as chroot() in the file system,
to implement virtualization. If you compare the behavior of MySQL in a Jail
with PostgreSQL, you'll see how this plays out immediately: MySQL uses UNIX
domain sockets by default, and this means it "just works" with Jail, as the
UNIX domain socket name space is, in fact, the file system name space. If
MySQL uses /tmp/mysql.sock in a jail, it's virtualized by virtue of the fact
that /jail/www.whatever.com/tmp doesn't, by definition, collide with
/jail/www.notanother.com/tmp.
Because the System V IPC name space is non-hierarchal, we have to deal with
the fact that names can and do collide. If each jail has its own name space,
for example, and each contains a PostgreSQL session with an ID of 54321 (made
up), then a process in the host environment can't simply issue the normal
System V IPC system calls in order to delete them, because those calls have no
way to express "which name space" the operation is in. In the jail, this is
OK, because applications will get whatever the jail-local name space is. But
outside the Jail, these commands would see the name space for the host, but
none of the contents of the Jail's name spaces. In essense, this mean that we
need to add new interfaces to allow ipcs, ipcrm, etc, to run outside the jails
yet see and operate on objects in the jails.
Again, this can be done, but the details are non-trivial, since they raise
hard questions about generalization, interactions between dynamically loaded
components, access control, name spaces etc.
This is why no one has done it yet. Several people, including myself, have
sat down and done the first 30% hack -- enough to get things working a bit,
and to bump into all the tricky parts (see above).
Robert N M Watson
More information about the freebsd-stable
mailing list