new feature: private IPC for every jail

Mon Apr 3 17:16:12 UTC 2006

On Mon, 3 Apr 2006, Marc G. Fournier wrote:

> On Mon, 3 Apr 2006, Robert Watson wrote:
>
>> (1) The fact that system v ipc primitives are loadable, and unloadable, 
>> which requires some careful handling relating to registration order, etc.
>
> For this one, I'm lost at the issue ... if not loaded, jail processes just 
> couldn't attach ... if loaded, and you try to unload, while there are shared 
> memory segments in play, don't unload ... or is there something i'm missing 
> here? What happens now, if I load ipc, start up postgresql and then try to 
> unload ipc?  I hardcode all the stuff I use in my kernel, so don't use the 
> load/unload mechanism, so can't test this easily ...

The problem is the relationship between jails and loadable System V IPC, and 
has to do with how you might implement the relationship between the two 
subsystems.  There are two general ways to approach adding virtualization to 
the System V IPC name spaces:

(1) Add a general virtualization facility, which causes the current process
     and its children to see a new name space.

(2) Key virtualization to the identity of the jail.

When dealing with the file system, jail relies on the existing chroot() 
subsetting facility to introduce virtualization.  This is a nice piece of 
behavior, as it means file system subsetting is a facility available to be 
used regardless of the use of jail, and avoids hard-coding jail 
instrumentation throughout the file system code.

So the question is this: if you load System V IPC support after you start a 
jail, how do we handle jails that have already started?  Do we go out and 
create new name spaces for jails already started (a problem for method (1), 
because it implies System V IPC will have pretty intimate knowledge of jails, 
and know how to walk lists, etc), do we deny access to System V IPC for jails 
not present when it was loaded?  Likewise, although we tend to refer to 
the different IPC mechanisms as in a single category, System V IPC, there are 
actually three name spaces, and the functionality for each can be loaded 
separately.

It's not that these questions can't be answered, but they do have to be 
answered.  My leaning, btw, in implementing this would be to:

- For each System V IPC mechanism, implement a mechanism to create a new name
   space, to be used by the current process and any children (until they
   replace them with a new one, similar to chroot).

- In jail(), similar to the way in which it uses chroot() to subset the file
   system name space, cause the creation of new name spaces, if the IPC
   services are present.

- We'll need a way to flag jails as not permitting any System V IPC of a
   particular type, to be used when the IPC service isn't loaded at the time
   jail() is to create a new jail, and the System V IPC services will need to
   check those flags (or whatever).

- We'll need a way to name the new name spaces (unlike the file system, we
   can't rely on an existing facility), and we'll need to enhance the System V
   IPC monitoring and management tools.  For example, ipcs, ipcrm, etc, will
   need to know about this, the kernel interfaces for management will need to
   know how to deal with name spaces, and they will have to make sure to use
   the right checks and decide how to represent the fact that processes in
   jails should not be able to see name spaces other than their name space.
   Maybe this is a flag to name space creation, but something is needed here.

Note there are some other tricky dependency problems, such as the fact that 
jails have to interact with code that may or may not be loaded, how to have 
jail vs IPC notions of privilege interact, etc.

>> (2) The name space model for system v ipc is flat, so while it's desirable 
>> to allow the administrator in the host environment to monitor and control 
>> resource use in the jail (for example, delete allocated but unused 
>> segments), doing that requires developing an administrative model for it.
>
> Again, you've lost me here ... how is that different then not using a jail? 
> from the root server, one does an 'ipcs -a' and ipcrm as required ... the 
> only thing I could think of 'being a nice thing' here is to maybe add a 
> 'jail' value, simpler to what is in proc, so that you know what segments 
> below to a specific jail ...
>
> I'm free to admit that I may be missing something you are seeing as obvious, 
> mind you ;)
>
> For instance, are you suggesting that 'root' in the jail himself could issue 
> ipcs -a and ipcrm?

I'm referring to how ipcs, ipcrm, etc, in the host environment interact with 
the IPC resources in the jail environments.  In particular, I'm making the 
assumption that it is useful and desirable for the administrator running in 
the host to be able to directly monitor allocation in the jails, and manage 
that allocation, without running the management commands in the jail.

I'm not sure if you've ever programmed to the System V IPC API, but if you 
have done so, you'll know that the name space for IPC objects is "odd".  It's 
non-hierarchal, and hence highly subject to collisions between applications. 
This means that we can't use neat tricks, such as chroot() in the file system, 
to implement virtualization.  If you compare the behavior of MySQL in a Jail 
with PostgreSQL, you'll see how this plays out immediately: MySQL uses UNIX 
domain sockets by default, and this means it "just works" with Jail, as the 
UNIX domain socket name space is, in fact, the file system name space.  If 
MySQL uses /tmp/mysql.sock in a jail, it's virtualized by virtue of the fact 
that /jail/www.whatever.com/tmp doesn't, by definition, collide with 
/jail/www.notanother.com/tmp.

Because the System V IPC name space is non-hierarchal, we have to deal with 
the fact that names can and do collide. If each jail has its own name space, 
for example, and each contains a PostgreSQL session with an ID of 54321 (made 
up), then a process in the host environment can't simply issue the normal 
System V IPC system calls in order to delete them, because those calls have no 
way to express "which name space" the operation is in.  In the jail, this is 
OK, because applications will get whatever the jail-local name space is.  But 
outside the Jail, these commands would see the name space for the host, but 
none of the contents of the Jail's name spaces.  In essense, this mean that we 
need to add new interfaces to allow ipcs, ipcrm, etc, to run outside the jails 
yet see and operate on objects in the jails.

Again, this can be done, but the details are non-trivial, since they raise 
hard questions about generalization, interactions between dynamically loaded 
components, access control, name spaces etc.

This is why no one has done it yet.  Several people, including myself, have 
sat down and done the first 30% hack -- enough to get things working a bit, 
and to bump into all the tricky parts (see above).

Robert N M Watson