Re: curtain: WIP sandboxing mechanism with pledge()/unveil() support
Date: Tue, 29 Mar 2022 17:32:41 UTC
On 3/29/22 04:34, David Chisnall wrote: > Hi, > > Does pledge actually require kernel support? I'd have thought that it > could be implemented on top of Capsicum as a purely userland > abstraction (more easily with libc help, but even with an LD_PRELOADed > library along the lines of libpreopen). In Verona, we're able to use > Capsicum to run unmodified libraries in a sandbox, for example, > including handling raw system calls: > > https://github.com/microsoft/verona/tree/master/experiments/process_sandbox > > > It would be good to understand why this needs more kernel attack surface. > > David If it can work like that then it's pretty cool. It could be a lot more secure. But it's just not the way I went with. Re-implementing so much kernel functionality in userland seems like a lot of work. Because I wanted my module to be able to sandbox (almost) everything that the OS can run. Including whole process hierarchies that execute other programs and use process management and shared memory, etc. That's a lot of little details to get right... So I went with the same route that jails, other MAC modules and even Capsicum are implemented: with access checks in the kernel itself. And most of these checks were already in place with MAC hooks. pledge()/unveil() are usually used for fairly well-disciplined applications that either don't run other programs or run very specific programs that are also well-disciplined and don't expect too much (unless you just drop the pledges on execve()). Pledged applications usually reduce the kernel attack surface a lot, but you don't run arbitrary programs with pledge (and that wasn't one of its goals AFAIK). But that's what I wanted my module to be able to do. I'd say it has become a bit of a weird hybrid between a "container" framework and an exploit mitigation framework at this point. You can run a `make buildworld` with it, build/install/run random programs isolated in your project directories, sandbox shell/desktop sessions as a whole, etc. And then within those sandboxes, nested applications can do their own sandboxing on top of it (with this module (and its pledge/unveil compat) or Capsicum (and possibly other compat layers built on top of it)). The "inner" programs can use more restrictive sandboxes that don't expose as much kernel functionality. But for the "outer" programs the whole thing slides more towards being "containers"/"jails" (and the more complex it would have been to do purely in userland I believe).