Re: curtain: WIP sandboxing mechanism with pledge()/unveil() support
- In reply to: David Chisnall : "Re: curtain: WIP sandboxing mechanism with pledge()/unveil() support"
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Thu, 31 Mar 2022 22:55:36 UTC
On 3/31/22 06:24, David Chisnall wrote: > On 29/03/2022 18:32, Mathieu wrote: >> On 3/29/22 04:34, David Chisnall wrote: >>> Hi, >>> >>> Does pledge actually require kernel support? I'd have thought that >>> it could be implemented on top of Capsicum as a purely userland >>> abstraction (more easily with libc help, but even with an >>> LD_PRELOADed library along the lines of libpreopen). In Verona, >>> we're able to use Capsicum to run unmodified libraries in a sandbox, >>> for example, including handling raw system calls: >>> >>> https://github.com/microsoft/verona/tree/master/experiments/process_sandbox >>> >>> >>> It would be good to understand why this needs more kernel attack >>> surface. >>> >>> David >> >> If it can work like that then it's pretty cool. It could be a lot >> more secure. But it's just not the way I went with. Re-implementing >> so much kernel functionality in userland seems like a lot of work. >> Because I wanted my module to be able to sandbox (almost) everything >> that the OS can run. Including whole process hierarchies that >> execute other programs and use process management and shared memory, >> etc. That's a lot of little details to get right... So I went with >> the same route that jails, other MAC modules and even Capsicum are >> implemented: with access checks in the kernel itself. And most of >> these checks were already in place with MAC hooks. > > My concern with adding it to the kernel is that anything that does > path-based checks is *incredibly* hard to get right and it will fail > open. To date, there are zero examples of path-based sandboxing > mechanisms deployed in the wild that have not had vulnerabilities > arising from the nature of the problem. The filesystem is, > inherently, concurrent. A process can mutate the shape of the > filesystem graph while you are doing path-based checks, mostly around > the handling of '..' in paths. Jails and Capsicum sidestep this in > different ways: > > Jails effectively punt the problem to the jail orchestration code. > They provide very strong restrictions on the paths, with a single root > and allowing all access within this. There are a few restrictions on > what you can do from outside of a jail to avoid allowing the jailed > process to exploit TOCTOU differences and escaping but fortunately > these align with the use of jails as isolated containers containing > (minimal) base system. > > Capsicum simply disallows '..' in paths. If you want to support it in > user code then you must do path resolution in userspace. You may still > have TOCTOU bugs, but they'll all fail closed: you will try to resolve > the result, discover that you don't have a file descriptor > corresponding to the path, and fail. Yeah it's by far the hardest part of this module for the reasons you describe. Directories can move while you're holding references to them or even while you're traversing them. I mean, this is hell. My implementation has flaws, but I believe that they have a limited impact, but one of them is a big deal and it means you have to be careful with what you do from outside the sandbox. I'll try to explain below for anyone interested. It's similar to the situation with jails. I still hope I can improve it though. > >> pledge()/unveil() are usually used for fairly well-disciplined >> applications that either don't run other programs or run very >> specific programs that are also well-disciplined and don't expect too >> much (unless you just drop the pledges on execve()). > > The execve hole is the reason that I have little interest in pledge as > an enforcement mechanism. If a process can just execve itself to > escape, then that's a trivial hole to exploit unless you're incredibly > careful to make sure that the process does not have the ability to > create or read files with executable privilege on the filesystem. I see this with some other sandboxing frameworks too, they allow to execute other programs with higher permissions. That's something I wanted to avoid completely. Look how difficult it is to write safe setugid binaries. All programs you'd allow to run would have to be safe too. pledge() can inherit its restrictions though. If you specify restrictions on the second argument, the executed process will never be able to drop those. Unveils will be inherited as well (but it will not be possible to use unveil() again, so this is not fully "nestable" sandboxing as I call it). But pledge() isn't always used like that because a lot of programs just won't work with inherited pledge restrictions if they weren't designed for it. That's what I wanted my module to support better. > > In contrast, something using Capsicum can create child processes but > they inherit the same limitations. It can inherit file descriptors > from the parent, so if it is using something like libpreopen then it > can inherit a large number of file descriptors for any of the files / > directories that it should be permitted to open. With my module, generally the sandbox is inherited. When you use it with curtain(1), that's how it is. The ability to drop the restrictions on-exec was mostly added for pledge() compatibility. And since my module is designed for arbitrary nesting, what actually happens when you request to drop restrictions on-exec with pledge() is that you return to the restrictions that your process inherited at first (which is no restrictions only if you started with no restrictions). > > Since rtld was extended to allow direct execution mode, you can launch > dynamically linked binaries in Capsicum mode. With the SIGCAP things > in https://reviews.freebsd.org/D33248, it becomes easy to write a > signal handler that intercepts blocked system calls and handles them > (I'm running with this applied and doing exactly that), so this can be > transparent to any dynamically linked binary. Yeah it's interesting. I see the "process descriptor" thing too that could make process management capability-based. Getting a general compat layer built on top of Capsicum could be the best of both worlds. > >> Pledged applications usually reduce the kernel attack surface a lot, >> but you don't run arbitrary programs with pledge (and that wasn't one >> of its goals AFAIK). But that's what I wanted my module to be able to >> do. I'd say it has become a bit of a weird hybrid between a >> "container" framework and an exploit mitigation framework at this >> point. You can run a `make buildworld` with it, build/install/run >> random programs isolated in your project directories, sandbox >> shell/desktop sessions as a whole, etc. And then within those >> sandboxes, nested applications can do their own sandboxing on top of >> it (with this module (and its pledge/unveil compat) or Capsicum (and >> possibly other compat layers built on top of it)). The "inner" >> programs can use more restrictive sandboxes that don't expose as much >> kernel functionality. But for the "outer" programs the whole thing >> slides more towards being "containers"/"jails" (and the more complex >> it would have been to do purely in userland I believe). > > So how do you avoid TOCTOU bugs in your path logic? I don't disagree > with the goals, I worry that you're doing something that is > intrinsically almost impossible to get right. > > David > > You're right, it's not done right. Not fully so. The security rests on the module never allowing sandboxed processes to alter the directory hierarchy in a way that could cause exploitable confusion. But if the alterations come from outside, then it can be unsafe. This is similar to jails but (as you said) it's probably less risky there because of the way jails tend to be used. And I could be wrong about all of this... this is going to need auditing. But IIUC, the risk mainly comes from moving directories from inside of the sandbox to outside of it. Say you sandbox firefox and it has write access to ~/.firefox and a separate $TMPDIR, moving any sub-directory from inside of those to somewhere outside is extremely unsafe. But moving directories between ~/.firefox and the $TMPDIR, this is fine. Moving directories from outside to inside, also fine (but they better stay in there). You could expect no one to do that, but say firefox has access to ~/Downloads though, moving directories out of it is a lot more tempting... So that's a big flaw. Moving directories from a writable sandboxed directory to a read-only sandboxed directory is also dangerous for the same reasons. But all of this has to be done from outside the sandbox. I'm not super happy with it. I wish there could be protections for that. But I'm not sure how. I'd have to keep thinking about it. But for now you have to be careful with what you do from outside of the sandbox to directories unveiled inside of it. I think it's tolerable. Even if just barely... In practice I can imagine it causing problems. The actual implementation is not all that complicated in principle and somewhat similar to jails in some ways. Instead of the process having one "jail root directory", it has as many as it has unveils. Each curtain unveil entry holds a vnode reference to its directory. And, when you unveil a path, ALL parent directories are also added to the curtain as unveil entries (with permissions to traverse them). When path lookup starts, if needed, it'll ascend the directory hierarchy to find the unveil entry that "covers" the start directory. After this, checks are strictly done against the curtain, there's no more searching for parent directories. The module just reacts to traversals of directories that it knows about. Whenever you land on a directory during path lookup (including after ".." lookups), the curtain is searched for the vnode. The dangerous part of it is when you descend into a directory with no corresponding curtain unveil entries. Then it switches into "uncharted" mode where it assumes that ALL files can inherit the parent unveil's permissions. But this is only until you hit another vnode that the curtain knows about. Then it'll leave "uncharted" mode and inherited permissions will not carry upwards during ".." lookups. Each unveil (and all of their parent directories) act like a sort of firewall for the hierarchy confusion down below (or at least that's the idea). And to cause confusion within this outer directory "skeleton", you need to have permissions to alter it in the first place! But within "uncharted" directories anything can happen (using the permissions of the unveil that covers it). If you can move a directory into the depths of another directory, you can trick curtain into using the permissions of the moved directory on the content of the target directory (curtain won't know where the directory really is when it handles ".." lookups). But here's the thing: to move that directory there, the unveils must give write access to both the source and target directories in the first place! So what was gained? The hierarchy confusion isn't so bad. That's a lie though. You can get permissions to modify file attributes on a directory for which you only had write permissions this way (those can be separate unveil permissions). That's something I'll try to fix. It's just a problem for writable directories that can be moved though, separate file attribute change permissions are still useful for /dev files for example (sandboxed root processes are correctly prevented from changing permissions on /dev/null even if they can write to it). All of this is kind of hard to reason about... So I hope I got this right (and there could be bugs too obviously).