Re: curtain: WIP sandboxing mechanism with pledge()/unveil() support
- Reply: Ed Maste : "Re: curtain: WIP sandboxing mechanism with pledge()/unveil() support"
- Reply: Mathieu : "Re: curtain: WIP sandboxing mechanism with pledge()/unveil() support"
- Reply: Poul-Henning Kamp: "Re: curtain: WIP sandboxing mechanism with pledge()/unveil() support"
- In reply to: Mathieu : "Re: curtain: WIP sandboxing mechanism with pledge()/unveil() support"
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Thu, 31 Mar 2022 10:24:43 UTC
On 29/03/2022 18:32, Mathieu wrote: > On 3/29/22 04:34, David Chisnall wrote: >> Hi, >> >> Does pledge actually require kernel support? I'd have thought that it >> could be implemented on top of Capsicum as a purely userland >> abstraction (more easily with libc help, but even with an LD_PRELOADed >> library along the lines of libpreopen). In Verona, we're able to use >> Capsicum to run unmodified libraries in a sandbox, for example, >> including handling raw system calls: >> >> https://github.com/microsoft/verona/tree/master/experiments/process_sandbox >> >> >> It would be good to understand why this needs more kernel attack surface. >> >> David > > If it can work like that then it's pretty cool. It could be a lot more > secure. But it's just not the way I went with. Re-implementing so much > kernel functionality in userland seems like a lot of work. Because I > wanted my module to be able to sandbox (almost) everything that the OS > can run. Including whole process hierarchies that execute other > programs and use process management and shared memory, etc. That's a > lot of little details to get right... So I went with the same route > that jails, other MAC modules and even Capsicum are implemented: with > access checks in the kernel itself. And most of these checks were > already in place with MAC hooks. My concern with adding it to the kernel is that anything that does path-based checks is *incredibly* hard to get right and it will fail open. To date, there are zero examples of path-based sandboxing mechanisms deployed in the wild that have not had vulnerabilities arising from the nature of the problem. The filesystem is, inherently, concurrent. A process can mutate the shape of the filesystem graph while you are doing path-based checks, mostly around the handling of '..' in paths. Jails and Capsicum sidestep this in different ways: Jails effectively punt the problem to the jail orchestration code. They provide very strong restrictions on the paths, with a single root and allowing all access within this. There are a few restrictions on what you can do from outside of a jail to avoid allowing the jailed process to exploit TOCTOU differences and escaping but fortunately these align with the use of jails as isolated containers containing (minimal) base system. Capsicum simply disallows '..' in paths. If you want to support it in user code then you must do path resolution in userspace. You may still have TOCTOU bugs, but they'll all fail closed: you will try to resolve the result, discover that you don't have a file descriptor corresponding to the path, and fail. > pledge()/unveil() are usually used for fairly well-disciplined > applications that either don't run other programs or run very specific > programs that are also well-disciplined and don't expect too much > (unless you just drop the pledges on execve()). The execve hole is the reason that I have little interest in pledge as an enforcement mechanism. If a process can just execve itself to escape, then that's a trivial hole to exploit unless you're incredibly careful to make sure that the process does not have the ability to create or read files with executable privilege on the filesystem. In contrast, something using Capsicum can create child processes but they inherit the same limitations. It can inherit file descriptors from the parent, so if it is using something like libpreopen then it can inherit a large number of file descriptors for any of the files / directories that it should be permitted to open. Since rtld was extended to allow direct execution mode, you can launch dynamically linked binaries in Capsicum mode. With the SIGCAP things in https://reviews.freebsd.org/D33248, it becomes easy to write a signal handler that intercepts blocked system calls and handles them (I'm running with this applied and doing exactly that), so this can be transparent to any dynamically linked binary. > Pledged applications usually reduce the kernel attack surface a lot, but > you don't run arbitrary programs with pledge (and that wasn't one of its > goals AFAIK). But that's what I wanted my module to be able to do. I'd > say it has become a bit of a weird hybrid between a "container" > framework and an exploit mitigation framework at this point. You can > run a `make buildworld` with it, build/install/run random programs > isolated in your project directories, sandbox shell/desktop sessions as > a whole, etc. And then within those sandboxes, nested applications can > do their own sandboxing on top of it (with this module (and its > pledge/unveil compat) or Capsicum (and possibly other compat layers > built on top of it)). The "inner" programs can use more restrictive > sandboxes that don't expose as much kernel functionality. But for the > "outer" programs the whole thing slides more towards being > "containers"/"jails" (and the more complex it would have been to do > purely in userland I believe). So how do you avoid TOCTOU bugs in your path logic? I don't disagree with the goals, I worry that you're doing something that is intrinsically almost impossible to get right. David