Re: curtain: WIP sandboxing mechanism with pledge()/unveil() support
Date: Tue, 29 Mar 2022 08:34:08 UTC
Hi, Does pledge actually require kernel support? I'd have thought that it could be implemented on top of Capsicum as a purely userland abstraction (more easily with libc help, but even with an LD_PRELOADed library along the lines of libpreopen). In Verona, we're able to use Capsicum to run unmodified libraries in a sandbox, for example, including handling raw system calls: https://github.com/microsoft/verona/tree/master/experiments/process_sandbox It would be good to understand why this needs more kernel attack surface. David On 28/03/2022 10:37, Mathieu wrote: > Hello list. Since a while I've been working on and off on a > pledge()/unveil() implementation for FreeBSD. I also wanted it to be > able to sandbox arbitrary programs that might not expect it with no (or > very minor) modifications. So I just kept adding to it until it could > do that well enough. I'm still working on it, and there are some known > issues and some things I'm not sure are done correctly, but overall it's > in a very functional state now. It can run unmodified most utilities and > desktop apps (though dbus/dconf/etc are trouble), server daemons, > buildworld and whole shell/desktop sessions sandboxed. > > https://github.com/Math2/freebsd-pledge > https://github.com/Math2/freebsd-pledge/blob/main/CURTAIN-README.md > > It can be broken up in 4 parts: 1) A MAC module that implements most of > the functionality. 2) The userland library, sandboxing utility, configs > and tests. 3) Various kernel changes needed to support it (including > new MAC handlers and extended syscall filtering). 4) Small > changes/fixes to the base userland (things like adding reporting to ps > and modifying some utilities to use $TMPDIR so that they can be properly > sandboxed). So 1) and 2) could be in a port. And I tried to minimize > 3) and 4) as much as possible. > > I noted some problems/limitations in the CURTAIN-ISSUES file. At this > point I'm mostly wondering about the general design being acceptable for > merging eventually. Because most of this could be part of a port, but > not all of it. And the way that it deals with filesystem access > restrictions in particular is kludgy. So any feedback/testing welcome. > > It still lacks documentation (in part because I'm not sure of what could > still change) so I'm going to give an overview of it here and show some > examples and that's going to be the documentation for now. And I'll > describe the kernel changes that it needed. So that's going to be a bit > of a long email. > > What it can do: > ~~~~~~~~~~~~~~~ > > It can restrict syscalls and various abilities (by categories that were > based on OpenBSD's pledge promises), ioctls, sysctls, socket > options/address families, priv(9) privileges, and filesystem access by > path. It can be used at the same time as jails and Capsicum (their > restrictions are also enforced on top of it). > > It can be used in a nested manner. A program that inherits sandbox > restrictions can do its own internal sandboxing or sandbox programs that > it run (which can then do the same, etc). The permissions of new > sandboxes are always a subset of the inherited sandbox. > > Certain kernel operations are protected by "barriers" which only allow a > sandboxed process to operate on kernel objects that were created by > itself or a descendant sandbox. There are barriers for > inspecting/signaling/debugging processes, POSIX/SysV IPC objects, PTYs, > etc. Barriers have their own hierarchy which can diverge from the > process hierarchy. > > Restrictions can be specified in configuration files and can be > associated with named "tags". Tags are assumed to match application > names, they're prefixed with "_" when they don't (just the convention > I've been using so far). Enabling a tag may cause other tags to be > enabled depending on configurations. Permissions associated with > different tags are merged in a purely additive manner. Configurations > can be spread in multiple files and directories > (/usr/local/etc/curtain.{conf,d} can be used for packages, > ~/.curtain.{conf,d} for user customizations). It'll check the .d > directories for files named after the enabled tags. > > Usage examples: > ~~~~~~~~~~~~~~~ > > curtain(1) is the wrapper utility to sandbox arbitrary programs. Default > permissions are in /etc/defaults/curtain.conf and /etc/curtain.conf. > > Here a bunch of examples. A bit random, but they demonstrate a lot of > the functionality. > > $ curtain id > > Not very exciting, but it works. The default permissions don't give it > access to the user DB so it only shows numeric IDs. It can be given > access with the "_pwddb" tag: > > $ curtain -t _pwddb id > > It's possible to nest sandboxes, but it needs the "curtain" tag because > the curtain config files are not unveiled by default (they could be > though, maybe they should be...). > > Here, id cannot read the user DB because the outer sandbox doesn't allow > it: > > $ curtain -t curtain curtain -t _pwddb id > > But this way it can: > > $ curtain -t curtain -t _pwddb curtain -t _pwddb id > > Starts a sandboxed shell session with access to ~/work in a clean > environment: > > $ mkdir -p ~/work && curtain -p ~/work:rwx -S > > You'll probably miss your dotfiles though. If you browse around you'll > see what paths get unveiled by default. > > If you try to list processes: > > $ curtain ps -ax > > You'll just see the ps process itself. It can be allowed to see > processes outside of it like that: > > $ curtain -d ability-pass:ps ps -ax > > But it will not be allowed to signal, reprioritize or debug them (there > are other "abilities" for that). The "-pass" means to allow the ability > in a "passthrough" manner (beyond the sandbox's barrier). Visibility > could also be blocked at an outer sandbox's barrier, like so: > > $ curtain -t curtain curtain -d ability-pass:ps ps -ax > > Give read-only access to the current directory and list files: > > $ curtain -p . ls > > If you have $CLICOLOR set, it may look less colorful than usual. > curtain(1) is a bit paranoid and will filter out most control characters > written to the TTY by default (and set $TERM to "dumb"). They can be > let through with -R: > > $ curtain -R -p . ls > > And -T can be used to stop it from doing PTY wrapping altogether and > give the program direct access to the TTY (which is less secure, but > there are ioctl restrictions). > > Per-path permissions can be specified after a ":". More specific paths > override the permissions of less specific paths. > > $ curtain -p .:rw -p ./secret: -p ./dev:rwx -p ./data:r ... > > Then those paths would have those permissions: > ./:rw > ./123:rw > ./secret: > ./dev:rwx > ./dev/123:rwx > ./data:r > ./data/123:r > > As an example of how nested sandboxing is handled, if you were then to > do this within this sandbox (don't forget to give it the "curtain" tag): > > $ curtain -p .:r -p ./dev:rx -p ./data:rw ... > > Then the permissions would end up being: > ./:r > ./123:r > ./secret: > ./dev:rx > ./dev/123:rx > ./data:r > ./data/123:r > > root processes can be sandboxed too. Some privileges are allowed by > default (which is similar to the set of privileges allowed by jails), > but most are denied. As are accesses to most /dev and /etc files. For > example, tcpdump will not be able to use bpf(4): > > # curtain tcpdump > > But there's a tag for that: > > # curtain -t _bpf tcpdump > > Something else that won't work: > > $ curtain node -e 'console.log(2+2)' > > It wants to do a PROT_EXEC mprotect(2) which is not allowed by default. > By default, PROT_EXEC is only allowed when mmap(2)'ing files that are > unveiled for execution. > > $ curtain -d ability:prot_exec node -e 'console.log(2+2)' > > Just what is allowed by default? Well it's kind of arbitrary and messy > and there are 10 levels of it. > > curtain(1) uses a 10-levels "permissions tower" usable with options -0 > to -9 (which enable tags "_level0" to "_level9"). These are mostly just > meant to be used as a quick way to try giving programs more or less > access from the command-line (ideally a profile should be made to give > programs just what they need). The default level currently is 5 (which > is fairly permissive compared to most pledge(3)'d applications). All > levels are intended to be securely containable, but each level exposes a > greater attack surface than the previous one. Level 9 is the "please > just work" level. It allows to use all ioctls and to read all sysctls > and almost all rare syscalls. Filesystem access is still very > restricted though so you've still got to figure out what unveils the > program needs. > > And there's another dimension to it which is the "unsafety level". > Directives in the config files can be suffixed with one or more "!" to > indicate that the permissions that it gives are potentially unsafe, > depending on circumstances, or could be surprising or undesired. The > directive only applies when curtain(1) is invoked with as many or more > "-!" options. This was more useful at the beginning when many features > weren't properly sandboxed yet. Now it's not used as much. But I still > find it useful. The way I'm using it is "!" is probably no big deal but > you might want to check it if you're paranoid, "!!" has a real risk of > allowing escapes in certain plausible scenarios, and "!!!" is very > likely insecure unless special precautions are taken. > > I'm still not sure what the defaults should be or how they could be > better organized. The "unsafety" is an odd thing to expose to the user > and as much as possible I tried to make it unnecessary. > > So anyway, a shorter way to make nodejs work is to use level 6 which > allows PROT_EXEC on anonymous memory (and to execute binaries in $TMPDIR > too): > > $ curtain -6 node -e 'console.log(2+2)' > > Now with X programs: > > $ curtain -X xlogo > $ curtain -X xterm > > -X gives "untrusted" X11 access, -Y "trusted" access (like with ssh) and > -W is for Wayland. > > There's an example config file with sample application profiles that can > be enabled by uncommenting the include line in /etc/curtain.conf (and > reading this file is a good way to see how the whole thing works). > Profiles can be used with -a/-A. Both simply enable the tag named after > the program. -A is a shortcut that also enables "unsafety level" 1 > (most profiles don't actually need it, but some do, so I just use it all > the time). > > $ curtain -XA xterm > $ curtain -XA firefox > $ curtain -XA chrome > $ curtain -XA falkon > $ curtain -XA qbittorrent > $ curtain -XA hexchat > $ curtain -XA gimp > $ curtain -XA audacious > # curtain -A tcpdump > > Programs started this way still have the default level 5 permissions in > addition to their profile permissions. > > Option -k ("kill") enables "strict" mode where the default becomes level > 1 and programs are sent SIGKILL when trying to do something forbidden > (otherwise they just get EPERM errors). I made those two things go > together because unexpected restrictions can make programs misbehave and > this could lead to security issues. This reduces the attack surface but > it also means you've got to figure out the permissions just right or > your programs are going to get killed a lot. Also, trying to access > non-unveiled files does not cause a SIGKILL to be sent yet, so missing > unveils have the potential to cause insecure misbehavior too. > > See the config files here: > > https://github.com/Math2/freebsd-pledge/blob/main/lib/libcurtain/curtain.conf.defaults > > https://github.com/Math2/freebsd-pledge/blob/main/lib/libcurtain/curtain.conf.sample > > > How well does it generally work? > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > Well, there are some problems. > > First of all, "untrusted" X11 access doesn't work all that great. Some > programs are just unstable with it. Firefox used to crash a lot with > X11 errors but for some reason it seems to have gotten a lot better > recently. But there might be thick borders around menus, client-side > decorated windows won't be movable, the system tray won't work, > selection/clipboard will only work one direction. And it'll be slower. > The alternative is to give them "trusted" X11 access but that's very > insecure. And even untrusted access isn't so secure either, untrusted > programs are not isolated from one another IIUC. And who knows what the > window manager, panels and others could be doing with the window > properties of untrusted clients... And this exposes the huge complexity > of the X11 server. > > Wayland's security is supposed to be much better, but it depends on how > the compositors handle security on the extra protocols that they support > and IIUC there's not a consensus on how it should be handled yet and > most compositors still lack security restrictions (but apparently some > people just compile out their support for insecure protocols). > > Programs that have built-in support for privilege-separation and > self-sandboxing can solve this by not giving direct access to the > display to the sandboxed parts. And that's something that this > implementation means to support (which can be done on top of sandboxing > the application as a whole). But it's not a general solution. > > Also, dbus/dconf/pulseaudio/etc are not dealt with very well yet. > They're just ignored really. And (a bit surprisingly) many programs > seem OK with that. fontconfig will complain a lot but if the font > caches are already up to date it doesn't look like it matters (startup > will be much slower otherwise). pulseaudio will just die when firefox > tries to start it but then it'll fallback to using OSS directly (sndio > works too). Thumbnail caches won't be accessible. The XDG shared > recent documents list won't work. dconf will be completely > non-functional and some programs won't be able to save their settings. > Etc. And even when it works, "desktop integration" in general is going > to be very degraded. A program trying to launch the desktop > environment's handler program to open a file or URL probably won't work > because it'll inherit a too restrictive sandbox. I haven't really > gotten into trying to deal with this better yet. I see that there are > dbus proxy services for sandboxing on Linux. It would probably need > something like that. > > There are some scripts to sandbox programs with separate XDG directories > or separate $HOME in /usr/share/examples/curtain/. But I wish doing this > wouldn't be necessary... > > For non-desktop programs, it generally just works (if you give them > enough permissions). The main thing causing trouble is usually /tmp. > > About the userland parts: > ~~~~~~~~~~~~~~~~~~~~~~~~~ > > libcurtain is a wrapper around the sandboxing syscall. It allows to > assign permissions to "slots" which then get merged. Path permissions > can override each others (most specific wins) within a slot, but across > slots they are merged in a non-interfering way (a more specific > permissions never cancels out less specific permissions from a different > slot). Permissions from different bracketed sections of config files > are added to different slots, so they all get merged in this way. > > Config files are also handled by libcurtain. Applications can use > libcurtain directly to sandbox themselves using tags, but the API for > that is more complex than it should be and I'm probably going to make > more changes to it. > > I added a freebsd_simple_sandbox() function directly to libc that tries > to load libcurtain and applies a tag. The idea is to make it as easy as > possible to add configurable, opportunistic sandboxing to applications > without having to link them to libcurtain. It can be called multiple > times at different stages of initialization of an application, or for > different sub-processes, etc. The application just specifies a tag for > each call and the details are in the config files. Conceivably, there > could be different backends implementing the sandboxing. > > libcurtain also contains the pledge()/unveil() implementation. On > OpenBSD, pledge/unveil are available directly in libc (with the > declarations in unistd.h), but the portable versions of some OpenBSD > programs have problems if pledge/unveil are available on non-OpenBSD > platforms because they just don't expect that. After fixing them, maybe > auto-loading wrappers could be added directly to libc too so that they > just work without having to deal with libcurtain dependencies. > > About the kernel-side parts: > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > Most of the implementation is in a separate mac_curtain module, but it > also needed some changes spread out in the kernel to support it. That's > what would need to be merged. > > The biggest change is adding "sysfils". It initially just meant > "syscall filters" but now it's more of a general category of things that > the kernel can do. Syscalls can be associated with zero or more > required sysfils and some explicit sysfil checks were added in various > places in the kernel as needed. ucreds have a set of allowed sysfils. > Sysfils are represented as simple bitmaps and checks are fast. Capsicum > was slightly modified to make use of a sysfil bit to simplify syscall > entry checks. > > Sysfils are meant to be part of the internal kernel API, they're not > exposed to the userland. The curtain module exposes intermediate > "abilities" instead. > > Some checks that checked for "capability mode" now check for a more > general "restricted mode" instead. A process is considered in > restricted mode whenever its ucred is missing any sysfil bit. > > MAC handlers were added to let curtain hook into places that didn't have > MAC checks. Some of those new handlers definitively seem out of place. > The new vnode "walk" functions are more of a low-level mechanism than > just a security policy. And many of the new handlers want to restrict > access to certain functionality as a whole (e.g. ioctls, sockopts, > procctls, etc) rather than compare labels. But it seemed like the best > place to add them because MAC already did most of what was needed. So > I've been treating the MAC framework like it stands for "Modular Access > Checks" or something. > > The curtain permissions are stored in "curtain" objects. Process ucreds > have their labels point to a curtain. Curtains have pointers to > "barrier" objects, which contain the hierarchical linkage needed to > restrict access to protected kernel objects. Those kernel objects have > their labels point directly to barriers. Barriers can outlive their > curtains. When a ucred loses its last reference from a process, it is > "trimmed" and its label curtain pointer "decays" into a pointer to the > curtain's barrier so that the curtain can be freed (because curtains can > be a few KBs and they can hold vnode references). A lot of objects hold > references to ucreds, so they could build up a lot without this. > > Processes can sandbox themselves with curtainctl(2). They have to > specify the full set of permissions they want to retain. The requested > permissions are then masked with the current curtain (if any). This > involves dealing with inheritance relationships between permissions (as > the new curtain can have permissions more specific than the old and vice > versa). > > Kernel-side handling of filesystem path unveiling was the hardest part > to deal with (given the "statelessness" of the vnode API) and it kind of > is all a big kludge. I tried to make it as nice as possible and wrapped > the whole thing behind a MAC API (it used to be a lot worse than that). > > Each directory "unveil" acts like a sort of chroot barrier but with > specific permissions. There's a per-thread "tracker" with a circular > buffer that remembers the permissions for the previous N looked-up > vnodes. N only needs to be 2 as far as I can tell (most syscalls only > need 1, but linkat() for example needs 2). The tracker has weak vnode > references and doesn't need to be cleaned up after syscalls. namei() > calls the new MAC handlers to manage the tracker during path lookup. > fget*() also adds a tracker entry. Then the access check MAC handlers > can find permissions for the passed vnodes in the tracker. This only > works because almost all of the kernel code that work on vnodes first > get a reference from namei()/fget*() and then don't call VOP_LOOKUP() > directly themselves. It's messy but one good thing with it is that it > usually "fails-secure" if the tracker was mismanaged because it won't > find the vnode in it and it defaults to deny. > > >