Re: capsicum(4): .. and SIGTRAP causing syscall really is in siginfo_t.si_errno?

From: Steffen Nurpmeso <steffen_at_sdaoden.eu>
Date: Thu, 13 Apr 2023 19:26:25 UTC
Hello.

David Chisnall wrote in
 <E8774F9D-239E-45EF-AFCE-EDE48489B323@freebsd.org>:
 |I added the siginfo member that passes the system call number (si_syscal\
 |l).  The problem that it solves is the syscall system call. For normal \
 |system calls, you can extract the system call number from the register \
 |frame, since it will be in rax. Unfortunately, for the syscall system \
 |call, this value is clobbered and you have no way of usefully recovering \
 |it.

I am too amd64 bound for sure.  Well, to be honest, i had written
a test-strace test target which runs the entire machinery (hmm..),
and then generates the list of necessary system calls for the
client and the server.  It, however, includes all system calls,
including all the ugly pre-sandbox things.
I am in the lucky position not to run debuggers, as well as being
too stupid to handle them, anyway (step and stepi, and break, that
is all i know).  But good that someone did that.
(Having said that i wrote that test target after i already had
started the seccomp(2) implementation, and the SIGSYS thing is
a regular build target, that i use.)

 |You might want to take a look at the Verona Sandbox code for inspiration \
 |(it works correctly without si_syscall for all system calls except \
 |syscall):
 |
 |https://github.com/microsoft/verona-sandbox
 |
 |This was my project that required this functionality, since it needed \
 |to intercept system calls and convert them to RPCs. It provides a simple \
 |mechanism for loading a .so in  an unprivileged child process and handling \
 |all system calls that touch a global namespace (open, bind, getaddrinfo) \
 |via RPC into the parent, with some easy-to-use abstractions for filesystem \
 |and network access. It works on Linux with seccomp-bpf and on FreeBSD \
 |with Capsicum. The FreeBSD version was significantly easier to write \
 |for a variety of reasons (Linux doesn’t support strongly aligned alloc\
 |ation in mmap, Linux can’t kill ld process when the parent process \
 |exits, only the parent thread, seccomp-bpf policies are amazingly fragile \
 |and require an entire library dependency to get right).

This sounds like a very impressive project, especially compared to
my little and primitive thing.

BPF for seccomp(2) seems to be very different than what the new
epbf is capable to do; I watched a LWN-linked presentation on what
BPF can do "some years" ago, with live modification / tracing
/ inspection of the kernel etc.
(But *i* dreamed of "a syscall bitset in front" (like capsicum
seems to have), and then executable snippets to do the rest,
including checks against real in-use descriptors, as opposed to
only compile-time constants.  Or complicated runtime program
generation.  And then, running a program for any systemcall is
tough.)

I think capsicum is likely the smartest thing and so nicely
reflects the UNIX "everything is a file".  But really, my setup
for my simple client/server is tremendous(ly complicated).

I see from looking that the FreeBSD kernel now supports
realpathat(2), yet not for users ([main] as of 03-31).
And this would be so really important to have!

I mean, i can evaluate configuration in a/the "super-capable" base
process, and then simply fork off a new server which then inherits
the new configuration (after the old has been told to die), but
that is a real mess.
Also because, you know, so i opened a directory FD for / (the way
i do it: do this, use realpath(3) on all paths, and then simply
openat(2) "rootfd,&[1]" to not openat(2) an absolute path..), but
this is only for the sandboxed process.

So if someone would mount some filesystem over / (i presumed that
is the reason why AT_FDCWD and plain open(2) and openat(2) with
absolute paths are forbidden), then this will affect the
"super-capable" process which reloads the configuration and from
which the new sandboxed server instance is spawned.  That does not
make sense.  I could open the / descriptor already in that
process, on the other hand; hmm.  But still ugly.

So my thinking would be that there *must* be a realpathat(2) so
that the capsicum(4)ized server can simply reload the
configuration itself, while allowing the user full flexibility.

(My current approach is rather identical to what OpenBSDs
unveil(2) thing ends up with, ... yet relative to the opened
/ file descriptor, of course.  Because .. what else could i do?
So users have to use the _very same_ file names, or the thing
fails.  realpath(3) cannot be used.  I need to implement some
purely string-only path canonicalization to make this a bit
better.  Lesser files the user may use, but new ones not at all.)

 |I have a patch under review that adds a SIGCAP as an alternative to \
 |SIGTRAP, which avoids painful interaction with the debugger. I’d love \
 |to get that merged before 14 but haven’t had time to address the last \
 |round of review comments. I’ve been running with it locally for a year \
 |or so.

So good luck for get this going!

--steffen
|
|Der Kragenbaer,                The moon bear,
|der holt sich munter           he cheerfully and one by one
|einen nach dem anderen runter  wa.ks himself off
|(By Robert Gernhardt)