New in-kernel privilege API: priv(9)
Robert Watson
rwatson at FreeBSD.org
Wed Sep 13 07:29:19 PDT 2006
Dear all,
Over the past few weeks, I've been working on a replacement for the suser(9)
API, used to check whether a thread or credential has the privilege to
override discretionary access control or perform system configuration
operations in the kernel. Currently, these checks use one of two kernel APIs:
suser(thread)
or:
suser_cred(cred, flags)
The former is the more common invocation, but the latter is also often used;
this is largely because jail(4) requires limits of superuser privilege, so
instances of privilege allowed in jail are explicitly marked via the flags
field. There are also circumstances in which only a credential is available,
perhaps cached from another context, and a very small number of instances (2)
where a second flag, forcing use of the ruid instead of the euid, is used. The
above API has served FreeBSD well for many years. However, it suffers from a
number of architectural and functionality inadequacies. The goal of my work
has been to address a particular functional lack: granularity. In particular,
there are a number of things that finer granularity in the API would allow us
to do:
- Make it easier to explore the finer-grained granting of privilege via
policy, such as assigning specific useful privileges -- the ability to bind
a port, configure a SLIP interface, adjust the time, be exempt from audit
requirements, be allowed to attach to a jail, override certain file
permissions, set quotas, configure IP addresses, etc, which are cleanly
separable (not to mention usefully assignable) privileges.
- Make it easier to explore the finer-grained denial of privilege. For
example, jail is in large part based on a marking of different privilege
checking points as being "allowed in jail" or "not allowed in jail". In
some ways this is advantageous: the implementer of each suser check gets to
decide whether it's in jail, and that information is available in the
context of the check. However, this has several important disadvantages.
Not least is that the implementation of jail is highly distributed rather
than centralized, making auditing the implementation difficult. Another
disadvantage is that configuration options that vary the behavior of jail
are also distributed throughout the kernel rather than centralized, as they
must vary whether the SUSER_ALLOWJAIL flag is being passed into suser. It
would be nice to be able to quickly and easily answer the question "what
privileges are granted in jail", and to easily vary the list, which is not
possible currently.
- Make it easier to identify, categorize, and audit the use of privilege
throughout the kernel by actually having a list of the privileges and what
they correspond to, as well as making it easier to identify all the places
a specific privilege is used. This facilitates auditing of kernel
privilege use, and easy comparison of the use of identical privileges in
different subsystems. For example, while doing this work, I identified
inconsistencies in the application of superuser privilege in different file
systems, privileges that were sometimes allowed in jail, but sometimes not,
etc. 200 anonymous suser checks are hard to analyze, 160 named privilege
checks are much easier to analyze.
- Make it easier to modify the audit mechanism to capture a log of exactly
what privileges are exercised during operation, a requirement for higher
assurance evaluation.
What does this all mean in practice? It means replacing suser(9) and
suser_cred(9) with calls that express the specific privilege being checked
for. I took the most straight forward possible implementation: I reviewed all
privilege checks in the kernel, identified all identical privileges and
categorized all privileges by subsystem. I then assigned unique numeric
constants to each unique privilege, and added a privilege identifier argument
to the two new functions, priv_check(9) and priv_check_cred(9). Here are a few
sample snippet from the privilege list in src/sys/priv.h:
...
PRIV_ACCT, /* Manage process accounting. */
PRIV_MAXFILES, /* Exceed system open files limit. */
PRIV_MAXPROC, /* Exceed system processes limit. */
PRIV_KTRACE, /* Set/accept KTRFAC_ROOT on ktrace. */
PRIV_SETDUMPER, /* Configure dump device (XXX: needs work). */
PRIV_NFSD, /* Can become NFS daemon. */
PRIV_REBOOT, /* Can reboot system. */
PRIV_SWAPON, /* Can swapon(). */
PRIV_SWAPOFF, /* Can swapoff(). */
...
PRIV_PMC_MANAGE, /* Can administer PMC. */
PRIV_PMC_SYSTEM, /* Can allocate a system-wide PMC. */
PRIV_SCHED_DIFFCRED, /* Exempt scheduling other users. */
PRIV_SCHED_SETPRIORITY, /* Can set lower nice value for proc. */
PRIV_SCHED_RTPRIO, /* Can set real time scheduling. */
PRIV_SCHED_SETPOLICY, /* Can set scheduler policy. */
PRIV_SCHED_SET, /* Can set thread scheduler. */
PRIV_SCHED_SETPARAM, /* Can set thread scheduler params. */
...
PRIV_UFS_SETQUOTA, /* setquota(). */
PRIV_UFS_SETUSE, /* setuse(). */
PRIV_UFS_EXCEEDQUOTA, /* Exempt from quota restrictions. */
PRIV_VFS_READ, /* Override vnode DAC read perm. */
PRIV_VFS_WRITE, /* Override vnode DAC write perm. */
PRIV_VFS_ADMIN, /* Override vnode DAC admin perm. */
PRIV_VFS_EXEC, /* Override vnode DAC exec perm. */
PRIV_VFS_LOOKUP, /* Override vnode DAC lookup perm. */
PRIV_VFS_BLOCKRESERVE, /* Can use free block reserve. */
...
As you can see, they break down into both a set of system management
privileges, relating to configuring kernel services, and then a set of
specific privileges associated with (and sorted by) major kernel subsystems.
None of this implies a change in underlying policy -- just that a bit more
contextual information is passed into the privilege check. This has some
important specific functional benefits:
- It makes it possible to migrate the "allowed in jail" decision from the
calling context to the privilege management code. This will allow us to
gradually eliminate the passing of flags to the privilege check code under
almost all circumstances. In my patch, I have added a new function to
kern_jail.c, prison_priv_check(), which essentially contains a switch
statement listing the privileges allowed in jail, and denying the rest.
Configurable privileges, raw socket access, etc, can now occur in one
place, and open the door to introducing more easy per-jail configuration
of privilege. After these changes, the implementation is much more
centralized in kern_jail.c.
- It makes it possible for the MAC Framework to restrict access to privilege,
a feature required for the SEBSD policy module, which implements the
FLASK/Type Enforcement policy environment as found in SELinux. Policy
modules can register interest in privilege checks, and then specifically
deny access to privileges as they see fit.
- It makes it possible for the MAC Framework to allow policies to grant
privilege. Policy modules can register interest in privilege checks, and
then specifically grant access to privileges as they see fit.
In order to demonstrate MAC Framework integration with the privilege system, I
have implemented a sample policy module, mac_privs, which allows rule-based
granting of privileges to specific uids. Using a command line tool,
appropriately privileged processes can modify the rule list, granting named
privileges to unprivileged users. This is not a particularly mature example
of a privilege-granting policy, as ideally privilege is something that is
available but not always exercised -- i.e., similar to a setuid root binary
that switches the effective uid to root only when it specifically needs
privilege. However, it's quite useful in practice, and demonstrates how
configurable policies can interact with kernel privilege decisions.
In the past, I've done similar work on two occasions: once in implementing
POSIX.1e privileges for FreeBSD as part of the TrustedBSD Project (not
merged), and once as part of the SEBSD implementation. This work is
functionally similar, but there are several important ways in which this
design differs from the POSIX.1e approach (also used in Linux):
- The identification of privileges is quite fine-grained. The Linux-extended
POSIX.1e privilege set contains high level privileges like "Network
privilege", which encapsulates a broad range of different network privilege
checks. I have identified over 50 different specific network privileges,
each separately named. It would be easy to map these into the POSIX.1e
privilege set, which is presumably what the SEBSD policy will need to do in
order to produce the narrower set expected by the SELinux code.
- The approach is intended to allow the granting as well as denying of
privilege. This is an important design choice, and has both some costs and
some benefits. One important benefit is that it has historically proven
difficult to take rights away from the root user without introducing
security vulnerabilities associated with applications written to use root
privilege expecting that all privileges be in place. Granting specific
privileges implies a fairly different application and policy construction
and may well be safer.
- Because of the fine-grained naming of privileges, it's possible to
encapsulate jail in a way that was not previously possible: the POSIX.1e
privilege set was simply too coarse to capture the requirements of jail.
- Privileges under this model are not treated as maskable values. In
practice, there are very few situations in which it is useful to check
multiple privileges at once, and permitting that encourages authors adding
new privilege checks to combine privileges in a way that makes it opaque
to the privilege mechanism as to which privilege was actually needed. This
also has the benefit of making it much easier/more efficient to add new
privileges as required, as it doesn't require expanding a bit string
representing the privileges. Most POSIX.1e implementations limit the total
number of privileges to 32 to 64 in order to have them fit in a bitmask
easily.
- By assigning new privileges for every privilege with significantly
different semantic, the question of "when to add a new privilege" is
answered: unless there is an obvious match, you add one. With the POSIX.1e
+ Linux set, it is necessary to try to figure out how to fit a new check
into one of many poorly matching privileges. The result was that almost
all privileges not clearly matched to one of the POSIX.1e set ended up in
the catch-all CAP_SYS_ADMIN.
The status of this work is that a pretty functional prototype can be found in
Perforce:
//depot/projects/trustedbsd/priv/...
A snapshot patch from the branch, excluding mac_privs, can be found here:
http://www.watson.org/~robert/freebsd/20060913-trustedbsd-priv.diff
In that tree, you'll want particularly to look at:
sys/kern/kern_jail.c Revised jail privilege behavior
sys/kern/kern_priv.c Privilege check implementation
sys/security/mac/mac_priv.c MAC extensions for privileges
sys/security/mac_privs/* Sample MAC policy granting privileges
sys/sys/priv.h Privilege list, API
share/man/man9/priv.9 Draft man page
usr.sbin/mac_privs/* Management tool for sample MAC policy
It is my intent, following review, discussion, cleanup, etc, to commit the
priv(9) work, sans mac_privs, to the 7.x tree in the next couple of weeks.
The mac_privs policy is a sample policy that will continue to be maintained as
part of the TrustedBSD Project, but not merged into the base tree at this
point. Some remaining TODO items are:
- Review various XXX comments I added as part of this work.
- Complete modification of System V IPC code to properly check privileges.
- Update mac_none.c sample policy to include privilege stubs.
- Possibly move securelevel support to kern_priv.c, since it largely relates
to privilege.
- Teach the audit subsystem to collect privilege information during a system
call, and add it to audit records using privilege tokens (already present in
Solaris).
- Complete man page updates, including finalize priv.9, trim down suser.9.
- Create further privilege-related regression tests.
- Finalize decision on using an enum or an int to identify privileges. Using
an enum requires more namespace pollution, and requires hard-coded values
anyway in order to avoid ABI issues. Possibly using #defines would be
simpler.
I'd like to greatfully acknowledge the sponsorship of nCircle Network
Security, Inc in performing this work.
Robert N M Watson
Computer Laboratory
University of Cambridge
More information about the trustedbsd-discuss
mailing list