maxfiles, file table, descriptors, etc...

Tue Apr 22 13:11:47 PDT 2003

"Kevin A. Pieckiel" wrote:

[ ... FreeBSD lazy allocation of KVA space for zalloci() use ... ]

> What I fail to see is why this scheme is decidedly "better" than
> that of the old memory allocator.  I understand from the vm source
> that uma wants to avoid allocating pools of unused memory for the
> kernel--allocating memory on an as needed basis is a logical thing
> to do.  But losing the guarantee that the allocation routines will
> not fail and not adjusting the calling functions of those routines
> seems a bit dumb (since, as you state, the kernel panics).  I think
> this might be a trouble spot for me because of another question....

Eventually, the calling functions will be adjusted, I think.

The reason the new code (Jeff's code) is better is that it doesn't
task-commit a limited resource.  When you compile a kernel with a
specific MAXFILES (or set "kern.maxfiles" in the loader) in 4.x,
you eat an unrecoverable chunk of the KVA, which is a scarce
resource.

This is a problem, if you are a general purpose system, since at
any point you can't know what resource is going to be in the
highest demand, at any given point in time.  Even the 5.x has a
fault here, in that, once allocated, the memory is type-stable;
luckily, however, most systems maintain homogeneous loads over
time, so you aren't going to see radical swings between being a
scientific computation platform vs. a web server, vs. a shell
machine, etc., without reboots in between, as the machine finds
itself repurposed.

For a platform allocated a specific task, it's also helpful to
have the new code.  In general, what happens in a platform that
has a specific role that is never going to change is that it is
manually tuned to that role.  This tuning process is complex and
time consuming, and requires both a lot of knowledge of the OS,
and a lot of domain specific knowledge.  Even then, people tend
to make mistakes.  By allowing the limits to be raised to the
point that they are irrelevent, and then modifying the code to
allocate resources, as necessary, it provides for a much simpler
tuning experience to get from 30% of performance to 90% of
performance, without a lot of work (the last 10% is still really
hard, and requires domain specific knowledge).

> What is the correct way to address this in the new allocator code?

There are several ways of doing this.

It's probably a good idea to make the kernel code in question
insensitive to NULL returns, as a general rule.  This helps the
code be more resilient to future changes, and it allows immediate
relief from high load situations: instead of hanging until a
request can be satisfied, the request is failed, and the pressure
on the system is reduced.  It's the same theory you get from seeing
a lot of cars jammed up in front of you, and turning right, instead
of heading into the jam with everyone else, and making things worse.

It's also probably a good idea to use this as an indicator of
where code needs to be refactored.  Most of the problems in 5.x
are a result of legacy code that should be refactored, that's
being locked down, instead.  What happens in this case is that
locks get held over function call boundaries, but they do not
have to come back up over those boundaries in order to be released;
e.g. A() locks X, A() calls B(), B() calls C(), C() unlocks X.
Every time you see a "lock order reversal" or "LOR" posting to
the list, it's either because someone has been confused about
"locking code" vs. "locking data", or it's because there's a
layering abstraction violation that makes some lock acquisition
and relese non-reflexive, like this.

Probably the easiest way of dealing with this problem is to
establish page mappings for all of physical memory, up front,
and all of KVA, and then modify the mappings and/or "give them
away", instead of trying to allocate new ones when you're in a
memory pressure situation.

One obvious fix for the zalloci() code would be to modify the
order in which page mappings are obtained, when new pages are
rquired by a given zone, and then add a second administrative
limit to the zone structure.  Initially set the administrative
limit equal to the hard limit on the zone, when the zone is
created, and then if you fail to obtain the page mapping, lower
the administrative limit to the current limit.

The effect of this would be to cause the zalloci() to fail in
a way that it's expected to fail: virtually, "because we have
hit our administratively agreed limit", rather than "because
we ran out of page mappings".

> I can come up with an option or two on my own... such as that to
> which I've already alluded: memory allocation routines that once
> guaranteed success can no longer be used in such a manner, thus the
> calling functions must be altered to take this into account.  But
> this is certainly not trivial!

Yes.  This is non-trivial, and it should be done anyway.  8-).
See above.

> >                         Basically, everywhere that calls zalloci()
> > is at risk of panic'ing under heavy load.
> 
> Am I not getting a point here?  I can't find any reference to
> zalloci() in the kernel source for 5.x (as of a 07 Apr 2003 cvs
> update on HEAD), and such circumstances don't apply to 4.x (which,
> of course, is where I DID find them after you mentioned them).

The calls have been changed; I should say "everywhere zalloci()
has been replaced with something which has a NULL-return semantic".

> > Correct.  The file descriptors are dynamically allocated; or rather,
> > they are allocated incrementally, as needed, and since this is not
> > at interrupt time, the standard system malloc() can be used.
> 
> A quick tangent....  when file descriptors are assigned and given to
> a running program, are they guaranteed to start from zero (or three
> if you don't close stdin, stdout, and stderr)?  Or is this a byproduct
> of implementation across the realm of Unixes?

The descriptor number is an index into the per process open
file table.  This table *always* starts at 0, but may start
with some slots filled in already (usually stdin/stdout/stderr,
but really, anything it's parent process didn't have marked
"close on exec", and which doesn't force those semantics by
failing dup2(), is copied).

The place to look for this is:

	struct proc	*p;		/* sys/proc.h */
	struct filedesc *fdescp;	/* sys/filedesc.h */
	struct file	*fp0;		/* sys/file.h */

	fdescp = p->p_fd;
	fp0 = fdescp->fd_ofiles[ 0 /* this is my fd */ ];

The place you see these indices translated are in falloc(),
fget(), etc., descriptor manipulation, which is located in the
kernel source file /usr/src/sys/kern/kern_descrip.c.

For an interesting case study, consider an already open file
on which you want to call "fstat" from user space.  Then look
in /usr/src/sys/kern/kern_descrip.c for the definition of the
function "fstat", which implements this system call (the struct
"fstat_args" is defined in a block comment above the function,
for convenience of the reader).

It's not commented in detail, but what happens is:

o	You take a trap for the system call via INT 0x80

o	The system call arguments are converted to a linear
	set, which is cast to a "struct fstat_args *" by the
	function entry (from a "void *").

o	A lock is held to prevent reentrancy

o	fget() translates the index (descriptor) into a
	"struct file *"; as a side effect, this obtains a
	reference, so that if someone else tries to close
	the file out from under you, you hold it open.

o	The fo_ ("file operation") stat is called, which copies
	the stat information into the stack region "ub", which is
	a "struct stat".

o	The data in "ub" is copied out into the user process
	address space, into the buffer whose address argument
	was supplied to the system call.

o	fdrop() is called to release the reference; if this was
	the last reference (unlikely, given the specific lock
	being held here), then the fdp is released back to the
	system, and the file is truly closed.

o	The lock is released.

o	Any error which occurred is returned in %AX, which gets
	given back to the system as a -1 return, with errno set
	to the error.

So, although it's abstracted by fget/fdrop, it's really accessing an
allocated linear array of "struct file", for which the user space
file descriptor is an index into that array.

[ ... per process open file table allocation inefficiencies ... ]

> Now this _IS_ interesting.  I would think circumstances requiring
> 100,000+ files or net connections, though not uncommon, are certainly
> NOT in the vast majority, but would still have a bone to pick with this
> implementation.  For example, a web server--from which most users
> expect (demand?) fast response time--that takes time to expand its
> file table during a connection or request would seem to have
> unreasonable response times.

Yes.  It's one of the things you rewrite when you are trying to
get uniform and high performance out of a system.

100,000 net connections is uncommon; until two years ago, no one
has really stressed FreeBSD above 32,768 connections, after which
a credentials bug would cause a kernel panic when enough sockets
had been closed.

Even at smaller numbers of open files, though, the allocation
causes "lurches" in server behaviour; you can see the dips as
inverse spikes from the allocations on "webbench", for example,
even for 10,000 and 20,000 connections.

> One would think there is a better way.

It's all about tradeoffs.  One way is to force the table size
large, to start with, using dup2().

> How much of an issue is this really?

You mean compared to having to disable PSE and PG_G, or some other
perforrmance issues?  Not much of one.  But every little bit hurts.

> Excellent info, Terry.  Thanks for sharing it!

It's not all that great; I'm sure I'll be corrected on some things
with regard to 5.x, since it's a moving target, and it's not really
possible to state anything authoritatively about it, since it will
be changed out from under you to address any easy complaints, so
by the time someone goes and looks at it, what you've said is not
true any more.  8-) 8-).

Basically, I answered because you asked.  I do that a lot, even in
private email; this got to the list because you Cc:'ed the list,
not because I would have put it there if you'd asked in private
email.  Lots of things never see the list; some people ask things
in private because of competitive advantage, or because I've stated
a non-disclosure requirement on a small set of topics.  8-).

-- Terry