getaffinity/setaffinity and cpu sets.

Brooks Davis brooks at freebsd.org
Sat Feb 23 21:35:19 UTC 2008


On Sat, Feb 23, 2008 at 11:21:33AM -1000, Jeff Roberson wrote:
> 
> On Sat, 23 Feb 2008, Brooks Davis wrote:
> 
>> On Fri, Feb 22, 2008 at 01:52:54PM -1000, Jeff Roberson wrote:
>>> On Fri, 22 Feb 2008, Brooks Davis wrote:
>>> 
>>>> On Fri, Feb 22, 2008 at 12:34:13PM -1000, Jeff Roberson wrote:
>>>>> 
>>>>> On Thu, 21 Feb 2008, Robert Watson wrote:
>>>>> 
>>>>>> On Wed, 20 Feb 2008, Jeff Roberson wrote:
>> 
>>>>>> - It would be nice to be able to use CPU sets in jail as well,
>>>>>> suggesting
>>>>>> a
>>>>>>  hierarchal model with some sort of tagging so you know what CPU sets
>>>>>> were
>>>>>>  created in a jail such that you know whether they can be changed in a
>>>>>> jail.
>>>>>>  While I recognize this makes things a lot more tricky, I think we
>>>>>> should
>>>>>>  basically be planning more carefully with respect to virtualization
>>>>>> when
>>>>>> we
>>>>>>  add new interfaces, since it's a widely used feature, and the current
>>>>>> set
>>>>>> of
>>>>>>  "stragglers" unsupported in Jail is growing rather than shrinking.
>>>>> 
>>>>> I have implemented a hierarchical model.  Each thread has a pointer to
>>>>> the
>>>>> cpuset that it's in.  If it makes a local modification via 
>>>>> setaffinity()
>>>>> it
>>>>> gets an anonymous cpuset that is a child of the set assigned to the
>>>>> process.  This anonymous set will also be inherited across fork/thread
>>>>> creation.
>>>>> 
>>>>> In this model presently there are nodes marked as root.  To query the
>>>>> 'system' cpus available we walk up from the current node until we find 
>>>>> a
>>>>> root.  These are the 'system' set.  A thread may not break out of its
>>>>> system set.  A process may join the root set but it may not modify a 
>>>>> root
>>>>> that is a parent.  Jails would create a new root.  A process outside of
>>>>> the
>>>>> jail can modify the set of processors in the jail but a process within
>>>>> the
>>>>> jail/root may not.
>>>>> 
>>>>> The next level down from the root is the assigned set.  The root may be
>>>>> an
>>>>> assigned set or this may be a subset of the root.  Processes may create
>>>>> sets which are parented back to their root and may include any 
>>>>> processors
>>>>> within their root.  The mask of the assigned set is returned as
>>>>> 'available'
>>>>> processors.
>>>>> 
>>>>> This gives a 1 to 3 level hierarchy. The root, an assigned set, and an
>>>>> anonymous set.  Any of these but the root may be omitted.  There is no
>>>>> current way for userland to create subsets of assigned sets to permit
>>>>> further nesting.  I'm not sure I see value in it right now and it gives
>>>>> the
>>>>> possibility of unbound tree depth.
>>>>> 
>>>>> Anonymous sets are immutable as they are shared and changes only apply 
>>>>> to
>>>>> the thread/pid in the WHICH argument and not others which have 
>>>>> inherited
>>>>> from it.  Anonymous sets have no id and may not be specifically
>>>>> manipulated
>>>>> via a setid.  You must refer to the process/thread.  From the
>>>>> administration point of view they don't exist.
>>>>> 
>>>>> When a set is modified we walk down the children recursively and apply
>>>>> the
>>>>> new mask.  This is done with a global set lock under which all
>>>>> modifications and tree operations are performed.  The td_cpuset pointer
>>>>> is
>>>>> protected under the thread_lock() and may read the set without a lock.
>>>>> This
>>>>> gives the possibility for certain kinds of races but I believe they are
>>>>> all
>>>>> safe.
>>>>> 
>>>>> Hopefully I explained that well enough for people to follow.  I realize
>>>>> it's a lot of text but it's fairly simple book keeping code.  This is 
>>>>> all
>>>>> implemented and I'm debugging now.
>>>> 
>>>> One place I'd like to implement CPU affinity is in the Sun Grid Engine
>>>> execution daemon.  I think anonymous set would not be sufficent there
>>>> because the model allows new tasks to be started on a particular node at
>>>> any time during a parallel job.  I'd have to do some more digging in the
>>>> code to be entierly certain.  I think the less limits we place on the
>>>> hierarchy, the better off we'll be unless there are compeling complexity
>>>> reasons to avoid them.
>>> 
>>> With the anonymous set you can bind any thread to any cpu that is visible
>>> to it.  How would this not work?
>> 
>> I'm still trying to wrap my head around the anonymous sets.  Is the idea
>> that once you are in an anonymous set, you can't expand it, or can you
>> expand out as far as the assigned set?  I'd like for parallel jobs to
>> be allocated a set of cpus that they can't change, but still be able
>> to make their own decisions about thread affinity if they desire (for
>> example OpenMPI has some support for this so processes stay put and in
>> theory benefit from positive cache effects).  If that's feasible in
>> this model, I'm happy ok it.  I think we should keep in mind that these
>> SGE execution daemons might be sitting inside jails. ;-)
> 
> Ah, when I said the anonymous sets were immutable, that only means that 
> they are copy-on-write.  Because you can't know who shares a copy via fork 
> or thread creation you must make a new set each time you write.
> 
> I made the anonymous sets so that the parent would have a list of all 
> derivative children sets so that modifications to the parent would be 
> reflected in the child.  This also means that the scheduler only has to 
> look at one bitmap to determine the available cpus for a thread.

I think the anonymous sets seem like a good idea.  On solution to my
problem might be to make changing your current set to be something that
is not a subset of your parent (or maybe your current set?) is privileged.

-- Brooks
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 187 bytes
Desc: not available
Url : http://lists.freebsd.org/pipermail/freebsd-arch/attachments/20080223/cd2b65f5/attachment.pgp


More information about the freebsd-arch mailing list