svn commit: r277199 - in head/sys: fs/devfs kern

Fri Jan 16 08:59:28 UTC 2015

Hi Konstantin,

On 01/16/15 09:03, Konstantin Belousov wrote:
> On Thu, Jan 15, 2015 at 01:14:39PM +0100, Hans Petter Selasky wrote:
>> On 01/15/15 12:51, Konstantin Belousov wrote:
>>> On Thu, Jan 15, 2015 at 11:49:09AM +0100, Hans Petter Selasky wrote:
>>>>
>>>> I see no leakage in that case either!
>>> Because these cases come through destroy_dev().
>>>
>>>>
>>>> Are there more cases which I don't see?
>>> You are breaking existig devfs KPI by your hack.  You introduce yet another
>>> reference on the device, which is not supposed to be there.
>>
>> Hi Konstantin,
>>
>> I need a non-sleeping way to say a character device is no longer
>> supposed to be used and be able to re-use the device name right away
>> creating a new device. I guess you got that.
> Yes, I got it.  The devfs design is not suitable for this, and your
> hack is the good witness of the fact.
>
> My opinion is that you should have tried to handle the issue at the driver
> level, instead of making this devfs issue.  I.e., if you already have
> cdev node with the correct name, driver should have replaced the private
> data to point to new device.

I think this way cannot be implemented in a clean way, because of 
locking order reversal. And if you try to avoid the LOR you end up with 
the race. Chess mate sort of ;-)  Let me explain:

Thread 1:
usb_sx_lock();
   cdev = Look in freelist for existing device();
else
   cdev = make_dev();
usb_sx_unlock();

Thread 2:
usb_sx_lock();
put cdev on freelist
usb_sx_unlock();

Thread 3:
usb_sx_lock();
cdev = remove first entry in freelist
usb_sx_unlock();

/*
  * XXX because USB needs to call destroy_dev() unlocked we
  * are racing with Thread 1 again
  */
destroy_dev(cdev);

>
> This would also close a window where /dev node is non-existent or operate
> erronously.

I'm not saying I plan so, but I think "cdevs" at some point need to 
understand mutexes and locks. That means like with other API's in the 
kernel we can associate a lock with the "cdev", and this lock is then 
used to ensure an atomic shutdown of the system in a non-blocking 
fashion. In my past experience multithreaded APIs should be high level 
implemented like this:

NON-BLOCKING methods:
lock(); **
xxx_start();
xxx_stop();
unlock();

BLOCKING methods:
setup(); // init
unsetup(); // drain

Any callbacks should always be called locked **

In devfs there was no non-blocking stop before I added the delist_dev() 
function.

>
> You do not understand my point.
>
> I object against imposing one additional global reference on all cdevs
> just to cope with the delist hack.  See the patch at the end of the message.

It's fine by me.
>
> WRT destroy_dev_sched_cb() calling delist_dev(), even after calling
> delist_dev(), the node still exists in the /dev. It is only removed
> after populate loop is run sometime later. dev_sched() KPI is inheritly
> racy, drivers must handle the races for other reasons.

The populate loop is all running under the dev_lock() from what I can 
see and make_dev() is also keeping the same lock when inserting and 
removing new cdevs. The populate loop should always give a consistent 
view of the character devices available, and I don't see how "cdev" 
structures without the CDP_ACTIVE flag can appear with recently created 
ones, even if the name is the same.

>
>>
>>> Entry can be random, since after the populate loop is ran, your code in
>>> other thread could start and create duplicate entry. There is a window
>>> in the lookup where both directory vnode lock and mount point sx locks
>>> are dropped. So running the populate does not guarantee anything.
>>
>> If there is such a race, it is already there! My patch changes nothing
>> in that area:
>>
>> Thread1:
>> Calls destroy_dev() and clears CDP_ACTIVE, after dev_unlock() it goes
>> waiting for some refs for xxx milliseconds.
>> Thread2:
>> Tries to create create a new character device having the same name like
>> the one in thread1. Device name duplication check is missed because
>> CDP_ACTIVE is cleared. Still thread1 is waiting.
>> Thread3:
>> Tries to open character device created by thread2 while thread1 is still
>> waiting for some ref held by a userspace app to go away.
>>
>> This can happen already before my patches! What do you think?
> Possibly.
>

>> At what level do you mean duplicate names, I don't get this fully? At
>> the directory level (DE nodes)? Or inside the list of character devices
>> (LIST_XXX)?
> It does not matter, dup at either one directory level, or dup of full
> names in the global list are equivalent (bad) things.

Like I write above I don't see where the problem is. At the cdev level, 
we are protecting the cdev's LIST with dev_lock() and only one entry 
will exist having CDP_ACTIVE bit set per unique cdev name and path. Else 
we will hit a panic in make_dev() and friends.

In the directory entry level the populate loop will also ensure a 
consistent view, and hence the cdev's LIST is consistent, the view 
presented to userspace will also be consistent.

That system functions can still call into the dangling read/write/ioctl 
functions is another story, and that is why I tell, that in order to 
simplify this teardown, we possibly should associate a client selectable 
lock with each CDEV, for teardown purposes. Like done for several years 
in the callout and USB APIs and possibly many other places.

>
>>
>>> Let me summarize:
>>> - the extra reference on arbitrary cdev should be eliminated. The
>>>     delist_dev_locked() may add the ref and set some CDP_ flag to
>>>     indicate to destroy_dev() that it should do dev_rel().
>>
>> It is possible to do this. I thought about this before doing my patch,
>> but decided to try to avoid adding yet another cdev flag.
>>
>>> - the existence of the duplicated entries should be either eliminated
>>>     (I am not sure it is possible with your code), or we must ensure
>>>     that only one name with CDP_ACTIVE flag set and given name exists.
>>>     Then, all lookup code must be audited to take CDP_ACTIVE into account
>>>     when accessing names.  I see at least devfs_find() and devfs_mknod()
>>>     which must be changed.  I did not performed full audit.
>>
>> I will check this path out aswell.
>>
>
> It seems it is simpler for me to try to clean up after the commit.
> The patch was only lightly tested.  I post the patch for discussion,
> not for you to committing it.  I will expedite the change into HEAD
> after the consensus on it is made and adequate testing is performed.

I don't see any problems about your patch, except it adds a bit more 
code to the kernel.

--HPS