Re: Review D38047 ... and then there was one....

From: David E. Cross <david_at_crossfamilyweb.com>
Date: Fri, 11 Oct 2024 19:42:54 UTC
TLDR: This IS almost exclusively a single feature; and the requested 
piecemeal change is actually DANGEROUS because the current brokenness of 
the code is preventing dangerous code from being executed in the first 
place.

This is in most ways a single change, 'get nscd to support 
getgrouplist(3)'.  That's it.  The reason I want that is that I have an 
LDAP directory backing my users, but I don't want unauthenticated access 
to my LDAP hierarchy, I also don't want a user readable plaintext 
password just sitting on the machine for anyone to grab; so I use 
kerberos with machine credentials.  But what happens when a non-root 
local service needs to look up a user?  it cannot.  NSCD running in 
'perform lookup' mode addresses this concern beautifully.  But 
getgrouplist(3) was never done, and because it was never done the 
fallback code isn't intercepted and it just outright fails.

This is a entirely new (and previously overlooked) access method. Almost 
certainly the reason it was overlooked is because it is fundamentally 
different than every other entity database operation in the system.   
Because it is fundamentally different it shares almost no access 
patterns and code with them.

Because nscd so tightly cooperates with libc, there's a dependency there 
as well.  libc cannot make calls  nscd doesn't understand and nscd 
cannot return data that libc cannot process..

Yes, there were numerous other bugs encountered during the journey, but 
at the point they were encountered it was already deep into the 
implementation; could they be fixed independently? Maybe, but extracting 
them increases the risk to both those bugs (this code was developed and 
tested together), and to the larger and more complicated feature set).  
And the implication here is that even if I go through the time and 
complexity to pull those out the actual feature that I want and did the 
majority of the effort for is just going to sit on a  shelf because the 
complexity is too high.

All of the substantive fixes made were in the agent code, which I am 
going to guess that nobody (but me) is actually using.  Since the 
current code just does not even work at all (loop/cycle detection is 
broken); like seriously attempt to turn it on and see what happens. 
(perform-actual-lookups passwd yes , perform-actual-lookups group yes; 
set nsswitch.com to " group: files cache [notfound=return] compat" 
"passwd: files cache [notfound=return] compat" and then run "id 
nosuchuser" ).  In which case doing fixes like the loop detection code 
fix without the rest is MORE dangerous, since it then exposes the 
underlying broken code.


Why is getgrouplist(3) so different, and so problematic requiring a 
large effectively atomic change?

The way that nscd is structured is around the typical 'entity' database 
in classic unix.  Give me a name and I will give you the record, give me 
the ID and I will give you the record.  These operations are standard 
for ethers, password, group, hosts, services...

The way nscd is structured (and it makes sense), is that calls are 
mapped to a key of database, a lookup type, and the lookup value. This 
'key' also becomes the effective parameter for an RPC call (a loading 
cache).  The previously defined lookup types are 'id', 'name', and 'all' 
(all being used to step through an entire database).

getgouplist(3) is exactly none of these, additionally the return value 
is  NOT a group structure, or anything like it, it is an array of 
gid_ts.  So we need either a new lookup type, a new database type, or a 
4th parameter to "some" calls (and thus changing the ENTIRE processing 
of key lookups in nscd and libc). Regardless we still need new 
marshaling and demarshaling routines because this data is unlike 
anything else in the system.

An alternative would have been to make a new entity type ('grouplist'?) 
but that would also necessitate creating a new 'agent' on the nscd side, 
and duplicating a tremendous amount of the boilerplate code in many 
places to create an entire new database type that only supported a 
single operation.

There really isn't any middle of the road.  To support getgrouplist(3) 
you need ALL of the code.  Could the code be in and be entirely dead and 
uncalled?  sure, but then you're also not supporting getgrouplist(3), 
and then at some point all of that code goes live all at once again, and 
the only thing validated in the interim is that it compiles; none of the 
code is called by anything that insn't getgrouplist(3), and 
getgrouplist(3) requires all of it.


On 10/11/24 11:35, Cy Schubert wrote:
> In message<AE37187C-79D0-4B5E-87F0-6BB52822F03B@gmail.com>, Enji Cooper
> writes
> :
>>> On Oct 7, 2024, at 08:08, Cy Schubert<Cy.Schubert@cschubert.com> wrote:
>>> Yes. I was about to suggest this. Plus, any proposed commit log message
>>> must answer the questions why, what and how. With special attention 
>>> to why=
>> .
>>
>> I have the same feelings as Cy.
>>
>> FWIW, part of the reason why large/complex changes like this languish 
>> in my r
>> =
>> eview queues is in part due to reasons like this.
> Totally!
>
>> Unless I am a SME in the area who is driven to understand what the 
>> change ai=
>> ms to achieve, I will not take the time to review large/complex 
>> chances. I h=
>> ave a lot of other things in my life which take priority over large 
>> code rev=
>> iews.
> Agreed. In the case of nscd, I applied one fix to it years ago. Like 
> you, I
> do not feel qualified to review a large and complex jumbo patch (group of
> patches).
>
>> Please break the large change down into a smaller set of 
>> changes/reviews to m
>> =
>> ake it easier to review the overall change effectively.
> Yes. I have the same issue at $JOB. We use github enterprise. I refuse to
> review such patches and yet others, less experienced, summarily approve
> such patches without even looking at them, breaking our infrastructure.
>
> The choice is clear.
>
>> Cheers,
>> -Enji=