svn commit: r362798 - in projects/nfs-over-tls/sys/rpc: . rpcsec_tls

Wed Jul 1 00:43:59 UTC 2020

Benjamin Kaduk wrote:
>On Tue, Jun 30, 2020 at 04:20:45PM +0000, Rick Macklem wrote:
>> Benjamin Kaduk wrote:
>> >On Tue, Jun 30, 2020 at 7:49 AM Rick Macklem <rmacklem at freebsd.org<mailto:rmacklem at freebsd.org>> wrote:
>> >Author: rmacklem
>> >Date: Tue Jun 30 14:49:51 2020
>> >New Revision: 362798
>> >URL: https://svnweb.freebsd.org/changeset/base/362798
>> >
>> >Log:
>> >  Testing when a server does not respond to TLS handshake records exposed
>> >  a couple of problems, since the daemon would be in SSL_connect() for 6 minutes.
>> >
>> >  - When the upcall timed out and was retried, the RPCTLS_SYSC_CLSOCKET syscall
>> >    was broken and did not return an error upon a retry. It allocated a file
>> >    descriptor for a NULL socket.
>> >  - The socket structure in the kernel could be free'd while the daemon was
>> >    still using it in SSL_connect().
>> >  - Adjust the timeout a retry count so that upcalls are only attempted once
>> >    with a 10minute timeout.
>> >
>> >
>> >10 minutes seems really long!  It sounds from the description like the upcall so >>that
>> >userspace can run SSL_connect() was taking 6 minutes, and you needed 10 >>minutes so
>> >as to be longer than the 6 minutes that is "out of your control"?
>> Well, I think a long timeout here is ok, since a timeout indicates a broken daemon.
>> (The upcalls to the local daemon should be reliable and cannot safely be redone.
>>  In a perfect world, the upcall mechanism would be "exactly once" instead of
>>  "at least once". I think an upcall might fail when the mbuf pool in the kernel
>>  is exhausted, but that should be rare.)
>>
>> >I feel like there should be some sockopts available to get the SSL_connect() timeout
>> >down, so that the upcall timeout doesn't need to be so long, either.
>> Yes, 6 minutes does seem like a long time. I only discovered this yesterday when
>> I simulated a server that did not respond to handshake records.
>>
>> I haven't yet dug into the openssl code to see if there is a way to adjust this
>> timeout.
>> I also do not know what a good timeout value for SSL_connect() might be,
>> even if the daemon can override the default.
>>
>> In practice, this should only happen when trying to do an NFS mount on
>> a broken server which responds to the "STARTTLS" Null RPC, but does not
>> do the handshake.
>> Having the mount attempt stuck for 6minutes before failing is not that serious
>> a problem, imho.
>> (When systems boot after something like a power failure, delays getting NFS
>>  mounts done, due to the NFS server/network needing to be up, is fairly
>>  normal. The "-b" option to put the mount attempt in background has been
>>  around for a long time for this.)
>>
>> If you happen to know how to set a timeout for SSL_connect() in the openssl
>> library, I would be interested in hearing that.
>
>As it happens, I took a look before I wrote the initial note, and there
>doesn't seem to be any intrinsic TLS (not DTLS) handshake timeouts in
>libssl itself; I expect this is actually just the (kernel's!) TCP timeout.
>So you'd be getting the socket fd (e.g., SSL_get_fd(), if you don't have a
>reference already) and using setsockopt() to set the timeout(s).
Interesting. The test case I simulated did not close the TCP socket used by
SSL_connect(). The server just replied to the STARTTLS Null RPC, but did not
call SSL_accept(), so the server side just isn't playing "handshake".
"netstat -a" showed the connection as ESTABLISHED.
During debugging, I also used the trick of putting:
    while (1)
        sleep(1);
right after the SSL_connect() call and, when watching it via "ps",
it would switch from "sbwait" to "nanoslp" after 6 minutes and
a syslog() call showed that SSL_connect() had returned -1.

So, if the TCP connection was "established", what caused the SSL_connect()
to return with an error (-1) after 6 minutes?

Now, there is a 6 minute idle timeout in the RPC code for TCP where it,
by default, closes the connection when there is 6 minutes without any
activity. (I have to look if waiting for a reply for the upcall implies "no activity" and if
this also happens for AF_LOCAL sockets, which is what the upcalls use.)

Now, if that happens, a SIGPIPE would be posted to the daemon, which
is SIG_IGN'd by the daemon. But maybe the SIGPIPE somehow causes
SSL_connect() to return -1 by making the syscall it is doing (read/recv on the
TCP socket sitting in sbwait) return EINTR, or something like that?

I can change this 6minute timeout to see if that affects it.

When you've got upcalls and library functions both talking to sockets it
can get interesting.

Thanks for the comments, rick

-Ben