panics in soabort with so_count != 0, one possible solution to one cause.
Gleb Smirnoff
glebius at FreeBSD.org
Wed Jan 9 17:12:56 UTC 2013
On Wed, Jan 09, 2013 at 04:22:15PM +0100, Steve Read wrote:
S> Context for this message:
S> http://www.freebsd.org/cgi/query-pr.cgi?pr=145825&cat=kern
S> <http://www.freebsd.org/cgi/query-pr.cgi?pr=145825&cat=kern>
S> kern/145825: [panic] panic: soabort: so_count
S>
S> AND
S>
S> http://www.freebsd.org/cgi/query-pr.cgi?pr=159621
S> kern/159621: [tcp] [panic] panic: soabort: so_count
S>
S> The two PRs are essentially reporting the same thing, and I have seen
S> evidence of people reporting this panic against kernels as old as 6.2.
S>
S> == Scenario ==
S> The basic scenario is:
S> 1. There is a local listening TCP socket. A userland thread is waiting
S> on a kqueue, and will eventually call accept() on this socket.
S> 2. A new TCP connection arrives that matches this TCP socket. Syncache
S> hangs on to the connection until the three-way handshake is complete
S> (i.e. the ACK arrives).
S> 3. At this point, syncache_socket() calls sonewconn() and passes
S> SS_ISCONNECTED. sonewconn() as a result hands the new socket off to the
S> accept queue and wakes up the userland thread (marks the listening
S> socket "readable", sends a kqueue notification, etc.).
S> 4. Something goes wrong during the rest of syncache_socket(), as a
S> result of which it calls soabort().
S>
S> == Consequence ==
S> On a single-CPU machine, the netisr thread that called syncache_socket()
S> blocks out the userland thread until it has finished, so so_count of the
S> new connected socket is still zero when syncache_socket() calls
S> soabort(). (It's not absolutely guaranteed, as there are calls to
S> locking functions along the way, but it usually happens.)
S>
S> On a multi-CPU machine of any sort, the userland thread resumes
S> immediately that it is woken up, and it is possible (but not guaranteed)
S> for it to grab the socket and increment its so_count before
S> syncache_socket() calls soabort().
S>
S> I have a core which shows the netisr thread hitting the panic in
S> soabort(), while the expected userland thread (on a different CPU) is
S> still in the kernel, churning through the post-pickup part of accept().
S>
S> == Proposed solution ==
S> My proposed solution to this issue is:
S> 1. Replace SS_ISCONNECTED with 0 in the call to sonewconn() to prevent
S> it from waking up the listening thread.
S> 2. At the "end" of syncache_socket(), call soisconnected(), passing the
S> new socket. This will issue the wakeup after syncache_socket() has
S> finished preparing itself, and in particular after the last possible
S> call to soabort().
S>
S> I'm concerned, of course, that this may cause some unobvious fallout
S> somewhere, but I can't see for the moment what it would be. Any advice
S> would be welcome.
S>
S> == Patch that applies the proposed solution ==
S> A patch that would apply to kernel 8.3 (the basic scenario appears to
S> still be feasible with HEAD, and the code is very similar):
S>
S> ======
S> --- netinet/tcp_syncache.c.orig 2013-01-09 13:18:05.000000000 +0000
S> +++ netinet/tcp_syncache.c 2013-01-09 14:03:54.000000000 +0000
S> @@ -638,7 +638,7 @@
S> * connection when the SYN arrived. If we can't create
S> * the connection, abort it.
S> */
S> - so = sonewconn(lso, SS_ISCONNECTED);
S> + so = sonewconn(lso, 0);
S> if (so == NULL) {
S> /*
S> * Drop the connection; we will either send a RST or
S> @@ -831,6 +831,8 @@
S>
S> INP_WUNLOCK(inp);
S>
S> + soisconnected(so);
S> +
S> TCPSTAT_INC(tcps_accepts);
S> return (so);
AFAIU, in head this race was fixed by r243627. Can Vijay and Andre comment?
--
Totus tuus, Glebius.
More information about the freebsd-net
mailing list