Scenario to make recv(MSG_WAITALL) stuck
Kostik Belousov
kostikbel at gmail.com
Tue Jun 14 09:55:08 UTC 2011
On Mon, Jun 13, 2011 at 07:19:40PM +0300, Mikolaj Golub wrote:
> Hi,
>
> Below is a scenario how to make recv(2) with MSG_WAITALL flag get stuck.
>
> (See http://people.freebsd.org/~trociny/test_MSG_WAITALL.4.c for the test code).
>
> Let's the size of the receive buffer is SOBUF_SIZE (e.g. 10000 bytes).
>
> On sender side do 2 send() requests:
>
> 1) data of size much smaller than SOBUF_SIZE (e.g. SOBUF_SIZE / 10);
>
> 2) data of size equal to SOBUF_SIZE.
>
> After this on receiver side do 2 recv() requests with MSG_WAITALL flag set:
>
> 1) recv() data of SOBUF_SIZE / 10 size;
>
> 2) recv() data of SOBUF_SIZE size;
>
> The second recv() will last for very long time. In tcpdump one can observe
> that the window is permanently stuck at 0 and pending data is only sent via
> TCP window probes (so one byte every few seconds).
>
> 18:09:14.784698 IP 127.0.0.1.53378 > 127.0.0.1.23481: Flags [S], seq 1907676797, win 65535, options [mss 16344,nop,wscale 3,sackOK,TS val 22207 ecr 0], length 0
> 18:09:14.784729 IP 127.0.0.1.23481 > 127.0.0.1.53378: Flags [S.], seq 2298857585, ack 1907676798, win 10000, options [mss 16344,nop,wscale 3,sackOK,TS val 2718467987 ecr 22207], length 0
> 18:09:14.784749 IP 127.0.0.1.53378 > 127.0.0.1.23481: Flags [.], ack 1, win 8960, options [nop,nop,TS val 22207 ecr 2718467987], length 0
> 18:09:14.785168 IP 127.0.0.1.53378 > 127.0.0.1.23481: Flags [P.], seq 1:1001, ack 1, win 8960, options [nop,nop,TS val 22207 ecr 2718467987], length 1000
> 18:09:14.785264 IP 127.0.0.1.53378 > 127.0.0.1.23481: Flags [.], seq 1001:10001, ack 1, win 8960, options [nop,nop,TS val 22207 ecr 2718467987], length 9000
> 18:09:14.785280 IP 127.0.0.1.23481 > 127.0.0.1.53378: Flags [.], ack 10001, win 0, options [nop,nop,TS val 2718467987 ecr 22207], length 0
> 18:09:19.784440 IP 127.0.0.1.53378 > 127.0.0.1.23481: Flags [.], seq 10001:10002, ack 1, win 8960, options [nop,nop,TS val 22707 ecr 2718467987], length 1
> 18:09:19.784480 IP 127.0.0.1.23481 > 127.0.0.1.53378: Flags [.], ack 10001, win 0, options [nop,nop,TS val 2718468487 ecr 22707], length 0
> 18:09:24.784439 IP 127.0.0.1.53378 > 127.0.0.1.23481: Flags [.], seq 10001:10002, ack 1, win 8960, options [nop,nop,TS val 23207 ecr 2718468487], length 1
> 18:09:24.784472 IP 127.0.0.1.23481 > 127.0.0.1.53378: Flags [.], ack 10002, win 0, options [nop,nop,TS val 2718468987 ecr 23207], length 0
> 18:09:29.784437 IP 127.0.0.1.53378 > 127.0.0.1.23481: Flags [.], seq 10002:10003, ack 1, win 8960, options [nop,nop,TS val 23707 ecr 2718468987], length 1
> 18:09:29.784478 IP 127.0.0.1.23481 > 127.0.0.1.53378: Flags [.], ack 10003, win 0, options [nop,nop,TS val 2718469487 ecr 23707], length 0
> 18:09:34.784444 IP 127.0.0.1.53378 > 127.0.0.1.23481: Flags [.], seq 10003:10004, ack 1, win 8960, options [nop,nop,TS val 24207 ecr 2718469487], length 1
> 18:09:34.784486 IP 127.0.0.1.23481 > 127.0.0.1.53378: Flags [.], ack 10004, win 0, options [nop,nop,TS val 2718469987 ecr 24207], length 0
> 18:09:39.784443 IP 127.0.0.1.53378 > 127.0.0.1.23481: Flags [.], seq 10004:10005, ack 1, win 8960, options [nop,nop,TS val 24707 ecr 2718469987], length 1
> 18:09:39.784478 IP 127.0.0.1.23481 > 127.0.0.1.53378: Flags [.], ack 10005, win 0, options [nop,nop,TS val 2718470487 ecr 24707], length 0
> 18:09:44.784442 IP 127.0.0.1.53378 > 127.0.0.1.23481: Flags [.], seq 10005:10006, ack 1, win 8960, options [nop,nop,TS val 25207 ecr 2718470487], length 1
> 18:09:44.784477 IP 127.0.0.1.23481 > 127.0.0.1.53378: Flags [.], ack 10006, win 0, options [nop,nop,TS val 2718470987 ecr 25207], length 0
> ...
>
> I first noticed this issue with HAST and suspect other people observed it with
> HAST too.
>
> Below is explanation what is going on.
>
> We totaly filled the receiver buffer with one SOBUF_SIZE/10 size request and
> partial SOBUF_SIZE request. When the first request was processed we got
> SOBUF_SIZE/10 free space. It was just enogh to recive the rest of bytes for
> the second request, and the reciving thread went in soreceive_generic->sbwait
> here:
>
> /*
> * If we have less data than requested, block awaiting more (subject
> * to any timeout) if:
> * 1. the current count is less than the low water mark, or
> * 2. MSG_WAITALL is set, and it is possible to do the entire
> * receive operation at once if we block (resid <= hiwat).
> * 3. MSG_DONTWAIT is not set
> * If MSG_WAITALL is set but resid is larger than the receive buffer,
> * we have to do the receive in sections, and thus risk returning a
> * short count if a timeout or signal occurs after we start.
> */
> if (m == NULL || (((flags & MSG_DONTWAIT) == 0 &&
> so->so_rcv.sb_cc < uio->uio_resid) &&
> (so->so_rcv.sb_cc < so->so_rcv.sb_lowat ||
> ((flags & MSG_WAITALL) && uio->uio_resid <= so->so_rcv.sb_hiwat)) &&
> m->m_nextpkt == NULL && (pr->pr_flags & PR_ATOMIC) == 0)) {
> ...
> error = sbwait(&so->so_rcv);
>
> recvbuf is almost full but has enough space to satisfy MSG_WAITALL request
> without draining data to user buffer, and soreceive waits for data. But the
> window was closed when the buffer was filled and to avoid silly window
> syndrome it opens only when available space is larger than sb_hiwat/4 or
> maxseg:
>
> tcp_output():
>
> /*
> * Calculate receive window. Don't shrink window,
> * but avoid silly window syndrome.
> */
> if (recwin < (long)(so->so_rcv.sb_hiwat / 4) &&
> recwin < (long)tp->t_maxseg)
> recwin = 0;
>
> so it is stuck and pending data is only sent via TCP window probes.
>
> It looks like the fix could be to remove this condition to block if
> MSG_WAITALL is set and it is possible to do the entire receive operation at
> once, like in the patch:
>
> http://people.freebsd.org/~trociny/uipc_socket.c.soreceive_generic.MSG_DONTWAIT.patch
>
> This works for me but I am not sure this is a correct solution.
>
> Note, the issue is not reproduced with soreceive_stream.
>
I do not understand what then happens for the recvfrom(2) call ?
Would it get some error, or 0 as return and no data, or something else ?
Also, what is the MT_CONTROL chunk about ?
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 196 bytes
Desc: not available
Url : http://lists.freebsd.org/pipermail/freebsd-net/attachments/20110614/9af048df/attachment.pgp
More information about the freebsd-net
mailing list