From nobody Tue Jan 02 21:06:59 2024 X-Original-To: dev-commits-src-all@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4T4QQW5MRFz56KLf; Tue, 2 Jan 2024 21:06:59 +0000 (UTC) (envelope-from git@FreeBSD.org) Received: from mxrelay.nyi.freebsd.org (mxrelay.nyi.freebsd.org [IPv6:2610:1c1:1:606c::19:3]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (4096 bits) client-digest SHA256) (Client CN "mxrelay.nyi.freebsd.org", Issuer "R3" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4T4QQW1tPHz4c5C; Tue, 2 Jan 2024 21:06:59 +0000 (UTC) (envelope-from git@FreeBSD.org) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=freebsd.org; s=dkim; t=1704229619; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding; bh=yuCBRuTCax087JEZdazjwjgIVlpkRoY//zGKDmAEs/Y=; b=oQt40WJDuzfo8+jVEDL0DniL7nzMTVsB2qPQdoY12bsBTP7jQU4mz7RokxfnMWbMMLDQYM b9zlpc/At9l299aKxJo3yvUhmla8oBJphpvK8yFPoq8OiH7emVjSYN808/gSuHy5NnqdLW C+omzzKi5CFJHatuChJTanSTK6B5bdznxZU69cV8T+W8+nBbmjmYPRoFZ/TTdhncfcpKeW Mi+tmSwuKviW7r+da4J4jPeq9d7QaJ/ONLgzMFSaFiEK/E3nG13SnSPkF6U9aym8aEhTC1 gmeC9TMdfBad6puuzElAV2GbTiEHyADEDdA5xc7ryqXTsRTiBQUVLTo9c9XTyg== ARC-Seal: i=1; s=dkim; d=freebsd.org; t=1704229619; a=rsa-sha256; cv=none; b=bMXjEGrMk2Tvf+kQ+WHv/4+u6RThiofg9U5ohJFGcXgoXF4fVBNA6s3LQz/xHnZqawQtSB 077gR8YPeTLn+2K6YUuiOCEXWuyywPjprLF5KC89H8Rqny5fHNkDicDTH7q/fl5OZdM4+S 6n2/K04W5BD2EhANQB4uWlNbZTAG65lTwGgOK3ScB3KqCzDAXCPgeAWqkUz3/+zqS/oydq eOoGmyLUBrkko696mg85UpZxukA4949OArQPaJejvpK5mfgOcZKNM5oB3GlkZ/3rNZhRuu d/h2PtPEc6haPp6hNnFIvYFA3gKlLoJ1UMkFp8G4Fs/V7Zwb7nRTMyoIz2N8iA== ARC-Authentication-Results: i=1; mx1.freebsd.org; none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=freebsd.org; s=dkim; t=1704229619; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding; bh=yuCBRuTCax087JEZdazjwjgIVlpkRoY//zGKDmAEs/Y=; b=QXMY4nc1PXwHZP8mtgDcEs/ar9qSWK1L7c2DARXoQ4/7eWA7arxQ4ra8DbA4F3VyoF9W/h Mab/xyc6WV2t9ynR5LlRyFw0L9U73QezYgx/lWJZypwaNmB4SSxfgUZUB/Uma/zf6XcOqG yDJa9So1be5WHuOOt9G0jnCkyqYiWGSZ1AkH3aUs+jDc78S/+nDAXKbZpqaFP+OjU0H8Wg N832gT1y9FVYnKNosBk6YyHI1OXQHCfZmZePOFXV/eiMm4aF/z/pA/6wEMSVJy1NuATBR7 BzfJPKTQ0/e7i720MbbLOSMu9/gMZxVWw1MhcJ7hyYIOC55CMa7MMD681O6jSg== Received: from gitrepo.freebsd.org (gitrepo.freebsd.org [IPv6:2610:1c1:1:6068::e6a:5]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (Client did not present a certificate) by mxrelay.nyi.freebsd.org (Postfix) with ESMTPS id 4T4QQW0xdMz1G9b; Tue, 2 Jan 2024 21:06:59 +0000 (UTC) (envelope-from git@FreeBSD.org) Received: from gitrepo.freebsd.org ([127.0.1.44]) by gitrepo.freebsd.org (8.17.1/8.17.1) with ESMTP id 402L6xpA056072; Tue, 2 Jan 2024 21:06:59 GMT (envelope-from git@gitrepo.freebsd.org) Received: (from git@localhost) by gitrepo.freebsd.org (8.17.1/8.17.1/Submit) id 402L6xj5056069; Tue, 2 Jan 2024 21:06:59 GMT (envelope-from git) Date: Tue, 2 Jan 2024 21:06:59 GMT Message-Id: <202401022106.402L6xj5056069@gitrepo.freebsd.org> To: src-committers@FreeBSD.org, dev-commits-src-all@FreeBSD.org, dev-commits-src-main@FreeBSD.org From: Gleb Smirnoff Subject: git: 7e19c0186f00 - main - netlink: improve nl_soreceive() List-Id: Commit messages for all branches of the src repository List-Archive: https://lists.freebsd.org/archives/dev-commits-src-all List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-dev-commits-src-all@freebsd.org X-BeenThere: dev-commits-src-all@freebsd.org MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit X-Git-Committer: glebius X-Git-Repository: src X-Git-Refname: refs/heads/main X-Git-Reftype: branch X-Git-Commit: 7e19c0186f001902efdc265139a9233b7a37eef1 Auto-Submitted: auto-generated The branch main has been updated by glebius: URL: https://cgit.FreeBSD.org/src/commit/?id=7e19c0186f001902efdc265139a9233b7a37eef1 commit 7e19c0186f001902efdc265139a9233b7a37eef1 Author: Gleb Smirnoff AuthorDate: 2024-01-02 21:05:25 +0000 Commit: Gleb Smirnoff CommitDate: 2024-01-02 21:05:25 +0000 netlink: improve nl_soreceive() The previous commit conservatively mimiced operation of soreceive_generic(). The new code does two things: - parses Netlink message headers and always returns at least one full nlmsg - hides nl_buf boundaries from the userland, copying out several at once More details can be found in the large comment block added. Reviewed by: melifaro Differential Revision: https://reviews.freebsd.org/D42785 --- sys/netlink/netlink_domain.c | 143 +++++++++++++++++++++++++++++++++---------- 1 file changed, 111 insertions(+), 32 deletions(-) diff --git a/sys/netlink/netlink_domain.c b/sys/netlink/netlink_domain.c index 3914d402fc04..94d8377b9e15 100644 --- a/sys/netlink/netlink_domain.c +++ b/sys/netlink/netlink_domain.c @@ -3,6 +3,7 @@ * * Copyright (c) 2021 Ng Peng Nam Sean * Copyright (c) 2022 Alexander V. Chernikov + * Copyright (c) 2023 Gleb Smirnoff * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions @@ -666,9 +667,10 @@ nl_soreceive(struct socket *so, struct sockaddr **psa, struct uio *uio, .nl_pid = 0 /* comes from the kernel */ }; struct sockbuf *sb = &so->so_rcv; - struct nl_buf *nb; + struct nl_buf *first, *last, *nb, *next; + struct nlmsghdr *hdr; int flags, error; - u_int overflow; + u_int len, overflow, partoff, partlen, msgrcv, datalen; bool nonblock, trunc, peek; MPASS(mp == NULL && uio != NULL); @@ -689,8 +691,16 @@ nl_soreceive(struct socket *so, struct sockaddr **psa, struct uio *uio, if (__predict_false(error)) return (error); + if (controlp != NULL) + *controlp = NULL; + + len = 0; + overflow = 0; + msgrcv = 0; + datalen = 0; + SOCK_RECVBUF_LOCK(so); - while ((nb = TAILQ_FIRST(&sb->nl_queue)) == NULL) { + while ((first = TAILQ_FIRST(&sb->nl_queue)) == NULL) { if (nonblock) { SOCK_RECVBUF_UNLOCK(so); SOCK_IO_RECV_UNLOCK(so); @@ -705,44 +715,113 @@ nl_soreceive(struct socket *so, struct sockaddr **psa, struct uio *uio, } /* - * XXXGL - * Here we emulate a PR_ATOMIC behavior of soreceive_generic() where - * we take only the first "record" in the socket buffer and send it - * to uio whole or truncated ignoring how many netlink messages are - * in the record and how much space is left in the uio. - * This needs to be fixed at next refactoring. First, we should perform - * truncation only if the very first message doesn't fit into uio. - * That will help an application with small buffer not to lose data. - * Second, we should continue working on the sb->nl_queue as long as - * there is more space in the uio. That will boost applications with - * large buffers. + * Netlink socket buffer consists of a queue of nl_bufs, but for the + * userland there should be no boundaries. However, there are Netlink + * messages, that shouldn't be split. Internal invariant is that a + * message never spans two nl_bufs. + * If a large userland buffer is provided, we would traverse the queue + * until either queue end is reached or the buffer is fulfilled. If + * an application provides a buffer that isn't able to fit a single + * message, we would truncate it and lose its tail. This is the only + * condition where we would lose data. If buffer is able to fit at + * least one message, we would return it and won't truncate the next. + * + * We use same code for normal and MSG_PEEK case. At first queue pass + * we scan nl_bufs and count lenght. In case we can read entire buffer + * at one write everything is trivial. In case we can not, we save + * pointer to the last (or partial) nl_buf and in the !peek case we + * split the queue into two pieces. We can safely drop the queue lock, + * as kernel would only append nl_bufs to the end of the queue, and + * we are the exclusive owner of queue beginning due to sleepable lock. + * At the second pass we copy data out and in !peek case free nl_bufs. + * + * XXX: current implementation of control data implies one chunk of + * data per one nl_buf. This doesn't map well with idea of no + * boundaries between nl_bufs. */ - if (__predict_true(!peek)) { - TAILQ_REMOVE(&sb->nl_queue, nb, tailq); - sb->sb_acc -= nb->datalen; - sb->sb_ccc -= nb->datalen; + TAILQ_FOREACH(nb, &sb->nl_queue, tailq) { + u_int offset; + + /* + * XXXGL: zero length buffer may be at the tail of a queue + * when a writer overflows socket buffer. When this is + * improved, use MPASS(nb->offset < nb->datalen). + */ + MPASS(nb->offset <= nb->datalen); + offset = nb->offset; + while (offset < nb->datalen) { + hdr = (struct nlmsghdr *)&nb->data[offset]; + if (uio->uio_resid < len + hdr->nlmsg_len) { + overflow = len + hdr->nlmsg_len - + uio->uio_resid; + partoff = nb->offset; + if (offset > partoff) { + partlen = offset - partoff; + if (!peek) { + nb->offset = offset; + datalen += partlen; + } + } else if (len == 0 && uio->uio_resid > 0) { + flags |= MSG_TRUNC; + partlen = uio->uio_resid; + if (!peek) { + /* XXX: may leave empty nb */ + nb->offset += hdr->nlmsg_len; + datalen += hdr->nlmsg_len; + } + } else + partlen = 0; + goto nospace; + } + len += hdr->nlmsg_len; + offset += hdr->nlmsg_len; + MPASS(offset <= nb->buflen); + msgrcv++; + } + MPASS(offset == nb->datalen); + datalen += nb->datalen; + } +nospace: + last = nb; + if (!peek) { + if (last == NULL) + TAILQ_INIT(&sb->nl_queue); + else { + /* XXXGL: create TAILQ_SPLIT */ + TAILQ_FIRST(&sb->nl_queue) = last; + last->tailq.tqe_prev = &TAILQ_FIRST(&sb->nl_queue); + } + sb->sb_acc -= datalen; + sb->sb_ccc -= datalen; } SOCK_RECVBUF_UNLOCK(so); - overflow = __predict_false(nb->datalen > uio->uio_resid) ? - nb->datalen - uio->uio_resid : 0; - error = uiomove(nb->data, (int)nb->datalen, uio); - if (__predict_false(overflow > 0)) { - flags |= MSG_TRUNC; - if (trunc) - uio->uio_resid -= overflow; - } + for (nb = first; nb != last; nb = next) { + next = TAILQ_NEXT(nb, tailq); - if (controlp != NULL) { - *controlp = nb->control; - nb->control = NULL; + /* XXXGL multiple controls??? */ + if (controlp != NULL && *controlp == NULL && + nb->control != NULL && !peek) { + *controlp = nb->control; + nb->control = NULL; + } + if (__predict_true(error == 0)) + error = uiomove(&nb->data[nb->offset], + (int)(nb->datalen - nb->offset), uio); + if (!peek) + nl_buf_free(nb); } + if (last != NULL && partlen > 0 && __predict_true(error == 0)) + error = uiomove(&nb->data[partoff], (int)partlen, uio); - if (__predict_true(!peek)) - nl_buf_free(nb); + if (trunc && overflow > 0) { + uio->uio_resid -= overflow; + MPASS(uio->uio_resid < 0); + } else + MPASS(uio->uio_resid >= 0); if (uio->uio_td) - uio->uio_td->td_ru.ru_msgrcv++; + uio->uio_td->td_ru.ru_msgrcv += msgrcv; if (flagsp != NULL) *flagsp |= flags;