From nobody Mon Apr 22 16:46:01 2024 X-Original-To: freebsd-hackers@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4VNWNR70w5z5HRsf for ; Mon, 22 Apr 2024 16:46:15 +0000 (UTC) (envelope-from asomers@gmail.com) Received: from mail-ot1-f47.google.com (mail-ot1-f47.google.com [209.85.210.47]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "smtp.gmail.com", Issuer "GTS CA 1D4" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4VNWNR0X3mz4Fv4 for ; Mon, 22 Apr 2024 16:46:15 +0000 (UTC) (envelope-from asomers@gmail.com) Authentication-Results: mx1.freebsd.org; dkim=none; dmarc=fail reason="SPF not aligned (relaxed), No valid DKIM" header.from=freebsd.org (policy=none); spf=pass (mx1.freebsd.org: domain of asomers@gmail.com designates 209.85.210.47 as permitted sender) smtp.mailfrom=asomers@gmail.com Received: by mail-ot1-f47.google.com with SMTP id 46e09a7af769-6eb848b5a2eso2641316a34.3 for ; Mon, 22 Apr 2024 09:46:15 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1713804373; x=1714409173; h=content-transfer-encoding:cc:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=mjSnNUo8zv8KKS/OfKqVcwzz/4nV5GVqMFFlin2VP2U=; b=qxXJecY9AhGGHm6/54DCaADBdgaSXwN5zY76ixUA0iXPgcLarkXnmgy/7i6e/lF9Cu xpXTrq4VFiN0CO1FDkFKg1AlqzeRYS6T374o0mYTO8TdlJ4dI4TAtyWxGC8iwMuRjvAA z8mELiQ2QFOrIe2YDgWpJELBKmkH8Ywvyb5aVYULP+RPnbw1gOsWn4HggKkdjEtsOz3J hZGLKiDkvC6Z5BNblMkPRrPJBvUbIbYcZN3NYa3gsz27qHHDeK8B79UsCub1Tn8f1hiF xIM1zeWlzuFI+lyl/m6qg8kFVhtsHrpgp1f7ewCQd6cPolI8LqeVFLYQ9DjiTGfdvDHD ZyeA== X-Forwarded-Encrypted: i=1; AJvYcCV6mbOaEfgXggZZ916AMHATypW6ane7dShZOsTmSfAdA6uUt2EYGRreFlYHJR+NL+p/CCht+93i7MuJzqtIvSStZnI2gSk1HKk3Kh8= X-Gm-Message-State: AOJu0YxsuOdX3Tsup3BzdPSbgU1To0JVMl9nl5kFaBzQuxwG4E6Pp8nh tMIC7q1wKybNy1Ww/KKoWvUAqtnE9fhONZ7y1DauuAUhiN5EGw4MkyH7QNarLG4q5bB+mUZqjab 8Sp/orp/vVP6mYwBnQ0tYMGs+aobZjVct X-Received: by 2002:a05:6358:ed11:b0:183:86c4:7603 with SMTP id hy17-20020a056358ed1100b0018386c47603mt14005103rwb.15.1713804373501; Mon, 22 Apr 2024 09:46:13 -0700 (PDT) List-Id: Technical discussions relating to FreeBSD List-Archive: https://lists.freebsd.org/archives/freebsd-hackers List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-hackers@FreeBSD.org MIME-Version: 1.0 References: In-Reply-To: From: Alan Somers Date: Mon, 22 Apr 2024 10:46:01 -0600 Message-ID: Subject: Re: Stressing malloc(9) Cc: Mark Johnston , FreeBSD Hackers Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spamd-Bar: + X-Spamd-Result: default: False [1.75 / 15.00]; MISSING_TO(2.00)[]; NEURAL_HAM_LONG(-1.00)[-0.997]; NEURAL_SPAM_MEDIUM(0.40)[0.396]; NEURAL_SPAM_SHORT(0.35)[0.346]; FORGED_SENDER(0.30)[asomers@freebsd.org,asomers@gmail.com]; R_SPF_ALLOW(-0.20)[+ip4:209.85.128.0/17:c]; RWL_MAILSPIKE_GOOD(-0.10)[209.85.210.47:from]; DMARC_POLICY_SOFTFAIL(0.10)[freebsd.org : SPF not aligned (relaxed), No valid DKIM,none]; MIME_GOOD(-0.10)[text/plain]; RCVD_TLS_LAST(0.00)[]; TO_DN_ALL(0.00)[]; RCPT_COUNT_TWO(0.00)[2]; FREEMAIL_ENVFROM(0.00)[gmail.com]; MIME_TRACE(0.00)[0:+]; ARC_NA(0.00)[]; MISSING_XM_UA(0.00)[]; RCVD_COUNT_ONE(0.00)[1]; PREVIOUSLY_DELIVERED(0.00)[freebsd-hackers@freebsd.org]; TO_MATCH_ENVRCPT_SOME(0.00)[]; FROM_NEQ_ENVFROM(0.00)[asomers@freebsd.org,asomers@gmail.com]; RCVD_IN_DNSWL_NONE(0.00)[209.85.210.47:from]; MLMMJ_DEST(0.00)[freebsd-hackers@freebsd.org]; ASN(0.00)[asn:15169, ipnet:209.85.128.0/17, country:US]; R_DKIM_NA(0.00)[]; FREEFALL_USER(0.00)[asomers]; FROM_HAS_DN(0.00)[] X-Rspamd-Queue-Id: 4VNWNR0X3mz4Fv4 On Sun, Apr 21, 2024 at 5:47=E2=80=AFPM Alan Somers w= rote: > > On Sun, Apr 21, 2024 at 10:09=E2=80=AFAM Mark Johnston wrote: > > > > On Sat, Apr 20, 2024 at 11:23:41AM -0600, Alan Somers wrote: > > > On Sat, Apr 20, 2024 at 9:07=E2=80=AFAM Mark Johnston wrote: > > > > > > > > On Fri, Apr 19, 2024 at 04:23:51PM -0600, Alan Somers wrote: > > > > > TLDR; > > > > > How can I create a workload that causes malloc(9)'s performance t= o plummet? > > > > > > > > > > Background: > > > > > I recently witnessed a performance problem on a production server= . > > > > > Overall throughput dropped by over 30x. dtrace showed that 60% o= f the > > > > > CPU time was dominated by lock_delay as called by three functions= : > > > > > printf (via ctl_worker_thread), g_eli_alloc_data, and > > > > > g_eli_write_done. One thing those three have in common is that t= hey > > > > > all use malloc(9). Fixing the problem was as simple as telling C= TL to > > > > > stop printing so many warnings, by tuning > > > > > kern.cam.ctl.time_io_secs=3D100000. > > > > > > > > > > But even with CTL quieted, dtrace still reports ~6% of the CPU cy= cles > > > > > in lock_delay via g_eli_alloc_data. So I believe that malloc is > > > > > limiting geli's performance. I would like to try replacing it wi= th > > > > > uma(9). > > > > > > > > What is the size of the allocations that g_eli_alloc_data() is doin= g? > > > > malloc() is a pretty thin layer over UMA for allocations <=3D 64KB. > > > > Larger allocations are handled by a different path (malloc_large()) > > > > which goes directly to the kmem_* allocator functions. Those funct= ions > > > > are very expensive: they're serialized by global locks and need to > > > > update the pmap (and perform TLB shootdowns when memory is freed). > > > > They're not meant to be used at a high rate. > > > > > > In my benchmarks so far, 512B. In the real application the size is > > > mostly between 4k and 16k, and it's always a multiple of 4k. But it's > > > sometimes great enough to use malloc_large, and it's those > > > malloc_large calls that account for the majority of the time spent in > > > g_eli_alloc_data. lockstat shows that malloc_large, as called by > > > g_elI_alloc_data, sometimes blocks for multiple ms. > > > > > > But oddly, if I change the parameters so that g_eli_alloc_data > > > allocates 128kB, I still don't see malloc_large getting called. And > > > both dtrace and vmstat show that malloc is mostly operating on 512B > > > allocations. But dtrace does confirm that g_eli_alloc_data is being > > > called with 128kB arguments. Maybe something is getting inlined? > > > > malloc_large() is annotated __noinline, for what it's worth. > > > > > I > > > don't understand how this is happening. I could probably figure out > > > if I recompile with some extra SDT probes, though. > > > > What is g_eli_alloc_sz on your system? > > 33kiB. That's larger than I expected. When I use a larger blocksize > in my benchmark, then I do indeed see malloc_large activity, and 11% > of the CPU is spend in g_eli_alloc_data. > > I guess I'll add some UMA zones for this purpose. I'll try 256k and > 512k zones, rounding up allocations as necessary. Thanks for the tip. When I said "33kiB" I meant "33 pages", or 132 kB. And the solution turns out to be very easy. Since I'm using ZFS on top of geli, with the default recsize of 128kB, I'll just set vfs.zfs.vdev.aggregation_limit to 128 kB. That way geli will never need to allocate more than 128kB contiguously. ZFS doesn't even need those big allocations to be contiguous; it's just aggregating smaller operations to reduce disk IOPs. But aggregating up to 1MB (the default) is overkill; any rotating HDD should easily be able to max out its consecutive write IOPs with 128kB operation size. I'll add a read-only sysctl for g_eli_alloc_sz too. Thanks Mark. -Alan