From nobody Tue Oct 31 16:06:08 2023 X-Original-To: freebsd-fs@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4SKZkV0fTnz4yYGx for ; Tue, 31 Oct 2023 16:06:10 +0000 (UTC) (envelope-from mjguzik@gmail.com) Received: from mail-oo1-xc34.google.com (mail-oo1-xc34.google.com [IPv6:2607:f8b0:4864:20::c34]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "smtp.gmail.com", Issuer "GTS CA 1D4" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4SKZkV09ctz3Yth for ; Tue, 31 Oct 2023 16:06:10 +0000 (UTC) (envelope-from mjguzik@gmail.com) Authentication-Results: mx1.freebsd.org; none Received: by mail-oo1-xc34.google.com with SMTP id 006d021491bc7-586b512ba0aso2731967eaf.3 for ; Tue, 31 Oct 2023 09:06:10 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1698768369; x=1699373169; darn=freebsd.org; h=cc:to:subject:message-id:date:from:references:in-reply-to :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=KfyJ0mRuQVJMGaLMLo1R4TQse8O9aHhaKOb1H7jmNv0=; b=YETFURdqYLGwBcbGWMgYvpbDvyuRbGMVaEuE0TKZ/p0aHSwMHK3LT8pIw2WwxosOoT ql/l5XGOoWXfbf3iVpdbW1s0IVNA6YxtwK6ptGXH6H28txQqcH5MO7WphvHcHVLAxoO7 gXHTuPoMJNTPvtIue9R3X0vLtCaP/+FsEx1SuCZEIksCKVIF0Y9tLEw5Ab5f23k4DQNP fyhGA5UUMOwXZe4QeWzYpurjSFlW9rUt6cONNhRgTkIU3F37lNLlLO7dWlgSl9kV+YbZ XXS0jgY4NddB6X/bYU0EEAmc/WkRj+8OWHGTevkUKMQsX77ZtDBSzxS6nCTSkLdxAQoN /bdw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1698768369; x=1699373169; h=cc:to:subject:message-id:date:from:references:in-reply-to :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=KfyJ0mRuQVJMGaLMLo1R4TQse8O9aHhaKOb1H7jmNv0=; b=rHsLg3nKLsW/HOQo3DQh7YBlxSkinCSYUv+wNqA7Ho0rTxGkmSaqXXIL4S7myRtfbh ozySWM0WpglbjCcB8PU+LAql3ys7S5+Ndedj0grLa9NpJo7CSqamcaZi0+AmzubvPuyj Css198tvU9+a8Abg4fZsqnTH2F5hDqUXkn1fcxjc0VG3ZjzstgdVovq9sRWShTvEtPer DCqii3u8a8OeWw/xxGOBJMkVu+nTcKj9iGHLsdvK/uN5+RLd4l7RTgj1UwjJWo0B4H/p C+WO3aG1eNr4D5Zfp3bfa35e9oJIvQLgs4lO/ZfXE7CTx3UzbdUn1BMRvUPWPn9M6eCu A4Bw== X-Gm-Message-State: AOJu0YzJRRJyXmPcYQoQuqRLXXAP2+5sbBdpyuUAMCdkzD0rwZdb6Pu2 pa+8c5dxMj0Tpslz1joFq8MG8Soth7IsEGn2ScTKBctc X-Google-Smtp-Source: AGHT+IH3DA09epKut5+THAPC0oh007GZ4uCBJfmI03bO05YDUbRW2tcuyeR6Pzk4oURb4cu6205jd9w0Xhewha4n4mY= X-Received: by 2002:a4a:c885:0:b0:587:2b9b:985a with SMTP id t5-20020a4ac885000000b005872b9b985amr2228910ooq.9.1698768368626; Tue, 31 Oct 2023 09:06:08 -0700 (PDT) List-Id: Filesystems List-Archive: https://lists.freebsd.org/archives/freebsd-fs List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-fs@freebsd.org MIME-Version: 1.0 Received: by 2002:a8a:d1:0:b0:4f0:1250:dd51 with HTTP; Tue, 31 Oct 2023 09:06:08 -0700 (PDT) In-Reply-To: <29268BEF-8620-4CAE-B69D-2D2C2EF59868@mit.edu> References: <67FBC292-E83D-4171-BE79-D28D5578F031@mit.edu> <8F17FEC3-C3CB-410D-ACEC-ADD12B3A747A@mit.edu> <29268BEF-8620-4CAE-B69D-2D2C2EF59868@mit.edu> From: Mateusz Guzik Date: Tue, 31 Oct 2023 17:06:08 +0100 Message-ID: Subject: Re: vnode_init took over 6 minutes To: John F Carr Cc: Freebsd fs Content-Type: text/plain; charset="UTF-8" X-Spamd-Bar: ---- X-Rspamd-Pre-Result: action=no action; module=replies; Message is reply to one we originated X-Spamd-Result: default: False [-4.00 / 15.00]; REPLY(-4.00)[]; ASN(0.00)[asn:15169, ipnet:2607:f8b0::/32, country:US] X-Rspamd-Queue-Id: 4SKZkV09ctz3Yth On 10/31/23, John F Carr wrote: > > >> On Oct 31, 2023, at 11:42, Mateusz Guzik wrote: >> >> On 10/31/23, John F Carr wrote: >>> >>> >>>> On Oct 31, 2023, at 10:16, Mateusz Guzik wrote: >>>> >>>> On 10/31/23, John F Carr wrote: >>>>> During a poudriere build control-T showed an awk process was stuck >>>>> for over 6 minutes trying to acquire a lock in vnode_init. >>>>> >>>>> load: 8.58 cmd: awk 51238 [running] 392.88r 0.00u 392.88s 100% 2604k >>>>> __mtx_lock_sleep+0xf8 __mtx_lock_flags+0xa4 vnode_init+0xc3 >>>>> keg_alloc_slab+0x277 zone_import+0x143 cache_alloc+0x3ed >>>>> cache_alloc_retry+0x23 getnewvnode_reserve+0x20 zfs_zget+0x1f >>>>> zfs_dirent_lookup+0x16d zfs_dirlook+0x7f zfs_lookup+0x3e0 >>>>> zfs_freebsd_cachedlookup+0x74 vfs_cache_lookup+0xa7 >>>>> cache_fplookup_noentry+0x241 cache_fplookup+0x575 namei+0x1ea >>>>> vn_open_cred+0x48d >>>>> >>>>> The stack trace was the same for several minutes. System CPU time was >>>>> increasing. >>>>> >>>>> Address vnode_init+0xc3 corresponds to the mtx_lock call here. >>>>> >>>>> vp->v_holdcnt = VHOLD_NO_SMR; >>>>> vp->v_type = VNON; >>>>> mtx_lock(&vnode_list_mtx); >>>>> TAILQ_INSERT_BEFORE(vnode_list_free_marker, vp, v_vnodelist); >>>>> mtx_unlock(&vnode_list_mtx); >>>>> return (0); >>>>> >>>>> Address __mtx_lock_sleep+0xf8 is the instruction after a call to >>>>> lock_delay. >>>>> >>>>> ps says the command line was >>>>> >>>>> /usr/bin/awk -f /usr/bin/awk >>>>> old-default/2023-10-31_08h21m03s/.poudriere.builders >>>>> old-default/2023-10-31_08h21m03s/.poudriere.buildname ... >>>>> >>>>> with the full list of input files exceeding the ~2KB command line >>>>> length >>>>> limit of ps. >>>>> "/usr/bin/awk" is probably not the real second argument. It would >>>>> cause >>>>> an >>>>> immediate syntax error. >>>>> >>>>> The hang resolved within a few more minutes and poudriere is >>>>> continuing >>>>> happily. >>>>> >>>>> I have never seen such behavior before. Code in vfs_subr.c tries not >>>>> to >>>>> hold the >>>>> vnode_list_mtx lock for too long. >>>>> >>>>> Any thoughts? >>>>> >>>>> >>>>> FreeBSD 13.2-STABLE up through commit b180f0040f95, 24 core 48 thread >>>>> Zen >>>>> 2, >>>>> zfs pool on SSD connected via 12 Gbps SAS with cache on NVME, 160 GB >>>>> RAM. >>>>> >>>> >>>> what does "sysctl vfs.vnode" say >>>> >>>> >>>> -- >>>> Mateusz Guzik >>> >>> With the builds now complete, >>> >>> vfs.vnode.vnlru.uma_reclaim_calls: 0 >>> vfs.vnode.vnlru.kicks: 0 >>> vfs.vnode.vnlru.max_free_per_call: 10000 >>> vfs.vnode.vnlru.failed_runs: 0 >>> vfs.vnode.vnlru.direct_recycles_free: 152871194 >>> vfs.vnode.vnlru.recycles_free: 77190881 >>> vfs.vnode.vnlru.recycles: 0 >>> vfs.vnode.stats.alloc_sleeps: 0 >>> vfs.vnode.stats.free: 1269702 >>> vfs.vnode.stats.skipped_requeues: 9375158 >>> vfs.vnode.stats.created: 235594976 >>> vfs.vnode.stats.count: 1407640 >>> vfs.vnode.param.wantfree: 681059 >>> vfs.vnode.param.limit: 2724237 >>> >>> Uptime is 11 days, mostly with low I/O rates. >>> >>> >> >> please add the uma stats: "sysctl vm.uma.VNODE" >> >> -- >> Mateusz Guzik > > vm.uma.VNODE.stats.xdomain: 0 > vm.uma.VNODE.stats.fails: 0 > vm.uma.VNODE.stats.frees: 245129082 > vm.uma.VNODE.stats.allocs: 246536724 > vm.uma.VNODE.stats.current: 1407642 > vm.uma.VNODE.domain.0.timin: 215 > vm.uma.VNODE.domain.0.limin: 4613 > vm.uma.VNODE.domain.0.wss: 10080 > vm.uma.VNODE.domain.0.bimin: 391986 > vm.uma.VNODE.domain.0.imin: 391986 > vm.uma.VNODE.domain.0.imax: 391986 > vm.uma.VNODE.domain.0.nitems: 391986 > vm.uma.VNODE.limit.bucket_max: 18446744073709551615 > vm.uma.VNODE.limit.sleeps: 0 > vm.uma.VNODE.limit.sleepers: 0 > vm.uma.VNODE.limit.max_items: 0 > vm.uma.VNODE.limit.items: 0 > vm.uma.VNODE.keg.domain.0.free_slabs: 0 > vm.uma.VNODE.keg.domain.0.free_items: 324187 > vm.uma.VNODE.keg.domain.0.pages: 266134 > vm.uma.VNODE.keg.efficiency: 95 > vm.uma.VNODE.keg.reserve: 0 > vm.uma.VNODE.keg.align: 7 > vm.uma.VNODE.keg.ipers: 8 > vm.uma.VNODE.keg.ppera: 1 > vm.uma.VNODE.keg.rsize: 488 > vm.uma.VNODE.keg.name: VNODE > vm.uma.VNODE.bucket_size_max: 254 > vm.uma.VNODE.bucket_size: 88 > vm.uma.VNODE.flags: 0xd0000 > vm.uma.VNODE.size: 488 > > > Ok, this is mostly what I was looking for: > vm.uma.VNODE.bucket_size: 88 That and vnlru being active. That codepath which appeared stalled may very well have been making progress. There is an API deficiency in UMA where it calls the constructor/destructor for each object separately. Meaning in this case you got 88 back-to-back lock acquires. Thrown in other CPUs into the mix which try to do the same thing (or relinquish vnode memory) vs vnlru you get the cost of waiting on it multiplied big time. Now vnlru scans happen more frequently than they need to and may hold up the lock for more than justified, but key point is that this problem is artificially exacerbated * 88 for each CPU participating. I'm going to hack something up about it. -- Mateusz Guzik