From nobody Tue Apr 02 07:20:42 2024 X-Original-To: freebsd-questions@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4V7znK0YW5z5GK5Q for ; Tue, 2 Apr 2024 07:20:53 +0000 (UTC) (envelope-from ml@netfence.it) Received: from soth.netfence.it (mailserver.netfence.it [78.134.96.152]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "mailserver.netfence.it", Issuer "R3" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4V7znH2Vcqz4G0Y for ; Tue, 2 Apr 2024 07:20:51 +0000 (UTC) (envelope-from ml@netfence.it) Authentication-Results: mx1.freebsd.org; dkim=none; dmarc=pass (policy=none) header.from=netfence.it; spf=pass (mx1.freebsd.org: domain of ml@netfence.it designates 78.134.96.152 as permitted sender) smtp.mailfrom=ml@netfence.it Received: from [10.1.2.18] (alamar.local.netfence.it [10.1.2.18]) (authenticated bits=0) by soth.netfence.it (8.17.2/8.17.2) with ESMTPSA id 4327Kg6K024523 (version=TLSv1.3 cipher=TLS_AES_128_GCM_SHA256 bits=128 verify=NO) for ; Tue, 2 Apr 2024 09:20:43 +0200 (CEST) (envelope-from ml@netfence.it) X-Authentication-Warning: soth.netfence.it: Host alamar.local.netfence.it [10.1.2.18] claimed to be [10.1.2.18] Message-ID: <1ca17a7a-025d-4403-a7f3-2892408ad628@netfence.it> Date: Tue, 2 Apr 2024 09:20:42 +0200 List-Id: User questions List-Archive: https://lists.freebsd.org/archives/freebsd-questions List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-questions@freebsd.org X-BeenThere: freebsd-questions@freebsd.org MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Content-Language: en-US From: Andrea Venturoli Subject: 13.3 troubles under load To: freebsd-questions@freebsd.org Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Spamd-Bar: --- X-Spamd-Result: default: False [-3.79 / 15.00]; NEURAL_HAM_LONG(-1.00)[-1.000]; NEURAL_HAM_MEDIUM(-1.00)[-1.000]; NEURAL_HAM_SHORT(-1.00)[-1.000]; DMARC_POLICY_ALLOW(-0.50)[netfence.it,none]; R_SPF_ALLOW(-0.20)[+ip4:78.134.96.152]; MIME_GOOD(-0.10)[text/plain]; XM_UA_NO_VERSION(0.01)[]; RCVD_VIA_SMTP_AUTH(0.00)[]; ASN(0.00)[asn:35612, ipnet:78.134.0.0/17, country:IT]; RCPT_COUNT_ONE(0.00)[1]; MIME_TRACE(0.00)[0:+]; RCVD_COUNT_ONE(0.00)[1]; RCVD_TLS_ALL(0.00)[]; R_DKIM_NA(0.00)[]; MLMMJ_DEST(0.00)[freebsd-questions@freebsd.org]; FROM_HAS_DN(0.00)[]; HAS_XAW(0.00)[]; MID_RHS_MATCH_FROM(0.00)[]; FROM_EQ_ENVFROM(0.00)[]; TO_MATCH_ENVRCPT_ALL(0.00)[]; TO_DN_NONE(0.00)[]; PREVIOUSLY_DELIVERED(0.00)[freebsd-questions@freebsd.org]; ARC_NA(0.00)[] X-Rspamd-Queue-Id: 4V7znH2Vcqz4G0Y Hello. Now that 13.3 is out, and given the relatively short overlap support window, I started upgrading my 13.2 machines as soon as I had the chance. However, I'm experiencing some troubles under load (in cases where every version up to 13.2 has always worked without troubles). Scenario 1: Box A is ZFS/SSD based, but has an UFS HD (with only specific data) which is exported via NFSv4. Box B mounts that NFSv4 share and backs in up to an UFS/USB disk via rsync. This has always worked fine until I upgraded box A to 13.3. Now, while rsync does it jobs, box A starts crawling: Nagios reports several failures (either daemons which die or daemons which are no longer able to answer timely) and logging in via SSH becomes almost impossible (with already open sessions almost unusable). System is on ZFS so it should not be affected by the load on the UFS HD; besides, a single UFS HD should not be able to provide so much load to halt an 8 core system with 32GiB or RAM. Is it possible that such not so high network traffic (lagg with two em cards) brings this box to an almost halt? Unfortunately, so far I don't have any useful logs. Scenario 1: A box is running with several services (including two clamd instances in two different jails). Once a week, it connects to a NAS via Bacula and copies ~1TB of data to an external UFS HD. As in the previous example, after I upgraded to 13.3 this simple operation (which has worked for several years) has started to be problematic, as daemons are killed all through it: > Apr 1 20:01:31 xxxxxxx kernel: pid 11753 (clamd), jid 3, uid 26, was killed: a thread waited too long to allocate a page > Apr 1 20:02:18 xxxxxxx kernel: pid 11720 (clamd), jid 5, uid 26, was killed: a thread waited too long to allocate a page > Apr 1 20:03:16 xxxxxxx kernel: pid 3707 (squid), jid 3, uid 100, was killed: a thread waited too long to allocate a page > Apr 1 20:03:54 xxxxxxx kernel: pid 7400 (zeek), jid 7, uid 782, was killed: a thread waited too long to allocate a page > Apr 1 20:04:25 xxxxxxx kernel: pid 1813 (snort), jid 0, uid 0, was killed: a thread waited too long to allocate a page > Apr 1 20:05:59 xxxxxxx kernel: pid 7399 (zeek), jid 7, uid 782, was killed: a thread waited too long to allocate a page > Apr 1 20:05:59 xxxxxxx kernel: pid 1820 (snort), jid 0, uid 0, was killed: a thread waited too long to allocate a page > Apr 1 20:06:48 xxxxxxx kernel: pid 44493 (perl), jid 5, uid 26, was killed: a thread waited too long to allocate a page > Apr 1 20:07:22 xxxxxxx kernel: pid 44512 (perl), jid 5, uid 26, was killed: a thread waited too long to allocate a page > Apr 1 20:09:23 xxxxxxx kernel: pid 7254 (zeek), jid 7, uid 782, was killed: a thread waited too long to allocate a page > Apr 1 20:10:17 xxxxxxx kernel: pid 14462 (mysqld), jid 11, uid 88, was killed: a thread waited too long to allocate a page > Apr 1 20:10:17 xxxxxxx kernel: pid 83231 (smbd), jid 8, uid 0, was killed: a thread waited too long to allocate a page > Apr 1 20:10:17 xxxxxxx kernel: pid 28868 (smbd), jid 8, uid 0, was killed: a thread waited too long to allocate a page > Apr 1 20:10:17 xxxxxxx kernel: pid 92611 (smbd), jid 8, uid 0, was killed: a thread waited too long to allocate a page > Apr 1 20:12:20 xxxxxxx kernel: pid 77438 (clamd), jid 3, uid 26, was killed: a thread waited too long to allocate a page > Apr 1 20:13:47 xxxxxxx kernel: pid 77473 (clamd), jid 5, uid 26, was killed: a thread waited too long to allocate a page Again, system/swap is on a SSD ZFS RAID pool, so disk load on the UFS USB HD shouldn't hamper its throughput. This time network is still a lagg, but with igb cards (so a similar driver). Any hint what to look for? Is there some known problem with LAGG, if_em/if_igb, USB, UFS, other? bye & Thanks av.