From nobody Fri Apr 21 20:05:55 2023 X-Original-To: freebsd-current@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4Q35BH046yz46q3B for ; Fri, 21 Apr 2023 20:05:59 +0000 (UTC) (envelope-from yuri@aetern.org) Received: from wout3-smtp.messagingengine.com (wout3-smtp.messagingengine.com [64.147.123.19]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 4Q35BG4DV5z3JRt for ; Fri, 21 Apr 2023 20:05:58 +0000 (UTC) (envelope-from yuri@aetern.org) Authentication-Results: mx1.freebsd.org; dkim=pass header.d=aetern.org header.s=fm2 header.b=Q1OIa+6G; dkim=pass header.d=messagingengine.com header.s=fm3 header.b="W /7nCeO"; spf=pass (mx1.freebsd.org: domain of yuri@aetern.org designates 64.147.123.19 as permitted sender) smtp.mailfrom=yuri@aetern.org; dmarc=none Received: from compute5.internal (compute5.nyi.internal [10.202.2.45]) by mailout.west.internal (Postfix) with ESMTP id 7DA233200B45 for ; Fri, 21 Apr 2023 16:05:57 -0400 (EDT) Received: from mailfrontend1 ([10.202.2.162]) by compute5.internal (MEProxy); Fri, 21 Apr 2023 16:05:57 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=aetern.org; h=cc :content-transfer-encoding:content-type:content-type:date:date :from:from:in-reply-to:in-reply-to:message-id:mime-version :references:reply-to:sender:subject:subject:to:to; s=fm2; t= 1682107557; x=1682193957; bh=kfjEBZW0PtxJPAH9rWjX/8d85E34Xqx3tNS Oqk9eZKg=; b=Q1OIa+6G1qPmIc2wdC50e2l1Zw9/s2/S7cGL+auyEY5bDdyS6Hr xBz7AztCtCJwScTuboWiKXTbgxvlf/o6MITFd5YlczKwePAI4EP0yhw4lpfXsEj6 Cf/2RyuZyQzoOt2ZkjJoV407lmGe1ScX+Zd9QnuU8+Y1PlhCLqWKOqVGu3oRlAAA e+nhOdeRGJdPei+vvpZl+4sK6lQOMyiP5Au13H87OIeL/kkxVqyC6b1pfmEFxB3I sUZ6kdtt6NyngsnWls0fzOfX7FjSQ/BBDkWQh1Bo+fFkMGPpv9aSpBNciLWgiSHe N+kJu4vzr59JOHKuFgGDh4jaK6sRgZGxA2g== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:content-transfer-encoding:content-type :content-type:date:date:feedback-id:feedback-id:from:from :in-reply-to:in-reply-to:message-id:mime-version:references :reply-to:sender:subject:subject:to:to:x-me-proxy:x-me-proxy :x-me-sender:x-me-sender:x-sasl-enc; s=fm3; t=1682107557; x= 1682193957; bh=kfjEBZW0PtxJPAH9rWjX/8d85E34Xqx3tNSOqk9eZKg=; b=W /7nCeOd9+EbxJkOdCV5Jb9jyjI4Oq6xsc+vFD6lKDzEx2h+8ZCiiDcLo5YFS4p1Y 5Crjr+4l4rvJyV1WqquWZxPC6ZtSHYscsCKajzzhYocGEFQx8vAaO1wF4dzumEUu BJsqZfTfUHtzyOpTQOZorrl9uB+7CnGtLUkirwjm/3ZVPgCXTn/2HKiT67ZlWtnz nNMtXp8nAJSSX+cEEG3yfGZzoJSbffM2cMrtEI5Agz6i/6HKNqlYxKfGBA1N5Qs/ dRxKkw4JYqUsA8Wd789R5litSaHzq7V4LWLxO5W2zM4n4756Kq66rHa7kf5csZus H++5cOvLNl9csgshQ1Gqg== X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgedvhedrfedtgedgudeggecutefuodetggdotefrod ftvfcurfhrohhfihhlvgemucfhrghsthforghilhdpqfgfvfdpuffrtefokffrpgfnqfgh necuuegrihhlohhuthemuceftddtnecunecujfgurhepkfffgggfuffhvfhfjggtgfesth ejredttdefjeenucfhrhhomhepjghurhhiuceohihurhhisegrvghtvghrnhdrohhrgheq necuggftrfgrthhtvghrnhepudejueefudfhfeeffeektdduheffgeegjeehveejleffhf ekfefhhfeiteeihfetnecuffhomhgrihhnpehfrhgvvggsshgurdhorhhgnecuvehluhhs thgvrhfuihiivgeptdenucfrrghrrghmpehmrghilhhfrhhomhephihurhhisegrvghtvg hrnhdrohhrgh X-ME-Proxy: Feedback-ID: i0d79475b:Fastmail Received: by mail.messagingengine.com (Postfix) with ESMTPA for ; Fri, 21 Apr 2023 16:05:56 -0400 (EDT) Message-ID: Date: Fri, 21 Apr 2023 22:05:55 +0200 List-Id: Discussions about the use of FreeBSD-current List-Archive: https://lists.freebsd.org/archives/freebsd-current List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-current@freebsd.org MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101 Thunderbird/102.10.0 Subject: Re: find(1): I18N gone wild? [[:alpha:]] not a substitute to refer 26 English letters A-Z Content-Language: en-US From: Yuri To: freebsd-current@freebsd.org References: <86efedcf-e3ed-be0c-79ab-03f0d4a743af@aetern.org> In-Reply-To: <86efedcf-e3ed-be0c-79ab-03f0d4a743af@aetern.org> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Rspamd-Queue-Id: 4Q35BG4DV5z3JRt X-Spamd-Bar: - X-Spamd-Result: default: False [-1.50 / 15.00]; DWL_DNSWL_LOW(-1.00)[messagingengine.com:dkim]; R_SPF_ALLOW(-0.20)[+ip4:64.147.123.19:c]; R_DKIM_ALLOW(-0.20)[aetern.org:s=fm2,messagingengine.com:s=fm3]; RCVD_IN_DNSWL_LOW(-0.10)[64.147.123.19:from]; DKIM_TRACE(0.00)[aetern.org:+,messagingengine.com:+]; ARC_NA(0.00)[]; DMARC_NA(0.00)[aetern.org]; PREVIOUSLY_DELIVERED(0.00)[freebsd-current@freebsd.org]; ASN(0.00)[asn:29838, ipnet:64.147.123.0/24, country:US]; local_wl_from(0.00)[yuri@aetern.org]; SUBJECT_HAS_QUESTION(0.00)[] X-Rspamd-Pre-Result: action=no action; module=multimap; Matched map: local_wl_from X-ThisMailContainsUnwantedMimeParts: N Yuri wrote: > parv/FreeBSD wrote: >> Wrote Dimitry Andric on Fri, 21 Apr 2023 10:38:05 UTC >> (via >> https://lists.freebsd.org/archives/freebsd-current/2023-April/003556.html ) >>> >>> ... However, I have read that with unicode, you should *never* >>> use [A-Z] or [0-9], but character classes instead. That seems to give >>> both files on macOS and Linux with [[:alpha:]]: >> ... >> >> Subject to the locale, problem with that is "[[:alpha:]]" will match >> more than 26 English letters "A" through "Z" (besides also matching >> lower case "a" through "z") even if none of 26 * 2 English alphabets >> appear in a string. > > (replying to random recent message) > > And there is a bit of quite recent history for fnmatch() related to > [a-z], same was done for regex with the same outcome -- attempt to make > [a-z] (guess [A-Z] as well) range non-collating failed. I am not aware > of the encountered failures, hopefully someone should remember: I just tried less intrusive change that seems to help with these ranges (but there's still a question what failed previously): diff --git a/lib/libc/gen/fnmatch.c b/lib/libc/gen/fnmatch.c index 40670545993..3234c1aaaa4 100644 --- a/lib/libc/gen/fnmatch.c +++ b/lib/libc/gen/fnmatch.c @@ -295,10 +295,11 @@ rangematch(const char *pattern, wchar_t test, int flags, char **newp, if (flags & FNM_CASEFOLD) c2 = towlower(c2); - if (table->__collate_load_error ? + if (table->__collate_load_error || + iswascii(test) ? c <= test && test <= c2 : - __wcollate_range_cmp(c, test) <= 0 - && __wcollate_range_cmp(test, c2) <= 0 + __wcollate_range_cmp(c, test) <= 0 && + __wcollate_range_cmp(test, c2) <= 0 ) ok = 1; } else if (c == test) $ LC_ALL=en_US.UTF-8 LD_PRELOAD=/usr/obj/home/yuri/ws/find/amd64.amd64/lib/libc/libc.so.7 find . -name '[a-z]*' ./bar $ LC_ALL=en_US.UTF-8 LD_PRELOAD=/usr/obj/home/yuri/ws/find/amd64.amd64/lib/libc/libc.so.7 find . -name '[A-Z]*' ./FOO > -------- > commit 5a5807dd4ca34467ac5fb458bc19f12bf62075a5 > Author: Andrey A. Chernov > Date: Sun Jul 10 03:49:38 2016 +0000 > > Remove broken support for collation in [a-z] type ranges. > Only first 256 wide chars are considered currently, all other are just > dropped from the range. Proper implementation require reverse tables > database lookup, since objects are really big as max UTF-8 (1114112 > code points), so just the same scanning as it was for 256 chars will > slow things down. > > POSIX does not require collation for [a-z] type ranges and does not > prohibit it for non-POSIX locales. POSIX require collation for ranges > only for POSIX (or C) locale which is equal to ASCII and binary for > other chars, so we already have it. > > No other *BSD implements collation for [a-z] type ranges. > > Restore ABI compatibility with unused now __collate_range_cmp() which > is visible from outside (will be removed later). > -------- > commit 1daad8f5ad767dfe7896b8d1959a329785c9a76b > Author: Andrey A. Chernov > Date: Thu Jul 14 08:18:12 2016 +0000 > > Back out non-collating [a-z] ranges. > Instead of changing whole course to another POSIX-permitted way > for consistency and uniformity I decide to completely ignore missing > regex fucntionality and concentrace on fixing bugs in what we have now, > too many small obstacles instead, counting ports. > -------- > commit 12eae8c8f346cb459a388259ca98faebdac47038 > Author: Andrey A. Chernov > Date: Thu Jul 14 09:07:25 2016 +0000 > > 1) Eliminate possibility to call __*collate_range_cmp() with inclomplete > locale (which cause core dump) by removing whole 'table' argument > by which it passed. > > 2) Restore __collate_range_cmp() in __sccl(). > > 3) Collating [a-z] range in regcomp() only for single bytes locales > (we can't do it now for other ones). In previous state only first 256 > wchars are considered and all others are just silently dropped from the > range. > -------- >