From nobody Thu Apr 20 10:08:29 2023 X-Original-To: current@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4Q2CzQ44Blz46cFs for ; Thu, 20 Apr 2023 10:08:34 +0000 (UTC) (envelope-from jamie@catflap.org) Received: from donotpassgo.dyslexicfish.net (donotpassgo.dyslexicfish.net [IPv6:2001:19f0:7400:8808:123::1]) by mx1.freebsd.org (Postfix) with ESMTP id 4Q2CzP3cDPz3JFx for ; Thu, 20 Apr 2023 10:08:33 +0000 (UTC) (envelope-from jamie@catflap.org) Authentication-Results: mx1.freebsd.org; dkim=none; spf=pass (mx1.freebsd.org: domain of jamie@catflap.org designates 2001:19f0:7400:8808:123::1 as permitted sender) smtp.mailfrom=jamie@catflap.org; dmarc=pass (policy=none) header.from=catflap.org X-Catflap-Envelope-From: X-Catflap-Envelope-To: current@FreeBSD.org Received: from donotpassgo.dyslexicfish.net (donotpassgo.dyslexicfish.net [209.250.224.51]) by donotpassgo.dyslexicfish.net (8.14.5/8.14.5) with ESMTP id 33KA8UOu077656; Thu, 20 Apr 2023 11:08:30 +0100 (BST) (envelope-from jamie@donotpassgo.dyslexicfish.net) Received: (from jamie@localhost) by donotpassgo.dyslexicfish.net (8.14.5/8.14.5/Submit) id 33KA8TpX077655; Thu, 20 Apr 2023 11:08:29 +0100 (BST) (envelope-from jamie) From: Jamie Landeg-Jones Message-Id: <202304201008.33KA8TpX077655@donotpassgo.dyslexicfish.net> Date: Thu, 20 Apr 2023 11:08:29 +0100 Organization: Dyslexic Fish To: yuri@aetern.org, phk@phk.freebsd.dk, delphij@gmail.com Cc: current@FreeBSD.org Subject: Re: find(1): I18N gone wild ? References: <202304172106.33HL6RUX051407@critter.freebsd.dk> In-Reply-To: User-Agent: Heirloom mailx 12.4 7/29/08 List-Id: Discussions about the use of FreeBSD-current List-Archive: https://lists.freebsd.org/archives/freebsd-current List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-current@freebsd.org MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.2.7 (donotpassgo.dyslexicfish.net [209.250.224.51]); Thu, 20 Apr 2023 11:08:30 +0100 (BST) X-Spamd-Result: default: False [-2.10 / 15.00]; SUBJECT_ENDS_QUESTION(1.00)[]; NEURAL_HAM_LONG(-0.99)[-0.991]; NEURAL_HAM_MEDIUM(-0.73)[-0.728]; NEURAL_HAM_SHORT(-0.68)[-0.677]; DMARC_POLICY_ALLOW(-0.50)[catflap.org,none]; R_SPF_ALLOW(-0.20)[+mx:dyslexicfish.net:c]; MIME_GOOD(-0.10)[text/plain]; RCVD_NO_TLS_LAST(0.10)[]; MLMMJ_DEST(0.00)[current@FreeBSD.org]; FREEMAIL_TO(0.00)[aetern.org,phk.freebsd.dk,gmail.com]; FROM_EQ_ENVFROM(0.00)[]; R_DKIM_NA(0.00)[]; ARC_NA(0.00)[]; RCVD_COUNT_THREE(0.00)[3]; ASN(0.00)[asn:20473, ipnet:2001:19f0:7400::/38, country:US]; FROM_HAS_DN(0.00)[]; FREEFALL_USER(0.00)[jamie]; TO_MATCH_ENVRCPT_SOME(0.00)[]; RCPT_COUNT_THREE(0.00)[4]; HAS_ORG_HEADER(0.00)[]; TO_DN_NONE(0.00)[]; MIME_TRACE(0.00)[0:+] X-Rspamd-Queue-Id: 4Q2CzP3cDPz3JFx X-Spamd-Bar: -- X-ThisMailContainsUnwantedMimeParts: N Xin LI wrote: > This is expected behavior (in en_US.UTF-8 the ordering is AaBb, not ABab). > You might want to set LC_COLLATE to C if C behavior is desirable. > > On Mon, Apr 17, 2023 at 2:06 PM Poul-Henning Kamp > wrote: > > > This surprised me: > > > > # mkdir /tmp/P > > # cd /tmp/P > > # touch FOO > > # touch bar > > # env LANG=C.UTF-8 find . -name '[A-Z]*' -print > > ./FOO > > # env LANG=en_US.UTF-8 find . -name '[A-Z]*' -print > > ./FOO > > ./bar > > > > Really ?! TL;DR Fix find(1) so it works as you expected. It's "legal" to do so. Not quite expected behaviour. It used to be, but now the behaviour is officially undefined, (as mentined in the section that Yuri quoted) When the locale collation first came in, there were numerous issues like this, causing POSIX to change it to undefined (My guess is that it had been one way for too long for them to specifically redefine it, so "undefined" it became.) However, "undefined" would also cover the original way of doing things, and as so many things break unexpectedly, many applications now treat such ranges as they did pre-locales. There would be nothing wrong in therefore changing find(1) to give the results you expected. (and in my opinion, I hope that that becomes the defacto standard) For further justification, note that "awk" in base (in newer versions at least) already gives the results you'd expect, as now does "gawk". In fact, a good summary of the situation, and why the gawk owner reverted the code to treat all character ranges as the tradional pre-locale situation is here: https://www.gnu.org/software/gawk/manual/html_node/Ranges-and-Locales.html Let's follow suit! Cheers, Jamie