Re: find(1): I18N gone wild ?
- In reply to: Xin LI : "Re: find(1): I18N gone wild ?"
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Thu, 20 Apr 2023 10:07:09 UTC
Xin LI <delphij@gmail.com> wrote: > This is expected behavior (in en_US.UTF-8 the ordering is AaBb, not ABab). > You might want to set LC_COLLATE to C if C behavior is desirable. > > On Mon, Apr 17, 2023 at 2:06 PM Poul-Henning Kamp <phk@phk.freebsd.dk> > wrote: > > > This surprised me: > > > > # mkdir /tmp/P > > # cd /tmp/P > > # touch FOO > > # touch bar > > # env LANG=C.UTF-8 find . -name '[A-Z]*' -print > > ./FOO > > # env LANG=en_US.UTF-8 find . -name '[A-Z]*' -print > > ./FOO > > ./bar > > > > Really ?! TL;DR Fix find(1) so it works as you expected. It's "legal" to do so. Not quite expected behaviour. It used to be, but now the behaviour is officially undefined, (as mentined in the section that Yuri quoted) When the locale collation first came in, there were numerous issues like this, causing POSIX to change it to undefined (My guess is that it had been one way for too long for them to specifically redefine it, so "undefined" it became.) However, "undefined" would also cover the original way of doing things, and as so many things break unexpectedly, many applications now treat such ranges as they did pre-locales. There would be nothing wrong in therefore changing find(1) to give the results you expected. (and in my opinion, I hope that that becomes the defacto standard) For further justification, note that "awk" in base (in newer versions at least) already gives the results you'd expect, as now does "gawk". In fact, a good summary of the situation, and why the gawk owner reverted the code to treat all character ranges as the tradional pre-locale situation is here: https://www.gnu.org/software/gawk/manual/html_node/Ranges-and-Locales.html Let's follow suit! Cheers, Jamie