Re: find(1): I18N gone wild ?

From: Yuri <yuri_at_aetern.org>
Date: Mon, 17 Apr 2023 21:33:04 UTC
Xin LI wrote:
> This is expected behavior (in en_US.UTF-8 the ordering is AaBb, not
> ABab).  You might want to set LC_COLLATE to C if C behavior is desirable.
> 
> On Mon, Apr 17, 2023 at 2:06 PM Poul-Henning Kamp <phk@phk.freebsd.dk
> <mailto:phk@phk.freebsd.dk>> wrote:
> 
>     This surprised me:
> 
>             # mkdir /tmp/P
>             # cd /tmp/P
>             # touch FOO
>             # touch bar
>             # env LANG=C.UTF-8 find . -name '[A-Z]*' -print
>             ./FOO
>             # env LANG=en_US.UTF-8 find . -name '[A-Z]*' -print
>             ./FOO
>             ./bar
> 
>     Really ?!

A bit more detail:

find uses fnmatch(3) here, where the RE Bracket Expression rules apply
(except for ! instead of ^, but that's unrelated):

https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_03_05

...which has the following note:

7. In the POSIX locale, a range expression represents the set of
collating elements that fall between two elements in the collation
sequence, inclusive. In other locales, a range expression has
unspecified behavior: strictly conforming applications shall not rely on
whether the range expression is valid, or on the set of collating
elements matched.

Indeed, it's unfortunate that collations in non-POSIX are not that...
linear and range expressions can break, but I don't see an easy way of
"fixing" this.