Re: find(1): I18N gone wild ?

From: Mark Millard <marklmi_at_yahoo.com>
Date: Fri, 21 Apr 2023 17:41:45 UTC
Dimitry Andric <dim_at_FreeBSD.org> wrote on
Date: Fri, 21 Apr 2023 10:38:05 UTC :

> On 21 Apr 2023, at 12:01, Ronald Klop <ronald-lists@klop.ws> wrote:
> > Van: Poul-Henning Kamp <phk@phk.freebsd.dk>
> > Datum: maandag, 17 april 2023 23:06
> > Aan: current@freebsd.org
> > Onderwerp: find(1): I18N gone wild ?
> > This surprised me:
> > 
> > # mkdir /tmp/P
> > # cd /tmp/P
> > # touch FOO
> > # touch bar
> > # env LANG=C.UTF-8 find . -name '[A-Z]*' -print
> > ./FOO
> > # env LANG=en_US.UTF-8 find . -name '[A-Z]*' -print
> > ./FOO
> > ./bar
> > 
> > Really ?!
> ...
> > My Mac and a Linux server only give ./FOO in both cases. Just a 2 cents remark.
> 
> Same here. However, I have read that with unicode, you should *never*
> use [A-Z] or [0-9], but character classes instead. That seems to give
> both files on macOS and Linux with [[:alpha:]]:
> 
> $ LANG=en_US.UTF-8 find . -name '[[:alpha:]]*' -print
> ./BAR
> ./foo
> 
> and only the lowercase file with [[:lower:]]:
> 
> $ LANG=en_US.UTF-8 find . -name '[[:lower:]]*' -print
> ./foo
> 
> But on FreeBSD, these don't work at all:
> 
> $ LANG=en_US.UTF-8 find . -name '[[:alpha:]]*' -print
> <nothing>
> 
> $ LANG=en_US.UTF-8 find . -name '[[:lower:]]*' -print
> <nothing>
> 
> This is an interesting rabbit hole... :)

FreeBSD:

     -name pattern
             True if the last component of the pathname being examined matches
             pattern.  Special shell pattern matching characters (“[”, “]”,
             “*”, and “?”) may be used as part of pattern.  These characters
             may be matched explicitly by escaping them with a backslash
             (“\”).

I conclude that [[:alpha:]] and [[:lower:]] were not
considered "Special shell pattern"s. "man glob"
indicates it is a shell specific builtin.

macOS says similarly. Different shells, different
pattern notations and capabilities? Well, "man bash"
reports:

QUOTE
      Pattern Matching

        . . .
              Within [ and ], character classes can be specified using the syntax [:class:], where class is one of the following classes defined in the POSIX standard:
              alnum alpha ascii blank cntrl digit graph lower print punct space upper word xdigit
              A character class matches any character belonging to that class.  The word character class matches letters, digits, and the character _.

              Within [ and ], an equivalence class can be specified using the syntax [=c=], which matches all characters with the same collation weight (as defined by the current locale) as the
              character c.

              Within [ and ], the syntax [.symbol.] matches the collating symbol symbol.

END QUOTE

"man zsh" does not document patterns but:

sh-3.2$ echo $SHELL
/bin/zsh
sh-3.2$ find . -name '[[:lower:]]*' -print
./bar

% ls -Tldt /bin/*sh
-r-xr-xr-x  1 root  wheel  1326688 Feb  9 01:39:53 2023 /bin/bash
-rwxr-xr-x  2 root  wheel  1153216 Feb  9 01:39:53 2023 /bin/csh
-rwxr-xr-x  1 root  wheel   307232 Feb  9 01:39:53 2023 /bin/dash
-r-xr-xr-x  1 root  wheel  2598864 Feb  9 01:39:53 2023 /bin/ksh
-rwxr-xr-x  1 root  wheel   134000 Feb  9 01:39:53 2023 /bin/sh
-rwxr-xr-x  2 root  wheel  1153216 Feb  9 01:39:53 2023 /bin/tcsh
-rwxr-xr-x  1 root  wheel  1377616 Feb  9 01:39:53 2023 /bin/zsh

But in each, even bash,

% echo $SHELL
/bin/zsh


With "find" not being part of the kernel, Linux may have
a number of variations across the operating systems.
Picking one . . .

openSUSE tumbleweed:

       -name pattern
              Base  of file name (the path with the leading directories removed) matches shell pattern pattern.  Because the leading directories are removed, the file names considered for a match
              with -name will never include a slash, so `-name a/b' will never match anything (you probably need to use -path instead).  A warning is issued if you try to do this, unless the  en-
              vironment variable POSIXLY_CORRECT is set.  The metacharacters (`*', `?', and `[]') match a `.' at the start of the base name (this is a change in findutils-4.2.2; see section STAN-
              DARDS CONFORMANCE below).  To ignore a directory and the files under it, use -prune rather than checking every file in the tree; see an example in the description  of  that  action.
              Braces  are  not  recognised as being special, despite the fact that some shells including Bash imbue braces with a special meaning in shell patterns.  The filename matching is per-
              formed with the use of the fnmatch(3) library function.  Don't forget to enclose the pattern in quotes in order to protect it from expansion by the shell.

"man 3 fnmatch" says:

       The fnmatch() function checks whether the string argument matches the pattern argument, which is a shell wildcard pattern (see glob(7)).

"man 7 glob" (not shell specific) in turn has a section on
"Character classes and internationalization" that reports:

QUOTE
. . .
. . . Therefore, POSIX extended the bracket notation  greatly,
       both  for  wildcard  patterns  and  for regular expressions.  In the above we saw three types of items that can occur in a bracket expression: namely (i) the negation, (ii) explicit single
       characters, and (iii) ranges.  POSIX specifies ranges in an internationally more useful way and adds three more types:

       (iii) Ranges X-Y comprise all characters that fall between X and Y (inclusive) in the current collating sequence as defined by the LC_COLLATE category in the current locale.

       (iv) Named character classes, like

       [:alnum:]  [:alpha:]  [:blank:]  [:cntrl:]
       [:digit:]  [:graph:]  [:lower:]  [:print:]
       [:punct:]  [:space:]  [:upper:]  [:xdigit:]

       so that one can say "[[:lower:]]" instead of "[a-z]", and have things work in Denmark, too, where there are three letters past 'z' in the alphabet.  These character classes are defined  by
       the LC_CTYPE category in the current locale.

       (v) Collating symbols, like "[.ch.]" or "[.a-acute.]", where the string between "[." and ".]" is a collating element defined for the current locale.  Note that this may be a multicharacter
       element.

       (vi) Equivalence class expressions, like "[=a=]", where the string between "[=" and "=]" is any collating element from its equivalence class, as defined for the current locale.  For  exam-
       ple, "[[=a=]]" might be equivalent to "[aáàäâ]", that is, to "[a[.a-acute.][.a-grave.][.a-umlaut.][.a-circumflex.]]".
END QUOTE

# file /usr/bin/sh
/usr/bin/sh: symbolic link to bash


Seems like: pick your shell (as shown by echo $SHELL) and
that picks the pattern match rules used. (May be controllable
in the specific shell.)

===
Mark Millard
marklmi at yahoo.com