Re: find(1): I18N gone wild ?
- Reply: Yuri : "Re: find(1): I18N gone wild ?"
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Fri, 21 Apr 2023 17:41:45 UTC
Dimitry Andric <dim_at_FreeBSD.org> wrote on Date: Fri, 21 Apr 2023 10:38:05 UTC : > On 21 Apr 2023, at 12:01, Ronald Klop <ronald-lists@klop.ws> wrote: > > Van: Poul-Henning Kamp <phk@phk.freebsd.dk> > > Datum: maandag, 17 april 2023 23:06 > > Aan: current@freebsd.org > > Onderwerp: find(1): I18N gone wild ? > > This surprised me: > > > > # mkdir /tmp/P > > # cd /tmp/P > > # touch FOO > > # touch bar > > # env LANG=C.UTF-8 find . -name '[A-Z]*' -print > > ./FOO > > # env LANG=en_US.UTF-8 find . -name '[A-Z]*' -print > > ./FOO > > ./bar > > > > Really ?! > ... > > My Mac and a Linux server only give ./FOO in both cases. Just a 2 cents remark. > > Same here. However, I have read that with unicode, you should *never* > use [A-Z] or [0-9], but character classes instead. That seems to give > both files on macOS and Linux with [[:alpha:]]: > > $ LANG=en_US.UTF-8 find . -name '[[:alpha:]]*' -print > ./BAR > ./foo > > and only the lowercase file with [[:lower:]]: > > $ LANG=en_US.UTF-8 find . -name '[[:lower:]]*' -print > ./foo > > But on FreeBSD, these don't work at all: > > $ LANG=en_US.UTF-8 find . -name '[[:alpha:]]*' -print > <nothing> > > $ LANG=en_US.UTF-8 find . -name '[[:lower:]]*' -print > <nothing> > > This is an interesting rabbit hole... :) FreeBSD: -name pattern True if the last component of the pathname being examined matches pattern. Special shell pattern matching characters (“[”, “]”, “*”, and “?”) may be used as part of pattern. These characters may be matched explicitly by escaping them with a backslash (“\”). I conclude that [[:alpha:]] and [[:lower:]] were not considered "Special shell pattern"s. "man glob" indicates it is a shell specific builtin. macOS says similarly. Different shells, different pattern notations and capabilities? Well, "man bash" reports: QUOTE Pattern Matching . . . Within [ and ], character classes can be specified using the syntax [:class:], where class is one of the following classes defined in the POSIX standard: alnum alpha ascii blank cntrl digit graph lower print punct space upper word xdigit A character class matches any character belonging to that class. The word character class matches letters, digits, and the character _. Within [ and ], an equivalence class can be specified using the syntax [=c=], which matches all characters with the same collation weight (as defined by the current locale) as the character c. Within [ and ], the syntax [.symbol.] matches the collating symbol symbol. END QUOTE "man zsh" does not document patterns but: sh-3.2$ echo $SHELL /bin/zsh sh-3.2$ find . -name '[[:lower:]]*' -print ./bar % ls -Tldt /bin/*sh -r-xr-xr-x 1 root wheel 1326688 Feb 9 01:39:53 2023 /bin/bash -rwxr-xr-x 2 root wheel 1153216 Feb 9 01:39:53 2023 /bin/csh -rwxr-xr-x 1 root wheel 307232 Feb 9 01:39:53 2023 /bin/dash -r-xr-xr-x 1 root wheel 2598864 Feb 9 01:39:53 2023 /bin/ksh -rwxr-xr-x 1 root wheel 134000 Feb 9 01:39:53 2023 /bin/sh -rwxr-xr-x 2 root wheel 1153216 Feb 9 01:39:53 2023 /bin/tcsh -rwxr-xr-x 1 root wheel 1377616 Feb 9 01:39:53 2023 /bin/zsh But in each, even bash, % echo $SHELL /bin/zsh With "find" not being part of the kernel, Linux may have a number of variations across the operating systems. Picking one . . . openSUSE tumbleweed: -name pattern Base of file name (the path with the leading directories removed) matches shell pattern pattern. Because the leading directories are removed, the file names considered for a match with -name will never include a slash, so `-name a/b' will never match anything (you probably need to use -path instead). A warning is issued if you try to do this, unless the en- vironment variable POSIXLY_CORRECT is set. The metacharacters (`*', `?', and `[]') match a `.' at the start of the base name (this is a change in findutils-4.2.2; see section STAN- DARDS CONFORMANCE below). To ignore a directory and the files under it, use -prune rather than checking every file in the tree; see an example in the description of that action. Braces are not recognised as being special, despite the fact that some shells including Bash imbue braces with a special meaning in shell patterns. The filename matching is per- formed with the use of the fnmatch(3) library function. Don't forget to enclose the pattern in quotes in order to protect it from expansion by the shell. "man 3 fnmatch" says: The fnmatch() function checks whether the string argument matches the pattern argument, which is a shell wildcard pattern (see glob(7)). "man 7 glob" (not shell specific) in turn has a section on "Character classes and internationalization" that reports: QUOTE . . . . . . Therefore, POSIX extended the bracket notation greatly, both for wildcard patterns and for regular expressions. In the above we saw three types of items that can occur in a bracket expression: namely (i) the negation, (ii) explicit single characters, and (iii) ranges. POSIX specifies ranges in an internationally more useful way and adds three more types: (iii) Ranges X-Y comprise all characters that fall between X and Y (inclusive) in the current collating sequence as defined by the LC_COLLATE category in the current locale. (iv) Named character classes, like [:alnum:] [:alpha:] [:blank:] [:cntrl:] [:digit:] [:graph:] [:lower:] [:print:] [:punct:] [:space:] [:upper:] [:xdigit:] so that one can say "[[:lower:]]" instead of "[a-z]", and have things work in Denmark, too, where there are three letters past 'z' in the alphabet. These character classes are defined by the LC_CTYPE category in the current locale. (v) Collating symbols, like "[.ch.]" or "[.a-acute.]", where the string between "[." and ".]" is a collating element defined for the current locale. Note that this may be a multicharacter element. (vi) Equivalence class expressions, like "[=a=]", where the string between "[=" and "=]" is any collating element from its equivalence class, as defined for the current locale. For exam- ple, "[[=a=]]" might be equivalent to "[aáàäâ]", that is, to "[a[.a-acute.][.a-grave.][.a-umlaut.][.a-circumflex.]]". END QUOTE # file /usr/bin/sh /usr/bin/sh: symbolic link to bash Seems like: pick your shell (as shown by echo $SHELL) and that picks the pattern match rules used. (May be controllable in the specific shell.) === Mark Millard marklmi at yahoo.com