Re: find(1): I18N gone wild ?
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Fri, 21 Apr 2023 19:51:55 UTC
Yuri <yuri_at_aetern.org> wrote on Date: Fri, 21 Apr 2023 18:18:21 UTC : > Yuri wrote: > > Mark Millard wrote: > >> Dimitry Andric <dim_at_FreeBSD.org> wrote on > >> Date: Fri, 21 Apr 2023 10:38:05 UTC : > >> > >>> On 21 Apr 2023, at 12:01, Ronald Klop <ronald-lists@klop.ws> wrote: > >>>> Van: Poul-Henning Kamp <phk@phk.freebsd.dk> > >>>> Datum: maandag, 17 april 2023 23:06 > >>>> Aan: current@freebsd.org > >>>> Onderwerp: find(1): I18N gone wild ? > >>>> This surprised me: > >>>> > >>>> # mkdir /tmp/P > >>>> # cd /tmp/P > >>>> # touch FOO > >>>> # touch bar > >>>> # env LANG=C.UTF-8 find . -name '[A-Z]*' -print > >>>> ./FOO > >>>> # env LANG=en_US.UTF-8 find . -name '[A-Z]*' -print > >>>> ./FOO > >>>> ./bar > >>>> > >>>> Really ?! > >>> ... > >>>> My Mac and a Linux server only give ./FOO in both cases. Just a 2 cents remark. > >>> > >>> Same here. However, I have read that with unicode, you should *never* > >>> use [A-Z] or [0-9], but character classes instead. That seems to give > >>> both files on macOS and Linux with [[:alpha:]]: > >>> > >>> $ LANG=en_US.UTF-8 find . -name '[[:alpha:]]*' -print > >>> ./BAR > >>> ./foo > >>> > >>> and only the lowercase file with [[:lower:]]: > >>> > >>> $ LANG=en_US.UTF-8 find . -name '[[:lower:]]*' -print > >>> ./foo > >>> > >>> But on FreeBSD, these don't work at all: > >>> > >>> $ LANG=en_US.UTF-8 find . -name '[[:alpha:]]*' -print > >>> <nothing> > >>> > >>> $ LANG=en_US.UTF-8 find . -name '[[:lower:]]*' -print > >>> <nothing> > >>> > >>> This is an interesting rabbit hole... :) > >> > >> FreeBSD: > >> > >> -name pattern > >> True if the last component of the pathname being examined matches > >> pattern. Special shell pattern matching characters (“[”, “]”, > >> “*”, and “?”) may be used as part of pattern. These characters > >> may be matched explicitly by escaping them with a backslash > >> (“\”). > >> > >> I conclude that [[:alpha:]] and [[:lower:]] were not > >> considered "Special shell pattern"s. "man glob" > >> indicates it is a shell specific builtin. > >> > >> macOS says similarly. Different shells, different > >> pattern notations and capabilities? Well, "man bash" > >> reports: > > [snip] > >> Seems like: pick your shell (as shown by echo $SHELL) and > >> that picks the pattern match rules used. (May be controllable > >> in the specific shell.) > > > > No, the pattern is not passed to shell and shell used should not matter > > (pattern should be properly escaped). The rules are here: > > > > https://pubs.opengroup.org/onlinepubs/9699919799/utilities/V3_chap02.html#tag_18_13 > > > > ...which in turn refers to the following link for bracket expressions: > > > > https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_03_05 > > > > Why we don't support all of that is different story. > > A bit more on this; first link applies both to find(1) and fnmatch(3), > and find uses fnmatch() internally (which is good), but even the > function that processes bracket expressions is called rangematch() and > that's really all it does ignoring other bracket expression rules: > > https://cgit.freebsd.org/src/tree/lib/libc/gen/fnmatch.c#n234 > > So to "fix" find we just need to implement the bracket expressions > properly in fnmatch(). Too bad the -name documentation does not track this but points to shell notation. The following confirms that even for the IEEE Std 1003.1-2001 that FreeBSD's find is documented to be based on, the notations that you reference were indicated. FreeBSD's man page reports: STANDARDS The find utility syntax is a superset of the syntax specified by the IEEE Std 1003.1-2001 (“POSIX.1”) standard. All the single character options except -H and -L as well as -amin, -anewer, -cmin, -cnewer, -delete, -empty, -fstype, -iname, -inum, -iregex, -ls, -maxdepth, -mindepth, -mmin, -not, -path, -print0, -regex, -sparse and all of the -B* birthtime related primaries are extensions to IEEE Std 1003.1-2001 (“POSIX.1”). . . . IEEE Std 1003.1-2001 find looks to be at: https://pubs.opengroup.org/onlinepubs/009604499/utilities/find.html -name pattern The primary shall evaluate as true if the basename of the filename being examined matches pattern using the pattern matching notation described in Pattern Matching Notation. https://pubs.opengroup.org/onlinepubs/009604499/utilities/xcu_chap02.html#tag_02_13 [ The open bracket shall introduce a pattern bracket expression. The description of basic regular expression bracket expressions in the Base Definitions volume of IEEE Std 1003.1-2001, Section 9.3.5, RE Bracket Expression shall also apply to the pattern bracket expression, https://pubs.opengroup.org/onlinepubs/009604499/basedefs/xbd_chap09.html#tag_09_03_05 • A character class expression shall represent the union of two sets: • The set of single-character collating elements whose characters belong to the character class, as defined in the LC_CTYPE category in the current locale. • An unspecified set of multi-character collating elements. All character classes specified in the current locale shall be recognized. A character class expression is expressed as a character class name enclosed within bracket-colon ( "[:" and ":]" ) delimiters. The following character class expressions shall be supported in all locales: [:alnum:] [:cntrl:] [:lower:] [:space:] [:alpha:] [:digit:] [:print:] [:upper:] [:blank:] [:graph:] [:punct:] [:xdigit:] In addition, character class expressions of the form: [:name:] are recognized in those locales where the name keyword has been given a charclass definition in the LC_CTYPE category. === Mark Millard marklmi at yahoo.com