Re: find(1): I18N gone wild? [[:alpha:]] not a substitute to refer 26 English letters A-Z
- In reply to: Yuri : "Re: find(1): I18N gone wild? [[:alpha:]] not a substitute to refer 26 English letters A-Z"
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Fri, 21 Apr 2023 20:05:55 UTC
Yuri wrote: > parv/FreeBSD wrote: >> Wrote Dimitry Andric on Fri, 21 Apr 2023 10:38:05 UTC >> (via >> https://lists.freebsd.org/archives/freebsd-current/2023-April/003556.html <https://lists.freebsd.org/archives/freebsd-current/2023-April/003556.html> ) >>> >>> ... However, I have read that with unicode, you should *never* >>> use [A-Z] or [0-9], but character classes instead. That seems to give >>> both files on macOS and Linux with [[:alpha:]]: >> ... >> >> Subject to the locale, problem with that is "[[:alpha:]]" will match >> more than 26 English letters "A" through "Z" (besides also matching >> lower case "a" through "z") even if none of 26 * 2 English alphabets >> appear in a string. > > (replying to random recent message) > > And there is a bit of quite recent history for fnmatch() related to > [a-z], same was done for regex with the same outcome -- attempt to make > [a-z] (guess [A-Z] as well) range non-collating failed. I am not aware > of the encountered failures, hopefully someone should remember: I just tried less intrusive change that seems to help with these ranges (but there's still a question what failed previously): diff --git a/lib/libc/gen/fnmatch.c b/lib/libc/gen/fnmatch.c index 40670545993..3234c1aaaa4 100644 --- a/lib/libc/gen/fnmatch.c +++ b/lib/libc/gen/fnmatch.c @@ -295,10 +295,11 @@ rangematch(const char *pattern, wchar_t test, int flags, char **newp, if (flags & FNM_CASEFOLD) c2 = towlower(c2); - if (table->__collate_load_error ? + if (table->__collate_load_error || + iswascii(test) ? c <= test && test <= c2 : - __wcollate_range_cmp(c, test) <= 0 - && __wcollate_range_cmp(test, c2) <= 0 + __wcollate_range_cmp(c, test) <= 0 && + __wcollate_range_cmp(test, c2) <= 0 ) ok = 1; } else if (c == test) $ LC_ALL=en_US.UTF-8 LD_PRELOAD=/usr/obj/home/yuri/ws/find/amd64.amd64/lib/libc/libc.so.7 find . -name '[a-z]*' ./bar $ LC_ALL=en_US.UTF-8 LD_PRELOAD=/usr/obj/home/yuri/ws/find/amd64.amd64/lib/libc/libc.so.7 find . -name '[A-Z]*' ./FOO > -------- > commit 5a5807dd4ca34467ac5fb458bc19f12bf62075a5 > Author: Andrey A. Chernov <ache@FreeBSD.org> > Date: Sun Jul 10 03:49:38 2016 +0000 > > Remove broken support for collation in [a-z] type ranges. > Only first 256 wide chars are considered currently, all other are just > dropped from the range. Proper implementation require reverse tables > database lookup, since objects are really big as max UTF-8 (1114112 > code points), so just the same scanning as it was for 256 chars will > slow things down. > > POSIX does not require collation for [a-z] type ranges and does not > prohibit it for non-POSIX locales. POSIX require collation for ranges > only for POSIX (or C) locale which is equal to ASCII and binary for > other chars, so we already have it. > > No other *BSD implements collation for [a-z] type ranges. > > Restore ABI compatibility with unused now __collate_range_cmp() which > is visible from outside (will be removed later). > -------- > commit 1daad8f5ad767dfe7896b8d1959a329785c9a76b > Author: Andrey A. Chernov <ache@FreeBSD.org> > Date: Thu Jul 14 08:18:12 2016 +0000 > > Back out non-collating [a-z] ranges. > Instead of changing whole course to another POSIX-permitted way > for consistency and uniformity I decide to completely ignore missing > regex fucntionality and concentrace on fixing bugs in what we have now, > too many small obstacles instead, counting ports. > -------- > commit 12eae8c8f346cb459a388259ca98faebdac47038 > Author: Andrey A. Chernov <ache@FreeBSD.org> > Date: Thu Jul 14 09:07:25 2016 +0000 > > 1) Eliminate possibility to call __*collate_range_cmp() with inclomplete > locale (which cause core dump) by removing whole 'table' argument > by which it passed. > > 2) Restore __collate_range_cmp() in __sccl(). > > 3) Collating [a-z] range in regcomp() only for single bytes locales > (we can't do it now for other ones). In previous state only first 256 > wchars are considered and all others are just silently dropped from the > range. > -------- >