Re: find(1): I18N gone wild? [[:alpha:]] not a substitute to refer 26 English letters A-Z
Date: Fri, 21 Apr 2023 19:36:05 UTC
parv/FreeBSD wrote: > Wrote Dimitry Andric on Fri, 21 Apr 2023 10:38:05 UTC > (via > https://lists.freebsd.org/archives/freebsd-current/2023-April/003556.html <https://lists.freebsd.org/archives/freebsd-current/2023-April/003556.html> ) >> >> ... However, I have read that with unicode, you should *never* >> use [A-Z] or [0-9], but character classes instead. That seems to give >> both files on macOS and Linux with [[:alpha:]]: > ... > > Subject to the locale, problem with that is "[[:alpha:]]" will match > more than 26 English letters "A" through "Z" (besides also matching > lower case "a" through "z") even if none of 26 * 2 English alphabets > appear in a string. (replying to random recent message) And there is a bit of quite recent history for fnmatch() related to [a-z], same was done for regex with the same outcome -- attempt to make [a-z] (guess [A-Z] as well) range non-collating failed. I am not aware of the encountered failures, hopefully someone should remember: -------- commit 5a5807dd4ca34467ac5fb458bc19f12bf62075a5 Author: Andrey A. Chernov <ache@FreeBSD.org> Date: Sun Jul 10 03:49:38 2016 +0000 Remove broken support for collation in [a-z] type ranges. Only first 256 wide chars are considered currently, all other are just dropped from the range. Proper implementation require reverse tables database lookup, since objects are really big as max UTF-8 (1114112 code points), so just the same scanning as it was for 256 chars will slow things down. POSIX does not require collation for [a-z] type ranges and does not prohibit it for non-POSIX locales. POSIX require collation for ranges only for POSIX (or C) locale which is equal to ASCII and binary for other chars, so we already have it. No other *BSD implements collation for [a-z] type ranges. Restore ABI compatibility with unused now __collate_range_cmp() which is visible from outside (will be removed later). -------- commit 1daad8f5ad767dfe7896b8d1959a329785c9a76b Author: Andrey A. Chernov <ache@FreeBSD.org> Date: Thu Jul 14 08:18:12 2016 +0000 Back out non-collating [a-z] ranges. Instead of changing whole course to another POSIX-permitted way for consistency and uniformity I decide to completely ignore missing regex fucntionality and concentrace on fixing bugs in what we have now, too many small obstacles instead, counting ports. -------- commit 12eae8c8f346cb459a388259ca98faebdac47038 Author: Andrey A. Chernov <ache@FreeBSD.org> Date: Thu Jul 14 09:07:25 2016 +0000 1) Eliminate possibility to call __*collate_range_cmp() with inclomplete locale (which cause core dump) by removing whole 'table' argument by which it passed. 2) Restore __collate_range_cmp() in __sccl(). 3) Collating [a-z] range in regcomp() only for single bytes locales (we can't do it now for other ones). In previous state only first 256 wchars are considered and all others are just silently dropped from the range. --------