libc/regex: r302824 added invalid check breaking collating ranges
Yuri Pankov
yuripv at icloud.com
Tue Jan 23 00:53:34 UTC 2018
(CCing Kyle as he's working on regex at the moment and not because he
broke something)
Hi,
r302284 added an invalid check which breaks collating ranges:
-if (table->__collate_load_error) {
- (void)REQUIRE((uch)start <= (uch)finish, REG_ERANGE);
+if (table->__collate_load_error || MB_CUR_MAX > 1) {
+ (void)REQUIRE(start <= finish, REG_ERANGE);
The "MB_CUR_MAX > 1" is wrong, we should be doing proper comparison
according to current locale's collation and not simply comparing the
wchar_t values.
Example -- see Table 1 in http://www.unicode.org/reports/tr10/:
Let's try Swedish collation:
$ echo 'test' | LC_COLLATE=se_SE.UTF-8 grep '[ö-z]'
grep: invalid character range
$ echo 'test' | LC_COLLATE=se_SE.UTF-8 grep '[z-ö]'
OK, the above seems to be correct, 'ö' > 'z' in Swedish collation, but
we just got lucky here, as wchar_t comparison gives us the same result.
Now German one:
$ echo 'test' | LC_COLLATE=de_DE.UTF-8 grep '[ö-z]'
grep: invalid character range
$ echo 'test' | LC_COLLATE=de_DE.UTF-8 grep '[z-ö]'
Same, but according to the table, 'ö' < 'z' in German collation!
I think the fix here would be to drop the "if
(table->__collate_load_error || MB_CUR_MAX > 1)" block entirely as we no
longer use the "table" so there's no point in getting it and checking
error, wcscoll() which would be called eventually in p_range_cmp() does
the table handling itself, and we can't use the direct comparison for
anything other than 'C' locale (not sure if it's applicable even there).
More information about the freebsd-hackers
mailing list