libc/regex: r302824 added invalid check breaking collating ranges

Tue Jan 23 00:53:34 UTC 2018

(CCing Kyle as he's working on regex at the moment and not because he 
broke something)

Hi,

r302284 added an invalid check which breaks collating ranges:

-if (table->__collate_load_error) {
-    (void)REQUIRE((uch)start <= (uch)finish, REG_ERANGE);
+if (table->__collate_load_error || MB_CUR_MAX > 1) {
+    (void)REQUIRE(start <= finish, REG_ERANGE);

The "MB_CUR_MAX > 1" is wrong, we should be doing proper comparison 
according to current locale's collation and not simply comparing the 
wchar_t values.

Example -- see Table 1 in http://www.unicode.org/reports/tr10/:

Let's try Swedish collation:
$ echo 'test' | LC_COLLATE=se_SE.UTF-8 grep '[ö-z]'
grep: invalid character range
$ echo 'test' | LC_COLLATE=se_SE.UTF-8 grep '[z-ö]'

OK, the above seems to be correct, 'ö' > 'z' in Swedish collation, but 
we just got lucky here, as wchar_t comparison gives us the same result.

Now German one:
$ echo 'test' | LC_COLLATE=de_DE.UTF-8 grep '[ö-z]'
grep: invalid character range
$ echo 'test' | LC_COLLATE=de_DE.UTF-8 grep '[z-ö]'

Same, but according to the table, 'ö' < 'z' in German collation!

I think the fix here would be to drop the "if 
(table->__collate_load_error || MB_CUR_MAX > 1)" block entirely as we no 
longer use the "table" so there's no point in getting it and checking 
error, wcscoll() which would be called eventually in p_range_cmp() does 
the table handling itself, and we can't use the direct comparison for 
anything other than 'C' locale (not sure if it's applicable even there).