libc/regex: r302824 added invalid check breaking collating ranges

Wed Jan 24 04:18:15 UTC 2018

On Tue, Jan 23, 2018 at 01:22:04PM -0600, Kyle Evans wrote:
> On Tue, Jan 23, 2018 at 1:10 PM, Yuri Pankov <yuripv at icloud.com> wrote:
>> On Tue, Jan 23, 2018 at 08:10:32AM -0600, Kyle Evans wrote:
>>>
>>> On Mon, Jan 22, 2018 at 11:36 PM, Yuri Pankov <yuripv at icloud.com> wrote:
>>>>
>>>> On Tue, Jan 23, 2018 at 03:53:19AM +0300, Yuri Pankov wrote:
>>>>>
>>>>>
>>>>> (CCing Kyle as he's working on regex at the moment and not because he
>>>>> broke something)
>>>>>
>>>>> Hi,
>>>>>
>>>>> r302284 added an invalid check which breaks collating ranges:
>>>>>
>>>>> -if (table->__collate_load_error) {
>>>>> -    (void)REQUIRE((uch)start <= (uch)finish, REG_ERANGE);
>>>>> +if (table->__collate_load_error || MB_CUR_MAX > 1) {
>>>>> +    (void)REQUIRE(start <= finish, REG_ERANGE);
>>>>>
>>>>> The "MB_CUR_MAX > 1" is wrong, we should be doing proper comparison
>>>>> according to current locale's collation and not simply comparing the
>>>>> wchar_t values.
>>>>
>>>>
>>>>
>>>> After re-reading the specification I now see that what looked like a bug
>>>> is
>>>> actually an implementation choice, though the one that needs to be
>>>> documentated.  I'll update the man page if anyone is willing to review
>>>> (and
>>>> commit) the changes.
>>>
>>>
>>> Can you point to the section of specification that indicates this is
>>> OK behavior? It doesn't seem desirable, but I see that GNU systems
>>> will operate in the same manner that we do now.
>>
>>
>> Here --
>> http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html:
>> ------------------------------------------------------------------------
>> In the POSIX locale, a range expression represents the set of collating
>> elements that fall between two elements in the collation sequence,
>> inclusive. In other locales, a range expression has unspecified behavior:
>> strictly conforming applications shall not rely on whether the range
>> expression is valid, or on the set of collating elements matched.
>> ------------------------------------------------------------------------
>>
> 
> Thanks- our current behavior seems reasonable in that context.
> 
>> I've tried to "fix" what I was seeing as well, and yes, everything outside
>> of ASCII is ugly, e.g. Cyrillic 'а-я' would match much more than you could
>> expect if you are doing lookups based on collation order (capital chars and
>> a lot of other symbols).
>>
>> So what we have currently looks the least evil to me:
>>
>> - non-collating ASCII lookups for any locale -- looking at the log for
>>    regcomp.c there was an attempt to "fix" this, but it was reverted as
>>    a lot of existing code relies on this;
>> - non-collating multi-byte locale lookups -- they will work for almost
>>    all   cases, and where they don't, well POSIX says it's undefined :D
>> - collating single-byte locale lookups for outside of ASCII range --
>>    they make sense as collation order there doesn't seem to mix
>>    small/caps/other characters together.
>>
>> What I think we need to do is document this as implementation choice in the
>> code and regex(3) "IMPLEMENTATION NOTES" so that another poor soul doesn't
>> come trying to fix it as I did :-)
> 
> I agree with your assessment- such a patch would be welcomed,
> especially before I go and revise a bunch of this for clarification in
> a future libregex world.

Actually, it's broken even more than I thought:

$ echo 'TEST' | LC_ALL=en_US.ISO8859-1 grep '[a-z]'
TEST

That's a result of using collation lookup for singlebyte locales.  Now I 
just think that using collations for range expressions in *any* locale 
is just plain wrong.

Another side effect of all this "sometimes non-collating" nonsense is 
inability to deal with multibyte characters whose corresponding wide 
character is in 128-255 range -- try adding 'µ' (\302\265, U+00B5) to 
the pattern and observe a nice ~1GB core from grep after endless loop in 
regcomp().  This is due to NC (I guess meaning "non-collating") being 
defined as (CHAR_MAX - CHAR_MIN + 1) which is 256.

To sum the above, how about we drop the "non-collating" notion, and just 
use binary wide character comparison everywhere?