svn commit: r301461 - in head/lib/libc: gen locale regex
Pedro Giffuni
pfg at FreeBSD.org
Mon Jun 6 13:43:27 UTC 2016
On 06/05/16 14:49, Andrey Chernov wrote:
> On 05.06.2016 22:12, Pedro F. Giffuni wrote:
>> --- head/lib/libc/regex/regcomp.c Sun Jun 5 18:16:33 2016 (r301460)
>> +++ head/lib/libc/regex/regcomp.c Sun Jun 5 19:12:52 2016 (r301461)
>> @@ -821,10 +821,10 @@ p_b_term(struct parse *p, cset *cs)
>> (void)REQUIRE((uch)start <= (uch)finish, REG_ERANGE);
>> CHaddrange(p, cs, start, finish);
>> } else {
>> - (void)REQUIRE(__collate_range_cmp(table, start, finish) <= 0, REG_ERANGE);
>> + (void)REQUIRE(__wcollate_range_cmp(table, start, finish) <= 0, REG_ERANGE);
>> for (i = 0; i <= UCHAR_MAX; i++) {
>> - if ( __collate_range_cmp(table, start, i) <= 0
>> - && __collate_range_cmp(table, i, finish) <= 0
>> + if ( __wcollate_range_cmp(table, start, i) <= 0
>> + && __wcollate_range_cmp(table, i, finish) <= 0
>> )
>> CHadd(p, cs, i);
>> }
>>
>
> As I already mention in PR, we have broken regcomp after someone adds
> wchar_t support there. Now regcomp ranges works only for the first 256
> wchars of the current locale, notice that loop upper limit:
> for (i = 0; i <= UCHAR_MAX; i++) {
> In general, ranges are either broken in regcomp now or are memory
> eating. We have bitmask only for the first 256 wchars, all other added
> to the range literally. Imagine what happens if someone specify full
> Unicode range in regexp.
>
> Proper fix will be adding bitmask for the whole Unicode range, and even
> in that case regcomp attempting to use collation in ranges will be
> _very_slow_ since needs to check all Unicode chars in its
> for (i = 0; i <= Max_Unicode_wchar; i++) {
> loop.
>
> Better stop pretending that we are able to do collation support in the
> ranges, since POSIX cares about its own locale only here:
> "In the POSIX locale, a range expression represents the set of collating
> elements that fall between two elements in the collation sequence,
> inclusive. In other locales, a range expression has unspecified
> behavior: strictly conforming applications shall not rely on whether the
> range expression is valid, or on the set of collating elements matched."
>
> Until whole Unicode range bitmask will be implemented (if ever), better
> stop pretending to honor collation order, we just can't do it with
> wchars now and do what NetBSD/OpenBSD does (using wchar_t) instead. It
> does not prevent memory eating on big ranges (bitmask is needed, see
> above), but at least fix the thing that only first 256 wchars are
> considered.
>
Sadly regex is one part of the system that could use a maintainer :(,
I have been forced to look at it more than I'd like to but I don't
really use the collation support at all.
Pedro.
More information about the svn-src-all
mailing list