[Bug 281710] RegEXP bug in bracket expression [^...] - sed(1), grep(1), re_format(7)

From: <bugzilla-noreply_at_freebsd.org>
Date: Fri, 27 Sep 2024 09:17:27 UTC
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=281710

--- Comment #13 from Eric <erichanskrs@gmail.com> ---
(in reply to Kyle Evans comment #10)
(in reply to  Olivier Certner comment #12)

Based on the commit comments 
https://cgit.freebsd.org/src/commit/?id=8f7ed58a15556bf567ff876e1999e4fe4d684e1d
however, I see that I may have underestimated the possible veracious impact on
string processing in a pervasive UTF-8 world.

I haven't a test setup available at the moment to test the examples below on
-CURRENT or -STABLE-13 or 14

-- Examples
[1] # cat names
cedric
étienne
égards
françois
[2] # cat names | grep '[é]'
étienne
égards
[3] # cat names | grep '[éç]'
étienne
égards
françois
[4] # cat names | grep '[éi]'      # <-- error
cedric
étienne
françois
[5] # cat names | grep -i '[éi]'   # <-- case-insensitive "avoids" singleton 
cedric
étienne
égards
françois
[6] # cat names | grep -E '[é]|[i]' # <-- splitting in two bracket expressions
avoids errroneous code
cedric
étienne
égards
françois
[7] #

I think such cases likely will have been overlooked, misjudged as correctly
processed or not investigated further.

Fast & correct (UTF-8) string processing is difficult and this made me have
another look at singleton's char processing. 
Viewing from a distance (and assuming one test operation (the first only) in
the string of "shortcut" ||-operands), the distance to the prize (i.e. line
1626) in
https://github.com/freebsd/freebsd-src/blob/main/lib/libc/regex/regcomp.c#L1626 
as compared to
https://github.com/freebsd/freebsd-src/blob/releng/14.1/lib/libc/regex/regcomp.c#L1600
has gone up considerably:
singleton-error:      2 tests
singleton-modified:   6 tests

Are the added complexity and extra processing steps of an added singleton
function for a bracket expression still justified?
Case-insensitive bracket expressions don't profit, as can be painfully observed
in the examples above; they just add a certain small amount of additional time.
I wonder if comparitive testing with singleton processing versus without it
yields justifiable gains—yes, that is a subjective adjective.

-- 
You are receiving this mail because:
You are on the CC list for the bug.