[Bug 281710] RegEXP bug in bracket expression [^...] - sed(1), grep(1), re_format(7)

From: <bugzilla-noreply_at_freebsd.org>
Date: Wed, 25 Sep 2024 20:44:12 UTC
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=281710

--- Comment #9 from commit-hook@FreeBSD.org ---
A commit in branch stable/13 references this bug:

URL:
https://cgit.FreeBSD.org/src/commit/?id=d96ce6d000703f3f57d9214b741e16cc7741d77e

commit d96ce6d000703f3f57d9214b741e16cc7741d77e
Author:     Bill Sommerfeld <sommerfeld@hamachi.org>
AuthorDate: 2023-12-21 03:46:14 +0000
Commit:     Kyle Evans <kevans@FreeBSD.org>
CommitDate: 2024-09-25 20:42:28 +0000

    regex: mixed sets are misidentified as singletons

    Fix "singleton" function used by regcomp() to turn character set matches
    into exact character matches if a character set has exactly one
    element.

    The underlying cset representation is complex; most critically it
    records"small" characters (codepoint less than either 128
    or 256 depending on locale) in a bit vector, and "wide" characters in
    a secondary array.

    Unfortunately the "singleton" function uses to identify singleton sets
    treated a cset as a singleton if either the "small" or the "wide" sets
    had exactly one element (it would then ignore the other set).

    The easiest way to demonstrate this bug:

            $ export LANG=C.UTF-8
            $ echo 'a' | grep '[abĂ ]'

    It should match (and print "a") but instead it doesn't match because the
    single accented character in the set is misinterpreted as a singleton.

    PR:             281710
    Reviewed by:    kevans, yuripv
    Obtained from:  illumos

    (cherry picked from commit 8f7ed58a15556bf567ff876e1999e4fe4d684e1d)

 lib/libc/regex/regcomp.c          | 25 ++++++++++++++++++-----
 lib/libc/tests/regex/multibyte.sh | 43 ++++++++++++++++++++++++++++++++++++++-
 2 files changed, 62 insertions(+), 6 deletions(-)

-- 
You are receiving this mail because:
You are on the CC list for the bug.