[Bug 281710] RegEXP bug in bracket expression [^...] - sed(1), grep(1), re_format(7)

From: <bugzilla-noreply_at_freebsd.org>
Date: Wed, 25 Sep 2024 13:30:34 UTC
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=281710

            Bug ID: 281710
           Summary: RegEXP bug in bracket expression [^...] - sed(1),
                    grep(1), re_format(7)
           Product: Base System
           Version: 14.1-RELEASE
          Hardware: amd64
                OS: Any
            Status: New
          Severity: Affects Some People
          Priority: ---
         Component: standards
          Assignee: standards@FreeBSD.org
          Reporter: erichanskrs@gmail.com

It looks like there's a bug in FreeBSD's sed(1), grep(1), re_format(7),
regarding accented characters and their use in a bracket expression [^...] in
regular expressions (modern REs as well as basic REs).


-- Short examples
Command lines 202, 203 and 207 show unexpected bahaviour.
[200] # echo '9a' | /usr/bin/sed        -En 's/([^a])(a)/-\1-\2-/p'
-9-a-
[201] # echo '9a' | /usr/bin/sed        -n 's/\([^a]\)\(a\)/-\1-\2-/p'
-9-a-
[202] # echo '9â' | /usr/bin/sed        -n 's/\([^â]\)\(â\)/-\1-\2-/p' # <--
[203] # echo '9â' | /usr/bin/sed        -En 's/([^â])(â)/-\1-\2-/p'    # <--
[204] # echo '9â' | /usr/local/bin/gsed -En 's/([^â])(â)/-\1-\2-/p'
-9-â-
[205] # echo 'ââ' | /usr/bin/sed        -En 's/([â])(â)/-\1-\2-/p'
-â-â-
[206] # echo 'ââ' | /usr/local/bin/gsed -En 's/([â])(â)/-\1-\2-/p'
-â-â-
[207] # echo '9â' | /usr/bin/grep       -E '[^â]â'                      # <--
[208] #

Same results with characters like 'ç' and 'é'. 
Reported in forum thread (see link below) Unicode characters.


-- Reference
FreeBSD forum link:
https://forums.freebsd.org/threads/bug-in-regexp-sed-1-grep-1-and-re_format-7.95088/

re_format(7):
"
DESCRIPTION
   [...]
       A bracket expression is a list of characters enclosed in `[]'.  It nor-
       mally  matches  any single character from the list (but see below).  If
       the list begins with `^', it matches any single character (but see  be-
       low)  not from the rest of the list.
"
As FreeBSD intends/tries to conform to POSIX, likewise :
https://pubs.opengroup.org/onlinepubs/9799919799/basedefs/V1_chap09.html#tag_09_03_05
"
3. A non-matching list expression begins with a <circumflex> ('^'), and the
matching behavior shall be the logical inverse of the corresponding matching
list expression (the same bracket expression but without the leading
<circumflex>). For example, since the RE "[abc]" only matches 'a', 'b', or 'c',
it follows that "[^abc]" is an RE that matches any character except 'a', 'b',
or 'c'. It is unspecified whether a non-matching list expression matches a
multi-character collating element that is not matched by any of the
expressions. The <circumflex> shall have this special meaning only when it
occurs first in the list, immediately following the <left-square-bracket>.
"


-- Context of my OS and programs:
[100] # uname -a
FreeBSD q210 14.1-RELEASE-p5 FreeBSD 14.1-RELEASE-p5 GENERIC amd64
[101] # pkg which /usr/local/bin/ggrep
/usr/local/bin/ggrep was installed by package gnugrep-3.11
[102] # pkg which /usr/local/bin/gsed
/usr/local/bin/gsed was installed by package gsed-4.9
[103] # locale
LANG=C.UTF-8
LC_CTYPE="C.UTF-8"
LC_COLLATE="C.UTF-8"
LC_TIME="C.UTF-8"
LC_NUMERIC="C.UTF-8"
LC_MONETARY="C.UTF-8"
LC_MESSAGES="C.UTF-8"
LC_ALL=

-- 
You are receiving this mail because:
You are the assignee for the bug.