From nobody Fri Dec 22 05:21:32 2023 X-Original-To: dev-commits-src-all@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4SxFyh2NTWz54m5J; Fri, 22 Dec 2023 05:21:32 +0000 (UTC) (envelope-from git@FreeBSD.org) Received: from mxrelay.nyi.freebsd.org (mxrelay.nyi.freebsd.org [IPv6:2610:1c1:1:606c::19:3]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (4096 bits) client-digest SHA256) (Client CN "mxrelay.nyi.freebsd.org", Issuer "R3" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4SxFyh1qxmz3YL9; Fri, 22 Dec 2023 05:21:32 +0000 (UTC) (envelope-from git@FreeBSD.org) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=freebsd.org; s=dkim; t=1703222492; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding; bh=Z+gIYUXtjqCMjfm3rEkUro7jWVoRVcfXeHiRnh+4d08=; b=YuAIh94cDN5+3GsIoL5VHYAFMBe1Y/SKBzou/UZ/eCHuMeDljT/YfuR0PaS9LrJ+EuLvL/ Q7VyM/VkPPHMNJxf+XwXJzdQooIQ7klo2oVykvAnDyHOr6IHytvv5KeU6WaEYuOtz+tLMw 5nNz+OmAHNli59dQj/2sM9pIf8kuxMvK/HFXAaIiJZBeN+AoGemWGF3wCLxC97gGpTydwR 5lN8j8pcysfI7sDStpTDTfTjhn2dYEVm6djBYESGEipzgyiaqEGYGORglBw04uqVykc7hL WcscAefDjPD60C0paAmazO/27xfSd0OCVSt/aDlO0J2D5C8X+E59BYfHu2za9Q== ARC-Seal: i=1; s=dkim; d=freebsd.org; t=1703222492; a=rsa-sha256; cv=none; b=uEvfvaWNJMfyPX3/zBQg0Xp2Zdg+RqiN9L46I78McQN+PWqWoXw31Gt0U7kgabV8otZ74k fKRKO0SHEd6npLukypReFIylh6uRRfO01Hqia/bNfdBk8ko7hyuoDiWZgy326+48tN+ght MtsIaZ8j5iDg3V3UFFBTJNj1E9zBXM8zCfqgApRqXzJO6W1W4yotyxQIXo7c2mOF0GX+9B oAiL5+zkS0hy/sM8GdAr7befg9ZRUWIHjPtcA35q60/ZbREY5mMiBF5WUEwkARTc1OvWgR EWZCLxKRNgblW7w4h/GngqUZ+dbx5IhY3UubYobTLI4limLyiDYxCBkuV9SN2A== ARC-Authentication-Results: i=1; mx1.freebsd.org; none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=freebsd.org; s=dkim; t=1703222492; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding; bh=Z+gIYUXtjqCMjfm3rEkUro7jWVoRVcfXeHiRnh+4d08=; b=wQMe2+mpXDwWjqI5KyXVfG/gQ3UX2tXXqlQ6AZPfpRPjQ3oEQhAH6GHCSeLVoYWHDgUzrN 5A6Uu3VKiU1EEZpLimBQSkLXkS6r5XimyoNc/ZvMXOmRqpxnQN/zcNDZlFQdTM5DfvjDwF C1/0v7LT71U7pNtpNUomnumitv71tGFDak/QE3MuuxHCz4DwnEidt8sS7tgFNfCBdr/VJT M8lw+fCNauUretE8c52Dsw9+gYtRZKOMuFtrThG0WX0mrI6jZRAnps0IaW98H1AnLXZIBp W5hPlxP86y5Za0QyZdsOJNshL3vbNikE1/p0rmIADte8E4T8r9EgqRKD9ShmMQ== Received: from gitrepo.freebsd.org (gitrepo.freebsd.org [IPv6:2610:1c1:1:6068::e6a:5]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (Client did not present a certificate) by mxrelay.nyi.freebsd.org (Postfix) with ESMTPS id 4SxFyh0tRkz13j1; Fri, 22 Dec 2023 05:21:32 +0000 (UTC) (envelope-from git@FreeBSD.org) Received: from gitrepo.freebsd.org ([127.0.1.44]) by gitrepo.freebsd.org (8.17.1/8.17.1) with ESMTP id 3BM5LWT0054582; Fri, 22 Dec 2023 05:21:32 GMT (envelope-from git@gitrepo.freebsd.org) Received: (from git@localhost) by gitrepo.freebsd.org (8.17.1/8.17.1/Submit) id 3BM5LWlF054579; Fri, 22 Dec 2023 05:21:32 GMT (envelope-from git) Date: Fri, 22 Dec 2023 05:21:32 GMT Message-Id: <202312220521.3BM5LWlF054579@gitrepo.freebsd.org> To: src-committers@FreeBSD.org, dev-commits-src-all@FreeBSD.org, dev-commits-src-main@FreeBSD.org From: Yuri Pankov Subject: git: 8f7ed58a1555 - main - regex: mixed sets are misidentified as singletons List-Id: Commit messages for all branches of the src repository List-Archive: https://lists.freebsd.org/archives/dev-commits-src-all List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-dev-commits-src-all@freebsd.org X-BeenThere: dev-commits-src-all@freebsd.org MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit X-Git-Committer: yuripv X-Git-Repository: src X-Git-Refname: refs/heads/main X-Git-Reftype: branch X-Git-Commit: 8f7ed58a15556bf567ff876e1999e4fe4d684e1d Auto-Submitted: auto-generated The branch main has been updated by yuripv: URL: https://cgit.FreeBSD.org/src/commit/?id=8f7ed58a15556bf567ff876e1999e4fe4d684e1d commit 8f7ed58a15556bf567ff876e1999e4fe4d684e1d Author: Bill Sommerfeld AuthorDate: 2023-12-21 03:46:14 +0000 Commit: Yuri Pankov CommitDate: 2023-12-22 05:19:59 +0000 regex: mixed sets are misidentified as singletons Fix "singleton" function used by regcomp() to turn character set matches into exact character matches if a character set has exactly one element. The underlying cset representation is complex; most critically it records"small" characters (codepoint less than either 128 or 256 depending on locale) in a bit vector, and "wide" characters in a secondary array. Unfortunately the "singleton" function uses to identify singleton sets treated a cset as a singleton if either the "small" or the "wide" sets had exactly one element (it would then ignore the other set). The easiest way to demonstrate this bug: $ export LANG=C.UTF-8 $ echo 'a' | grep '[abà]' It should match (and print "a") but instead it doesn't match because the single accented character in the set is misinterpreted as a singleton. Reviewed by: kevans, yuripv Obtained from: illumos Differential Revision: https://reviews.freebsd.org/D43149 --- lib/libc/regex/regcomp.c | 25 ++++++++++++++++++----- lib/libc/tests/regex/multibyte.sh | 43 ++++++++++++++++++++++++++++++++++++++- 2 files changed, 62 insertions(+), 6 deletions(-) diff --git a/lib/libc/regex/regcomp.c b/lib/libc/regex/regcomp.c index ba803130a050..89b96b00fefb 100644 --- a/lib/libc/regex/regcomp.c +++ b/lib/libc/regex/regcomp.c @@ -1586,17 +1586,32 @@ singleton(cset *cs) { wint_t i, s, n; + /* Exclude the complicated cases we don't want to deal with */ + if (cs->nranges != 0 || cs->ntypes != 0 || cs->icase != 0) + return (OUT); + + if (cs->nwides > 1) + return (OUT); + + /* Count the number of characters present in the bitmap */ for (i = n = 0; i < NC; i++) if (CHIN(cs, i)) { n++; s = i; } - if (n == 1) - return (s); - if (cs->nwides == 1 && cs->nranges == 0 && cs->ntypes == 0 && - cs->icase == 0) + + if (n > 1) + return (OUT); + + if (n == 1) { + if (cs->nwides == 0) + return (s); + else + return (OUT); + } + if (cs->nwides == 1) return (cs->wides[0]); - /* Don't bother handling the other cases. */ + return (OUT); } diff --git a/lib/libc/tests/regex/multibyte.sh b/lib/libc/tests/regex/multibyte.sh index a736352bf0a2..18323f500a2b 100755 --- a/lib/libc/tests/regex/multibyte.sh +++ b/lib/libc/tests/regex/multibyte.sh @@ -1,4 +1,3 @@ - atf_test_case bmpat bmpat_head() { @@ -45,8 +44,50 @@ icase_body() echo $c | atf_check -o "inline:$c\n" sed -ne "/$a/Ip" } +atf_test_case mbset cleanup +mbset_head() +{ + atf_set "descr" "Check multibyte sets matching" +} +mbset_body() +{ + export LC_CTYPE="C.UTF-8" + + # This involved an erroneously implemented optimization which reduces + # single-element sets to an exact match with a single codepoint. + # Match sets record small-codepoint characters in a bitmap and + # large-codepoint characters in an array; the optimization would falsely + # trigger if either the bitmap or the array was a singleton, ignoring + # the members of the other side of the set. + # + # To exercise this, we construct sets which have one member of one side + # and one or more of the other, and verify that all members can be + # found. + printf "a" > mbset; atf_check -o not-empty sed -ne '/[aà]/p' mbset + printf "à" > mbset; atf_check -o not-empty sed -ne '/[aà]/p' mbset + printf "a" > mbset; atf_check -o not-empty sed -ne '/[aàá]/p' mbset + printf "à" > mbset; atf_check -o not-empty sed -ne '/[aàá]/p' mbset + printf "á" > mbset; atf_check -o not-empty sed -ne '/[aàá]/p' mbset + printf "à" > mbset; atf_check -o not-empty sed -ne '/[abà]/p' mbset + printf "a" > mbset; atf_check -o not-empty sed -ne '/[abà]/p' mbset + printf "b" > mbset; atf_check -o not-empty sed -ne '/[abà]/p' mbset + printf "a" > mbset; atf_check -o not-empty sed -Ene '/[aà]/p' mbset + printf "à" > mbset; atf_check -o not-empty sed -Ene '/[aà]/p' mbset + printf "a" > mbset; atf_check -o not-empty sed -Ene '/[aàá]/p' mbset + printf "à" > mbset; atf_check -o not-empty sed -Ene '/[aàá]/p' mbset + printf "á" > mbset; atf_check -o not-empty sed -Ene '/[aàá]/p' mbset + printf "à" > mbset; atf_check -o not-empty sed -Ene '/[abà]/p' mbset + printf "a" > mbset; atf_check -o not-empty sed -Ene '/[abà]/p' mbset + printf "b" > mbset; atf_check -o not-empty sed -Ene '/[abà]/p' mbset +} +mbset_cleanup() +{ + rm -f mbset +} + atf_init_test_cases() { atf_add_test_case bmpat atf_add_test_case icase + atf_add_test_case mbset }