POSIX regex VS. multi-byte characters
Wolfgang Zenker
wolfgang at lyxys.ka.sub.org
Fri Sep 2 06:37:57 UTC 2011
Hi Gabor,
* Gabor Kovesdan <gabor at freebsd.org> [110902 04:08]:
> While working on bringing in a new regex code to FreeBSD, I came into an
> issue. POSIX says here:
> http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09
> "Matching shall be based on the bit pattern used for encoding the
> character, not on the graphic representation of the character. This
> means that if a character set contains two or more encodings for a
> graphic symbol, or if the strings searched contain text encoded in more
> than one codeset, no attempt is made to search for any other
> representation of the encoded symbol. If that is required, the user can
> specify equivalence classes containing all variations of the desired
> graphic symbol."
> According to my interpretation of this text, if someone specifies a
> single bit as pattern that can be a prefix of a multi-byte character
> that shall match, since match is based on bit pattern not semantical
> meaning. Besides, in a consistent environment that uses a single
> encoding and also supposing a user with common sense that would not
> enter meaningless input, only whole characters should occur in the
> pattern. However, GNU grep has a test in its regression test suite that
> contradicts to this and chooses the opposite approach, i.e. it shall not
> match a fragment of a character. Looking at the standard, I think GNU
> grep is incorrect and my interpretation is the correct one.
I think you are misinterpreting the standard here. As I read it, the
phrase "bit pattern used for encoding the character" means the complete
byte sequence that encodes the character, not just a byte. The paragraph
quoted above talks about characters that have several different encodings
like e.g. characters that exist as single codepoint but can also be
encoded using diacritical marks and a base character.
Wolfgang
More information about the freebsd-standards
mailing list