POSIX regex VS. multi-byte characters
Gabor Kovesdan
gabor at FreeBSD.org
Fri Sep 2 02:26:10 UTC 2011
Hi Folks,
While working on bringing in a new regex code to FreeBSD, I came into an
issue. POSIX says here:
http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09
"Matching shall be based on the bit pattern used for encoding the
character, not on the graphic representation of the character. This
means that if a character set contains two or more encodings for a
graphic symbol, or if the strings searched contain text encoded in more
than one codeset, no attempt is made to search for any other
representation of the encoded symbol. If that is required, the user can
specify equivalence classes containing all variations of the desired
graphic symbol."
According to my interpretation of this text, if someone specifies a
single bit as pattern that can be a prefix of a multi-byte character
that shall match, since match is based on bit pattern not semantical
meaning. Besides, in a consistent environment that uses a single
encoding and also supposing a user with common sense that would not
enter meaningless input, only whole characters should occur in the
pattern. However, GNU grep has a test in its regression test suite that
contradicts to this and chooses the opposite approach, i.e. it shall not
match a fragment of a character. Looking at the standard, I think GNU
grep is incorrect and my interpretation is the correct one.
Could you please comment on this?
Thanks,
Gabor Kovesdan
More information about the freebsd-i18n
mailing list