POSIX regex VS. multi-byte characters

Fri Sep 2 06:37:57 UTC 2011

Hi Gabor,

* Gabor Kovesdan <gabor at freebsd.org> [110902 04:08]:
> While working on bringing in a new regex code to FreeBSD, I came into an 
> issue. POSIX says here: 
> http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09

> "Matching shall be based on the bit pattern used for encoding the 
> character, not on the graphic representation of the character. This 
> means that if a character set contains two or more encodings for a 
> graphic symbol, or if the strings searched contain text encoded in more 
> than one codeset, no attempt is made to search for any other 
> representation of the encoded symbol. If that is required, the user can 
> specify equivalence classes containing all variations of the desired 
> graphic symbol."

> According to my interpretation of this text, if someone specifies a 
> single bit as pattern that can be a prefix of a multi-byte character 
> that shall match, since match is based on bit pattern not semantical 
> meaning. Besides, in a consistent environment that uses a single 
> encoding and also supposing a user with common sense that would not 
> enter meaningless input, only whole characters should occur in the 
> pattern. However, GNU grep has a test in its regression test suite that 
> contradicts to this and chooses the opposite approach, i.e. it shall not 
> match a fragment of a character. Looking at the standard, I think GNU 
> grep is incorrect and my interpretation is the correct one.

I think you are misinterpreting the standard here. As I read it, the
phrase "bit pattern used for encoding the character" means the complete
byte sequence that encodes the character, not just a byte. The paragraph
quoted above talks about characters that have several different encodings
like e.g. characters that exist as single codepoint but can also be
encoded using diacritical marks and a base character.

Wolfgang