Re: Confusion with grep & locale?
- In reply to: deleted: "deleted (X-No-Archive)"
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Fri, 20 Aug 2021 15:09:03 UTC
On Fri, Aug 20, 2021 at 8:19 AM Helge Oldach <freebsd@oldach.net> wrote: > Stefan Esser wrote on Fri, 20 Aug 2021 14:47:11 +0200 (CEST): > > Am 20.08.21 um 11:03 schrieb Helge Oldach: > > But POSIX makes no guarantees for locales other than POSIX or C. > > OK, thanks for the explanation. That clarifies a lot for me. Although > it's not really POLA. :-) > > Thanks a lot also to Stefan Ehmann for the pointer to gawk oddities. > > > > # export LANG=en_US.ISO8859-1 > > > # (echo bla; echo Bla) | grep '[A-Z]' > > > bla > > > Bla > > > > This one is unexpected, the upper case should be a range of its own > > and should not include any lower case letters. > > > > > # export LANG=en_US.UTF-8 > > > # (echo bla; echo Bla) | grep '[A-Z]' > > > Bla > > > > Here I had expected the result you got with en_US.ISO8859-1 ... > > > Definitely a bug in the definition of the collating sequences. > > > > And I have just verified that de_DE.ISO8859-1 wrongly considers "รถ" > > to be within [a-z], while de_DE.UTF-8 does not (but should). > > > > Seems that the correct collating sequences for ISO8859-1 and UTF-8 are > > each assigned to the other one. > > PR 257972 raised. > I've looked at that, and I don't think it's a bug since posix says it's undefined behavior. > > > There is nothing special in the environment, specifically no LC_xxx nor > > > MM_CHARSET in either case. > > > > LANG defines LC_COLLATE, unless overridden. > > Indeed. I just explicitly mentioned *no* LC_xxx to clarify that it's not > overriden. :-) > > > BTW, character classes work for your examples and more: > > Certainly they do. But they harder to type... :-) > I think that A-Za-z is undefined, but :letter: is well defined. Most shell scripts use the 'C' locale for this very reason. Warner