Re: Confusion with grep & locale?
- In reply to: deleted: "deleted (X-No-Archive)"
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Fri, 20 Aug 2021 12:47:11 UTC
Am 20.08.21 um 11:03 schrieb Helge Oldach: > Hi all, > > I'm confused about the FreeBSD behaviour with respect to locale's > and grep - specifically, it seems case sensitivity is not handled > consistently when grepping character ranges. It looks to me like 11 and > 13 are not behaving consistently however I'm unclear why. > > # uname -a > FreeBSD 11STABLE 11.4-STABLE FreeBSD 11.4-STABLE #1059 r368289M: Thu Dec 3 01:48:30 UTC 2020 root@XXX amd64 > # export LANG=en_US.ISO8859-1 > # (echo bla; echo Bla) | grep '[A-Z]' > Bla > # export LANG=C > # (echo bla; echo Bla) | grep '[A-Z]' > Bla > # export LANG=en_US.UTF-8 > # (echo bla; echo Bla) | grep '[A-Z]' > bla > Bla This is not unexpected, since the default collating sequence for many UTF-8 locales is to have lower case letters precede their upper case versions in the sequence, i.e.: "aAbBcC..." https://developer.mimer.com/services/sql-unicode-collation-charts/ Here is a collation chart for English: https://download.mimer.com/pub/developer/charts/english.htm But POSIX makes no guarantees for locales other than POSIX or C. > # uname -a > FreeBSD 13STABLE 13.0-STABLE FreeBSD 13.0-STABLE #49 stable/13-n246779-64085efb677-dirty: Mon Aug 16 08:42:53 CEST 2021 root@XXX amd64 > # export LANG=en_US.ISO8859-1 > # (echo bla; echo Bla) | grep '[A-Z]' > bla > Bla This one is unexpected, the upper case should be a range of its own and should not include any lower case letters. > # export LANG=C > # (echo bla; echo Bla) | grep '[A-Z]' > Bla Correct. > # export LANG=en_US.UTF-8 > # (echo bla; echo Bla) | grep '[A-Z]' > Bla Here I had expected the result you got with en_US.ISO8859-1 ... > For comparison, a Linux RHEL box delivers the expected results: > > # uname -a > Linux rhel.local 3.10.0-1062.9.1.el7.x86_64 #1 SMP Mon Dec 2 08:31:54 EST 2019 x86_64 x86_64 x86_64 GNU/Linux > # export LANG=en_US.ISO8859-1 > # (echo bla; echo Bla) | grep '[A-Z]' > Bla > # export LANG=C > # (echo bla; echo Bla) | grep '[A-Z]' > Bla > # export LANG=en_US.UTF-8 > # (echo bla; echo Bla) | grep '[A-Z]' > Bla Seems that this version uses a POSIX style collating sequence for UTF-8. It would be interesting to test with ranges that contain accented characters or German Umlaut characters. > There is nothing special in the environment, specifically no LC_xxx nor > MM_CHARSET in either case. LANG defines LC_COLLATE, unless overridden. > Any guidance is appreciated... Thanks! Definitely a bug in the definition of the collating sequences. And I have just verified that de_DE.ISO8859-1 wrongly considers "ö" to be within [a-z], while de_DE.UTF-8 does not (but should). Seems that the correct collating sequences for ISO8859-1 and UTF-8 are each assigned to the other one. Some platforms have switched to use the POSIX style collating sequence to support traditional style [A-Z] for [[:upper:]], since a lot of shell script have been written with that assumption for decades. BTW, character classes work for your examples and more: # (echo bla; echo Bla) | LANG=en_US.ISO8859-1 grep '[[:upper:]]' Bla # (echo bla; echo Bla) | LANG=en_US.UTF-8 grep '[[:upper:]]' Bla # (echo "o"; echo "ö") | LANG=de_DE.ISO8859-1 grep '[[:lower:]]' o # (echo "o"; echo "ö") | LANG=de_DE.UTF-8 grep '[[:lower:]]' o ö Regards, STefan