Re: Grep with non-ascii
- Reply: Tomoaki AOKI : "Re: Grep with non-ascii"
- In reply to: Tomoaki AOKI : "Re: Grep with non-ascii"
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Fri, 03 Feb 2023 16:31:55 UTC
Den Sat, 4 Feb 2023 01:06:05 +0900 skrev Tomoaki AOKI <junchoon@dec.sakura.ne.jp>: > On Fri, 3 Feb 2023 15:18:53 +0100 > Eivind Nicolay Evensen <eivinde@terraplane.org> wrote: > > > Den Fri, 3 Feb 2023 19:12:32 +0700 > > skrev Eugene Grosbein <eugen@grosbein.net>: > > > > > 03.02.2023 17:06, Eivind Nicolay Evensen wrote: > > > > Hello. > > > > > > > > I just noticed this today: > > > > > > > > elg!ene[~]> printf "bø\nhei\nøl\n" | grep ø > > > > grep: trailing backslash (\) > > > > elg!ene[~]> echo $LC_CTYPE $LANG > > > > nb_NO.ISO8859-1 nb_NO.ISO8859-1 > > > > > > > > While I have the result I envisioned with gnugrep: > > > > > > > > elg!ene[~]> printf "bø\nhei\nøl\n" | ggrep ø > > > > bø > > > > øl > > > > > > > > Also, on OpenIndiana, linux and Netbsd, grep gives the proper > > > > result. > > > > > > > > Is lib/libc/regex the right place to look into this if I > > > > find the time, or does anybody know this enough to know the > > > > problem? > > > > > > Try single quotes instead of double quotes. > > > And pleace specify system version and shell name, and shell > > > version if its not in base system. > > > > This is > > elg!ene[~]> uname -a > > FreeBSD elg.hjerdalen.lokalnett 13.2-PRERELEASE FreeBSD > > 13.2-PRERELEASE #1: Tue Jan 31 11:23:29 CET 2023 > > ene@elg.hjerdalen.lokalnett:/usr/obj/usr/src/amd64.amd64/sys/ENE-spurv > > amd64 > > > > Using the tcsh that comes with it. But I don't think the quotes > > matter much because of this: > > > > elg!ene[~]> grep ø > > grep: trailing backslash (\) > > > > The output was more just to have something to look for, like > > with ggrep but anyway: > > > > elg!ene[~]> printf 'bø\nhei\nøl\n' |grep ø > > grep: trailing backslash (\) > > > > And obviously: > > > > elg!ene[~]> printf 'bø\nhei\nøl\n' > > bø > > hei > > øl > > > > And it seems to be the same for any 8859-1 character not part > > of ascii: > > > > elg!ene[~]> grep ä > > grep: trailing backslash (\) > > elg!ene[~]> grep ß > > grep: trailing backslash (\) > > elg!ene[~]> grep ç > > grep: trailing backslash (\) > > > > -- > > Eivind Nicolay Evensen > > I recalled very, very old problem on Japanese characters. > Does the characters you mentioned include 0x5c in nb_NO.ISO8859-1 > charset? > > In dirty, ugly DOS era, Shift-JIS (CP932) was the mainstream in Japan. > In this charset, some 2bytes kanji characters have 0x5c in its second > byte. > > This caused imported, non-Japanese-aware softwares mis-handle Japanese > texts, and the workaround was to add excessive 0x5c after problematic > characters. :-( > > For example, ?? in Shift-JIS bytestream was 0x95 0x5c 0x8e 0xa6, and > as 0x5c was usually considered as backslash, escape character, it was > modified to 0x95 0x8e 0xa6 in non-Japanese softwares. > As this mis-conversion often happened recussively, the required > numbers of excessive 0x5c varied, varied and varied!!!!! Crazily. > > If this is the case like above, the only solution is to move to > character set containing ALL characters all over the world. > > AFAIK, the only candidates are only two, TRON code [1] and Unicode > (UCS, ISO/IEC 10646) [2]. And TRON code is very rarely used, actual > candidate would be Unicode only. > Note that Unicode is usually encoded to any of UTF-8, UTF-16 or UTF-32 > for data transfer (sometimes raw UCS-2?). > > > [1] https://en.wikipedia.org/wiki/TRON_(encoding) > [2] https://en.wikipedia.org/wiki/Unicode > > P.S. > On UTF-8, character ø was encoded to UTF-8: 0xC3 0xB8. So it should be > OK. In 8859-1, "ø" is: elg!ene[~]> printf ø |hexdump -C 00000000 f8 |ø| 00000001 so this does not seem to be the problem here. And all those characters I tried are one-byte (all 8859-1 are): elg!ene[~]> printf "äßç" |hexdump -C 00000000 e4 df e7 |äßç| 00000003 So I do not believe this is the same problem. I did, however, find it interesting that multi-byte character sets may have been in use longer than I imagined. -- Eivind Nicolay Evensen