sort is broken

Sun Nov 3 01:23:26 UTC 2019

On 2019-11-02 23:29, Dr. Nikolaus Klepp wrote:
> Anno domini 2019 Sat, 02 Nov 15:11:37 -0700
>   Ronald F. Guilmette scripsit:
>> In message <eec0b13b-b5d6-7e51-6241-8e1898150315 at queldor.net>, you wrote:
>>
>>>
>>>
>>>
>>> On 11/2/19 5:14 PM, Ronald F. Guilmette wrote:
>>>> Not a question, just an expression of grief and deep dismay.
>>>>
>>>> It is a sad day when even very fundamental tools, used in billions
>>>> of scripts, such as /usr/bin/sort turn up broken.
>>>>
>>>> https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=241679
>>>>
>>>> Regards,
>>>> rfg
>>>>
>>>
>>> root at q4:/ # sort a
>>> zürich.email
>>> root at q4:/ # sort < a
>>> zürich.email
>>> root at q4:/ # uname -a
>>> FreeBSD q4.queldor.net 12.0-RELEASE-p3 FreeBSD 12.0-RELEASE-p3 GENERIC
>>> amd64
>>> root at q4:/ # cat a
>>> zürich.email
>>> root at q4:/ #
>>>
>>> Seems to be fine on my 12.0
>>
>> Well, I guess it's just me then...
>>
>> % uname -a
>> FreeBSD segfault.tristatelogic.com 12.0-RELEASE FreeBSD 12.0-RELEASE r341666 GENERIC  amd64
>> % sort --version
>> 2.3-FreeBSD
>>
>>
>> What version of sort do you have?
> 
> I remember that this sort of thing is around since at least 11.0. The problem occurs, when you have UFT-8 encoding set as default, but the input data is iso 8859-1. Some characters of iso 8859-1 (äöü...) are not valid in UTF-8.

This is exactly the problem - in fact, by definition (see RFC 3629)
*no* characters with values outside the range 0x00 to 0x7f are valid
as-is in UTF-8 - this is the case for almost 80 characters in 8859-1
(ü is 0xfc).

$ uname -a
FreeBSD pluto.hedeland.org 12.0-RELEASE FreeBSD 12.0-RELEASE GENERIC  amd64
$ env LANG=C sort < /tmp/test
zürich.email
$ env LANG=en_US.UTF-8 sort < /tmp/test
sort: Illegal byte sequence

And the "success" case:

$ env LANG=en_US.UTF-8 sort /tmp/test
zÃ¼rich.email

Not sure if it survives the e-mail encoding, but the output here has
actually been *converted* to the correct UTF-8 representation - if my
terminal was set up for UTF-8, I would actually see "ü" there.

$ od -t x1 /tmp/test
0000000    7a  fc  72  69  63  68  2e  65  6d  61  69  6c  0a
0000015
$ env LANG=en_US.UTF-8 sort /tmp/test | od -t x1
0000000    7a  c3  bc  72  69  63  68  2e  65  6d  61  69  6c  0a
0000016

I wouldn't consider the "Illegal byte sequence" case a bug, but rather
the "success" case - why is the content converted, and why is it
different from stdin?

--Per Hedeland