sort is broken
Per Hedeland
per at hedeland.org
Sun Nov 3 01:23:26 UTC 2019
On 2019-11-02 23:29, Dr. Nikolaus Klepp wrote:
> Anno domini 2019 Sat, 02 Nov 15:11:37 -0700
> Ronald F. Guilmette scripsit:
>> In message <eec0b13b-b5d6-7e51-6241-8e1898150315 at queldor.net>, you wrote:
>>
>>>
>>>
>>>
>>> On 11/2/19 5:14 PM, Ronald F. Guilmette wrote:
>>>> Not a question, just an expression of grief and deep dismay.
>>>>
>>>> It is a sad day when even very fundamental tools, used in billions
>>>> of scripts, such as /usr/bin/sort turn up broken.
>>>>
>>>> https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=241679
>>>>
>>>> Regards,
>>>> rfg
>>>>
>>>
>>> root at q4:/ # sort a
>>> zürich.email
>>> root at q4:/ # sort < a
>>> zürich.email
>>> root at q4:/ # uname -a
>>> FreeBSD q4.queldor.net 12.0-RELEASE-p3 FreeBSD 12.0-RELEASE-p3 GENERIC
>>> amd64
>>> root at q4:/ # cat a
>>> zürich.email
>>> root at q4:/ #
>>>
>>> Seems to be fine on my 12.0
>>
>> Well, I guess it's just me then...
>>
>> % uname -a
>> FreeBSD segfault.tristatelogic.com 12.0-RELEASE FreeBSD 12.0-RELEASE r341666 GENERIC amd64
>> % sort --version
>> 2.3-FreeBSD
>>
>>
>> What version of sort do you have?
>
> I remember that this sort of thing is around since at least 11.0. The problem occurs, when you have UFT-8 encoding set as default, but the input data is iso 8859-1. Some characters of iso 8859-1 (äöü...) are not valid in UTF-8.
This is exactly the problem - in fact, by definition (see RFC 3629)
*no* characters with values outside the range 0x00 to 0x7f are valid
as-is in UTF-8 - this is the case for almost 80 characters in 8859-1
(ü is 0xfc).
$ uname -a
FreeBSD pluto.hedeland.org 12.0-RELEASE FreeBSD 12.0-RELEASE GENERIC amd64
$ env LANG=C sort < /tmp/test
zürich.email
$ env LANG=en_US.UTF-8 sort < /tmp/test
sort: Illegal byte sequence
And the "success" case:
$ env LANG=en_US.UTF-8 sort /tmp/test
zürich.email
Not sure if it survives the e-mail encoding, but the output here has
actually been *converted* to the correct UTF-8 representation - if my
terminal was set up for UTF-8, I would actually see "ü" there.
$ od -t x1 /tmp/test
0000000 7a fc 72 69 63 68 2e 65 6d 61 69 6c 0a
0000015
$ env LANG=en_US.UTF-8 sort /tmp/test | od -t x1
0000000 7a c3 bc 72 69 63 68 2e 65 6d 61 69 6c 0a
0000016
I wouldn't consider the "Illegal byte sequence" case a bug, but rather
the "success" case - why is the content converted, and why is it
different from stdin?
--Per Hedeland
More information about the freebsd-questions
mailing list