Re: FreeBSD awk behavior change proposal

From: Stefan Esser <se_at_freebsd.org>
Date: Fri, 09 Jul 2021 14:36:40 UTC
Am 09.07.21 um 15:21 schrieb Rodney W. Grimes:
>> Greetings,
>>
>> I've posted  https://reviews.freebsd.org/D31114 which eliminates the last
>> delta we have from upstream one-true-awk. This delta has basically been
>> rejected by upstream as being a really bad idea. Let me give some
>> background.
>>
>> In 2005, FreeBSD changed one-true-awk to honor the locale's collating order.
>> https://svnweb.freebsd.org/base/head/usr.bin/awk/b.c.diff?annotate=146322&pathrev=201988
>> This was billed as a temporary patch. It was also compatible with
>> the then-current behavior of gawk. That temporary patch has lasted 16
>> years now.
>>
>> However, IEEE Std 1003.1-2008 changed the behaivor of ranges in regular
>> expressions outside of the "C" and "POSIX" locales to be undefined.
>>
>> Starting in 2011, gawk 4.0 stopped using the locale for the range
>> regular expressions and used the traditional behavior only. The
>> maintainer had grown weary of answering why '[A-Z]' would sometimes
>> match lower-case expressions. The details about are explained here:
>> https://www.gnu.org/software/gawk/manual/html_node/Ranges-and-Locales.html
>>
>> To restore compatibility with other implementaitons of awk, revert this
>> patch. FreeBSD is the odd-system out. It also has the nice side effect
>> of eliminating the last of our differences with upstream one-true-awk.
>>
>> I'd like to commit the change at least to -current. Ideally, I'd like to MFC
>> the change. I believe better compatibility with gawk and other awk
>> implementations justifies this change in behavior because the current
>> behavior is outside the mainstream enough to be considered a bug.
>>
>> I'd like to solicit input before I do this, however.
> 
> My only concern on this is does anything in the ports system get
> tickled by this change, I know its a pita, but maybe have an exp
> run done?  I reviewed and accepted the differential, and by examination
> I do not see how this could cause an issue now, so Meh give it a long
> back in -current and things should be ok.

While possible in theory, I do not see how the ports system could
be affected in practice.

Ports are built in a C/POSIX locale on the official builders, and
thus using a different locale and collating sequence on a user's
system could break the port, but should never be a requirement.

I have checked the port Makefiles for occurrences of LANG or LC_*
outside specific command invocations (e.g. to set the locale for
a sort command). These are the results:

- ${USE_LOCALE} is used in bsd.port.mk, but the only case where
  a locale other than C or en_US.UTF-8 is specified is shells/fd
  which has USE_LOCALE=ja (i.e. does not specify an encoding).

- ${ELIXIR_LOCALE} is used to set LANG and LC_ALL for USES=elixir.
  But ELIXIR_LOCALE is only ever set to en_US.UTF-8, AFAICT.

- print/libpaper explicitly requests LANG=C LC_ALL=C for AWK.

- The only port that requests a locale that is not en_US.UTF-8,
  en_US.ISO8859-1, or C is textproc/te-hunspell, which uses
  LANG=te_IN.utf8 LC_ALL=te_IN.utf8 to execute wordlist2hunspell,
  but only for this single shell script that does not invoke AWK
  and which does internally use LC_ALL=C for sort and uniq to
  make those not depend on an externally set locale.

All other cases where LC_* or LANG are used in port Makefiles are
in e.g. EXTRACT_CMD, TEST_ENV or in patch files, but those do
enforce a C or C.UTF-8 locale (or en_US.*) and thus have no effect
on the proposed change to AWK (besides often only setting the locale
for a TAR file extraction).

If an exp-run is planned for other reasons, using the modified
AWK could be thrown in as a little risk modification.

But I do not see any possible effect on the ports system, after
performing a grep for LANG and LC_* on the Makefiles and patch
files.

Regards, STefan