Re: What's the locale for system files (e.g. /etc/fstab)?

From: Warner Losh <imp_at_bsdimp.com>
Date: Fri, 25 Mar 2022 04:08:47 UTC
On Thu, Mar 24, 2022 at 2:51 PM Phil Shafer <phil@juniper.net> wrote:

> On 24 Mar 2022, at 15:12, Warner Losh wrote:
> > On Thu, Mar 24, 2022, 10:30 AM Simon J. Gerraty
> > <sjg@juniper.net<mailto:sjg@juniper.net>> wrote:
> >> AFAIK virtually everything about locale support tells you about the
> >> locale for the current process - which does not necessarily inform
> >> you
> >> of the locale that was in effect when a system file was last edited.
>
> Exactly.  The value is $LANG is transient, leaving no clue about the
> encoding of the data.
>
> >> There's probably something to be said for enforcing something like
> >> C.UTF-8 for system files.
>
> I'd like to have UTF-8 as a given, or at least something definitive like
> the symlink idea.  Something that tells df, mount, etc how to treat the
> value, so that it knows if it's locale-based ("%hs" for libxo) or utf-8
> ("%s" for libxo).
>

Right now we use %s for these things in all the other utilities (or have
traditionally done so, I've not checked recently). We don't setup the
locale stuff in these programs at all, so to match historic practice,
I think libxo should use %s.


> > That is the primary reason for system files always being C.UTF-8...
> > There is no way to tag it as anything else... and some of these files
> > are often parsed from a context that can't set the locale, like the
> > boot loader or the kernel... also, these files have a format that was
> > defined back in the 7bit ascii time frame. They also don't make use of
> > the text in a way that isn't literal...
>
> Exactly.  There's just no way to know in the current setup.  And
> declaring it UTF-8 will break anyone currently using locale-based
> values.  Using the symlink has the value of allowing a simple fix ("sudo
> ln -s $LANG /etc/locale").
>

Except it's not a simple fix. Sure, you can find this value, but nothing
will use it, necessarily. Since there's little value and little need, I
think it would be more hassle than it's worth absent a much more
extensive audit. For system wide things like config files, we assume
C.UTF-8 or the lessor ASCII-7 (or maybe ASCII-8).

> Having said that, I'm unsure how you'd mount /<kanji-for-neko> from
> > fstab, or if that is well defined. The kernel just presents a string
> > of bytes not containing /...
>
> Currently it's not well defined, just a string of bytes, which has
> worked fine so far, but it's a problem for adding libxo support to df
> and mount, since the strings being used don't have a known encoding.
> And JSON, XML, or HTML are UTF-8, so we need to know how to treat them.
> The patch under review changes mount to use "%hs" which means that
> strings will be locale-based, but that means they will be interpreted
> using the current process's $LANG, which may not be how the file was
> encoded.
>

Right. They are de-facto C.UTC-8, at least at the top level these days.
That's
why I think we should use %s unless someone does extensive testing and
auditing of these programs to see if they still work (along with test suites
to make sure they still work). We should not be in the business of promising
that we can set the locale in any meaningful way and have it work for
system-level
things. In addition, we'd need to add a test suite to test the boot loader
so
that the presence of non-C.UTC-8 encoded strings in /etc/fstab doesn't
cause it
to misbehave. My big worry is that this will open up a big can of worms for
people
that have a system-wide default set to not be C.UTC-8 and these changes will
cause subtle behavior changes that we have to play whack-a-mole with in
the forums and PR database. They may not be obviously related to this change
at first, and you may be hard to track down to fix what comes up.

I will be the first to admit that I feel a bit burned by the locale stuff
in global
settings since I had to rip out a bunch of locale things in our awk because
they
caused weird compatibility problems with awk scripts written in other
systems.

Warner


> Thanks,
>   Phil
>
>