ISO-8859-1 file name in UTF-8 file system

Reply: Tomoaki AOKI : "Re: ISO-8859-1 file name in UTF-8 file system"
Reply: Chris Torek : "Re: ISO-8859-1 file name in UTF-8 file system"
Go to: [ bottom of page ] [ top of archives ] [ this month ]

From: George Mitchell <george+freebsd_at_m5p.com>
Date: Thu, 29 Feb 2024 01:30:19 UTC

(I tried sending this to freebsd-python, but I can't post there
because I haven't subscribed, and I'm hoping someone here will have
a suggestion.  Thanks for your indulgence.)

In Python 3.9 on FreeBSD 13.2-RELEASE, sys.getfilesystemencoding()
reports 'utf-8'.  However, a couple of ancient files on one of my
disks have names that were evidently ISO-8859-1 encoded at the time
they were originally created.  When I os.walk() through a directory
with one of these files, the UTF-8 string name of the file has, for
example, a '\udcc3' in it.  Literally, the file name on disk had
hex c3 at that position (ISO-8859-1 for Ã), and I guess \udcc3 is a
surrogate for the 0xc3, which is incomprehensible in conformant
UTF-8 (though I don't understand "surrogates" in UTF-8 and you can't
take that last statement as gospel).

Be that as it may, what can I do at this point to transmogrify that
Python str with the \udcc3 back into the literal bytes found in the
file name on the disk, so that I can then encode them into proper
UTF-8 from ISO-8859-1?                                    -- George