Re: ISO-8859-1 file name in UTF-8 file system
- Reply: George Mitchell : "Re: ISO-8859-1 file name in UTF-8 file system"
- In reply to: George Mitchell : "ISO-8859-1 file name in UTF-8 file system"
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Thu, 29 Feb 2024 04:22:08 UTC
On Wed, Feb 28, 2024 at 5:31 PM George Mitchell <george+freebsd@m5p.com> wrote: > (I tried sending this to freebsd-python, but I can't post there > because I haven't subscribed, and I'm hoping someone here will have > a suggestion. Thanks for your indulgence.) > > In Python 3.9 on FreeBSD 13.2-RELEASE, sys.getfilesystemencoding() > reports 'utf-8'. However, a couple of ancient files on one of my > disks have names that were evidently ISO-8859-1 encoded at the time > they were originally created. When I os.walk() through a directory > with one of these files, the UTF-8 string name of the file has, for > example, a '\udcc3' in it. Literally, the file name on disk had > hex c3 at that position (ISO-8859-1 for Ã), and I guess \udcc3 is a > surrogate for the 0xc3, which is incomprehensible in conformant > UTF-8 (though I don't understand "surrogates" in UTF-8 and you can't > take that last statement as gospel). > > Be that as it may, what can I do at this point to transmogrify that > Python str with the \udcc3 back into the literal bytes found in the > file name on the disk, so that I can then encode them into proper > UTF-8 from ISO-8859-1? -- George I ran into this problem ages ago on another system. Here is what I did (note that some modern Python checkers hate the lambda form, I wrote this a long time ago): if sys.version_info[0] >= 3: # Python3 encodes "impossible" strings using UTF-8 and # surrogate escapes. For instance, a file named <\300><\300>eek # (where \300 is octal 300, 0xc0 hex) turns into '\udcc0\udcc0eek'. # This is how we can losslessly re-encode this as a byte string: path_to_bytes = lambda path: path.encode('utf8', 'surrogateescape') # If we wish to print one of these byte strings, we have a # problem, because they're not valid UTF-8. This method # treats the encoded bytes as pass-through, which is # probably the best we can do. bpath_to_str = lambda path: path.decode('unicode_escape') else: # Python2 just uses byte strings, so OS paths are already # byte strings and we return them unmodified. path_to_bytes = lambda path: path bpath_to_str = lambda path: path Chris