"Unprintable" 8-bit characters

Wed Nov 9 05:03:45 UTC 2011

"Conrad J. Sabatier" <conrads at cox.net> wrote:
>
> <grin>
>
> Yes, and this is one area where the labels are more than a little
> misleading as well.  My natural inclination is think of UTF-8 as being a
> single-byte representation for each character in the set, whereas
> UTF-16, as the name implies, would be the "wide", 2-byte version.

"Not exactly."

> Nonetheless, as I posted earlier in this thread, according to the info
> in gucharmap, the representations of the umlauted "u" are just the
> opposite of this:

"not exactly." Again.

> UTF-8: 0xC3 0xBC
> UTF-16: 0x00FC
>  
> Go figure, huh?  :-)

In UTF-16, everything _is_ a 16-bit entity.  Notice that 0x00FC has -four-
nybbles after the '0x.'  Every character boundary is on a multiple of 16
bits.

In UTF-8, the 'base' charset -- the 'C0' and 'C1' groups are represented
by a single byte.  'extended' characters are represented by two bytes.
Thus, 'characters' have  a *variable*length* representation -- one or two 
bytes.  A character, whether it is represented by one or two bytes,  can 
begin on -any- byte boundary within a data stream, depending on 'what came 
before it'.  UTF-8 2-byte representations are designed such that one can 
jump to any _byte_ offset within the file, and determine -- by looking *only* 
at the value of that byte whether is is (a) a single-byte character, (b) the 
first byte of a two-byte sequence, or (c) the second byte of a two-byte 
sequence.

With UTF-16 you can position directly to any -character-, by jumping to 
a _byte_ offset that is twice the index of the character you want. Given
a byte offset, you always know the 'equivalent' _character_ offset.

With UTF-8, you have to read the character stream, counting 'characters' 
as you go, to get to the desired point.  You can seek to an arbitrary
_byte_ offset, but you do not know how mny 'characters' into the file 
that offset is.

UTF-8 vs. UTF-16 is a trade-off between 'compactness' (UTF-8), and 
simplicity of addessing/representation (UTF-16).

> This seems rather unfortunate to me.  You would think that, by now,
> some "standard" character set might have emerged that would allow one
> to use, at the very least, the "Western" characters (as opposed to
> the "Eastern" or "Oriental" or "Asian", if you will) with a reasonable
> expectation that others will see what was intended.

Heh. 

How many 'character' codes are you willing to devote to national 'currency 
symbols', just for starters?  Probable minimum of two per currency -- one
for the minimum coinage unit (cent, pence, pfennig, etc.) and one for
the denomination unit (dollar, pound, mark, kroner, etc.)

Now, one (obviously) has to have the basic 'Roman' alphabet. 

Then there are all the diacritical markings (accent, accent grave, dot
umlaut, ring, bar, 'hat', inverted hat,  etc.) for vowels.  And cedilla,
tilde, etc., for select consonants.  Plus language specific symbols like
ess-zett , 'thorn', etc.

How about phonetic symbols, like 'schwa' ?

And Greek for all sorts of scientific use?

What about Cyrilic characters, for many Eastern Eurpean languages?

Now, consider punctuation marks:
   the 'typewriter' basics, 
   How many of 'minus-sign, hyphen, em-dash, en-dash, soft-hyphen' are needed?
   How many of 'accent, accent grave, apostrophe, opening/closing single-quote'
       are needed?
   opening/closing double-quotes,  and/or a 'position neutral' double-quote?

"Other symbols", like --
   digits,
   common fractions,
   'Trademark','Registered trademark','copyright' 
   'paragraph','section', 
   superscripts  -- exponents, footnotes, etc.
   subscripts -- chemical formulae, etc.
   "Simple line-drawing graphics"

Diphthongs??  Ligatures??

Start counting things up. 

An 8-bit 'address space' gets used used up _really_ quick.

<wry grin>