libxo and i18n
Phil Shafer
phil at juniper.net
Wed Feb 11 18:50:46 UTC 2015
[background: libxo is a new library in freebsd that provides
the ability for a single source code path to emit XML, JSON,
HTML and traditional text. Full docs are at:
http://juniper.github.io/libxo/libxo-manual.html
]
In libxo, I'm having issues dealing with i18n, which are
mostly from my lack of depth on the subject. Specifically,
when someone makes a call like:
xo_emit("[{:numbers/%-4..4s/%s}]\n", "123456");
they are asking for numbers to be truncated a 4 columns, rather
than the printf-style four bytes. The output should be:
[1234]
My issue is when the ligatures are used, with multiple unicode
values occupy the same column. An example would be the "Sri"
in Sinhalese:
http://en.wikipedia.org/wiki/Sinhala_alphabet#Consonant_conjuncts
When I look at src/mklocale/UTF-8.src, I see:
/*
* U+0D80 - U+0DFF : Sinhala
*/
GRAPH 0x0d82 0x0d83 0x0d85 - 0x0d96 0x0d9a - 0x0db1 0x0db3 - 0x0dbb
GRAPH 0x0dbd 0x0dc0 - 0x0dc6 0x0dca 0x0dcf - 0x0dd4 0x0dd6
GRAPH 0x0dd8 - 0x0ddf 0x0df2 - 0x0df4
PUNCT 0x0df4
PRINT 0x0d82 0x0d83 0x0d85 - 0x0d96 0x0d9a - 0x0db1 0x0db3 - 0x0dbb
PRINT 0x0dbd 0x0dc0 - 0x0dc6 0x0dca 0x0dcf - 0x0dd4 0x0dd6
PRINT 0x0dd8 - 0x0ddf 0x0df2 - 0x0df4
SWIDTH1 0x0d82 0x0d83 0x0d85 - 0x0d96 0x0d9a - 0x0db1 0x0db3 - 0x0dbb
SWIDTH1 0x0dbd 0x0dc0 - 0x0dc6 0x0dca 0x0dcf - 0x0dd4 0x0dd6
SWIDTH1 0x0dd8 - 0x0ddf 0x0df2 - 0x0df4
Consider the UTF-8 sequence for the glyph in the Sinhalese table above,
at the ninth row from the bottom, fifth character in.
UTF-8: [e0b6bb][e0b78a][e2808d][e0b69d]
Unicode: u+0dbb u+0dca u+200d u+0d9d
wcwidth reports third character (ZWJ) as -1, but all the others as
width 1:
(gdb) p (int) wcwidth(0xdbb)
$1 = 1
(gdb) p (int) wcwidth(0xdca)
$2 = 1
(gdb) p (int) wcwidth(0x200d)
$3 = -1
(gdb) p (int) wcwidth(0xd9d)
$4 = 1
So my question is (at long last): How does one know when multiple
unicode characters will result in a single column of output?
Thanks,
Phil
More information about the freebsd-i18n
mailing list