Validating docbook articles...
Chuck Swiger
chuck at pkix.net
Sun Feb 22 19:26:02 UTC 2004
Thierry Thomas wrote:
>Le Lun 16 fév 04 à 22:29:46 +0100, Chuck Swiger <chuck at pkix.net>
> écrivait :
>
>
>>...tidy-devel doesn't understand the -preserve option. Something like the
>>following, as www/tidy-devel/files/patch-console-tidy.c:
>>
>>
>
>Some days ago, we were speaking of this option (with Alex Dupre). It
>seems useful for documents encoded with charsets unsupported by Tidy.
>
>
Hi, Thierry--
Thanks for your response and interest in the change I suggested. I
would be happy to spend more time on this issue and "do the right thing"
rather than just turn that option into a null operation. However, as
you've noticed:
>There exist two possibilities:
>
>- we encode all documents in supported charsets (e.g. UTF8), and this
>option is not necessary (we can apply your patch to keep a compatibility
>with old scripts);
>
>- we have documents written in such encodings, and tidy-devel should be
>patched to actually preserve entities, or we have to keep the original
>Tidy.
>
Your latter comment suggests that the -preserve functionality in tidy is
no longer available in tidy-devel, which matches my own attempt when
looking though the tidy-devel code to find a comparible flag to set, and
not finding anything? Maybe we should ask the author, <dsr at w3.org>, or
<html-tidy at w3.org>...?
I just checked, and the difference -preserve in the old version of tidy
(vers 4th August 2000) makes is fairly common, tends to be things like
angle brackets in email addresses. For example, the input source of:
<P CLASS="ADDRESS"><CODE CLASS="EMAIL"><
<A HREF="mailto:chuck at pkix.net">chuck at pkix.net</A>></CODE></P>
...becomes either of (results compared via diff):
-<p class="ADDRESS"><code class="EMAIL"><<a href=
-"mailto:chuck at pkix.net">chuck at pkix.net</a>></code></p>
+<p class="ADDRESS"><code class="EMAIL"><<a href=
+"mailto:chuck at pkix.net">chuck at pkix.net</a>></code></p>
However, the usage of > rather than > is purely a detail of
encoding, and I am willing to use tidy-devel without having the
-preserve capability.
Although, then again now that I think about it, using © rather than
&#A9; (I think?) is more portable-- the issue of whether 0xA9 actually
is the copyright symbol in the particular character character set being
used could be a problem. Isn't 0xA9 not the copyright symbol in one of
UTF8 or ISO-8859-1? [ I ran into this issue using the W3C HTML
validator as well. ]
A broader issue is whether tidy should generate a charset declaration
(particularly when used with -xml/-asxml), and what should it pick if
the user and/or the source document doesn't specify one. I think it
would be useful for tidy to do so by default...
--
-Chuck
More information about the freebsd-doc
mailing list