Tidy and HTML tab spacing
Warren Block
wblock at wonkity.com
Mon Jan 23 19:39:27 UTC 2012
On Mon, 23 Jan 2012, Gabor Kovesdan wrote:
> On 2012.01.22. 1:30, Warren Block wrote:
>> On Sun, 22 Jan 2012, Gabor Kovesdan wrote:
>>
>>> On 2012.01.18. 23:49, Warren Block wrote:
>>>> 5. Don't tidy HTML files at all (suggested as an option by Benedict
>>>> Reuschling). The unprocessed HTML is ugly, but few people are going
>>>> to look at it directly. Files that haven't been through tidy are a
>>>> little larger, about 4% in the case of the Porter's Handbook.
>>> I also think tidy should be removed. As hrs wrote, new standards should be
>>> evaluated and probably they are much better. (I think they are.) If there
>>> are some nits, then we should process it with a custom script or
>>> something, instead of this crapware.
>>
>> Tidy does a lot; it would be a lot of work to recreate.
> Tidy is also the reason that our webpages are not valid HTML.
A new version of Tidy is supposed to be out soonish. Whether it will
solve the problems, I don't know.
What about lxml? Available in ports (devel/py-lxml), reputed to be good
at parsing problem HTML and creating good XHTML. A quick test showed
that it seems to do okay with <pre> elements.
A quick script to generate a test is attached. The W3C validator says
this version of the Porter's Handbook has eight errors, versus the six
errors and five warnings of the Tidy version. (The ugly special-case in
line 12 drops the lxml version to five errors.)
-------------- next part --------------
#!/usr/bin/env python
from lxml import etree
import re
inhtml = open('book.html', 'r').read()
tree = etree.HTML(inhtml.replace('\r', ''))
outxhtml = '\n'.join([ etree.tostring(stree, pretty_print=True, method="xml")
for stree in tree ])
outxhtml = outxhtml.replace('compact="COMPACT"', 'compact="compact"')
f = open('lxml.html', 'w')
f.write('<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\n')
f.write('<html xmlns="http://www.w3.org/1999/xhtml">\n')
f.write(outxhtml)
f.write('</html>\n')
f.close()
More information about the freebsd-doc
mailing list