Tidy and HTML tab spacing

Mon Jan 23 19:39:27 UTC 2012

On Mon, 23 Jan 2012, Gabor Kovesdan wrote:

> On 2012.01.22. 1:30, Warren Block wrote:
>> On Sun, 22 Jan 2012, Gabor Kovesdan wrote:
>> 
>>> On 2012.01.18. 23:49, Warren Block wrote:
>>>> 5. Don't tidy HTML files at all (suggested as an option by Benedict
>>>>    Reuschling).  The unprocessed HTML is ugly, but few people are going
>>>>    to look at it directly.  Files that haven't been through tidy are a
>>>>    little larger, about 4% in the case of the Porter's Handbook. 
>>> I also think tidy should be removed. As hrs wrote, new standards should be 
>>> evaluated and probably they are much better. (I think they are.) If there 
>>> are some nits, then we should process it with a custom script or 
>>> something, instead of this crapware.
>> 
>> Tidy does a lot; it would be a lot of work to recreate. 
> Tidy is also the reason that our webpages are not valid HTML.

A new version of Tidy is supposed to be out soonish.  Whether it will 
solve the problems, I don't know.

What about lxml?  Available in ports (devel/py-lxml), reputed to be good 
at parsing problem HTML and creating good XHTML.  A quick test showed 
that it seems to do okay with <pre> elements.

A quick script to generate a test is attached.  The W3C validator says 
this version of the Porter's Handbook has eight errors, versus the six 
errors and five warnings of the Tidy version.  (The ugly special-case in 
line 12 drops the lxml version to five errors.)
-------------- next part --------------
#!/usr/bin/env python

from lxml import etree
import re

inhtml = open('book.html', 'r').read()

tree = etree.HTML(inhtml.replace('\r', ''))
outxhtml = '\n'.join([ etree.tostring(stree, pretty_print=True, method="xml")
		for stree in tree ])

outxhtml = outxhtml.replace('compact="COMPACT"', 'compact="compact"')

f = open('lxml.html', 'w')
f.write('<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\n')
f.write('<html xmlns="http://www.w3.org/1999/xhtml">\n')
f.write(outxhtml)
f.write('</html>\n')
f.close()