RFC: doc/www cleanup

Fri Aug 3 11:13:05 UTC 2012

Hi Doc Fellows,

the XML migration that is in progress now, is also a big cleanup that 
will probably simplify documentation authoring. When working on this 
item I've encountered several old constructs and several things that 
made me think of further directions. I'd like to discuss these changes 
with you before proceeding with them:

1, Removing emacs PSGML comments: PSGML is an emacs mode for SGML 
editing. It can be instructed to behave in a determined way by SGML 
comments or separately with a configuration file (described in 
fdp-primer). Our documentation is scattered by PSGML comments like this:

<!--
      Local Variables:
      mode: sgml
      sgml-indent-data: t
      sgml-omittag: nil
      sgml-always-quote-attributes: t
      End:
-->

XML requires tags to be closed and attributes to be always quoted so 
this loses most if its utility and these comments just confuse people, 
who don't know what they mean. Indenting or any other specific option 
can be configured in the .emacs file. I propose dropping these comments.

2, Relaxing character entity usage: To be able to read non-ASCII 
characters on ASCII-only systems, we have been using character entities, 
like á. But in CJK languages, Greek and Russian every character 
is non-ASCII so practically they cannot be used nor were they used. So 
they are only used in ISO-8859 encodings (except Greek, which is also 
from this family). In fact, displaying these Latin-based characters 
nowadays isn't that problematic any more. Furthermore, if you edit text 
in a given language then we can suppose that you understand the language 
so you know what you should see and you know how to configure your 
system if you don't see the desired result. As a result, these entities 
nowadays don't have any real advantage any more but they highly 
"pollute" the text and make it much harder to edit and read. One 
exception is using characters in a specific language that aren't present 
there, e.g. a non-English developer name in the English documentation, 
etc. So I propose for every translation to convert back entities to 
normal characters and only conserve those that aren't present in the 
given language. Abundance of character entities used to mean 
difficulties for new documentation people, especially for those who 
don't have that much IT background. This change would make the texts 
more natural.

3, Preferring XML/XSLT over scripts: Some parts of the web, like the A-Z 
index and sitemap pages have their own format that is processed with 
shell scripts. It would be more consistent to use an XML data file with 
an XSLT stylesheet for this objective. It would give us more flexibility 
for further changes and would reduce the several different methods we 
use to generate things.

4, Stricter XHML: I don't propose going directly to XHTML Strict 1.0 but 
there are very inconsistently marked up <hr/>'s, <table>'s, etc. I would 
like to make them more consistent and prefer CSS styling when 
applicable. There are also empty paragraphs used as line breaks, which 
should also be eliminated. This would give us a more consistent look and 
more structure-oriented webpage files.

And after the migration, I plan:

5, Identifying obsolete webpages: There are moved pages both in the 
English pages and translations that only serve for redirection. These 
pages were moved a very long time ago so any interested party could 
update her bookmarks. I would like to remove these finally. On the other 
hand, there are leftovers in translations, i.e. pages that were removed 
from the English web but not from the translations. I would like to 
generate a list of them and send patches to translation projects to 
clean these up.

Thanks in advance for your comments,
Gabor