what's the easiest way to de-html-ize files?
Ian Smith
smithi at nimnet.asn.au
Tue May 15 05:34:48 UTC 2007
On Sat, 12 May 2007 14:34:52 -0700 Gary Kline <kline at tao.thought.org> wrote:
> On Mon, May 14, 2007 at 12:09:07PM -0700, Chuck Swiger wrote:
> > On May 12, 2007, at 12:54 PM, Gary Kline wrote:
> > >This is for those of us who appreciate ASCII or straight
> > > ISO_8859-15 rather than marked up files. I have slapped together
> > > a crude C program that does scotch (or *cleanse*) text of
> > > <B></B> and so on. Still... is there some standalone converter
> > > that gets rids of markup more elegantly? Something where i
> > > can say
> > >
> > > % cmd file_1.html ... file_N.html and output file_1.text ...
> > > file_N.text?
> >
> > Perhaps:
> >
> > lynx -dump file1.html ... > file.text
> >
> > ...?
>
> Hm, maybe Ineed Bill Campbell's -force_html switch.
>
> Yes, seems that way. USing just -dump got most of them, but
> using the -force_html caught all. Need to script something to
> reformat, but the worst of it's done!
Also, if using Mozilla (so, I would assume, Firefox) the 'Save Page As'
dialog offers a picklist for 'Files of Type' that includes 'Text Files'.
This does a pretty decent job of producing text from HTML files, and is
quicker than firing up lynx (or links) if you're already viewing a page.
Cheers, Ian
More information about the freebsd-questions
mailing list