Convert from HTML to XML With HTML Tidy 43
An anonymous reader writes "HTML Tidy, a powerful tool to help convert old HTML pages to newer standards, such as XML. This tip demonstrates how to convert HTML documents to XML (or more specifically, XHTML) with a simple, open source tool. This conversion is useful for webmasters who are migrating to XML. It can also help XML converts who have to interface with legacy HTML tools."
I'll check it out (Score:3)
Re:I'll check it out (Score:2, Informative)
Re:I'll check it out (Score:5, Interesting)
Re:I'll check it out (Score:2)
Re:I'll check it out (Score:2)
First off, every reason to use HTML 4.0 is a reason to use XHTML, unless that reason happens to be "it's not XHTML!".
Secondly, using XHTML allows you all the niceties of XML. This is great when you decide to update your site so it works in say cell-phone browsers, rather than just a PC browser. This alone is a great reason to use XHTML. As more and more data sources become xml aware, being able to easily connect them becomes important. XHTML allows you to do this
Re:I'll check it out (Score:2)
The big virtue of XHTML is the big virtue of all XML document types: it's open. You can do anything with an XML document. I suppose that's also true of say TeX or RTF. Except these formats are very mes
What about converting RSS to HTML ? (Score:2)
Re:What about converting RSS to HTML ? (Score:2)
libxml2? (Score:5, Informative)
A few days ago I had to convert HTML pages into XHTML, stripping out a few extra elements and attributes. I used xsltproc, from libxslt [xmlsoft.org], which uses the parser from libxml2 [xmlsoft.org], and this has the option of parsing strict HTML into an XML DOM.
XMLTidy can be useful when you have a not-so-strict HTML, but for most quick conversions I've found libxml2 &co to be quite light and easy.
This isn't new... (Score:3, Informative)
HTML Tidy has been our for years.
Check out the Tidy Homepage [w3.org] or the project on SourceForge [sourceforge.net].
how many years has it produced XHTML ? (Score:1)
Re:how many years has it produced XHTML ? (Score:2, Informative)
Again, it's converted to XHTML for a few years. I only posted the original message because I was quite surprised to see it on Slashdot. It's not uncommon to see a story that is a month or two old on the homepage, but several years old is crazy.
hehe stupid /. (Score:1)
I think people must submit to
Re:how many years has it produced XHTML ? (Score:3, Informative)
The date for the referenced article is 18 Sep 2003, less than two weeks ago.
Larry
Re:how many years has it produced XHTML ? (Score:1)
The date for the referenced article is 18 Sep 2003, less than two weeks ago.
Yeah, but the fact remains that HTML Tidy has been around for years. Essentially this article is a tutorial on how to use tidy. It's almost like submitting a story about a man page.
Re:how many years has it produced XHTML ? (Score:3, Informative)
Re:This isn't new... (Score:2, Informative)
news for nerds? (Score:2, Insightful)
but yeah this is a great tip.. especially if you are writing web-scrapers to extra data from web pages and/or convert them to RSS. Just use Tidy to tidy it, and then your favorite XML parser can slurp it right up and you can use XPath to pull out what you need.
Look out though, there are some cases that Tidy chokes on. One that I keep running into is shit like this:
<table>
<form>
<tr><td>...</td></tr>
</table >
</form>
Basically mixing
*sigh* Who modded the parent down? Why? (Score:1)
Re:news for nerds? (Score:4, Insightful)
Re:news for nerds? (Score:3, Informative)
Well, from the W3C page on HTML 4.01 [w3.org]
Re:news for nerds? (Score:1)
Correct:
<table>
<form>
</form>
</table>
Incorrect:
<table>
<form>
</table>
</form>
Re:news for nerds? (Score:2)
There's actually a reason why people write code like this. The tag in many browser implementations emits the equivalent of a line break after the tag. Heavily styled pages with forms therefore get unexpected spaces, which pisses off a lot of designers. Putting the as in the above prevents this, even if it's incorrect. Browsers don't seem to have a problem.
Nowadays the correct way to do this is with CSS. should do the trick, or just toss a form {margin:0px} in the document style sheet.
Why use XHTML? (Score:4, Informative)
Re:Why use XHTML? (Score:2)
Re:Why not use HTML4 then? (Score:4, Informative)
Re:Why use XHTML? (Score:1)
I use XHTML on my site for two reasons.
The first is that I'm a nerd and I want to use a cutting-edge standard. I imagine that that is a big motivation for a lot of XHTML users.
The second is that I'm a big LaTeX fan and its system of separating appearance from content really appeals to me. XHTML does the same for the web. One can concentrate on just putting information in there, and then can keep visual appearance in a separate place and easily replace it site-wide if necessary. A consequence of this is
Re:Why use XHTML? (Score:1)
(At the end it even links to http://www.mozillaquestquest.com/ [mozillaquestquest.com], which, in their own words:
After much soul searching we have decided to shut down MozillaQuestQuest. In our opinion we cannot compete with MozillaQuest's content for humour value.
Keywords: Humour value.
Re:Why use XHTML? (Score:2)
But let's skip the Latin and look at Hickson's actual arguments. They're a little convoluted, so maybe I'm not reading him correctly. As far as I can tell, he's mostly saying that correct XHTML is hard to prepare. Well, yeah, that's the whole point of using a tight-assed XML document type like XHTML instead of a tolerant, laid-back SGML document type like HTML: you're embra
HTML-Kit (Score:2)
He's right, HTML-Kit is the choice... (Score:2)
MOD PARENT UP!! He's right, HTML-Kit is the choice of even those who use Dreamweaver MX, because Dreamweaver does not respect the formatting of your HTML. HTML coders use HTML Tidy and HTML-Kit to clean up Dreamweaver HTML output, and you-know-who's HTML output, of course, which is so disrespectful it would stomp on your toes if it could.
Re:He's right, HTML-Kit is the choice... (Score:1)
Just pressing CTRL-ALT-F.
Re:HTML-Kit (Score:2)
HTML Builder XP is created by one of the two developers of the now-defunct 1st Page 2000 [evrsoft.com] by Evrsoft. Evrsoft is now just the one remaining developer who has essential
Re:HTML-Kit (Score:2)
BBTidy BBEdit plugin (Mac OS) (Score:4, Informative)
If you are running MacOS with BBEdit, you can use the BBTidy plugin [geocities.com] to get HTML Tidy integration in BBEdit.
JP
HTML Tidy and WDG (Score:1)
HTML to XHTML can only be made manually (Score:3, Interesting)
But correct XHTML is more than that. The goal is to actually give the right context to every element of the text.
When you have an horror like
My company
to display a title, how do you want an automatic tool like Tidy to convert it to
My company
?
It just can't. It will see a table with no caption, no column headers and three elements : two images and a text that is not supposed to be a title at all.
Converting an HTML web site with no semantic to XHTML using Tidy is useless. The result will still be unparsable (it will, but elements will have no meaning), the site will still be unaccessible to alternative browsers, it will still be a hell to maintain, etc. Of course easy navigation with the keyboard shortcuts using Mozilla is out of question.
And the code will even be larger because of the indentation, closing and styles created by Tidy.
All benefits of XHTML/CSS are totally lost.
Look at an horror like
http://www.skyrock.com/
Try to access it with Lynx or the built-in browser of a phone or PDA with no support for styles (ex: Sony/Ericsson P800).
You don't see anything but the names of three files supposed to be images. And this is all you can see on the web site. You don't see any link nor any text.
Convert this to XHTML using Tidy.
The site still doesn't look like anything but three useless filenames. It's just twice longer to load because the code is larger.
Correct XHTML sites have to be designed the right way from the ground up. There's no magic to convert an horror to something clean. And even manually, the best way to do so is almost always to restart from scratch.
HTML to XHTML can only be made manually (extrans) (Score:3, Interesting)
Let's try again, sorry for the noise, I believed
"plain old text" would escape HTML tags.
---
Yes, HTMLTidy can "convert" an HTML page to XHTML. It basically adds CDATA marks, closes tags and create CSS classes instead of attributes like "background".
But correct XHTML is more than that. The goal is to actually give the right context to every element of the text.
When you have an horror like
<table><tr><td width="100%" align="center"><img src="transparentpix.gif" width="20"><font size="9"><b>My company</b></font><img src="transparentpix.gif" width="20"></td></tr></table>
to display a title, how do you want an automatic tool like Tidy to convert it to
<h1>My company</h1>
?
It just can't. It will see a table with no caption, no column headers and three elements : two images and a text that is not supposed to be a title at all.
Converting an HTML web site with no semantic to XHTML using Tidy is useless. The result will still be unparsable (it will, but elements will have no meaning), the site will still be unaccessible to alternative browsers, it will still be a hell to maintain, etc. Of course easy navigation with the keyboard shortcuts using Mozilla is out of question.
And the code will even be larger because of the indentation, closing and styles created by Tidy.
All benefits of XHTML/CSS are totally lost.
Look at an horror like
http://www.skyrock.com/
Try to access it with Lynx or the built-in browser of a phone or PDA with no support for styles (ex: Sony/Ericsson P800).
You don't see anything but the names of three files supposed to be images. And this is all you can see on the web site. You don't see any link nor any text.
Convert this to XHTML using Tidy.
The site still doesn't look like anything but three useless filenames. It's just twice longer to load because the code is larger.
Correct XHTML sites have to be designed the right way from the ground up. There's no magic to convert an horror to something clean. And even manually, the best way to do so is almost always to restart from scratch.