Slashdot is powered by your submissions, so send in your scoop

 



Forgot your password?
typodupeerror
×
Programming Software IT Technology

Convert from HTML to XML With HTML Tidy 43

An anonymous reader writes "HTML Tidy, a powerful tool to help convert old HTML pages to newer standards, such as XML. This tip demonstrates how to convert HTML documents to XML (or more specifically, XHTML) with a simple, open source tool. This conversion is useful for webmasters who are migrating to XML. It can also help XML converts who have to interface with legacy HTML tools."
This discussion has been archived. No new comments can be posted.

Convert from HTML to XML With HTML Tidy

Comments Filter:
  • by Kethinov ( 636034 ) on Monday September 29, 2003 @11:52AM (#7085287) Homepage Journal
    I've always been interested in X(HT)ML, but I've never wanted to sit down and convert every single page by hand. This tool might be just what I need.
    • Re:I'll check it out (Score:2, Informative)

      by in10se ( 472253 )
      It's extremely useful for converting the "HTML" generated by Microsoft Office products into nice, clean, well formatted XHTML.
    • Re:I'll check it out (Score:5, Interesting)

      by fm6 ( 162816 ) on Monday September 29, 2003 @03:20PM (#7087465) Homepage Journal
      I've sneered at XHTML in the past, but I was speaking out of ignorance. I was assuming it was just a silly attempt to preserve HTML in an XML world. Actually, it's a very convenient bridge between HTML and XML. It's only incidentally about web content, since browsers will always need to support legacy HTML, and thus will never adopt all of XHTML's structure and restrictions. But once you have your content in XHTML format, you can transform it into any XML application you choose, using XSLT scripts. Which opens up a whole world of possibilities for people with all their content in messy old word processor formats, since word processor now tend to come with HTML export filters.
      • Enlighten me. Should I convert a web page from old HTML into XHTML, exactly how is the code reusable in areas other than web page design? I've always been under the impression that XHTML would one day just replace legacy HTML, but you seem to think otherwise.
        • XHTML is great for a bunch of reasons.

          First off, every reason to use HTML 4.0 is a reason to use XHTML, unless that reason happens to be "it's not XHTML!".

          Secondly, using XHTML allows you all the niceties of XML. This is great when you decide to update your site so it works in say cell-phone browsers, rather than just a PC browser. This alone is a great reason to use XHTML. As more and more data sources become xml aware, being able to easily connect them becomes important. XHTML allows you to do this
        • Well first of all, legacy HTML will never go away -- not as long as millions of people are hacking out web pages by hand, or using antiquated HTML editors. XHTML will never completely replace legacy HTML, and if I still thought that was XHTML's central purpose, I would still consider XHTML a waste of effort.

          The big virtue of XHTML is the big virtue of all XML document types: it's open. You can do anything with an XML document. I suppose that's also true of say TeX or RTF. Except these formats are very mes

  • There seem to be some nice cut-down pages available as XML/RSS. Any good conversions or text-based readers?

  • libxml2? (Score:5, Informative)

    by dakkar ( 128056 ) <dakkar@nOspAM.thenautilus.net> on Monday September 29, 2003 @11:57AM (#7085345) Homepage

    A few days ago I had to convert HTML pages into XHTML, stripping out a few extra elements and attributes. I used xsltproc, from libxslt [xmlsoft.org], which uses the parser from libxml2 [xmlsoft.org], and this has the option of parsing strict HTML into an XML DOM.

    XMLTidy can be useful when you have a not-so-strict HTML, but for most quick conversions I've found libxml2 &co to be quite light and easy.

  • This isn't new... (Score:3, Informative)

    by in10se ( 472253 ) on Monday September 29, 2003 @12:01PM (#7085395) Homepage

    HTML Tidy has been our for years.



    Check out the Tidy Homepage [w3.org] or the project on SourceForge [sourceforge.net].

  • news for nerds? (Score:2, Insightful)

    by Anonymous Coward
    more like tips for newbies

    but yeah this is a great tip.. especially if you are writing web-scrapers to extra data from web pages and/or convert them to RSS. Just use Tidy to tidy it, and then your favorite XML parser can slurp it right up and you can use XPath to pull out what you need.

    Look out though, there are some cases that Tidy chokes on. One that I keep running into is shit like this:

    <table>
    <form>
    <tr><td>...</td></tr>
    </table >
    </form>

    Basically mixing
    • Yeah, the first line is a little flamebait-ey, but other than that it's informative.
    • Re:news for nerds? (Score:4, Insightful)

      by aWalrus ( 239802 ) <sergio&overcaffeinated,net> on Monday September 29, 2003 @02:14PM (#7086808) Homepage Journal
      That's because that's invalid markup. When it gets tidied, where would you put the form? inside or outside the table?
      • Re:news for nerds? (Score:3, Informative)

        by mhesseltine ( 541806 )

        That's because that's invalid markup. When it gets tidied, where would you put the form? inside or outside the table?

        Well, from the W3C page on HTML 4.01 [w3.org]

        The FORM element acts as a container for controls. It specifies:

        * The layout of the form (given by the contents of the element).
        * The program that will handle the completed and submitted form (the action attribute). The receiving program must be able to parse name/value pairs in order to make use of them.
        * The method by which user data will be sent to th

        • Yes, it makes sense to nest the table inside the form. That problem was that the form started inside the table, and ended outside. It was improperly nested.

          Correct:
          <table>
          <form>
          </form>
          </table>

          Incorrect:
          <table>
          <form>
          </table>
          </form>
    • ...

      There's actually a reason why people write code like this. The tag in many browser implementations emits the equivalent of a line break after the tag. Heavily styled pages with forms therefore get unexpected spaces, which pisses off a lot of designers. Putting the as in the above prevents this, even if it's incorrect. Browsers don't seem to have a problem.

      Nowadays the correct way to do this is with CSS. should do the trick, or just toss a form {margin:0px} in the document style sheet.

  • Why use XHTML? (Score:4, Informative)

    by Alethes ( 533985 ) on Monday September 29, 2003 @01:05PM (#7086089)
    Ian Hickson makes a good case here [hixie.ch] that using XHTML may not be the right direction to go -- at least at this point.
    • Well, this is veering off-topic, but the MIME-type isn't used for the most part is that the user agents that are in the market don't know how to handle the application/xhtml+xml type. I don't see this as any real reason to not use xhtml, you've just got to be careful to make it well formed. Ian's argument stands for crappy html too, and more than a few people I've run into don't want to use HTML for anything as the HTML they've run into doesn't make much sense. Some tags are open, some times you close on
    • I use XHTML on my site for two reasons.

      The first is that I'm a nerd and I want to use a cutting-edge standard. I imagine that that is a big motivation for a lot of XHTML users.

      The second is that I'm a big LaTeX fan and its system of separating appearance from content really appeals to me. XHTML does the same for the web. One can concentrate on just putting information in there, and then can keep visual appearance in a separate place and easily replace it site-wide if necessary. A consequence of this is

    • This is FUD.

      (At the end it even links to http://www.mozillaquestquest.com/ [mozillaquestquest.com], which, in their own words:
      After much soul searching we have decided to shut down MozillaQuestQuest. In our opinion we cannot compete with MozillaQuest's content for humour value.

      Keywords: Humour value.
    • This is sort of ad hominem [nizkor.org], but it's hard to accept criticism of markup languages from a guy whose web pages are hand-formatted text!

      But let's skip the Latin and look at Hickson's actual arguments. They're a little convoluted, so maybe I'm not reading him correctly. As far as I can tell, he's mostly saying that correct XHTML is hard to prepare. Well, yeah, that's the whole point of using a tight-assed XML document type like XHTML instead of a tolerant, laid-back SGML document type like HTML: you're embra

  • If you are running Windows, there is a nice HTML editor called HTML-Kit [chami.com] that integrates HTML-Tidy right in. It's not WYSIWYG, it color codes your HTML and can format it a number of ways.

    • MOD PARENT UP!! He's right, HTML-Kit is the choice of even those who use Dreamweaver MX, because Dreamweaver does not respect the formatting of your HTML. HTML coders use HTML Tidy and HTML-Kit to clean up Dreamweaver HTML output, and you-know-who's HTML output, of course, which is so disrespectful it would stomp on your toes if it could.
    • Many web development applications have HTML Tidy built in. One I use is HTML Builder XP [code-builders.com]. Don't let the name fool you. It's more than an HTML editor. It comes with functions for creating CSS, ASP, and PHP (4.x integrated!) and customizable DHTML scripts. It has tabbed preview windows to check your rendered code in as many browsers as you have.

      HTML Builder XP is created by one of the two developers of the now-defunct 1st Page 2000 [evrsoft.com] by Evrsoft. Evrsoft is now just the one remaining developer who has essential

  • by jpkunst ( 612360 ) on Monday September 29, 2003 @02:42PM (#7087058)

    If you are running MacOS with BBEdit, you can use the BBTidy plugin [geocities.com] to get HTML Tidy integration in BBEdit.

    JP

  • I recently used HTML Tidy to convert my website from html to xhtml. It works well. After Tidy, I used WDG HTML Validator [htmlhelp.com] to verify that the code was correct. (It validates XHTML as well.) If you install your own version of the validator you can more easily check your entire website. This is important if you have a lot of pages.
  • by chrysalis ( 50680 ) on Monday September 29, 2003 @06:28PM (#7089341) Homepage
    Yes, HTMLTidy can "convert" an HTML page to XHTML. It basically adds CDATA marks, closes tags and create CSS classes instead of attributes like "background".

    But correct XHTML is more than that. The goal is to actually give the right context to every element of the text.

    When you have an horror like :

    My company

    to display a title, how do you want an automatic tool like Tidy to convert it to :

    My company

    ?

    It just can't. It will see a table with no caption, no column headers and three elements : two images and a text that is not supposed to be a title at all.

    Converting an HTML web site with no semantic to XHTML using Tidy is useless. The result will still be unparsable (it will, but elements will have no meaning), the site will still be unaccessible to alternative browsers, it will still be a hell to maintain, etc. Of course easy navigation with the keyboard shortcuts using Mozilla is out of question.

    And the code will even be larger because of the indentation, closing and styles created by Tidy.

    All benefits of XHTML/CSS are totally lost.

    Look at an horror like :

    http://www.skyrock.com/

    Try to access it with Lynx or the built-in browser of a phone or PDA with no support for styles (ex: Sony/Ericsson P800).

    You don't see anything but the names of three files supposed to be images. And this is all you can see on the web site. You don't see any link nor any text.

    Convert this to XHTML using Tidy.

    The site still doesn't look like anything but three useless filenames. It's just twice longer to load because the code is larger.

    Correct XHTML sites have to be designed the right way from the ground up. There's no magic to convert an horror to something clean. And even manually, the best way to do so is almost always to restart from scratch.

  • by chrysalis ( 50680 ) on Monday September 29, 2003 @06:32PM (#7089375) Homepage
    Argl, I forgot to enable "Extrans" before submitting the previous post :(

    Let's try again, sorry for the noise, I believed
    "plain old text" would escape HTML tags.

    ---

    Yes, HTMLTidy can "convert" an HTML page to XHTML. It basically adds CDATA marks, closes tags and create CSS classes instead of attributes like "background".

    But correct XHTML is more than that. The goal is to actually give the right context to every element of the text.

    When you have an horror like :

    <table><tr><td width="100%" align="center"><img src="transparentpix.gif" width="20"><font size="9"><b>My company</b></font><img src="transparentpix.gif" width="20"></td></tr></table>

    to display a title, how do you want an automatic tool like Tidy to convert it to :

    <h1>My company</h1>

    ?

    It just can't. It will see a table with no caption, no column headers and three elements : two images and a text that is not supposed to be a title at all.

    Converting an HTML web site with no semantic to XHTML using Tidy is useless. The result will still be unparsable (it will, but elements will have no meaning), the site will still be unaccessible to alternative browsers, it will still be a hell to maintain, etc. Of course easy navigation with the keyboard shortcuts using Mozilla is out of question.

    And the code will even be larger because of the indentation, closing and styles created by Tidy.

    All benefits of XHTML/CSS are totally lost.

    Look at an horror like :

    http://www.skyrock.com/

    Try to access it with Lynx or the built-in browser of a phone or PDA with no support for styles (ex: Sony/Ericsson P800).

    You don't see anything but the names of three files supposed to be images. And this is all you can see on the web site. You don't see any link nor any text.

    Convert this to XHTML using Tidy.

    The site still doesn't look like anything but three useless filenames. It's just twice longer to load because the code is larger.

    Correct XHTML sites have to be designed the right way from the ground up. There's no magic to convert an horror to something clean. And even manually, the best way to do so is almost always to restart from scratch.

He has not acquired a fortune; the fortune has acquired him. -- Bion

Working...