Independent Data and Formatting with Microformats
Posted by
ScuttleMonkey
on Tue Jul 11, 2006 06:21 PM
from the web-fun dept.
from the web-fun dept.
IdaAshley writes to tell us IBM DeveloperWorks is running an article about how to best utilize microformats to embed data within standard XHTML code. From the article: "Microformats are a pragmatic approach to solving the issue of structured data on the Web. Is it as architecturally pure as XML-encoded data separated from its formatting through a mechanism such as XSLT style sheets? No. But I think this approach is a realistic middle step that will help build a more intelligent Web that is easier to use and provides better search and data integration."
This discussion has been archived.
No new comments can be posted.
Independent Data and Formatting with Microformats
|
Log In/Create an Account
| Top
| 99 comments
| Search Discussion
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.

Geez, man... (Score:3, Insightful)
Re:Geez, man... (Score:4, Insightful)
(http://silmaril.ie/cgi-bin/blog)
There is already a buzzword: tag abuse. It's the last resort of the untalented.
This particular version is known as semantic imputation (giving things meanings they don't inherently have). It's neither new, special, exciting, nor useful, but at least we now know how little the people at IBM and Leverage Software know about markup and XML.
I guess I'd better add a warning to the XML FAQ [silmaril.ie] about it...
META headers (Score:1, Interesting)
(http://www.robotwisdom.com/)
Re:META headers (Score:4, Informative)
Get off your hobby-horse, Jorn. At some point, please realise that you are clueless about markup. Only then will you be able to learn a bit about what you are so high-and-mighty about.
Firstly, <meta> is an element type, not a header. It doesn't do your credibility much good when you don't even know what it is.
Secondly, <meta> is an astonishingly limited element type. It's scoped to the page not particular parts of it, and it has a plain-text content model because it uses attributes instead of child elements.
Thirdly, I anticipate you saying that you could fix this by changing the <meta> element type. Sure you could. You could fix it by changing it to a set of element types that describe content more accurately and changing it so that it could appear in other parts of the document. And you know what you'd have then? The structured HTML that you despise so much. That's right, microformats embody the very thing you are criticising.
Finally, given that HTML hasn't changed recently to allow microformats, everything that is possible today with microformats was possible five years ago with microformats. It's a design strategy, not a new technology.
Again, please learn a bit about something before you turn your nose up at it. You might be smart in other respects, but when it comes to markup, you are dumb. Please accept this so you can change it.
Re:META headers (Score:5, Informative)
None of it. META tags and microformats serve two entirely seperate purposes, and neither is in any way a replacement for the other.
Firefox (Score:1)
(http://en.wikipedia.org/wiki/User:Sir_Lewk)
Tagging in Text (Score:1, Informative)
(http://slashdot.org/ | Last Journal: Wednesday January 29 2003, @02:50AM)
I do like the idea of being able to move XML around without having to parse to view the basic file in a formatted fashion. So, you're mixing HTML with a tag. Again, SO WHAT? But what about the encapsulated text, what's the point? If you're going to use a viewer eventually (because you have the encapsulated text), use a viewer. This would only help in reading the actual data, but not in bug fixing, because the XML is that much more unreadable.
On the other hand, this is kindof like the PDF format, with text as text. The PDF client renders it as a font bitmap but it's rendered from TEXT in the PDF, therefore you can do things like cut/paste/etc. This takes it a step further by adding a data structure around it which allows you to import rows of things. Pretty sweet, I might use this somewhere. I can see it being useful in mobile stuff, so you don't have to muck with a client parser.
Re:Tagging in Text (Score:5, Informative)
(http://itaintart.blogspot.com/index.html)
The article mentions the wiki [microformats.org], but doesn't link to it, except at the very bottom of the resources section.
LISP (Score:5, Insightful)
I'm sure the LISP community would love to hear about this brand-new idea of embedding specialy, or domain-specific if you will, languages and data. How extraordinarilly novel.
You'll be running a limited LISP implementation on every browser in no time!
Re:LISP (Score:5, Funny)
Standardization is the problem (Score:5, Insightful)
So now why is this "vevent" class special, and who decided it would be "vevent" and not "scheduledevent" or "calendarevent" or "microsoftcalendarhassomethingforyoutodotoday"? Clearly as a human I can look at "dtstart" and think about it and realize that this means the starting date, but how does a computer know this? If the "semantic web" is going to take off, then we need semantics, and pronto.
Hopefully any standardization doesn't turn into a nightmare though. I used to develop in the healthcare insurance claims field, and the old NSF format for transmitting an insurance claim electronically was a horrible death-by-committee piece of work. It was as if nobody could come to a consensus and the committee decided to just throw everything in. You might look at your insurance card and think "gee I have an insurance ID number" but no, in the NSF, there were about 10 different blanks for insurance IDs, depending. Is it a Medicare number? Then it goes in the Medicare blank. God forbid the computer would have just one blank and assume that if you're billing Medicare then the number in the blank is probably a Medicare ID. Medicare was easy, there's just one. Medicaid in most states have a billion subcontractors, all with names that have nothing to do with "medicaid" so you simply had to maintain a magic list of insurance plans that changed every other year or so that used the Medicaid ID field. Or the separate fields for Blue Cross and Blue Shield. What about the states where you have BCBS as a single entity?
Anyway, I'm digressing (and ranting about a chunk of my ilfe I'd much rather forget). What's important in standardizing in semantics is identifying everywhere where things are identical and reusing semantics whenever possible. Decisions have to be made up front as to what is the relationship between "name" and "last name" (people have a name, which has a last name, yet companies have names that typically don't have a last name. What about a cat named "John K. Wibblesworth" how is that different from one named "Tama"?) Yet, take dtstart which is used here for a calendar event. Should we have "dtclassstart" for the first day of school?
Re:Standardization is the problem (Score:5, Insightful)
No. I do remember how a lot of clueless PHB-types ran around telling everybody that though. XML solves the parsing problem, not the semantics problem. It's languages built on top of XML that handle semantics.
XML was never meant to solve the problem you are talking about. Parsing markup into a tree is a totally different concept to figuring out what the stuff in the tree means. The only people who ever thought XML had something to do with what you say were totally clueless about XML.
It's special because it appears in the hCalendar specification [microformats.org]. The people who wrote the specification decided it would be "vevent". They intend to submit it to a standards body.
Re:Standardization is the problem (Score:5, Informative)
(http://www.tschopp.net/)
The idea is to leverage standards that are already out there, and in this case it would be the iCalendar standard.
Re:Standardization is the problem (Score:4, Insightful)
(http://sc.tri-bit.com/ | Last Journal: Sunday July 08, @02:36AM)
Yeah. It works when you use the same DTD, which was the promise. It's not XML's fault that you and your supplier can't get your ducks in a row. The purpose of XML is to provide a medium that two ends can use to standardize a communications format of their own design, while giving a regular form to said formats so that arbitrary formats could be supported by arbitrary tools. It fulfills this ideal quite well, as anyone even vaguely familiar with web standards knows. It is not meant to magically merge two inconsistent standards.
Then <lname> tripped over <lastname> which was crawling on the floor after being decked by <name last="Henry"/> who was rather pissed off after an argument with <name><last>Henry</last></name>
Yeah. And that's XML's fault how? Get a DTD and stick to it.
and the whole thing went down in a pile of flames
Yeah, essentially every office suite, database, most graphics editors, many layout programs, and quite a few games support XML. Jabber / Google Chat run on XML. The web is built on an SGML dialect, which is largely being converted into an XML dialect; XML is itself an SGML dialect. Web 2.0 (god I hate that name) is an outcropping of XML's parsability. XML is so useful that Microsoft was able to use it to ward Massachusettes' lawsuits off. The United Nations now releases their transcripts solely in XML. XML is now the second most pervasive data storage format on earth, after CSV/TSV, and it's gaining fast. (Don't bother saying SQL - it's an API, not a storage format.)
Exactly what is your definition of "going down in flames" ?
and the whole thing went down in a pile of flames and is now relegated to being a 2MB configuration parsing library to embrace and extend "option=value".
Uh, TinyXML has a footprint of 40k, champ. Also, that's not what "embrace and extend" means.
So now why is this "vevent" class special, and who decided it would be "vevent" and not "scheduledevent" or "calendarevent" or "microsoftcalendarhassomethingforyoutodotoday"?
What a surprise, the guy who couldn't standardize on a DTD now fails to understand other format standardizations. Read the article, champ. It's not SlashDot's job to read for you, and this one's honestly pretty simple. Indeed, the specific purpose of microformats is to address your whining, but you don't see the point. Cough.
Clearly as a human I can look at "dtstart" and think about it and realize that this means the starting date, but how does a computer know this?
Er, by supporting a specific microformat. Are you putting in effort to be dense? It's the same way they support iCal, or MS Word files, or in fact any format at all, ever.
If the "semantic web" is going to take off, then we need semantics, and pronto.
This has nothing to do with the semantic web. You want to drop another? Ontological Web Language sounds important too. Use that one more often: fewer people will see through you.
God forbid the computer would have just one blank and assume that if you're billing Medicare then the number in the blank is probably a Medicare ID.
Yes, I'm sure the people billing Medicare who aren't using Medicare IDs will be greatly amused that your application just fails for them. Why is it that I don't believe you had much to do with the design of the system?
What's important in standardizing in semantics is identifying everywhere where things are identical and reusing semantics whenever possible.
"Semantics" aren't reusable. They're not arbitrarily applied. Please stop using words you fail to understand. Not every markup of data is semantic, even if the markup means something. Semantics are the work of understanding context, not identifying relations
I don't get it... (Score:5, Insightful)
<span style="display: none">
<vevent:event>
<vevent:dtstart>20060501</vevent:dstart>
<vevent:dtend>20060502<vevent:dtend>
<vevent:summary">My Conference opening</vevent:summary>
<vevent:location>Hollywood, CA</vevent:location>
</vevent:event>
</span>
We the 'right'[tm] way to day it?
Re:I don't get it... (Score:5, Informative)
History, failures, doomed to repeat (Score:5, Insightful)
There's no namespacing. There's not even an ATTEMPT at namespacing. This will fast become an unmanageable hodge-podge of insanity, with common words used willy-nilly in class attributes.
The class attribute is defined as CDATA. That's it. You can use pretty much ANY character in it. There's a lot of characters that can't be used in a CSS selector, though, such as ":". See where I'm going with this? <div class="mf:vevent"> for a start. Better yet, <div class="hidden mf:vevent"> such that you can hide (or format) the block of data separately.
Now, as if that wasn't bad enough, and, trust me, it IS bad enough, there's also the misuse of the "title" attribute and the "abbr" element. A machine formatted date is not the expanded version of a human formatted date, which is not an abbreviation. A renderer trying to make sense of <abbr class="dtstart" title="10034134134T00">17th Smarch</abbr> will think "AHA! This here is an abbreviation, I will provide unto the user some means to see what that '17th Smarch' abbrevation stands for!" Usability disasters follow.
So, in summary, this is the worst idea I've seen in HTML space since some bright spark said, "let's suggest that people use the 'text/html' content type for their XHTML markup!"
HoTMetaL (Score:3, Insightful)
(http://slashdot.org/~Doc%20Ruby/journal | Last Journal: Thursday March 31 2005, @01:48PM)
Pingerati from Technorati (Score:2)
(http://www.simpy.com/ | Last Journal: Tuesday April 15 2003, @12:58PM)
hResume and Emurse.com (Score:2)
(http://www.emurse.com/)
We're looking to implement hResume on Emurse.com [emurse.com] web resumes here in the next couple of days.
I'm really excited about being able to push the standard some. We've been wondering what the effects of it could be negatively though, in terms of screen scrapers (alex.emurse.com, for instance). Any one have any thoughts?
We've built hResume support to be configurable by the user, if it proves to be an issue. Just wondering how we should initially offer it.
I Was Going To Say... (Score:4, Interesting)
I was going to say "I Don't Get It" but somebody beat me to it.
I think the title of TFA "Separate data and formatting with microformats" is a bit ironic since it's about wedging your data into a web page in such a fashion that somebody might be able to pull it back out.
If you want to make your data available there are all sorts of standard and more efficient ways of doing it than embedding it in the presentation layer. If somebody is going to all the trouble to create a parseable human-readable page, why wouldn't they go to about the same amount of trouble and make a far more efficient and standard RSS feed? What about the buzzword of the last few years, SOAP? Hell, what about XML?
From TFA:
I agree. This reminds me of the lame number tricks where you have somebody pick a number, add something, multiply it by something, blah blah blah, you take the result, divide it by 7 and then you give them their orignal number because you had it all set up ahead of time. If they screw up in their calculations, the trick doesn't work. In this thing, if you screw up embedding the text within the HTML (plenty of ways to do that), the trick doesn't work - and doesn't accomplish much even if it does.JSON (Javascript over the wire) (Score:2, Informative)
(http://www.wanfear.com/~mbrito)
We have this, only IE does not support it. (Score:2)
(http://anti-slash.org/)
It appears you were thinking about the data URI scheme [wikipedia.org]. Unfortunately, and very much like modern CSS standards, the only browser to not support it is the one with the greatest market share.
I don't see how this is better than XML/XSLT. (Score:2)
This "Microformatting" concept is predicated on the idea that data is (or should be) human-readable in its default state, but with mechanisms that make it easier to translate it into something machine-readable. This seems backwards to me.
Humans only need to be able to comprehend the data structure at two points: input and output. In between, computers may perform a thousand different transfers and transformations on the data, and at those points, the ability to see the data in plain English (or plain Anyotherlanguage) is just excess baggage.
He mentions Webmonkey and Technorati as computer services which essentially work by screen-scraping content intended for humans and hacking it into something for computers. This is not to be encouraged.
The XML output of the author's sample transformation seems like a more logical default storage format for the data. It's easy and flexible to transform this data back into any format desired, and certainly easier than transforming from "Microformatted" XHTML to intermediate XML to target format.