W3C Gets Excessive DTD Traffic 334
eldavojohn writes "It's a common string you see at the start of an HTML document, a URI declaring the type of document, but that is often processed causing undue traffic to W3C's site. There's a somewhat humorous post today from W3.org that seems to be a cry for sanity and asking developers and people to stop building systems that automatically query this information. From their post, 'In particular, software does not usually need to fetch these resources, and certainly does not need to fetch the same one over and over! Yet we receive a surprisingly large number of requests for such resources: up to 130 million requests per day, with periods of sustained bandwidth usage of 350Mbps, for resources that haven't changed in years. The vast majority of these requests are from systems that are processing various types of markup (HTML, XML, XSLT, SVG) and in the process doing something like validating against a DTD or schema. Handling all these requests costs us considerably: servers, bandwidth and human time spent analyzing traffic patterns and devising methods to limit or block excessive new request patterns. We would much rather use these assets elsewhere, for example improving the software and services needed by W3C and the Web Community.' Stop the insanity!"
Re:Wow (Score:5, Insightful)
Re:Who made the DTD a URL? (Score:3, Insightful)
and use that instead.
Re:Leave it to Slashdot... (Score:2, Insightful)
Re:Wow (Score:5, Insightful)
Why on earth are you blaming webmasters? They are just about the only people who cannot be responsible for this. People who write HTML parsers, HTTP libraries, screen-scrapers, etc, they are the ones causing the problem. Badly-coded client software is to blame, not anything you put on a website.
Gumdrops (Score:5, Insightful)
The only thing I'm unclear on is whether your average browser is contributing to this problem when parsing properly written documents.
Re:caching (Score:3, Insightful)
You're solving that problem at the wrong layer. HTTP already includes caching mechanisms, the W3C already use them, and part of the problem is that buggy software is ignoring them.
Please read the article. This is already supposed to happen. Buggy software fails to do this, which is the problem being talked about.
Surprise (Score:4, Insightful)
I've got to say, this doesn't surprise me at all. In the time I've spent at my job, I've been repeatedly floored by the amazing conduct of other companies IT departments. We've only encountered two people I can think of who have been hostile. Everyone else has been quite nice. You'd think people would have things setup well, but they don't.
We've seen many custom XML parsers and encoders, all slightly wrong. We've seen people transmitting very sensitive data without using any kind of security until we refused to continue working without SSL being added to the equation. We've seen people who were secure change their certificates to self-signed, and we seem to consistently know when people's certificates expire before they do.
But even without these things, I can't tell you how many people send us bad data and flat out ignore the response. We get all sorts of bad data sent to us all the time. When that happens, we reply with a failure message describing what's wrong. Yet we get bits of stuff all the time that is wrong, in the same way, from the same people. I'm not talking about sending us something that they aren't supposed to (X when we say only Y), I'm saying invalid XML type wrong... such that it can't be parsed.
We have, a few times while I've been there, had people make a change in their software (or something) and bombard us with invalid data until we we either block their IP or manage to get into voice contact with their IT department. Sometimes they don't even seem to notice the lockout.
Some places can be amazing. Some software can be poorly designed (or something can cause a strange side effect, see here [thedailywtf.com]). I really like one of the suggestions in the comments on the article... start replying really slow, and often with invalid data. They won't do it. I wouldn't. But I like the idea.
Re:Who made the DTD a URL? (Score:5, Insightful)
The URL can be seen as a backup ("in case you don't know the DTD for W3C HTML 4.01, you can create a local copy from this URL" - in the future, when people have forgotten HTML 4.01, that can be useful), or the same way XML namespaces is used - you don't have to send a HTTP request to http://www.w3.org/1999/xhtml [w3.org] to know that a document that uses that namespace is a xhtml document - it's just another form of a unique resource identifier (URI), just like a URN or a guid.
What the W3C is having a problem with is applications that decide to fetch the DTD every single request. That's just crazy. Why do you even need to validate it, unless you're a validator? Just try to parse it - it probably won't validate anyway, and you'll have to do either do it in some kind of quirks mode or just break. If you can parse it correctly, does it matter if it validates? If you can't parse it, does it matter if it validates? And if you actually do want to validate it, why make the user wait a few seconds while you fetch the DTD on every page request? The only reasonable way this could happen that I can think of is link crawlers who find the URL - but doesn't link crawlers usually avoid to revisit pages they just visited?
Re:Wow (Score:5, Insightful)
The expectation is that software would ship with its own copies of "well-known" DTDs with associated catalog entries; the URL is only there as a fallback. The problem is ignorant and/or lazy software developers not implementing catalogs and simply downloading from the URI each time.
Make it slower, not faster (Score:2, Insightful)
Re:Wow (Score:1, Insightful)
The problem is with the docs (Score:5, Insightful)
Simple caching on client side could already improve the situation a whole lot... BUT:
When people implement something for html-ish or svg-ish or xml-ish purposes, they go google for it: "Howto XML blah foo" - result, they're getting basic screw-it-with-a-hammer tutorials that don't point out important design decisions, but instead Just Work - which is what the author wanted to achieve when they started writing the software.
It's a little bit like people still using ifconfig on Linux though it's been deprecated and superseded by iptables and iproute2. But since most tutorials and howtos on the net are just dumbed-down copypasta for quick and dirty hacks - and since nobody fucking enforces the standards - nobody does it the Right Way.
So if I start writing some sax-parser, some html-rendering lib, some silly scraper, whatnot... and the first example implementations only deal with basic stuff and show me how to do it so basic functionality can be implemented... and I'm not really interested in that part of the program anyways, because I need it for putting something more fancy on top... once after I'm through with the initial testing of this particular subsystem, I won't really care about anything else. It works, it doesn't seem to hit performance too badly, it's according to some random guy's completely irrelevant blog - hey, this guy knows what he's doing. I don't care!
This story hitting
Re:I'm just conforming! (Score:2, Insightful)
What's the point of having a DTD if it won't change? Oh yeah, there is none. Conceptually, the DTD is there to define the data, and unless you know what is in the DTD, you cannot use it to validate, which is its purpose. And conceptually, if you assume the data is defined a certain way, you don't need a DTD.
Generally the DTD is for the person parsing the XML. If you're writing the XML, you don't need a DTD, because you already know the schema. If it's only for the XML writers, all you'd need to do is place your schema with the rest of the specs for your application.
Now I wasn't suggesting that in practice you should go to the server every time and fetch the DTD. But clearly you take things too seriously.
Try getting a sense of humor.Re:Delay (Score:5, Insightful)
Re:Wow (Score:5, Insightful)
Yeah, the standard. If your shitty http engine is too shitty to process html without having to look up the DTD on the w3c's website every single page, your shitty http engine shouldn't be allowed out on the internet.
Re:Wow (Score:5, Insightful)
Wow, it just struck me... welcome to Microsoft's world.
Their security was so bad for so many years because they worked on the assumption that:
1) Programmers know what they're doing
2) Programmers aren't assholes
Of course, the success of malware vendors (and Real Networks) has proved those two assumptions wrong many years ago, and probably 90% of the development work on Vista was adding in safeties to protect against idiot programmers, and asshole programmers.
And now the W3C is getting their lesson on a golden platter.
In short, here's the lesson learned:
1) Some proportion of programmers don't know what they're doing and never will
2) Some proportion of programmers are assholes
That's the problem with a URI for an ID (Score:4, Insightful)
If they'd at least made the identifier NOT a URI, something like domain.example.com::[path/]versionstring, or something else that wasn't a URT, so it was clearly an identifier even if it was ultimately convertible to a URI, they would have avoided this kind of problem.
Re:Do what.... (Score:4, Insightful)
Re:They already do. (Score:4, Insightful)
The problem is that several major XML libraries don't just default to no DTD/schema cache - they don't even implement a cache or local catalog. Implementing such a thing is left to the developers using the library.
For example, the XML libraries that come with Sun's Java rely on java.net.URL for downloading resources. I just checked my 1.6 Java install, and by default, it has no cache. In looking up how the java.net cache works, I discovered it wasn't even added until Java 1.5. So prior to Java 1.5, most Java libraries wouldn't cache responses at all because the included library didn't support caching. 'Course, even in Java 1.6, there's no default implementation, so each Java application would have to implement their own cache[1].
The included Java libraries also offer no internal DTD/schema catalog. You can create one (implement org.xml.sax.EntityResolver[2]) but by default they're off to the Internet to download any DTD they run across.
It's really not hard to see how these libraries could result in millions of hits a day - most people using them probably don't even realize that they're hitting the W3C's servers since it happens transparently. And fixing it is unfortunately not just setting configuration files and saving the DTDs locally: it's implementing a bunch of classes.
[1] And for added fun, the stub that is provided appears to be insufficient to support conditional requests - either the cache says "I have it!" and the cached response is used, or the server has to send a new copy. There's no way to do offer up an "If-Modified-Since:" request via the cache class.
[2] Noting that this can't be set for all parsers, it's set on a per-parser object basis. So if you use a third-party library that parses XML after creating its own parser object, you can't make it use your local DTD catalog.
Re:Wow (Score:5, Insightful)
Yeah, the standard. If your shitty http engine is too shitty to process html without having to look up the DTD on the w3c's website every single page, your shitty http engine shouldn't be allowed out on the internet.
Good and jolly bacon bits, please mod parent up. I realize that their comment might come off as harsh, but crap, come on. If one is building an application, would one really want to have to connect to a website to get instructions on how to read a filetype? Especially when all it would take it a single wget and including those instructions with the application to avoid all of this.
Furthermore, it would seem that the process of reading a file would be far faster if the processing instructions were on the local file system rather than on a remote host. If one were really worried about changes to the instructions, one could code a routine to update the DTD whenever the application is updated; if the app isn't such that *would* be updated, one could always have it run a diff against the W3C's DTD every few months - after it's been standardized, it's not like the DTD is going to change on a daily basis. While not a complete cure, it'd still be far more considerate to the W3C's bandwidth than hitting it every request, or even every time a program is started.
Honestly, I wouldn't blame them if they 302'd the file to a page that, upon CAPTCHA'd request, made the file temporarily available for download, so that vendors could fix their broken software. They're obviously far more considerate and forgiving people than I - and, I suspect, many of you fellow Slashdotters - tend to be.
*puts on flame-resistant suit*
Re:That's what you get for making stupid rules. (Score:5, Insightful)
Should SGML renderers cache it? Yes. Should W3C bitch that some SGML renderers are downloading their DTD? No. They should have thought about that before they made HTML a subset of SGML. I don't feel sorry for them.
Re:Who made the DTD a URL? (Score:4, Insightful)
An address is effectively a unique ID.
And the advantage of an address is that its a logical place to put the DTD if you don't happen to have your own copy. Its a unique id and a map to where to get it if you don't already have it.
What were they thinking?
They were thinking people wouldn't needlessly continually redownload the same page over and over and over again.
The root dns servers operate under the same assumption. Do you think they were crazy too? After all, you can force your dns queries to go through the route servers every time if you really want to. Your not supposed to, and doing so needlessly puts more load on them, but you could.
Re:Wow (Score:5, Insightful)
It's more like this: your app should *never* query the DTD. If the DTD changes, your app's code probably needs to change and your app should *never* try to parse using a DTD that hasn't been tested by a human being, or at least through your regression tests. Any changes to DTDs should be handled by updating the app itself.
The only exception to this is an app that also happens to be a development tool.
I'd write the crap code. (Score:2, Insightful)
Simply put, I can not hope to correctly parse the mess in the same way as IE 7 or even Firefox 3. Why burn myself out trying, only to miss lots of stuff? To be really correct I'd probably need to execute everything from ActionScript to VBscript. Sorry, but NO FUCKING WAY.
The only way I'm going to avoid loading the DTD crap as a URL is with a URL blacklist.
Re:That's what you get for making stupid rules. (Score:4, Insightful)
Re:Speaking of caches... (Score:5, Insightful)
Re:Wow (Score:3, Insightful)
It's fallen into common usage. What else would you suggest? "Web Designer", "Network Architect" and all the other 'bits' of webmastery are already taken. Perhaps "Web Systems Administrator".
Re:Wow (Score:5, Insightful)
then there's little point in having one at all, is there.
You're quite right though, copy the DTD, develop against it, publish without the DTD being present in your released app. simple. If only the W3C hadn't specified it as being required to be present. If only every sample didn't have it shown in place.
Re:Wow (Score:5, Insightful)
If you ask me, the W3 asked for this. They didn't consider the consequences, and now that they're under siege, they want to blame everyone else.