W3C Gets Excessive DTD Traffic 334
eldavojohn writes "It's a common string you see at the start of an HTML document, a URI declaring the type of document, but that is often processed causing undue traffic to W3C's site. There's a somewhat humorous post today from W3.org that seems to be a cry for sanity and asking developers and people to stop building systems that automatically query this information. From their post, 'In particular, software does not usually need to fetch these resources, and certainly does not need to fetch the same one over and over! Yet we receive a surprisingly large number of requests for such resources: up to 130 million requests per day, with periods of sustained bandwidth usage of 350Mbps, for resources that haven't changed in years. The vast majority of these requests are from systems that are processing various types of markup (HTML, XML, XSLT, SVG) and in the process doing something like validating against a DTD or schema. Handling all these requests costs us considerably: servers, bandwidth and human time spent analyzing traffic patterns and devising methods to limit or block excessive new request patterns. We would much rather use these assets elsewhere, for example improving the software and services needed by W3C and the Web Community.' Stop the insanity!"
Re:Leave it to Slashdot... (Score:5, Informative)
Umm, no. (Score:5, Informative)
If you want to complain, it should be the fact that slashdot is serving a strict.dtd when it doesn't validate against it.
They already do. (Score:5, Informative)
Re:That's what you get for making stupid rules. (Score:5, Informative)
It's not a link. It's a reference to an external DTD subset. It's there so that generic SGML software can properly parse the document without any special knowledge of HTML.
No, external DTD subsets are a part of SGML, which is at least a decade older than the W3C.
I'm going to say this as clearly as possible. (Score:3, Informative)
There, you can now stop posting your hilarious "jokes".
Re:Leave it to Slashdot... (Score:3, Informative)
From the article, it seems like the problem is with software that processes XML, like a web crawler, not a browser.
Browsers are also pretty good about caching stuff.
Re:Delay (Score:4, Informative)
Re:Leave it to Slashdot... (Score:3, Informative)
FTA:
I don't claim to fully grasp what software is causing the problem but it does seem to effect more than just XML.
Re:Wow (Score:2, Informative)
Re:Wow (Score:5, Informative)
"Webmasters" refers to people who run websites, not the W3C. And this particular feature is an artefact of SGML, which was around for over a decade before the W3C ever existed.
You mean like how RFC 2616 describes the caching mechanism that is being ignored by the problem clients? Or are you referring to the established-for-decades SGML system catalogue that they mention in the HTML 4 specification [w3.org] multiple times [w3.org]?
If people writing client software actually did what they were supposed to, this wouldn't be a problem. This is not a designed-in bug, this is caused by a minority of developers eschewing the specifications and standard practice out of either ignorance or apathy.
Re:Delay (Score:3, Informative)
Better: Had a build system once that looked for a host and had to TCP timeout before the build could continue. Had to happen several hundred times a build cycle.
The Java libraries do this down in their innards unless you're very careful to avoid it.
Re:Submitted this to /.? (Score:5, Informative)
Re:Irony (Score:2, Informative)
http://www.w3.org/blog/systeam/2008/02/08/w3c_s_excessive_dtd_traffic#c1821 [w3.org]
Re:The problem is with the docs (Score:5, Informative)
ip addr add 192.168.1.2/24 dev eth0
ip link set eth0 down
etc. etc.
Re:Submitted this to /.? (Score:4, Informative)
Re:I always thought it was stupid (Score:3, Informative)
I think I was using the Java version of Apache Xerces at the time for the Docbook processing. More recently I've used lxml in Python (based on libxml2), which has an option (no_network) to suppress DTD loading from the web, but you have to request that explicitly.
I've never seen a parser that caches DTDs by default. I'm not sure about parsers that do not download by default.
Re:I'd write the crap code. (Score:4, Informative)