
W3C Gets Excessive DTD Traffic 334
eldavojohn writes "It's a common string you see at the start of an HTML document, a URI declaring the type of document, but that is often processed causing undue traffic to W3C's site. There's a somewhat humorous post today from W3.org that seems to be a cry for sanity and asking developers and people to stop building systems that automatically query this information. From their post, 'In particular, software does not usually need to fetch these resources, and certainly does not need to fetch the same one over and over! Yet we receive a surprisingly large number of requests for such resources: up to 130 million requests per day, with periods of sustained bandwidth usage of 350Mbps, for resources that haven't changed in years. The vast majority of these requests are from systems that are processing various types of markup (HTML, XML, XSLT, SVG) and in the process doing something like validating against a DTD or schema. Handling all these requests costs us considerably: servers, bandwidth and human time spent analyzing traffic patterns and devising methods to limit or block excessive new request patterns. We would much rather use these assets elsewhere, for example improving the software and services needed by W3C and the Web Community.' Stop the insanity!"
Who made the DTD a URL? (Score:2, Interesting)
Delay (Score:5, Interesting)
MIT needs a CDN! (Score:3, Interesting)
That's what you get for making stupid rules. (Score:1, Interesting)
The link in the declaration serves absolutely NO purpose other than to comply with the standard that they created. It sounds like the whole purpose was so that they could have every source page begin with a link to their site. Serves the right.
had this problem with hibernates website... (Score:3, Interesting)
the doctype was being used during a xsl transform during our build process; when the hibernate sight flaked out, the builds would fail intermittently.
solution was to add a xmlcatalog using a local resource.
bet this happens a lot more than most people realize; we'd been doing this for years before we noticed a problem.
Re:Wow (Score:4, Interesting)
I don't claim to know why you have a problem with webmasters (I am not one), but if you're a programmer and perceive them to have less technical ability than yourself, well.. your ilk seem to be the "clowns" this time.
Re:I always thought it was stupid (Score:3, Interesting)
I wrote my thesis in Docbook and installed the processing toolchain on a laptop. Sometimes the processing would fail and sometimes it worked. After a while I noticed it worked when I was setting behind my desk and failed when I was sitting on my bed. After some digging, I found out that the catalog configuration was wrong and the XML parser was downloading the DTDs from the web. This was before WiFi, so sitting on the bed meant the laptop did not have internet access.
The core of the problem is that most XML parsers will automatically and transparently fetch the DTD from the URL and do not cache it. So if you have no DTDs installed locally, or if your XML parser cannot find them (catalog configuration is easy to mess up), the parsing will work just fine and if processing the XML takes a significant amount of time, you probably won't notice the small delay from downloading the DTD.
There are several possible solutions for this:
All of these are things that should be addressed in the XML parsers.
Oy Vey... (Score:3, Interesting)
Here's an example of what correct markup should look like: The documented standard uses a URL that links to the W3C's copy of the DTD only as an EXAMPLE. The standard DOES NOT REQUIRE usage of the URL to W3C's copy of the DTD. Responsible developers use a URL that links to their OWN COPY of the DTD. ANYTHING else is just leeching from W3C. PERIOD.
Re:Umm, no. (Score:5, Interesting)
Don't be so sure- even if your own code ignores it. Unless you're dealing with it on a raw character level, with most XML libraries and frameworks it can be quite tricky to prevent DTDs from being resolved behind your back.
I wrote some Java code a while back to parse some XML files that were downloaded from NCBI. Typical for NCBI data, this involved wading through terabytes of crap, and anything based on DOM wasn't going to work- so I used the lower level event-based SAX library in JAXP. The files did have DTD declarations in them pointing to NCBI, which I wanted to ignore, since this was a one-time data mining operation. I just examined some sample files, figured out pseudo-XPath expressions for what I wanted to pull out, set up a simple state machine to stumble through the SAX events, and not caring about the DTD, cleared the namespace-aware and validating flags on the SAXParserFactory. So I ended up with this:
File xmlgz = new File("ncbi_diarrhea.xml.gz");
DefaultHandler myHandler = new MyNCBIStateMachineHandler();
GZIPInputStream gzos = new GZIPInputStream(new FileInputStream(xmlgz));
SAXParserFactory spf = SAXParserFactory.newInstance();
spf.setValidating(false);
spf.setNamespaceAware(false);
SAXParser sp = spf.newSAXParser();
InputSource input = new InputSource(gzos);
sp.parse(input, handler);
This ran fine, until it mysteriously froze up 18 hours into the run. It turned out to be caused by our switch to a different ISP, during which time the building lost its outside network access. The thread picked up the next file and immediately got blocked in the SAX library, trying to resolve the NCBI DTD.
This is how I fixed it:
spf.setFeature("http://xml.org/sax/features/external-general-entities", false);
spf.setFeature("http://xml.org/sax/features/external-parameter-entities", false);
Now I'm sure someone is going to come on here calling me a noob for not knowing to use an XMLReaderFactory (or whatever XML API class isn't obsolete this week) and setting a custom EntityResolver that can provide my local copy of the NCBI DTD when presented with its URI, but why should I even have to bother with that? XML pretends to be simple but it's seriously messed up.
Re:Oy Vey... (Score:4, Interesting)
Re:Wow (Score:3, Interesting)
Failing to be aware of how your users will likely behave is a design bug. If a tiny fraction of your users make a particular error it's probably their fault. If a significant proportion of your users make a particular error, it's your fault.