Forgot your password?
typodupeerror
The Internet

W3C Gets Excessive DTD Traffic 334

Posted by ScuttleMonkey
from the stop-the-intertubes-i-wanna-get-off dept.
eldavojohn writes "It's a common string you see at the start of an HTML document, a URI declaring the type of document, but that is often processed causing undue traffic to W3C's site. There's a somewhat humorous post today from W3.org that seems to be a cry for sanity and asking developers and people to stop building systems that automatically query this information. From their post, 'In particular, software does not usually need to fetch these resources, and certainly does not need to fetch the same one over and over! Yet we receive a surprisingly large number of requests for such resources: up to 130 million requests per day, with periods of sustained bandwidth usage of 350Mbps, for resources that haven't changed in years. The vast majority of these requests are from systems that are processing various types of markup (HTML, XML, XSLT, SVG) and in the process doing something like validating against a DTD or schema. Handling all these requests costs us considerably: servers, bandwidth and human time spent analyzing traffic patterns and devising methods to limit or block excessive new request patterns. We would much rather use these assets elsewhere, for example improving the software and services needed by W3C and the Web Community.' Stop the insanity!"
This discussion has been archived. No new comments can be posted.

W3C Gets Excessive DTD Traffic

Comments Filter:
  • by Anonymous Coward on Friday February 08, 2008 @09:29PM (#22356656)
    Oh, that was you? I thought that making every webauthor refer to a W3C URL in every web page was going to get someone in trouble someday. Today seems to be someday.
  • Delay (Score:5, Interesting)

    by erikina (1112587) <eri.kina@gmail.com> on Friday February 08, 2008 @09:31PM (#22356674) Homepage
    Have they tried delaying the response by 5 or 6 seconds? It could cause a lot of applications to hang pretty badly. That or just serve a completely nonsensical schema every thousandth request. Gotta keep developers on their toes.
  • MIT needs a CDN! (Score:3, Interesting)

    by rekoil (168689) on Friday February 08, 2008 @09:35PM (#22356694)
    I'm surprised none of the CDNs out there haven't volunteered to host this file - the problem is they'd have to host the entire w3.org site, else move the rest of it to a another hostname.
  • by v(*_*)vvvv (233078) on Friday February 08, 2008 @09:35PM (#22356696)
    They insist that every document begin with a declaration that includes a link to their site. Now they are complaining about traffic.

    The link in the declaration serves absolutely NO purpose other than to comply with the standard that they created. It sounds like the whole purpose was so that they could have every source page begin with a link to their site. Serves the right.
  • by rgrbrny (857597) on Friday February 08, 2008 @09:39PM (#22356718)

    the doctype was being used during a xsl transform during our build process; when the hibernate sight flaked out, the builds would fail intermittently.

    solution was to add a xmlcatalog using a local resource.

    bet this happens a lot more than most people realize; we'd been doing this for years before we noticed a problem.

  • Re:Wow (Score:4, Interesting)

    by x_MeRLiN_x (935994) on Friday February 08, 2008 @09:43PM (#22356754) Homepage
    The summary strongly implies and the article states that this unwanted traffic is coming from software that parses markup. Placing the DTD into a web page or other medium where markup is used is the intended and desirable usage.

    I don't claim to know why you have a problem with webmasters (I am not one), but if you're a programmer and perceive them to have less technical ability than yourself, well.. your ilk seem to be the "clowns" this time.
  • by MtHuurne (602934) on Friday February 08, 2008 @10:52PM (#22357172) Homepage

    I wrote my thesis in Docbook and installed the processing toolchain on a laptop. Sometimes the processing would fail and sometimes it worked. After a while I noticed it worked when I was setting behind my desk and failed when I was sitting on my bed. After some digging, I found out that the catalog configuration was wrong and the XML parser was downloading the DTDs from the web. This was before WiFi, so sitting on the bed meant the laptop did not have internet access.

    The core of the problem is that most XML parsers will automatically and transparently fetch the DTD from the URL and do not cache it. So if you have no DTDs installed locally, or if your XML parser cannot find them (catalog configuration is easy to mess up), the parsing will work just fine and if processing the XML takes a significant amount of time, you probably won't notice the small delay from downloading the DTD.

    There are several possible solutions for this:

    • Do not automatically fetch DTDs from the web: make it an explicit option that the user has to set.
    • Be vocal when fetching a DTD from the web, for example issue a warning.
    • Cache fetched DTDs locally.

    All of these are things that should be addressed in the XML parsers.

  • Oy Vey... (Score:3, Interesting)

    by zanaxagoras (1116047) on Friday February 08, 2008 @11:41PM (#22357428)
    PocketPick is 100% correct.

    Here's an example of what correct markup should look like:

    <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
                            "http://schemas.slashdot.org/strict.dtd">
    The documented standard uses a URL that links to the W3C's copy of the DTD only as an EXAMPLE. The standard DOES NOT REQUIRE usage of the URL to W3C's copy of the DTD. Responsible developers use a URL that links to their OWN COPY of the DTD. ANYTHING else is just leeching from W3C. PERIOD.
  • Re:Umm, no. (Score:5, Interesting)

    by MillionthMonkey (240664) on Saturday February 09, 2008 @12:37AM (#22357736)
    At least what little code I've written to process HTML/XML has always entirely ignored the DTD.

    Don't be so sure- even if your own code ignores it. Unless you're dealing with it on a raw character level, with most XML libraries and frameworks it can be quite tricky to prevent DTDs from being resolved behind your back.

    I wrote some Java code a while back to parse some XML files that were downloaded from NCBI. Typical for NCBI data, this involved wading through terabytes of crap, and anything based on DOM wasn't going to work- so I used the lower level event-based SAX library in JAXP. The files did have DTD declarations in them pointing to NCBI, which I wanted to ignore, since this was a one-time data mining operation. I just examined some sample files, figured out pseudo-XPath expressions for what I wanted to pull out, set up a simple state machine to stumble through the SAX events, and not caring about the DTD, cleared the namespace-aware and validating flags on the SAXParserFactory. So I ended up with this:

    File xmlgz = new File("ncbi_diarrhea.xml.gz");
    DefaultHandler myHandler = new MyNCBIStateMachineHandler();
    GZIPInputStream gzos = new GZIPInputStream(new FileInputStream(xmlgz));
    SAXParserFactory spf = SAXParserFactory.newInstance();
    spf.setValidating(false);
    spf.setNamespaceAware(false);
    SAXParser sp = spf.newSAXParser();
    InputSource input = new InputSource(gzos);
    sp.parse(input, handler);

    This ran fine, until it mysteriously froze up 18 hours into the run. It turned out to be caused by our switch to a different ISP, during which time the building lost its outside network access. The thread picked up the next file and immediately got blocked in the SAX library, trying to resolve the NCBI DTD.

    This is how I fixed it:

    spf.setFeature("http://xml.org/sax/features/external-general-entities", false);
    spf.setFeature("http://xml.org/sax/features/external-parameter-entities", false);


    Now I'm sure someone is going to come on here calling me a noob for not knowing to use an XMLReaderFactory (or whatever XML API class isn't obsolete this week) and setting a custom EntityResolver that can provide my local copy of the NCBI DTD when presented with its URI, but why should I even have to bother with that? XML pretends to be simple but it's seriously messed up.
  • Re:Oy Vey... (Score:4, Interesting)

    by Zarel (900479) on Saturday February 09, 2008 @12:45AM (#22357772)

    The documented standard uses a URL that links to the W3C's copy of the DTD only as an EXAMPLE. The standard DOES NOT REQUIRE usage of the URL to W3C's copy of the DTD. Responsible developers use a URL that links to their OWN COPY of the DTD. ANYTHING else is just leeching from W3C. PERIOD.
    Well, no, it's not. It's true that the standard does not require usage of the URL to W3C's copy of the DTD, but it's definitely recommended, since every client presumably has a cached copy of the W3C's DTD for something as common as HTML 4.01, and if you were to link to your own, some parsers might be confused and unsure about whether or not you're using Official W3C HTML (tm). (Yes, yes, I know; they should know by '-//W3C//etc' but this article is about stupid parsers, isn't it?)
  • Re:Wow (Score:3, Interesting)

    by mollymoo (202721) * on Saturday February 09, 2008 @09:41AM (#22359450) Journal

    If people writing client software actually did what they were supposed to, this wouldn't be a problem. This is not a designed-in bug, this is caused by a minority of developers eschewing the specifications and standard practice out of either ignorance or apathy.

    Failing to be aware of how your users will likely behave is a design bug. If a tiny fraction of your users make a particular error it's probably their fault. If a significant proportion of your users make a particular error, it's your fault.

Wernher von Braun settled for a V-2 when he coulda had a V-8.

Working...