Forgot your password?
typodupeerror
The Internet

W3C Gets Excessive DTD Traffic 334

Posted by ScuttleMonkey
from the stop-the-intertubes-i-wanna-get-off dept.
eldavojohn writes "It's a common string you see at the start of an HTML document, a URI declaring the type of document, but that is often processed causing undue traffic to W3C's site. There's a somewhat humorous post today from W3.org that seems to be a cry for sanity and asking developers and people to stop building systems that automatically query this information. From their post, 'In particular, software does not usually need to fetch these resources, and certainly does not need to fetch the same one over and over! Yet we receive a surprisingly large number of requests for such resources: up to 130 million requests per day, with periods of sustained bandwidth usage of 350Mbps, for resources that haven't changed in years. The vast majority of these requests are from systems that are processing various types of markup (HTML, XML, XSLT, SVG) and in the process doing something like validating against a DTD or schema. Handling all these requests costs us considerably: servers, bandwidth and human time spent analyzing traffic patterns and devising methods to limit or block excessive new request patterns. We would much rather use these assets elsewhere, for example improving the software and services needed by W3C and the Web Community.' Stop the insanity!"
This discussion has been archived. No new comments can be posted.

W3C Gets Excessive DTD Traffic

Comments Filter:
  • Wow (Score:2, Funny)

    by geekoid (135745)
    "Webmasters" strike again. Clowns.

    • Re:Wow (Score:5, Insightful)

      by Breakfast Pants (323698) on Friday February 08, 2008 @09:34PM (#22356690) Journal
      Not only that, this document gets cached all over the place by ISPs, etc., and they *still* get that many hits.
    • Re:Wow (Score:4, Interesting)

      by x_MeRLiN_x (935994) on Friday February 08, 2008 @09:43PM (#22356754) Homepage
      The summary strongly implies and the article states that this unwanted traffic is coming from software that parses markup. Placing the DTD into a web page or other medium where markup is used is the intended and desirable usage.

      I don't claim to know why you have a problem with webmasters (I am not one), but if you're a programmer and perceive them to have less technical ability than yourself, well.. your ilk seem to be the "clowns" this time.
      • Re:Wow (Score:5, Funny)

        by Curtman (556920) on Saturday February 09, 2008 @07:47AM (#22359058)

        I don't claim to know why you have a problem with webmasters (I am not one)

        Probably for the same reason that many other people hate them. They announce themselves to people as being a "webmaster". It's a really stupid title. They don't preform wizardry. If I can't at least be a "codemaster", and maybe our plumber gets to be called a "pipemaster", then we'll continue to mock anyone who uses the word. Oooh, "plungemaster". I think he'd go for that.
    • Re:Wow (Score:5, Insightful)

      by Bogtha (906264) on Friday February 08, 2008 @09:45PM (#22356772)

      Why on earth are you blaming webmasters? They are just about the only people who cannot be responsible for this. People who write HTML parsers, HTTP libraries, screen-scrapers, etc, they are the ones causing the problem. Badly-coded client software is to blame, not anything you put on a website.

      • Gumdrops (Score:5, Insightful)

        by milsoRgen (1016505) on Friday February 08, 2008 @09:58PM (#22356856) Homepage

        They are just about the only people who cannot be responsible for this.
        Exactly, for as long as I've been involved with HTML's various forms over the years it was always considered proper technique (from W3C documentation) to include the doctype (or more recently xmlns). Certainly sounds like a parser issue to me.

        The only thing I'm unclear on is whether your average browser is contributing to this problem when parsing properly written documents.
      • I agree with your main point, but blaming authors of screen scrapers is ridiculous. Screen scraping is reading the final output of program (or in this case web page) in image format and converting that into usable data with methods such as OCR (optical character recognition).
        • by Bogtha (906264)

          Screen scraping is reading the final output of program (or in this case web page) in image format and converting that into usable data with methods such as OCR (optical character recognition).

          Actually, the term is widely used as a synonym for spidering a site. It's rare I see it used in the way you describe. Sorry for the confusion.

  • by OdieWan (757584) on Friday February 08, 2008 @09:29PM (#22356648)
    I have a solution to the problem; I wrote it down at http://www.w3.org/TR/html4/strict.dtd [w3.org] !
  • Do what.... (Score:5, Funny)

    by Creepy Crawler (680178) on Friday February 08, 2008 @09:29PM (#22356652)
    Do what any other respectable web provider would do..

    Put links to Goatse in the definitions!
  • by Anonymous Coward
    Oh, that was you? I thought that making every webauthor refer to a W3C URL in every web page was going to get someone in trouble someday. Today seems to be someday.
    • Re: (Score:3, Insightful)

      Or you could do what I do, and simply download the DTD, install it on your system,
      and use that instead.
    • by ozamosi (615254) on Friday February 08, 2008 @10:10PM (#22356920) Homepage
      It does contain a URL. It also contain a URN (for instance "-//W3C//DTD HTML 4.01//EN"). The point of a URN is that it doesn't have a universal location - you're supposed to find it wherever you can, probably in local cache somewhere.

      The URL can be seen as a backup ("in case you don't know the DTD for W3C HTML 4.01, you can create a local copy from this URL" - in the future, when people have forgotten HTML 4.01, that can be useful), or the same way XML namespaces is used - you don't have to send a HTTP request to http://www.w3.org/1999/xhtml [w3.org] to know that a document that uses that namespace is a xhtml document - it's just another form of a unique resource identifier (URI), just like a URN or a guid.

      What the W3C is having a problem with is applications that decide to fetch the DTD every single request. That's just crazy. Why do you even need to validate it, unless you're a validator? Just try to parse it - it probably won't validate anyway, and you'll have to do either do it in some kind of quirks mode or just break. If you can parse it correctly, does it matter if it validates? If you can't parse it, does it matter if it validates? And if you actually do want to validate it, why make the user wait a few seconds while you fetch the DTD on every page request? The only reasonable way this could happen that I can think of is link crawlers who find the URL - but doesn't link crawlers usually avoid to revisit pages they just visited?
      • by MenTaLguY (5483)
        Minor quibbles: "-//W3C//DTD HTML 4.01//EN" is not a URN but a PI (public identifier), and there is a reason to have validating parsers: the DTD can contain essential information for correctly interpreting the document (e.g. entity declarations, as is obviously the case in HTML).

        Other than that you're spot on.
      • by Bogtha (906264)

        Why do you even need to validate it, unless you're a validator? Just try to parse it

        The external DTD subset isn't just for error checking. It defines the character entities and the content model for element types. If you don't have access to the DTD (or hard-coded HTML-specific behaviour) you can't parse it fully.

      • by Prototerm (762512) on Saturday February 09, 2008 @09:28AM (#22359398)
        Nothing, it's a non-profit.

        (ducks and runs)
    • by Mantaar (1139339) on Friday February 08, 2008 @10:33PM (#22357092) Homepage
      The problem does not lie in the mechanism itself - it's in the documentation - or the lack of understandable (or at least often-used) docs directly at the source.

      Simple caching on client side could already improve the situation a whole lot... BUT:

      When people implement something for html-ish or svg-ish or xml-ish purposes, they go google for it: "Howto XML blah foo" - result, they're getting basic screw-it-with-a-hammer tutorials that don't point out important design decisions, but instead Just Work - which is what the author wanted to achieve when they started writing the software.

      It's a little bit like people still using ifconfig on Linux though it's been deprecated and superseded by iptables and iproute2. But since most tutorials and howtos on the net are just dumbed-down copypasta for quick and dirty hacks - and since nobody fucking enforces the standards - nobody does it the Right Way.

      So if I start writing some sax-parser, some html-rendering lib, some silly scraper, whatnot... and the first example implementations only deal with basic stuff and show me how to do it so basic functionality can be implemented... and I'm not really interested in that part of the program anyways, because I need it for putting something more fancy on top... once after I'm through with the initial testing of this particular subsystem, I won't really care about anything else. It works, it doesn't seem to hit performance too badly, it's according to some random guy's completely irrelevant blog - hey, this guy knows what he's doing. I don't care!

      This story hitting /.'s front page might actually help improve the situation. But.. it's like this with stupid programmers - they never die out, they'll always create problems. Let's get used to it.
  • It's a good we don't contribute to the problem - Oh, wait...

    <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
                            "http://www.w3.org/TR/html4/strict.dtd">
    <html>
    <head>
    <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">

    <title>Slashdot: News for nerds, stuff that matters</title>
    • by snl2587 (1177409) on Friday February 08, 2008 @09:36PM (#22356700)
      Note: It is my understanding that the browser is what looks up the DTD. So /. having the declaration is irrelevant.
      • Re: (Score:2, Insightful)

        by Vectronic (1221470)
        And if he really wanted to be funny, he would have quoted it from the webpage that the Story/Blog was posted on on W3C
      • Re: (Score:3, Informative)

        by corsec67 (627446)
        Actually, do any browsers get the DTD?
        From the article, it seems like the problem is with software that processes XML, like a web crawler, not a browser.

        Browsers are also pretty good about caching stuff.
        • Re: (Score:3, Informative)

          by milsoRgen (1016505)

          From the article, it seems like the problem is with software that processes XML, like a web crawler, not a browser.

          FTA:

          The vast majority of these requests are from systems that are processing various types of markup (HTML, XML, XSLT, SVG)

          I don't claim to fully grasp what software is causing the problem but it does seem to effect more than just XML.
        • by MenTaLguY (5483)
          Browsers should ship with the DTD. That's the whole point of the public identifier (e.g. "-//W3C//DTD HTML 4.01/EN"), so that a local copy can be obtained using the PI as an index into a local catalog. The URL is only there as a fallback.
          • by corsec67 (627446)
            Except that I was talking about software ASIDE from browsers, like a XML validator, crawler, etc...
            Stuff that deals with generic XML and is being used for xhtml.
    • Umm, no. (Score:5, Informative)

      by pavon (30274) on Friday February 08, 2008 @09:43PM (#22356758)
      That is supposed to be there according to the standard. And all the major browsers cached that that file after loading it (at most) once, and then never read it again. So no, slashdot is not causing a problem. The problem is all the other HTML processing software besides browsers that do not cache their DTD files, not the files for containing it.

      If you want to complain, it should be the fact that slashdot is serving a strict.dtd when it doesn't validate against it.
      • by Skapare (16644)

        It's the whole design of HTML/XML, that needs to have DTD files in the first place to do the processing, that is all wrong. I warned about this well over 12 years ago. At least what little code I've written to process HTML/XML has always entirely ignored the DTD.

        • Re:Umm, no. (Score:5, Interesting)

          by MillionthMonkey (240664) on Saturday February 09, 2008 @12:37AM (#22357736)
          At least what little code I've written to process HTML/XML has always entirely ignored the DTD.

          Don't be so sure- even if your own code ignores it. Unless you're dealing with it on a raw character level, with most XML libraries and frameworks it can be quite tricky to prevent DTDs from being resolved behind your back.

          I wrote some Java code a while back to parse some XML files that were downloaded from NCBI. Typical for NCBI data, this involved wading through terabytes of crap, and anything based on DOM wasn't going to work- so I used the lower level event-based SAX library in JAXP. The files did have DTD declarations in them pointing to NCBI, which I wanted to ignore, since this was a one-time data mining operation. I just examined some sample files, figured out pseudo-XPath expressions for what I wanted to pull out, set up a simple state machine to stumble through the SAX events, and not caring about the DTD, cleared the namespace-aware and validating flags on the SAXParserFactory. So I ended up with this:

          File xmlgz = new File("ncbi_diarrhea.xml.gz");
          DefaultHandler myHandler = new MyNCBIStateMachineHandler();
          GZIPInputStream gzos = new GZIPInputStream(new FileInputStream(xmlgz));
          SAXParserFactory spf = SAXParserFactory.newInstance();
          spf.setValidating(false);
          spf.setNamespaceAware(false);
          SAXParser sp = spf.newSAXParser();
          InputSource input = new InputSource(gzos);
          sp.parse(input, handler);

          This ran fine, until it mysteriously froze up 18 hours into the run. It turned out to be caused by our switch to a different ISP, during which time the building lost its outside network access. The thread picked up the next file and immediately got blocked in the SAX library, trying to resolve the NCBI DTD.

          This is how I fixed it:

          spf.setFeature("http://xml.org/sax/features/external-general-entities", false);
          spf.setFeature("http://xml.org/sax/features/external-parameter-entities", false);


          Now I'm sure someone is going to come on here calling me a noob for not knowing to use an XMLReaderFactory (or whatever XML API class isn't obsolete this week) and setting a custom EntityResolver that can provide my local copy of the NCBI DTD when presented with its URI, but why should I even have to bother with that? XML pretends to be simple but it's seriously messed up.
    • by Bogtha (906264)

      No, Slashdot is not contributing to the problem, that is correct code. Just because a URI is listed, it doesn't mean that software should request it each and every time it sees it. Most code that sees that URI should already have a copy of the DTD in the local catalogue. It's only generic SGML software that cannot be expected to have a copy of the DTD.

    • Oy Vey... (Score:3, Interesting)

      by zanaxagoras (1116047)
      PocketPick is 100% correct.

      Here's an example of what correct markup should look like:

      <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
      "http://schemas.slashdot.org/strict.dtd">

      The documented standard uses a URL that links to the W3C's copy of the DTD only as an EXAMPLE. The standard DOES NOT REQUIRE usage of the URL to W3C's copy of the DTD. Responsible developers use a URL that links to their OWN COPY of the DTD. A

      • Re:Oy Vey... (Score:4, Interesting)

        by Zarel (900479) on Saturday February 09, 2008 @12:45AM (#22357772)

        The documented standard uses a URL that links to the W3C's copy of the DTD only as an EXAMPLE. The standard DOES NOT REQUIRE usage of the URL to W3C's copy of the DTD. Responsible developers use a URL that links to their OWN COPY of the DTD. ANYTHING else is just leeching from W3C. PERIOD.
        Well, no, it's not. It's true that the standard does not require usage of the URL to W3C's copy of the DTD, but it's definitely recommended, since every client presumably has a cached copy of the W3C's DTD for something as common as HTML 4.01, and if you were to link to your own, some parsers might be confused and unsure about whether or not you're using Official W3C HTML (tm). (Yes, yes, I know; they should know by '-//W3C//etc' but this article is about stupid parsers, isn't it?)
  • Delay (Score:5, Interesting)

    by erikina (1112587) <eri.kina@gmail.com> on Friday February 08, 2008 @09:31PM (#22356674) Homepage
    Have they tried delaying the response by 5 or 6 seconds? It could cause a lot of applications to hang pretty badly. That or just serve a completely nonsensical schema every thousandth request. Gotta keep developers on their toes.
    • Re:Delay (Score:4, Informative)

      by bunratty (545641) on Friday February 08, 2008 @10:09PM (#22356906)
      RTFA. They returned the 503 Service Unavailable error to many abusers, and they just kept on with abusive requests. Many abusers aren't checking the response to the request at all.
      • Re:Delay (Score:5, Insightful)

        by bwb (6483) on Friday February 08, 2008 @10:47PM (#22357154)
        Sure, they're ignoring the response status, but I'll betcha most of them are doing synchronous requests. If I were solving this problem for W3C, I'd be delaying the abusers by 5 or 6 *minutes*. Maybe respond to the first request from a given IP/user agent with no or little delay, but each subsequent request within a certain timeframe incurs triple the previous delay, or the throughput gets progressively throttled-down until you're drooling it out at 150bps. That would render the really abusive applications immediately unusable, and with any luck, the hordes of angry customers would get the vendors to fix their broken software.
    • Re:Delay (Score:5, Funny)

      by dotancohen (1015143) on Friday February 08, 2008 @10:36PM (#22357104) Homepage
      You must be a Microsoft engineer.
    • Re: (Score:3, Informative)

      by RhysU (132590)
      Good: Delivered a piece of code once that tested just fine for us, but blew up at the customer's site. We never realized that the new J2EE-like features were hitting a live URL during DTD parsing.

      Better: Had a build system once that looked for a host and had to TCP timeout before the build could continue. Had to happen several hundred times a build cycle.

      The Java libraries do this down in their innards unless you're very careful to avoid it.

  • MIT needs a CDN! (Score:3, Interesting)

    by rekoil (168689) on Friday February 08, 2008 @09:35PM (#22356694)
    I'm surprised none of the CDNs out there haven't volunteered to host this file - the problem is they'd have to host the entire w3.org site, else move the rest of it to a another hostname.
  • by mcrbids (148650) on Friday February 08, 2008 @09:36PM (#22356702) Journal
    The answer to this problem is quite easy.

    Continue to host the data referenced on a single T-1 line. That will cut your expenses to the bone since you'll never exceed 1.54 Mbps and that should be quite cheap. And, any dumfuxorz who fubarred their parser to not cache these basically static values will probably figure it out... very quickly.

    You don't have to leave it on the T-1, maybe just 1 month out of the year. Every year.

    Problem solved!
  • by rgrbrny (857597) on Friday February 08, 2008 @09:39PM (#22356718)

    the doctype was being used during a xsl transform during our build process; when the hibernate sight flaked out, the builds would fail intermittently.

    solution was to add a xmlcatalog using a local resource.

    bet this happens a lot more than most people realize; we'd been doing this for years before we noticed a problem.

  • A plea to the web community to stop pinging the W3C DTDs isn't going to solve anything. What will work is blocking any unnecessary DTD traffic aggressively, and if that doesn't do the job, blocking it even more aggressively. Intelligently designed software / ISPs / routers will cache, filter and block these requests for the sake of their own efficiency, bandwidth, and proper function. Buggy, bloated and inefficient applications won't. Nothing's ever going to convince the 'web community' to stop pinging the
  • I can't think of a problem that is simpler to solve. Just stop serving these documents. The offending programs will be fixed very quickly.
  • Irony (Score:5, Funny)

    by davburns (49244) <davburns+slashdo ... u ['x.e' in gap]> on Friday February 08, 2008 @09:44PM (#22356764) Homepage Journal

    So, w3c complains about their bandwidth, and the response is: The Slashdot Effect. Doesn't that make the old bandwidth problem seem less of a problem?

    I'm just loving the irony in that.

  • by mwasham (1208930) on Friday February 08, 2008 @09:49PM (#22356790) Homepage
    And it is only 4 articles down.. Host with Yahoo! Yahoo Offers All-You-Can-Eat Storage and Bandwidth http://hardware.slashdot.org/article.pl?sid=08/02/08/1811236 [slashdot.org]
  • by dotancohen (1015143) on Friday February 08, 2008 @09:52PM (#22356810) Homepage
    Great, they cry "we get too much traffic", so we go ahead and slap them on the front page of slashdot. Sick, sick fucking joke.
  • I always thought it was stupid that XML documents include reference to a DTD hosted on a remote server that you do not maintain. This is wrong on so many levels, I don't even know where to begin:

    1. The validation will not work if the remote server is down, or network is down, or your connection to the internet is down, or if the file is not accessible for any other reason.

    2. You are at the mercy of some third-party to ensure that the file is correct and that it doesn't change.

    3. You are susceptible to man-i
    • by MenTaLguY (5483)
      What people were supposed to do is include a copy of the DTDs with their software. That's what the PI string is there for, as an index into a local catalog of DTD resources. The URL was supposed to be only a fallback measure.
      • Re: (Score:3, Interesting)

        by MtHuurne (602934)

        I wrote my thesis in Docbook and installed the processing toolchain on a laptop. Sometimes the processing would fail and sometimes it worked. After a while I noticed it worked when I was setting behind my desk and failed when I was sitting on my bed. After some digging, I found out that the catalog configuration was wrong and the XML parser was downloading the DTDs from the web. This was before WiFi, so sitting on the bed meant the laptop did not have internet access.

        The core of the problem is that most

  • by glwtta (532858) on Friday February 08, 2008 @09:55PM (#22356834) Homepage
    Browsers cache the DTDs.

    There, you can now stop posting your hilarious "jokes".
  • Surprise (Score:4, Insightful)

    by MBCook (132727) <foobarsoft@foobarsoft.com> on Friday February 08, 2008 @10:08PM (#22356904) Homepage

    I've got to say, this doesn't surprise me at all. In the time I've spent at my job, I've been repeatedly floored by the amazing conduct of other companies IT departments. We've only encountered two people I can think of who have been hostile. Everyone else has been quite nice. You'd think people would have things setup well, but they don't.

    We've seen many custom XML parsers and encoders, all slightly wrong. We've seen people transmitting very sensitive data without using any kind of security until we refused to continue working without SSL being added to the equation. We've seen people who were secure change their certificates to self-signed, and we seem to consistently know when people's certificates expire before they do.

    But even without these things, I can't tell you how many people send us bad data and flat out ignore the response. We get all sorts of bad data sent to us all the time. When that happens, we reply with a failure message describing what's wrong. Yet we get bits of stuff all the time that is wrong, in the same way, from the same people. I'm not talking about sending us something that they aren't supposed to (X when we say only Y), I'm saying invalid XML type wrong... such that it can't be parsed.

    We have, a few times while I've been there, had people make a change in their software (or something) and bombard us with invalid data until we we either block their IP or manage to get into voice contact with their IT department. Sometimes they don't even seem to notice the lockout.

    Some places can be amazing. Some software can be poorly designed (or something can cause a strange side effect, see here [thedailywtf.com]). I really like one of the suggestions in the comments on the article... start replying really slow, and often with invalid data. They won't do it. I wouldn't. But I like the idea.

  • which is never ever learned...

    A freely accessible [wikipedia.org] network [wikipedia.org] resource [wikipedia.org] is begging to be driven, smoking and shattered, into the ground by the ill-mannered, ill-trained, or ill-intentioned hordes.

    Personally, I blame the introduction of AOL in 1994 to the Usenet for this downward spiral. We were doing just fine before all you "me too"s started pouring in.

    Get off my lawn, you clueless kids!

  • ... come back to haunt you.

    Perhaps they will stop putting HTTP-URLs in standardized tags now... Also, enjoy life as a web content provider who spends many hours per week blocking Referers (nice typo in the original RFC!) and dealing with broken clients, something that the W3C never spent much time pondering about.

  • If the problem is that it gets served out too many times, then make the server slow as molasses. If it takes 1-2 minutes to get the DTD from the server, or more, then it is quickly discovered by the performance teams.
  • That doctype is simply <!DOCTYPE HTML> [w3.org]!
  • Recording UA? (Score:2, Redundant)

    by dotancohen (1015143)
    What are the user agents making the requests? Do these programs identify themselves with a UA string or something?
  • I bet Slashdot.org could possibly find some bloggers that would be more than happy to receive that traffic!
  • I'm sorry the typo's mine. I made it when I was working late one night and spilt spagetti down my shirt. I had no idea that it would propagate so far and ruin the web. Oops. Anyway I've fixed it, but it's not in the stable CVS branch yet so I'm afraid you'll just have to put up with it for a while longer.

    (For those without a sense of humour, yes this is a joke)
  • Perhaps ISPs should install caching DTD servers.

    People would have another reason to complain about their ISP's quirks.

  • I think they screwed up, and brought this on themselves. I already thought that it was annoying having so verbose an identifier... this just makes it more hateful.

    If they'd at least made the identifier NOT a URI, something like domain.example.com::[path/]versionstring, or something else that wasn't a URT, so it was clearly an identifier even if it was ultimately convertible to a URI, they would have avoided this kind of problem.
  • Creating a standard that would allow people to host DTD's all over the web and fetch them automatically was major design stupidity, not just because people need to host that stuff, but because it misses the point of standardization in the first place.
  • by 93 Escort Wagon (326346) on Friday February 08, 2008 @11:46PM (#22357456)
    ... you can't blame Microsoft for this problem! After all, IE ignores pretty much all web standards and best practices, and does its own thing!

APL is a write-only language. I can write programs in APL, but I can't read any of them. -- Roy Keir

Working...