Stories
Slash Boxes
Comments

News for nerds, stuff that matters

Slashdot Log In

Log In

Create Account  |  Retrieve Password

W3C Gets Excessive DTD Traffic

Posted by ScuttleMonkey on Fri Feb 08, 2008 09:22 PM
from the stop-the-intertubes-i-wanna-get-off dept.
eldavojohn writes "It's a common string you see at the start of an HTML document, a URI declaring the type of document, but that is often processed causing undue traffic to W3C's site. There's a somewhat humorous post today from W3.org that seems to be a cry for sanity and asking developers and people to stop building systems that automatically query this information. From their post, 'In particular, software does not usually need to fetch these resources, and certainly does not need to fetch the same one over and over! Yet we receive a surprisingly large number of requests for such resources: up to 130 million requests per day, with periods of sustained bandwidth usage of 350Mbps, for resources that haven't changed in years. The vast majority of these requests are from systems that are processing various types of markup (HTML, XML, XSLT, SVG) and in the process doing something like validating against a DTD or schema. Handling all these requests costs us considerably: servers, bandwidth and human time spent analyzing traffic patterns and devising methods to limit or block excessive new request patterns. We would much rather use these assets elsewhere, for example improving the software and services needed by W3C and the Web Community.' Stop the insanity!"
+ -
story

Related Stories

This discussion has been archived. No new comments can be posted.
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More
Loading... please wait.
  • by OdieWan (757584) on Friday February 08 2008, @09:29PM (#22356648)
    I have a solution to the problem; I wrote it down at http://www.w3.org/TR/html4/strict.dtd [w3.org] !
  • Do what.... (Score:5, Funny)

    by Creepy Crawler (680178) on Friday February 08 2008, @09:29PM (#22356652)
    Do what any other respectable web provider would do..

    Put links to Goatse in the definitions!
  • Delay (Score:5, Interesting)

    by erikina (1112587) <eri.kina@gmail.com> on Friday February 08 2008, @09:31PM (#22356674) Homepage
    Have they tried delaying the response by 5 or 6 seconds? It could cause a lot of applications to hang pretty badly. That or just serve a completely nonsensical schema every thousandth request. Gotta keep developers on their toes.
    • Re:Delay (Score:5, Funny)

      by dotancohen (1015143) on Friday February 08 2008, @10:36PM (#22357104) Homepage
      You must be a Microsoft engineer.
      • Re:Delay (Score:5, Insightful)

        by bwb (6483) on Friday February 08 2008, @10:47PM (#22357154)
        Sure, they're ignoring the response status, but I'll betcha most of them are doing synchronous requests. If I were solving this problem for W3C, I'd be delaying the abusers by 5 or 6 *minutes*. Maybe respond to the first request from a given IP/user agent with no or little delay, but each subsequent request within a certain timeframe incurs triple the previous delay, or the throughput gets progressively throttled-down until you're drooling it out at 150bps. That would render the really abusive applications immediately unusable, and with any luck, the hordes of angry customers would get the vendors to fix their broken software.
  • by mcrbids (148650) on Friday February 08 2008, @09:36PM (#22356702) Journal
    The answer to this problem is quite easy.

    Continue to host the data referenced on a single T-1 line. That will cut your expenses to the bone since you'll never exceed 1.54 Mbps and that should be quite cheap. And, any dumfuxorz who fubarred their parser to not cache these basically static values will probably figure it out... very quickly.

    You don't have to leave it on the T-1, maybe just 1 month out of the year. Every year.

    Problem solved!
  • So, w3c complains about their bandwidth, and the response is: The Slashdot Effect. Doesn't that make the old bandwidth problem seem less of a problem?

    I'm just loving the irony in that.

  • by mwasham (1208930) on Friday February 08 2008, @09:49PM (#22356790) Homepage
    And it is only 4 articles down.. Host with Yahoo! Yahoo Offers All-You-Can-Eat Storage and Bandwidth http://hardware.slashdot.org/article.pl?sid=08/02/08/1811236 [slashdot.org]
  • by dotancohen (1015143) on Friday February 08 2008, @09:52PM (#22356810) Homepage
    Great, they cry "we get too much traffic", so we go ahead and slap them on the front page of slashdot. Sick, sick fucking joke.
    • by ger (3028) on Friday February 08 2008, @10:47PM (#22357150) Homepage
      To try to help put these numbers into perspective, this blog post is currently #1 on slashdot, #7 on reddit, the top page of del.icio.us [del.icio.us], etc; yet www.w3.org is still serving more than 650 times as many DTDs as this blog post, according to a 10-min sample of the logs I just checked.
    • Re:Wow (Score:5, Insightful)

      by Breakfast Pants (323698) on Friday February 08 2008, @09:34PM (#22356690) Journal
      Not only that, this document gets cached all over the place by ISPs, etc., and they *still* get that many hits.
    • Re:Wow (Score:5, Insightful)

      by Bogtha (906264) on Friday February 08 2008, @09:45PM (#22356772)

      Why on earth are you blaming webmasters? They are just about the only people who cannot be responsible for this. People who write HTML parsers, HTTP libraries, screen-scrapers, etc, they are the ones causing the problem. Badly-coded client software is to blame, not anything you put on a website.

      • Gumdrops (Score:5, Insightful)

        by milsoRgen (1016505) on Friday February 08 2008, @09:58PM (#22356856) Homepage

        They are just about the only people who cannot be responsible for this.
        Exactly, for as long as I've been involved with HTML's various forms over the years it was always considered proper technique (from W3C documentation) to include the doctype (or more recently xmlns). Certainly sounds like a parser issue to me.

        The only thing I'm unclear on is whether your average browser is contributing to this problem when parsing properly written documents.
        • Re:Wow (Score:5, Insightful)

          by MenTaLguY (5483) on Friday February 08 2008, @10:19PM (#22356986) Homepage
          That's the whole purpose of the public identifier (e.g. "-//W3C//DTD HTML 4.01//EN") in the doctype, and the SGML and XML Catalog specifications!

          The expectation is that software would ship with its own copies of "well-known" DTDs with associated catalog entries; the URL is only there as a fallback. The problem is ignorant and/or lazy software developers not implementing catalogs and simply downloading from the URI each time.
        • Re:Wow (Score:5, Informative)

          by Bogtha (906264) on Friday February 08 2008, @10:28PM (#22357044)

          They literally wrote the standard.

          "Webmasters" refers to people who run websites, not the W3C. And this particular feature is an artefact of SGML, which was around for over a decade before the W3C ever existed.

          If they didn't want the traffic they should have specified the matter in their RFCs.

          You mean like how RFC 2616 describes the caching mechanism that is being ignored by the problem clients? Or are you referring to the established-for-decades SGML system catalogue that they mention in the HTML 4 specification [w3.org] multiple times [w3.org]?

          You can tell them apart by their attention to the consequences of their actions.

          If people writing client software actually did what they were supposed to, this wouldn't be a problem. This is not a designed-in bug, this is caused by a minority of developers eschewing the specifications and standard practice out of either ignorance or apathy.

          • Re:Wow (Score:5, Insightful)

            by Blakey Rat (99501) on Friday February 08 2008, @10:59PM (#22357206)
            If people writing client software actually did what they were supposed to, this wouldn't be a problem. This is not a designed-in bug, this is caused by a minority of developers eschewing the specifications and standard practice out of either ignorance or apathy.

            Wow, it just struck me... welcome to Microsoft's world.

            Their security was so bad for so many years because they worked on the assumption that:
            1) Programmers know what they're doing
            2) Programmers aren't assholes

            Of course, the success of malware vendors (and Real Networks) has proved those two assumptions wrong many years ago, and probably 90% of the development work on Vista was adding in safeties to protect against idiot programmers, and asshole programmers.

            And now the W3C is getting their lesson on a golden platter.

            In short, here's the lesson learned:
            1) Some proportion of programmers don't know what they're doing and never will
            2) Some proportion of programmers are assholes
        • Re:Wow (Score:5, Insightful)

          by Anonymous Coward on Friday February 08 2008, @10:54PM (#22357176)
          They literally wrote the standard

          Yeah, the standard. If your shitty http engine is too shitty to process html without having to look up the DTD on the w3c's website every single page, your shitty http engine shouldn't be allowed out on the internet.
    • by snl2587 (1177409) on Friday February 08 2008, @09:36PM (#22356700)
      Note: It is my understanding that the browser is what looks up the DTD. So /. having the declaration is irrelevant.
    • Umm, no. (Score:5, Informative)

      by pavon (30274) on Friday February 08 2008, @09:43PM (#22356758)
      That is supposed to be there according to the standard. And all the major browsers cached that that file after loading it (at most) once, and then never read it again. So no, slashdot is not causing a problem. The problem is all the other HTML processing software besides browsers that do not cache their DTD files, not the files for containing it.

      If you want to complain, it should be the fact that slashdot is serving a strict.dtd when it doesn't validate against it.
    • They already do. (Score:5, Informative)

      by pavon (30274) on Friday February 08 2008, @09:51PM (#22356804)
      The spec already recommends this and all the major browsers do it. The software that is causing the problem are generic XML/SGML processing packages which were designed to be able to deal with documents with any random DTD, not just the main HTML/XHTML ones from W3C. They are the folks that are downloading each DTD every single time and not caching it, contrary to the standard. Sometimes caching is a configuration option which defaults to off and administrators never turn it on.
    • by Bogtha (906264) on Friday February 08 2008, @09:55PM (#22356832)

      They insist that every document begin with a declaration that includes a link to their site.

      It's not a link. It's a reference to an external DTD subset. It's there so that generic SGML software can properly parse the document without any special knowledge of HTML.

      The link in the declaration serves absolutely NO purpose other than to comply with the standard that they created. It sounds like the whole purpose was so that they could have every source page begin with a link to their site. Serves the right.

      No, external DTD subsets are a part of SGML, which is at least a decade older than the W3C.

    • by ozamosi (615254) on Friday February 08 2008, @10:10PM (#22356920) Homepage
      It does contain a URL. It also contain a URN (for instance "-//W3C//DTD HTML 4.01//EN"). The point of a URN is that it doesn't have a universal location - you're supposed to find it wherever you can, probably in local cache somewhere.

      The URL can be seen as a backup ("in case you don't know the DTD for W3C HTML 4.01, you can create a local copy from this URL" - in the future, when people have forgotten HTML 4.01, that can be useful), or the same way XML namespaces is used - you don't have to send a HTTP request to http://www.w3.org/1999/xhtml [w3.org] to know that a document that uses that namespace is a xhtml document - it's just another form of a unique resource identifier (URI), just like a URN or a guid.

      What the W3C is having a problem with is applications that decide to fetch the DTD every single request. That's just crazy. Why do you even need to validate it, unless you're a validator? Just try to parse it - it probably won't validate anyway, and you'll have to do either do it in some kind of quirks mode or just break. If you can parse it correctly, does it matter if it validates? If you can't parse it, does it matter if it validates? And if you actually do want to validate it, why make the user wait a few seconds while you fetch the DTD on every page request? The only reasonable way this could happen that I can think of is link crawlers who find the URL - but doesn't link crawlers usually avoid to revisit pages they just visited?
    • by Mantaar (1139339) on Friday February 08 2008, @10:33PM (#22357092) Homepage
      The problem does not lie in the mechanism itself - it's in the documentation - or the lack of understandable (or at least often-used) docs directly at the source.

      Simple caching on client side could already improve the situation a whole lot... BUT:

      When people implement something for html-ish or svg-ish or xml-ish purposes, they go google for it: "Howto XML blah foo" - result, they're getting basic screw-it-with-a-hammer tutorials that don't point out important design decisions, but instead Just Work - which is what the author wanted to achieve when they started writing the software.

      It's a little bit like people still using ifconfig on Linux though it's been deprecated and superseded by iptables and iproute2. But since most tutorials and howtos on the net are just dumbed-down copypasta for quick and dirty hacks - and since nobody fucking enforces the standards - nobody does it the Right Way.

      So if I start writing some sax-parser, some html-rendering lib, some silly scraper, whatnot... and the first example implementations only deal with basic stuff and show me how to do it so basic functionality can be implemented... and I'm not really interested in that part of the program anyways, because I need it for putting something more fancy on top... once after I'm through with the initial testing of this particular subsystem, I won't really care about anything else. It works, it doesn't seem to hit performance too badly, it's according to some random guy's completely irrelevant blog - hey, this guy knows what he's doing. I don't care!

      This story hitting /.'s front page might actually help improve the situation. But.. it's like this with stupid programmers - they never die out, they'll always create problems. Let's get used to it.