Forgot your password?
typodupeerror
Microsoft Perl

Microsoft Bots Effectively DDoSing Perl CPAN Testers 332

Posted by timothy
from the stuck-in-a-rut dept.
at_slashdot writes "The Perl CPAN Testers have been suffering issues accessing their sites, databases and mirrors. According to a posting on the CPAN Testers' blog, the CPAN Testers' server has been being aggressively scanned by '20-30 bots every few seconds' in what they call 'a dedicated denial of service attack'; these bots 'completely ignore the rules specified in robots.txt.'" From the Heise story linked above: "The bots were identified by their IP addresses, including 65.55.207.x, 65.55.107.x and 65.55.106.x, as coming from Microsoft."
This discussion has been archived. No new comments can be posted.

Microsoft Bots Effectively DDoSing Perl CPAN Testers

Comments Filter:
  • by Anonymous Coward on Monday January 18, 2010 @09:11AM (#30807010)

    I had a registration page - static content basically. The only thing that was dynamic was that it was referred to by many pages on the site with a variable in the querystring. Bing decided that it needed check on this one page *thousands* of time per day.

    They ignored robots.txt.
    I sent a note to an address on the Bing site that requested feedback from people having issues with the Bing bots - nothing.

    The only thing they finally 'listened' to was placing "" in the header.

    This kind of sucked because it took the registration page out of the search engines' index, however it was much better than being DDOS'd. Plus, the page is easy to find on the site so not *that* big a deal.

    Bing has been open for months now and if you search around there are tons of stories just like this. Maybe now that a site with some visibility has been 'attacked', the engineers will take a look at wtf is wrong.

  • Flooding... (Score:5, Informative)

    by Bert64 (520050) <bert@@@slashdot...firenzee...com> on Monday January 18, 2010 @09:15AM (#30807030) Homepage

    I have noticed the microsoft crawlers (msnbot) being fairly inefficient on many of my sites...
    In contrast to googlebot and spiders from other search engines msnbot is far more aggressive, ignores robots.txt and will frequently re-request the same files repeatedly, even if those files haven't changed... Looking at my monthly stats (awstats) which groups traffic from bots, msnbot will frequently have consumed 10 times more bandwidth than googlebot, but is responsible for far less incoming traffic based on referrer headers (typically 1-2% of the traffic generated by google on my sites).

    Other small search engines don't bring much traffic either, but their bots don't hammer my site as hard as msnbot does.

  • by Rogerborg (306625) on Monday January 18, 2010 @09:30AM (#30807122) Homepage

    You're probably new here, but if you'd RTFA, you'd see that:

    It seems their bots completely ignore the rules specified in the robots.txt, despite me setting it up as per their own guidelines on their site

    Come to think of it though, isn't this what happens to most people who try to interoperate with Microsoft?

    Amusingly, if I Google for "bing robots.txt" [google.co.uk] I get a link to a bing page titled "Bing - Robots.txt Disallow vs No Follow - Neither Working!" which has already been elided from history by Microsoft [bing.com]. CLassy.

  • by jlp2097 (223651) on Monday January 18, 2010 @09:31AM (#30807138) Homepage Journal

    Not necessary. A Bing Product Manager has already commented on the CPAN Testers blog entry [perl.org] upon which the article is based:

    Hi,
    I am a Program Manager on the Bing team at Microsoft, thanks for bringing this issue to our attention. I have sent an email to barbie@cpan.org as we need additional information to be able to track down the problem. If you have not received the email please contact us through the Bing webmaster center at bwmc@microsoft.com.

    As said below, never ascribe to malice that which can be adequately explained by stupidity. (Insert lame joke about MSFT being full of stupidity here).

  • by Anonymous Coward on Monday January 18, 2010 @09:32AM (#30807140)

    Excuse my ignorance, but isn't robots.txt compliance easily enforceable on the server? I remember something about hiding links to trap pages in order to indentify robots and then holding identified robots responsible for robots.txt infractions by blocking their IP address.

  • Re:Typical M$ (Score:1, Informative)

    by Anonymous Coward on Monday January 18, 2010 @09:33AM (#30807150)

    That's not a troll. That's common knowledge.

    A more appropriate mod would be +5 Redundant.

  • by John Hasler (414242) on Monday January 18, 2010 @09:34AM (#30807152) Homepage

    > ...why not just block them?

    They have.

  • Re:Are you sure? (Score:2, Informative)

    by Anonymous Coward on Monday January 18, 2010 @09:40AM (#30807198)

    You only see an IP in an apache log after a successfull TCP handshake. This is hard (not impossible, but really, really hard) to do with a forged IP.

  • Re:Are you sure? (Score:5, Informative)

    by TheRaven64 (641858) on Monday January 18, 2010 @09:45AM (#30807246) Journal

    Are we sure this traffic comes from Microsoft? Could it not consist of forged network packets?

    It's a TCP connection, so they need to have completed the three-way handshake for it to work. That means that they must have received the SYN-ACK packet or by SYN flooding. If they are SYN flooding, then that would show up in the firewall logs. If they've received the SYN-ACK packet then they are either from that IP, or they are on a router between you and that IP and can intercept and block the packets from thatIP.

    You don't need a reply if you are running a DDOS.

    You do if it's via TCP. If they're just ping flooding, then that's one thing, but they're issuing HTTP requests. This involves establishing a TCP connection (send SYN, receive SYN-ACK with random number, reply ACK with that number) and involves sending TCP window replies for each group of TCP packets that you receive.

    On the other hand, why would anyone, including Microsoft, want to bring down CPAN?

    Who says that they want to? It's more likely that their web crawler has been written to the same standard as the rest of their code.

  • by Ardaen (1099611) on Monday January 18, 2010 @09:56AM (#30807350)

    Probably not, if you look at other incidents: http://cmeerw.org/blog/594.html [cmeerw.org] it appears they just like to push the limits.

  • Re:Robots.txt (Score:3, Informative)

    by Ogi_UnixNut (916982) on Monday January 18, 2010 @10:00AM (#30807382) Homepage

    It's the first. Whatever you specify in the robots.txt as no-follow etc... means not to spider the pages, so no scanning of them at all.

    You use it for when you only want part of your site to appear in search results, such as just the front page (for example). The rest of the site should not be touched by the bot at all.

  • Re:Robots.txt (Score:3, Informative)

    by afidel (530433) on Monday January 18, 2010 @10:23AM (#30807540)
    It's basically a rough pattern filter that the bot is supposed to follow on parts of the site not to crawl. One reason it's used is that you can have dynamically generated pages that create an infinite loop that's impossible for the bot to detect.
  • DDoS? Really? (Score:3, Informative)

    by Siberwulf (921893) on Monday January 18, 2010 @10:33AM (#30807666)
    I'm pretty sure the first "D" in DDoS stands for "Distributed."

    If it was really a DDoS, you wouldn't be able to filter the IP out with a simple regex (like the /^65\.55\.(106|107|207)/. from TFA).

    To boot, TFA didn't even say DDoS. Maybe that's too much to expect the editors to oh... I don't know...say... RTFA or Fact-Check it?

    I should drop my bar a bit, I suppose.
  • by Goaway (82658) on Monday January 18, 2010 @10:36AM (#30807696) Homepage

    I'm sure you heard that, but it's not actually true in any way.

  • No problem (Score:5, Informative)

    by rgviza (1303161) on Monday January 18, 2010 @10:51AM (#30807868)

    ipchains -A input -j REJECT -p all -s 65.55.207.0/24 -i eth0 -l
    ipchains -A input -j REJECT -p all -s 65.55.107.0/24 -i eth0 -l
    ipchains -A input -j REJECT -p all -s 65.55.106.0/24 -i eth0 -l

    problem solved

  • by b1t r0t (216468) on Monday January 18, 2010 @11:28AM (#30808278)
    What exactly do you mean by "elided from history"? I brought them both up, turned off the CSS (Google's version is broken), and tab-flipped betwen them. Not only is the page still there, it has all the same posts as the Google cache version, with small differences such as tags switching around, number of posts by users, and another stupid Blackpool adlink. Maybe you found some messages missing and then Google later re-cached it, but the thread itself is certainly not missing.
  • Re:Robots.txt (Score:3, Informative)

    by John Hasler (414242) on Monday January 18, 2010 @11:40AM (#30808422) Homepage

    Is it an 'agreement' to not scan the site at all...

    It is a request not to scan part or all of a site. robots.txt [wikipedia.org]

    And if so, I can't see anything wrong with what Microsoft's bots did.

    Every site does not have dozens of powerful servers and terabytes of bandwidth, nor is every site an ad-supported one that wants to maximize traffic. Common courtesy requires that a bot operator minimize his impact on any given site and honor requests not to index. Of course "courtesy" and "honor" are concepts that baffle Microsoft managers.

  • Re:No problem (Score:5, Informative)

    by j_sp_r (656354) on Monday January 18, 2010 @11:47AM (#30808494) Homepage

    Linux IP Firewalling Chains, normally called ipchains, is free software to control the packet filter/firewall capabilities in the 2.2 series of Linux kernels. It superseded ipfwadm, but was replaced by iptables in the 2.4 series.

    You're a few kernels behind.

  • by Nadaka (224565) on Monday January 18, 2010 @12:07PM (#30808682)

    Nothing you listed under the "War on Drugs" has anything to do with the war on drugs.

    The war on drugs has made America a police state where the government can seize any of your property and auction it for profit before your trial. Even if you are found innocent, or the charges are thrown out for insufficient grounds, you will not be compensated for your lost money or profit. It has made an America where more people are imprisoned than any other nation on earth. It has made a nation where the cheapest and most effective drug for curing glaucoma and mitigating the pain and nausea associated with cancer treatments is a crime. Its made a nation where at least half its citizens are criminals.

  • Simple solution: (Score:1, Informative)

    by Anonymous Coward on Monday January 18, 2010 @03:21PM (#30811186)

    Add to your .htaccess file:

    deny from 65.55.207.
    deny from 65.55.106.
    deny from 65.55.107.

  • Re:Check the blog... (Score:3, Informative)

    by schon (31600) on Monday January 18, 2010 @05:58PM (#30813194)

    They admitted they were powerless to solve their own problems without help from their victims.

    Heh. It's another "damned if you do; damned if you don't" scenario.

    Un, no. Not unless you're a rabid MS apologist.

    Usually, people criticise Microsoft for developing software without bothering to consult or test with actual customers.

    True.

    Now we have a manager of a MS dev group that actually does communicate (though not exactly with "customers"), and acts on what they say, so he's criticised for needing help from his "victims".

    Umm, exactly how did he act on what they said? According to the quote, they explicitly didn't act, which is the problem people are complaining about.

"Say yur prayers, yuh flea-pickin' varmint!" -- Yosemite Sam

Working...