Websites Complaining About Screen-Scraping

Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

Websites Complaining About Screen-Scraping 616

Posted by michael on Friday February 07, 2003 @04:18PM from the there's-always-a-whiner dept.

wilko11 writes "There have been two cases recently where websites have requested the removal of modules from CPAN. These modules could be used to access the websites (EuroTV and Streetmap) from a PERL program. The question being asked on the mailinglists (threads about EuroTV and about Streetmap) is 'can companies dictate what software you can use to access web content from their server?'"

This discussion has been archived. No new comments can be posted.

Websites Complaining About Screen-Scraping

Load 500 More Comments

Search 616 Comments Log In/Create an Account

Comments Filter:

In short, no. (Score:5, Insightful)

by numbski ( 515011 ) writes: <[numbski] [at] [hksilver.net]> on Friday February 07, 2003 @04:20PM (#5252942) Homepage Journal

If you don't want your content being redisplayed on another site, place appropriate copyright and seek protections therein.

Don't stifle the technology. Treat the cause, not the symptom.

Share
twitter facebook
- Re-read the article... (Score:5, Insightful)
  
  by numbski ( 515011 ) writes: <[numbski] [at] [hksilver.net]> on Friday February 07, 2003 @04:24PM (#5252995) Homepage Journal
  
  So far as apps are concerned, again no.
  
  There's no law stating that we have to look at ads. Although I see the problem paying the bills, a flaw in a business model is not the problem of the application coder (namely: me, you, and most people reading this site).
  
  Parent Share
  twitter facebook
  - paging Jack Valenti (Score:5, Funny)
    
    by sydlexic ( 563791 ) writes: on Friday February 07, 2003 @04:44PM (#5253192)
    
    didn't you read the terms of service agreement you were handed at birth (us citizens only) that states any bypassing of ads during receipt of content is theft?
    
    I'm just waiting for ashcroft's goons to knock on my door, find the tivo and haul my ass off to jail.
    
    Parent Share
    twitter facebook
    - Re:paging Jack Valenti (Score:5, Funny)
      
      by merlyn ( 9918 ) writes: on Friday February 07, 2003 @04:57PM (#5253298) Homepage Journal
      
      "Click Here To Accept Your Life's Conditions: [Agree] [Disagree]"
      {grin}
      
      Parent Share
      twitter facebook
      - Re:paging Jack Valenti (Score:5, Funny)
        
        by trbogie ( 608396 ) writes: on Friday February 07, 2003 @05:05PM (#5253368)
        
        I thought they were trying to modify that to say that "Having left the womb, you have, by default, accepted the agreements to all life's conditions."
        
        Parent Share
        twitter facebook
        
        Birth agreement (Score:3, Funny)
        
        by lastberserker ( 465707 ) writes:
        
        Then I assume such agreements do not apply to c-section kids, do they? Oh the ineffable joy of medical techno... oops... does this make c-section DMCA circumvention device? =8-Z
      - Re:paging Jack Valenti (Score:3, Funny)
        
        by flacco ( 324089 ) writes:
        
        "Click Here To Accept Your Life's Conditions: [Agree] [Disagree]"
        A slight correction:
        "Click Here To Accept Your Life's Conditions: [Agree] [Agree]"
        
        Re:paging Jack Valenti (Score:4, Funny)
        
        by jon doh! ( 463271 ) writes: <jondohNO@SPAMcurztech.com> on Friday February 07, 2003 @06:25PM (#5253935) Homepage
        
        a correction of the correction
        
        "Click Here To Accept Your Life's Conditions: [Agree] [Disagree]"
        
        (it's greyed out, like the microsoft patch i applied that said "you need to reboot your computer for the changes to take effect" and had two buttons, one to reboot now, one to reboot later. the reboot later was greyed out...)
        
        Parent Share
        twitter facebook
  - Derivative work (Score:5, Informative)
    
    by yerricde ( 125198 ) writes: on Friday February 07, 2003 @04:59PM (#5253325) Homepage Journal
    
    There's no law stating that we have to look at ads.
    
    What about 17 USC 106 [cornell.edu], which states that barring fair use, etc., the copyright owner has the right to prevent others from creating derivative works of a web page?
    
    Parent Share
    twitter facebook
    - Re:Derivative work (Score:5, Informative)
      
      by Natalie's Hot Grits ( 241348 ) writes: on Friday February 07, 2003 @05:36PM (#5253598) Homepage
      
      Yes, barring fair use, which explicitly allows you to do this unless you re-distribute the work. Which you aren't.
      
      Short answer is that you can modify any work under fair use for your OWN PERSONAL USE and not for someone else. If your web browser cuts out ads, then that is legal, and no US Code that is currently existance disallows these modifications.
      
      Aside from this point, there is still the legal rammifications that there is no US Law which states it is illegal to build, distribute, or use tools that can modify copyrighted works (unless the work is encrypted and covered under the DMCA)
      
      If an ISP started doing this at his firewall, and then re-distributing the web site to your computer after you request it, then this might be illegal. They might be able to argue that one party is getting the work, modifying it, and redistributing it, which is certaintly not covered under the Fair Use Doctrine.
      
      OTOH, if the ISP has a fair use reason to do this (such as reformatting the text to work on a text only terminal), then this may also be legal.
      
      What it all boils down to is that the spirit of copyright laws are restricting COPYING and REDISTRIBUTING, not how a person uses those works. This has been true untill 1998 when the DMCA was enacted, and even now is still true for all copyrighted works that are not covered under the DMCA's encryption clauses. To this day, I have yet to find a website that is encrypted for purposes of the DMCA protection. Untill this changes, they won't have any legal legs to stand on.
      
      Parent Share
      twitter facebook
    - Re:Derivative work (Score:4, Interesting)
      
      by Sabalon ( 1684 ) writes: on Friday February 07, 2003 @05:46PM (#5253680)
      
      If I buy a copy of The Hobbit, rip out every 5th page and then read it, have I created a derivative work and broken a law?
      
      If I don't distribute it, can't I do whatever I want with the content?
      
      If I was to then repost this on the web, yes...I could see where that would be a problem, but not what I do for myself.
      
      Parent Share
      twitter facebook
    - Re:Derivative work (Score:5, Insightful)
      
      by bwt ( 68845 ) writes: on Friday February 07, 2003 @07:01PM (#5254177)
      
      The author does not create the "web page", that is the job of the user agent. The author offers up raw HTML source code and YOU render it. Your argument proves too much -- it proves that all rendering of HTML in a browswer is copyright infringement because it creates a derivitive work of his source code. Indeed, it DOES create a derivitive work, just one that is **authorized**.
      
      The author creates various files such as HTML text files, pictures, pdf's etc. By using HTML, he has authorized the user agent to render consistent with the HTML standard and his HTML code. Thus, he has explicitly authorized certain limited types of derivitive works to be made from his source code by using HTML. The HTML standard does not require images to be rendered, and since it was the author's choice to use HTML, no violation of copyright law occurs when HTML is rendered in a manner consistent with the HTML spec.
      
      Had he wanted to mandate the exact representation, he could have used an image format or a PDF. It's his choice, but he must live with it and all that follows from it.
      
      Of course, there is nothing wrong with not rendering the HTML at all and just looking at it as source code. Nor is there any cause of action under copyright law if you extract unprotectable facts and ideas from either the source code or the rendered version.
      
      Parent Share
      twitter facebook
- Re:In short, no. (Score:4, Interesting)
  
  by helix400 ( 558178 ) writes: on Friday February 07, 2003 @04:31PM (#5253080) Journal
  
  place appropriate copyright and seek protections therein.
  I agree. I believe if any particular company (Eurostar) places data to be accessed online by anyone...then that data is free and up for grabs unless the company legally tells you it isn't.
  As someone who writes many screen scraping applications, we've run into these legality issues many times. We frequently need to grab data off copywrited sites. In most cases, we just contact the company to get their OK before proceeding. Sometimes, the company was overjoyed that we were providing a new interface for their data. In this EuroTV case, I believe EuroTV's lawyers are simply trying to out muscle the ignorant, without consulting their PR deparment.
  
  Parent Share
  twitter facebook
  - Re:In short, no. (Score:3, Interesting)
    
    by Natalie's Hot Grits ( 241348 ) writes:
    
    IANAL
    
    But one thing I DO know is that data isn't copyrightable. If you write a website with data in it, you own a copyright on the formatting of that data. Any person is legally allowed to copy that data as long as it isn't covered under trade secret laws.
    
    In the case of channel guide information, they are not trade secrets because they are being offered to the public via http.
    
    The only lawfull protection they can claim is a technical one. If they modify their website to try to "protect" that data (for example, they could turn all the text into an image and then you would have to update your program to recognise text as an image) They could do that, but then they would have to modify their website each time you updated your program. It's a losing battle on both ends, depending on how dedicated to the project both ends are.
    
    It is possible they could make you sign an NDA and then give you a login to their website. If they want the protection they are asking for, that is the only way to go about it and still have legal protection of their data (under current US law).
    
    If they are trying to change the laws, well.. then I will just say that you should not be able to change laws to fix your failing business models. It is not the government's responsibility, nor the citizen's responsibility, to be forced to prop up a business that has a failing business model.
  - - - Re:In short, no. (Score:3, Interesting)
        
        by Natalie's Hot Grits ( 241348 ) writes:
        
        IANAL..
        
        "Can you explain this more? This would be an exact answer for this EuroTV situation. Is EuroTV legally protected, or not? "
        
        Since I'm not in europe, I don't know. However, in US law, anything that is written or created that is copyrightable is automatically covered under copyright laws, even if you don't tell anybody you made it. If you don't stamp the (C) and date on it, It is still covered. If somebody copies it, they are liable for damages.
        
        You can also register your copyright. The only advantage to this is 1) to help prove it is yours origionally, and 2) you can claim punative damages if someone copies your work without permission.
        
        In the specific case we are talking about, where data is the thing being copied, and the "compilation of the data" is not, there is no copyright protection. Data cannot be copyrighted, and is in public domain the first time it is offered to the public. If you want to re-format data on a website, you can legally do that and not even bring copyright law into the picture (there are some exceptions. for example, if the data is a poem, and you copy that poem, you must get permission, but if the data is a list of numbers and names, it is public information). I am not certain if you must use a clean room approach in this data gathering, so you might want to check on that, but I am under the impression you can re-write data and not include their formatting, and that makes it a seperate (AND un-derrived) work. An example is the walmart case where they sued people for posting pricing information on certain items. Wal-Mart tried to sue, but the judge ruled that they were only liable of copyright infringement if they copied the entire ad. If they just rewrote the pricing information into an HTML table, it is perfectly legal.
        
        In both cases of copyrights listed above (first 2 paragraphs), no license is needed, and no EULA is needed, as these things only give EXTRA rights to the customer not defined in US Copyright Law. As stated in my previous posts on copyright law, sections of EULA's that take rights away that are already granted to you by the US Supreme Court in the "Fair Use Doctrine" are void. Unless you sign a contract specifically giving up these rights, in exchange for other things (note: clicking "I Agree" could be interpreted as signing, but the contract must be a "fair exchange" to be a legally binding contract), these EULA sections are not enforceable.
        
        EULA's legally cannot be forced upon you, once you purchase a copy of a work, you can legally do anything you want to it as long as its covered under fair use, even if it contains an EULA, you dont have to agree to it unless you click "i agree" in a clickwrap license, or sign a piece of paper with the EULA written on it. Shirinkwrap licenses are not valid contracts, and also do not have to be agreed to (ex: "by opening this package, you agree to... blah blah blah").
  - - - Re:In short, no. (Score:3, Interesting)
        
        by Archfeld ( 6757 ) writes:
        
        that is correct, while the image can be re-displayed, no profit can be derived unless the subjects signs a release or is under contract. If you read the back of many sports venue tickets you will find this contract already included, because of rebroadcasting.
- Re:In short, no. (Score:5, Insightful)
  
  by ESCquire ( 550277 ) writes: on Friday February 07, 2003 @04:41PM (#5253175)
  
  That's right.
  
  The pure act of retrieving any information from a web page by what means ever possible is not violating any rights of the content owner. That's why he put the content up for!
  
  But redisplaying the content on another site is another thing. That's an act of utilisation or distribution that can be protected by the content owners copyright.
  
  So the right thing to do is not to demand removal of CPAN modules that can retrieve information as this retrieval is legal. If the retrieved content is utilised or redistributed, the content owner may be able to demand removal and even damages because of copyright violation.
  
  Parent Share
  twitter facebook
  - are "Facts" copyrightable? (Score:4, Interesting)
    
    by acroyear ( 5882 ) writes: <jws-slashdot@javaclientcookbook.net> on Friday February 07, 2003 @06:06PM (#5253815) Homepage Journal
    
    Trouble is, there's still an open debate on whether or not databases that contain facts (as opposed to other types of corporate data) are copyrightable. Take for example the Weather Channel. Its "current temperature" and other statistical information is a Fact, not copy, and thus perhaps should be scraped away and redisplayed wherever. However, its predictions of the day's or week's weather is written/created material, not "fact" in the same sense and thus should be illegal to copy... but then again, there are those that claim that you should still pay for the service of getting facts, such as cddb.org. Of course, there situation is that they want you to pay, yet you're the one who entered the data in the first place back when it was "free", so its slightly different grounds. but you see, things can get confusing when getting into this issue of what's public and yet still copyrightable.
    
    Parent Share
    twitter facebook
- Re:In short, no. (Score:3, Informative)
  
  by $$$$$exyGal ( 638164 ) writes:
  
  At the bottom of every eurotv.com page, it says '(c) 1995, 2002' (with the copyright symbol). That should be enough to enforce that people request permission before scraping their site. On the other hand, people should be able to safely write and distribute software that scrapes the site, they just can't use it without EuroTV's permission.
  --nude [slashdot.org]
  - Re:In short, no. (Score:5, Insightful)
    
    by CaseyB ( 1105 ) writes: on Friday February 07, 2003 @04:51PM (#5253254)
    
    people should be able to safely write and distribute software that scrapes the site, they just can't use it without EuroTV's permission.
    Sure they can. They can't redistribute that information at will, but they have every right to make a regular http request to the EuroTV server and then use the response for whatever personal use they see fit.
    
    Parent Share
    twitter facebook
    - Re:In short, no. (Score:4, Insightful)
      
      by interiot ( 50685 ) writes: on Friday February 07, 2003 @05:24PM (#5253517) Homepage
      
      Which is the same moral position most Slashdoters have regarding DRM and shrink-wrapped software: A company can up-front demand anything they want for some content/disks/files, but once the content and money have exchanged hands and you take the stuff home, you should be able to do whatever the hell you want with it as long as you don't give it to other people.
      NDAs/noncompetes/whatever are fine for business-to-business contracts but have no place in the consumer market. Granted, education is always better than forcibly changing the laws, but consumers are sheep and so must be protected. So let no consumer be restricted!
      
      Parent Share
      twitter facebook
- Re:In short, no. (Score:5, Informative)
  
  by swinginSwingler ( 161566 ) writes: <marc_swingler AT hotmail DOT com> on Friday February 07, 2003 @04:50PM (#5253245)
  
  Well I don't know about EuroTV, but Simon Batistoni's (the author of WWW::Map::UK::Streetmap) post has links directly to Streetmap's terms of use. Also the front page of streetmap has a disclaimer link right on top which claims copyright:
  
  The Street Map site is compiled and made available by BTex Limited.
  
  Information displayed through the Streetmap site is extracted from a copyright work owned by BTex's Respective Suppliers. See our About page for more details.
  
  This compilation is a copyright work. © BTex Ltd 1997,1998,1999,2000,2001,2002.
  
  A single print of the results of a map search is permitted for your own personal use. Otherwise the reproduction, copying, downloading, storage, recording, broadcasting, retransmission and distribution of any part of the Streetmap site is not permitted. Please see our business services page if you would like to use this data or to print the maps for your business.
  
  Considerable efforts are made to make information contained in the Streetmap site as accurate as possible but no warranty or fitness is implied.
  
  The Streetmap site is provided on an "as is" basis. BTex Ltd shall have neither liability nor responsibility to any person or organisation with respect to any loss or damage arising from Information or the use of Information.
  
  Requests made on Streetmap web site can be logged, however this information is only used for statistical use. Any data that may be given to external parties will not be directly attributable to your use, eg we may say we have 1000 subscribers in London, but we will not say which subscribers live in London !
  
  So if the data scraped from Streetmap ended up on someone elses webpage, that would be a clear violation of copyright. What would be said if Microsoft violated the GPL for Mozilla source and placed it into IE?
  
  Parent Share
  twitter facebook
- Re:In short, no. (Score:3, Insightful)
  
  by Florian Weimer ( 88405 ) writes:
  
  If you don't want your content being redisplayed on another site, place appropriate copyright and seek protections therein.
  
  And use the robots.txt convention. Your opponents will face some trouble when they have to explain why they ignore it.
  
  So the answer is "yes", at least from a purely technical point of view.
  
  Now the inevitable question: Is this a good thing?
- In short, maybe... (Score:3, Insightful)
  
  by EvilAlien ( 133134 ) writes:
  
  ... but only if the application vendor complies with their requests to remove the feature, code or application. In other words, if CPAN is not willing to tell them to go pound sand, perhaps our beef is as much with CPAN as it is with the media sources.
- - Re:In short, no. (Score:4, Interesting)
    
    by ACNeal ( 595975 ) writes: on Friday February 07, 2003 @04:52PM (#5253256)
    
    If you put content on the web, port 80, then you are dictating the terms of how it should be accessed.
    
    Anything that operates on port 80, posting proper GET or POST commands is a web browser.
    
    Do you think a company should be able to dictate whether I use mozilla, netscape, IE, or lynx. Sure they might do it with technology (make something IE only), or requiring certain agent id's (which are spoofable), but should they.
    
    Is it right to say lynx only, or opera only?
    
    I don't think it is right, and definitely not smart.
    
    We would all have to have every major browser installed, and then contact the company by means other than a browser just to ask them what browser is allowed to hit their site.
    
    Parent Share
    twitter facebook
  - Re:In short, no. (Score:5, Insightful)
    
    by Anonym0us Cow Herd ( 231084 ) writes: on Friday February 07, 2003 @05:01PM (#5253340)
    
    Looking at other posts, I may be in the minority here, but I think companies should be able to dictate how the information put up is accessed.
    
    Redistributing the information may be a problem. But let's not talk about that.
    
    Let's talk about the fact that what they are so upset about is that I can scrape their information and use it conveniently. They don't mind if I use their information. (My favorite TV show is on at Wed 3:00 PM.) But they do mind if I use that information too conveniently. As long as I have to use the information with my organic eyeballs they're okay. If my computer can get the information and program my PVR, they're pissed.
    
    It's their choice to put it up, therefore it should be done in their terms.
    
    I really don't see a logical connection here between these two statements.
    
    You seem to be advocating that people should be forced to watch ads? Because, after all, the content provider wants you to. How is this different than not watching a TV ad, but enjoying the content? All we're doing is scraping their TV listings to program our home built PVR's without watching their ads.
    
    Parent Share
    twitter facebook
    - Re:In short, no. (Score:3, Informative)
      
      by Natalie's Hot Grits ( 241348 ) writes:
      
      As has been proven by the Wal-Mart case, data such as channel listings, times of shows, etc are not copyrightable. They are trade secrets and are in the public domain once it is published. The only copyright violations that can be enforced is if someone copied the formatting of the data of the site, not just the DATA. Following are some examples...
      
      If a company spends $50 million on research to obtain data on the fruit fly(how fast does it fly, how old does it get, how many babies does it have normally, etc..), and they publish that data, it is in the public domain. They can only keep it under trade secret laws if they require NDA's before allowing people to view the data.
      
      If walmart publishes a weekly ad with christmas specials in it, they hold the copyright on the ad, but not the pricing data, or product names (which are trademarked, not copyrightable). The courts have already tested this one
      
      If a cable company publishes their channel numbers, names (trademarked, not copyrightable) and times they are showing, non of that data is copyrightable, and none of it is covered under copyright laws. The only way to keep this data out of the public domain is to require everyone who views it to sign an NDA.
      
      This is exactly what trade secret laws are designed to protect. This is exactly what NDA's are designed to assist the laws with. This is exactly what copyright laws were not intended to protect (and to my knowledge, no court has ever used copyright to protect such data)
    - - Re:Why not a fair medium? (Score:4, Insightful)
        
        by Anonym0us Cow Herd ( 231084 ) writes: on Friday February 07, 2003 @06:04PM (#5253799)
        
        Howzabout this: the first time you bring up the guide, after you've scraped the listings, it pulls up the ad.
        
        Their server can see that I consumed the bandwidth to request and retrieve the ad. What they can't be sure of is that I actually saw the ad.
        
        What is needed is a new W3C standard. A cryptographically secure protocol for trusted platforms, so that the browser can signal the server in a trustworthy way that the ad was forcibly seen by the user. What degree of force was used can also be signalled. (1) I displayed a "must play" page for X seconds, you can assume that the user might have seen the ad. or, (2) Trustworthy Computing enabled robot arms reached out of the monitor and held your head in place while the ad played. (3) retina scanning software made sure the user was watching the ad.
        
        We'll need some futuristic technology (the RIAA's brain implants?) to ensure that you were actually thinking about the ad.
        
        Parent Share
        twitter facebook
- - amazon.com web services (Score:5, Interesting)
    
    by daviddennis ( 10926 ) writes: <david@amazing.com> on Friday February 07, 2003 @04:57PM (#5253294) Homepage
    
    Have you seen Amazon.com web services?
    
    They are actually encouraging people to put up sites using Amazon's content, as long as purchasing is allowed via links to Amazon, and certain other (fairly simple) rules are being followed.
    
    From their point of view, it's just a more sophisticated version of their Associates Program.
    
    I get your basic point, of course, but Amazon is a conspicously bad example to use!
    
    D
    
    Parent Share
    twitter facebook
- - - - Re:In short, no. (Score:3, Funny)
        
        by molarmass192 ( 608071 ) writes:
        
        Off topic but ...
        <grin>
        Everybody knows that guns DO NOT kill people ... bullets DO!!!
        </grin>
The question is (Score:2)

by TerryAtWork ( 598364 ) writes:

"'can companies dictate what software you can use to access web content from their server?'"

The short answer is 'no, but they will'
- Re:The question is (Score:3, Interesting)
  
  by exhilaration ( 587191 ) writes:
  
  They can dictate whatever they want, but if they block automated agents - screen readers for the blind, specifically - they'll be opening themselves up to a lawsuit.
  All of the companies requiring you to retype something you see on the screen are in violation of the American Disabilities Act. Unfortunately it appears that a lawsuit against a website has yet to win [umich.edu], but it will happen soon enough. The article I link to states that a airline got away with an inaccessible website because they offered a phone-based order system. At some point, lawyers are going to realize that there's good money to be made by suing websites that are inaccessible to the disabled. For example, where's the toll-free number on the Paypal site? The E-Bay site? The Yahoo site?
Sure they can! (Score:5, Interesting)

by stile ( 54877 ) writes: on Friday February 07, 2003 @04:21PM (#5252946)

If we piss them off enough by chopping off their advertisements and snipping out their content, they'll just write their sites in Flash, or as one big image file, or some other proprietary format. That'll pretty well dictate what software you use to view their site.

Share
twitter facebook
- Comment removed (Score:4, Interesting)
  
  by account_deleted ( 4530225 ) writes: on Friday February 07, 2003 @04:24PM (#5252987)
  
  Comment removed based on user account deletion
  
  Parent Share
  twitter facebook
  - Re:Sure they can! (Score:4, Insightful)
    
    by TheJesusCandle ( 558547 ) writes: on Friday February 07, 2003 @04:29PM (#5253055) Homepage
    
    Thats what I tell my clients who try to "encrypt" things in this silly manner. I've written packages that defeat those silly "enter the word contained in the image" tests, I've written packages that defeat silly anti-automation scripts.
    
    It's really not hard.
    
    Sure, theres always the 2% that can get around any barier you put up. Stopping the 98% is usually good enough to justify the extra effort of developing these measures.
    
    You shouldnt complain too much about what your customers want, theyre paying you for your time right? Give 'em what they want.
    
    Parent Share
    twitter facebook
    - Re:Sure they can! (Score:3, Insightful)
      
      by umeboshi ( 196301 ) writes:
      
      -- Sure, theres always the 2% that can get around any barier you put up. Stopping the 98% is usually good enough to justify the extra effort of developing these measures.
      
      they're trying to stop the %2 from sharing their knowledge with the other 98%
  - What falls out the back end of a bull? (Score:5, Funny)
    
    by Wonko42 ( 29194 ) writes: <ryan+slashdot@[ ]ko.com ['won' in gap]> on Friday February 07, 2003 @04:44PM (#5253198) Homepage
    
    "I've written packages that defeat those silly "enter the word contained in the image" tests..."
    Ahem. Bullshit.
    
    Parent Share
    twitter facebook
    - Re:What falls out the back end of a bull? (Score:3, Insightful)
      
      by poot_rootbeer ( 188613 ) writes:
      
      Maybe bullshit, maybe not. A good OCR library will get you 90% of the way there already.
      
      They can't distort the characters TOO much in the image, or else humans wouldn't be able to recognize them either. And the background patterns to cause interference with OCR sytems could be pretty easy to strip out too; a grid of straight black lines on a white background is fairly trivial to recognize algorithmically, and then removing the lines becomes a simple matter of figuring out where a black pixel is just part of a line, and where it's part of a character.
      
      Whether it's worth all that effort just to be able to automate the submission of a form is debatable.
      - Captchas (Score:5, Interesting)
        
        by Valdrax ( 32670 ) writes: on Friday February 07, 2003 @05:32PM (#5253570)
        
        Actually, this is a field that is quickly being considered a new Turing test for the computer vision field. It is actually very easy to make pictures that humans can read and that machines currently can't. Look up more info on it here [captcha.net].
        
        Parent Share
        twitter facebook
    - Comment removed (Score:4, Interesting)
      
      by account_deleted ( 4530225 ) writes: on Friday February 07, 2003 @05:37PM (#5253616)
      
      Comment removed based on user account deletion
      
      Parent Share
      twitter facebook
    - Re:What falls out the back end of a bull? (Score:3, Interesting)
      
      by Alan Cox ( 27532 ) writes:
      
      Actually there is a much simpler way to defeat please enter the word on the image web sites, and one that actually raises a real issue. Those image tricks are discriminating horribly against the blind, the old and those with eye problems in general, as well in some cases dyslexics
  - Turing test? (Score:5, Insightful)
    
    by siskbc ( 598067 ) writes: on Friday February 07, 2003 @04:50PM (#5253250) Homepage
    
    So far, I was under the impression no one had won the Turing contest yet. You are beating their trivial problems, but they're finally waking up and shifting the "online human test" to things that people haven't figured out how to code. I'd link to the article if I could remember where I saw it...
    Hell, the simplest would be an easy reading comprehension or logic test with a short-answer blank - the computer would never get it, and all humans would.
    My guess is that soon, people who REALLY want you out will keep you out.
    
    Parent Share
    twitter facebook
    - Re:Turing test? (Score:3, Insightful)
      
      by nuggz ( 69912 ) writes:
      
      First off you assume people will be able to comprehend, I doubt that people are dumb. Don't belive me, listen to a daytime talk show.
      
      Second a computer will mark your answer, so it must be able to comprehend the answer you put in, you have to give a precise and exact answer (likely), which means its a simple question, and a computer might be able to answer it.
  - Re:Sure they can! (Score:5, Interesting)
    
    by CaseyB ( 1105 ) writes: on Friday February 07, 2003 @05:03PM (#5253355)
    
    If human eyes can read it, someone can write software to parse it.
    Uh huh.
    Good [captcha.net] luck [captcha.net], buddy [captcha.net].
    
    Parent Share
    twitter facebook
    - - Re:Sure they can! (Score:4, Funny)
        
        by CaseyB ( 1105 ) writes: on Friday February 07, 2003 @05:27PM (#5253541)
        
        Hmm. I didn't have trouble with any of them. (Reload for different variants.)
        It may be that the tests go beyond a simple Turing test and also validate for a certain level of intelligence. I suppose that would be useful sometimes as well.
        "You must be _this_ smart to ride our web site."
        
        Parent Share
        twitter facebook
    - - Re:Sure they can! (Score:3, Insightful)
        
        by CaseyB ( 1105 ) writes:
        
        The first one of those could be semi-easily defeted with a well written vision program.
        Unsubstantiated bullshit. And for every advance in smart OCR you come up with, I can come up with 10 obscuring transformations that leave it readable to humans but garbage to a computer.
        The second could be very easily defeated by a simple concept to image hash database.
        Yeah, you only have to model the recognition and indexing abilities of a human brain.
        The final test could simply be brute forced. Pick three buttons. Keep selecting those until they're right.
        You're ignorantly assuming that an implementation detail like radio buttons is core to the system.
        These proof of concepts show just the first step in writing a solid system.
        An obvious extension that I can think of, would be to implement a whole slew of different types of these problems, and then an engine that outputs a given problem -- and the method for determining the solution -- all into a bitmap. Then you have to deal with not only whatever first-order recognition is specific to the problem, but also the higher-order job of interpreting the nature of the problem itself: e.g. A picture of a guy brushing his teeth, with accompanying text "what is this man doing" OR "what color is the mans shirt?". Good luck to your software.
  - NF Chance (Score:3, Insightful)
    
    by frovingslosh ( 582462 ) writes:
    
    If human eyes can read it, someone can write software to parse it.
    Thats what I tell my clients who try to "encrypt" things in this silly manner. I've written packages that defeat those silly "enter the word contained in the image" tests, I've written packages that defeat silly anti-automation scripts.
    It's really not hard.
    Can something that recognizes text in an image be written? Sure. It's just a form of OCR. Can you write one that's able to look at any generic webpage, a mix of text and images, and do what is being asked of a human? I don't believe you can, and it seems a pretty high expectation of any software for the current state of AI. A targeted program for one website I might believe, but such tests for a human are certainly valid protection against web crawling 'bots.
    Which is not to say I in any way agree that screen scraping software in any way is a violation of a website owner's rights. It's not.
  - - - Re: (Score:3, Informative)
        
        by account_deleted ( 4530225 ) writes:
        
        Comment removed based on user account deletion
- Re:Sure they can! (Score:5, Insightful)
  
  by interiot ( 50685 ) writes: on Friday February 07, 2003 @04:25PM (#5253003) Homepage
  
  No they won't. The main goal of HTML wasn't so everything would be open and "stealable", the goal was to have content that could be viewed on a variety of platforms. You can't get that with flash or huge images, and in fact, for some of the more interesting devices (eg. cell phones, PDAs), it's explicitely required that the machine be able to understand the content to some extent so that it can transform it to something that better suits the particular device.
  
  Parent Share
  twitter facebook
  - Re:Sure they can! (Score:5, Interesting)
    
    by SoCalChris ( 573049 ) writes: on Friday February 07, 2003 @04:30PM (#5253070) Journal
    
    You have good points, but try explaining that to a very non-technical executive who is afraid that everyone is out to steal their content. I've seen many companies that will do their entire website in Flash just so the content can't be "stolen".
    
    Personally, I refuse to install the Flash plugin, so if I come to one of these pages looking to do business, oh well. I'll just go somewhere else. The higher up people in companies that make all Flash sites don't seem to realize that Flash is annoying to a lot of people.
    
    Parent Share
    twitter facebook
- Re:Sure they can! (Score:5, Insightful)
  
  by superdan2k ( 135614 ) writes: on Friday February 07, 2003 @04:28PM (#5253049) Homepage Journal
  
  Yeah, and then they'll lose traffic and die because no one will bother wasting the time on their site.
  
  What a lot of companies fail to realize is that the Social Contract (philosophy, not law) applies as much to the relationship between client and customer as it does between Joe and Jim Average. Play by the rules and be part of society, or doom yourself...that's basically it. No man is an island. No company is an island...well, maybe Microsoft, but that's it.
  
  Parent Share
  twitter facebook
- Re:Sure they can! (Score:3, Insightful)
  
  by Gojira Shipi-Taro ( 465802 ) writes:
  
  If they want to take an extreme measure such as that, fine. They are entitled to limit their viewership as much as they like. To take steps to get a project to eliminate code that offends them is going beyond the realm of reasonable request.
  
  If they wish to restrict which applications can access their content, it is up to THEM to take the measures necessary to restrict the access. It is not the responsibility of the developer to comply with their request.
- Re:Sure they can! (Score:4, Insightful)
  
  by mr_z_beeblebrox ( 591077 ) writes: on Friday February 07, 2003 @04:42PM (#5253182) Journal
  
  That'll pretty well dictate what software you use to view their site.
  
  As the admin for a large distributor, I am often called to the desk of various sales people to install flash. I inform that flash is not supported in our environment. The result, well companies use websites because it costs a LOT less to process web orders than to process called orders (but the cost of order placement is only slightly different). Some of these companies depend on us as their largest customer. I have to date seen three websites rewritten to accomodate that policy. If we all leverage (buzzword ;-) ourselves as customers we can defeat the evil monolith. That is my contribution to the internet.
  
  Parent Share
  twitter facebook
Short answer: No (Score:5, Insightful)

by gilroy ( 155262 ) writes: on Friday February 07, 2003 @04:21PM (#5252954) Homepage Journal

Blockquoth the poster:

The question being asked on the mailinglists (threads about EuroTV and about Streetmap) is 'can companies dictate what software you can use to access web content from their server?'"

No. It's not what the Web is about. If they don't want people accessing it, they probably shouldn't put it out there in a public place. It's like saying, "I'm going to hang up a banner from the windows of my house, but I don't want Raiders fans to read it." Ir's just silly -- the Web is about information transfer, period.

Share
twitter facebook
- Re:Short answer: No (Score:2, Funny)
  
  by Valiss ( 463641 ) writes:
  
  I can already picture them writing a JS that dectects your browser, or whatever, and re-routes you to their 'other' site, which is nothing more than a court order for you to appear in court for attempting to view the site.
if it's on the web... (Score:5, Insightful)

by pizza_milkshake ( 580452 ) writes: on Friday February 07, 2003 @04:23PM (#5252969)

it's shared. if you don't want people to use the information you've made available on the web, require them to login or jump through some other hoops.

Share
twitter facebook
Sure they can- (Score:2, Insightful)

by purduephotog ( 218304 ) writes:

After all, doesn't a EULA tell you what you can and can not do with said program, and we all follow it to the T every time?

If it's an issue of revenue, it's stealing.

Remember way back when on askSlashdot that some guy posted he used another websites' image, linked to that website, on his page? And how that site's webmaster got upset about the bandwidth usage and replaced the image with something slightly more objectionable (I believe it was pornographic).... and as to whether or not that was legal?

Same situation here, folks. Undesired conncetions consuming bandwidth without proper licensing. Sounds fightable to me, but IANAL.
- Re:Sure they can- (Score:3, Insightful)
  
  by zenyu ( 248067 ) writes:
  
  If it's an issue of revenue, it's stealing.
  No no no, if Pizza Hut tries to break into a ritzy neighborhoods died by giving free slices to everyone walking by their stand in that neighborhood and I go back to the hood and tell everyone they are handing out free pizza on 72nd street it's not stealing. Pizza Hut is giving away the pizza and if their "revenue model" starts to fail, it's not my fault.
  
  Same situation here, folks. Undesired conncetions consuming bandwidth without proper licensing. Sounds fightable to me, but IANAL.
  
  You don't need a license to access content on public networks, that's the point. If they want that kind of deal they can block all connections from browsers that don't have a secret key only given to those who have signed a contract with them. More like if Pizza Hut only gave the free pizza to men with ties and women with high heals that had a receipt for pizza on them.
  
  Sadly you may be right about it having some merit. Copyright does not apply to information since it is, hopefully, not a "creative work." But, European contries have laws specific to databases that give them government protected monopolies on some other basis. Any Europeans have the details?
Really Stupid... Really Clever (Score:5, Insightful)

by agentZ ( 210674 ) writes: on Friday February 07, 2003 @04:24PM (#5252983)

I wish I knew who said it, but if you keep people from doing really stupid things, you also keep them from doing really creative things. Restricting the methods used to access a web site will only stifle innovation.

Share
twitter facebook
Silly (Score:3, Interesting)

by LongJohnStewartMill ( 645597 ) writes: on Friday February 07, 2003 @04:24PM (#5252991)

Haven't places like Yahoo! and Hotmail had problems with automated scripts? What did they do? They put in some sort of feature that the computer/program couldn't understand. I think that's probably a better workaround than getting rid of somebody's hard work.

Share
twitter facebook
Can companies dictate the client? (Score:2)

by Frobnicator ( 565869 ) writes:

I think the best they can do, if they really want to, is read the HTML request headers and try to send browser-specific content, then send specialized commands that only that browser can read, such as ActiveX controls on Windows, or ... is there anything similar in other clients?
Ethical reasons asside, I don't think there is a technical way to do it while maintaining any similarity to 'good' html practices.
frob.
TiVo analogy? (Score:2)

by Ashish Kulkarni ( 454988 ) writes:

Using Perl modules to access websites is like using TiVo: you can cut out all the crap. I think it is inevitable that things like this will continue to happen, as long as you can do that (eg. GnuCash uses Finance::Quote to retrive stock quotes). Now, if the website explicitly blocks it or disallows it as a policy, then it's a different ball game....
- Re:TiVo analogy? (Score:5, Insightful)
  
  by JUSTONEMORELATTE ( 584508 ) writes: on Friday February 07, 2003 @04:45PM (#5253208) Homepage
  
  Using Perl modules to access websites is like using TiVo: you can cut out all the crap
  s/crap/stuff that payed for what you're viewing/
  I'm not claiming that skipping commercials is theft, nor that web-clipping without ads is theft.
  Just remember one thing -- you aren't the consumer. You are the product. The sale is between the "content provider" and the advertiser. You're what's being sold.
  If you sidestep this sale (TiVo, banner-ad-filter, whatever) then the "Content Provider" makes less money, and has to either change careers or find other ways to increase their sales. (product placement, ads that appear to be news items, etc)
  Getting all jacked up about your "Rights as a consumer" is just innane.
  
  Parent Share
  twitter facebook
I would guess (Score:2)

by hackstraw ( 262471 ) writes:

that there are 2 things that the people are worried about. 1) bandwidth and 2) people not looking at the pretty ads on their pages.

If 2 isn't an issue, I would say that the site would consider putting a low bandwidth version of the pages that are scraped like slashdot and freshmeat offer.

Otherwise, I don't really see how anyone could prevent someone from doing a screen scrape. I too wondered about this when I found some kind of tv guide that did screen scrapes in perl.
david vs goliath (Score:2, Insightful)

by aa0606 ( 250018 ) writes:

obviously, the big sites are only going to go after those that they feel they can get to stop scraping with legal threats.

isn't the google news site basically a big scraper app? i doubt any of the sites whose articles are displayed there are complaining abou it.
I'll just use Oprah... (Score:3, Interesting)

by adjuster ( 61096 ) writes: on Friday February 07, 2003 @04:28PM (#5253045) Homepage Journal

No problem if they don't like my user agent... I'll just use Oprah.

wget --user-agent="Mozilla/4.0 (compatible; MSIE 6.0; MSIE 5.5; Windows NT 5.1) Oprah 7.0 [en]"

Relying on the User-Agent header for anything is silly. That probably means that some idiots are going to resort to some kind of algorithmic analysis of traffic patterns to determine screen-scraper use. Wonderful.

Sigh.

Share
twitter facebook
- Re:I'll just use Oprah... (Score:5, Funny)
  
  by splattertrousers ( 35245 ) writes: on Friday February 07, 2003 @04:32PM (#5253091) Homepage
  
  No problem if they don't like my user agent... I'll just use Oprah.
  You'll need a really fat network connection to use Oprah [oprah.com] to browse the web.
  
  Parent Share
  twitter facebook
  - Re:I'll just use Oprah... (Score:3, Funny)
    
    by Fulcrum of Evil ( 560260 ) writes:
    
    You'll need a really fat network connection to use Oprah [oprah.com] to browse the web.
    
    Nah, only on odd years.
It's their server... (Score:3, Insightful)

by Just Some Guy ( 3352 ) writes: <kirk+slashdot@strauser.com> on Friday February 07, 2003 @04:28PM (#5253047) Homepage Journal

...but the limit of their sphere of influence should be strictly limited to their users, and not the author of software that those users may use to retrieve content from the site.
Put another way: particularly on a subscription site, the site owners may specify whatever stupid terms and conditions that their subscribers are willing to submit to. That does not mean, though, that the client software is obligated to know whether or not the software itself meets the TOS (nor can I be made to believe that this is possible).

Share
twitter facebook
TerraServer (Score:3, Interesting)

by Corrupt System ( 636550 ) writes: on Friday February 07, 2003 @04:29PM (#5253051)

I can understand how site owners could have a problem with a commercial software product like ExpertGPS [expertgps.com] wasting their bandwidth while skipping ads. ExpertGPS costs $59.95, but downloads maps from Microsoft's TerraServer [msn.com] without going through its web interface and viewing its advertising. Microsoft hasn't blocked access from these programs yet, but what if they do? All the paying users of ExpertGPS would be out of this functionality.

Share
twitter facebook
Don't they already??? (Score:5, Interesting)

by tacocat ( 527354 ) writes: <tallison1&twmi,rr,com> on Friday February 07, 2003 @04:29PM (#5253059)

I am constantly greeted with messages to the tone of:

You must have Windows Internet Explorer 4 or higher installed on your system to view this website

How is this any different from what they are attempting to do here?

I hate to disappoint, but I don't think that this is a new precedent. What is a new precedent is the notion that they can request the removal, or to make unavailable, software that is otherwise available

The precedent here is not the software usage to access a website, but the notion that this can be extended to:

Dear Mozilla.org,

It has come to our attention that people are using your software to access our website. We don't like this are sending our legal team over to discuss the removal of your software application from the internet.

Similarly, we are contacting Netscape, AOL, Opera, Konqueror, et al and removing them as well.

Have a nice day!

Share
twitter facebook
If you don't want window shoppers... (Score:5, Insightful)

by Eese ( 647951 ) writes: on Friday February 07, 2003 @04:30PM (#5253067)

... don't put merchandise in the windows.

Just like you can listen to unencrypted radio broadcasts through the airwaves as much as you want, or stand next to a group of people talking and listen in, you can view web pages that are served openly over the Internet.

If you are going to be presenting something for people to observe, they can observe it however they like. Legislate all you want, but this is a fundamental component of logical (as opposed to legal) privacy.

Share
twitter facebook
Why not? (Score:5, Insightful)

by JazzyJ ( 1995 ) writes: on Friday February 07, 2003 @04:30PM (#5253069) Homepage Journal

There are a multitude of methods for providing different content based on what the client browser returns on certain environment variables. While I think it's silly to demand that modules be removed from CPAN, it's entirely up to the people running the server to determine who they want to serve content to....and who they dont.

If they can't figure out how to do it serverside (or with clientside scripting) then that's their problem.

That's the bitch about open standards....EVERYONE can use them.... :)

Share
twitter facebook
- Re:Why not? (Score:3, Interesting)
  
  by Tokerat ( 150341 ) writes:
  
  Same goes for the deep-link fanatics. Create a 0px wide frame (basically invisible) the encompases the entire browser window content area and then load pages in there, on server side checking the HTTP_REFERER and on the client side, using JavaScript to ensure the documents are loaded inside the proper frame (which could have a static name or one that is dynamically allocated to each session, even). Make it run over SSL so no one can "steal" those URLs "in transit".
  
  Is it really just easier to sue everyone than to pay a grungy guy in a t-shirt like me to set up your server to do this?
  
  Ahh, I get it, it's the return you make on the "investment" in your lawyer.
Learn from Google (Score:4, Insightful)

by shiflett ( 151538 ) writes: on Friday February 07, 2003 @04:33PM (#5253106) Homepage

They should do as many of us do and learn a lesson from Google.

It is a violation of Google's terms of use for you to "screen scrape" search results. You can implement their API using a free key and achieve similar results, however.

Not only are these companies approaching the "problem" from the wrong angle in terms of common sense, they are also taking the most difficult approach. It is practically impossible to seek to outlaw software that fetches Web content, because Web browsers and wget (for example) are the same thing, HTTP clients. The HTTP protocol is an open standard that anyone can implement. If you don't want a valid HTTP client accessing your server, don't make your server an HTTP server.

Stated another way, don't try to take an open standard and restrict everyone else's use of it to suit your own needs. You don't see me (an avid soccer player) trying to get the NBA to change the rules of their game to require use of the feet for ball control. If I want to play basketball, I have to play by the rules, else I am not really playing basketball.

Share
twitter facebook
HTTP GET is an authorization (Score:5, Insightful)

by bwt ( 68845 ) writes: on Friday February 07, 2003 @04:34PM (#5253108)

This is just another example of gross technical incompetence by executives and lawyers.

A company that attaches an HTTP server receives an HTTP GET request complete with some information in its headers. They have a reasonable case to request that that information be accurate. They have unilateral technical ability to firewall IP's or whole subnets. Otherwise, once they receive a GET request, when the machine that they have configured responds by sending a file, they have granted explicit permission to process that file consistent with the info in the GET request.

The owner of the server is completely in control at a technical level. If they don't like what you are doing, they can firewall you. Absent a contractual agreement not to, you have the permission to send ***REQUESTS*** for anything you would like to request. They can say no. If you lie in your request, then they have a case to say your use is unauthorized, but short of that, there should be no need to have the judicial system rewrite the technology.

Share
twitter facebook
- Not completely (Score:3, Insightful)
  
  by mccrew ( 62494 ) writes:
  
  Follow that logic, then by having a telephone a diner has granted explicit permission to the telemarketer to interrupt his meal.
  Or more related to the point, here are some real-world scenarios:
  1. Spammer tries to relay through a machine by looking for well-known CGI. For example, I frequently see requests for /cgi-bin/formail.pl, with the Referer: header set to the name of my domain.
  2. Spammer tries to relay through either an HTTP server or HTTP proxy which supports the "CONNECT" method.
  Has the owner of the machine explicitly granted spammer permission to (mis-)use his machine, just because a well-known script is present, or because CONNECT is enabled on the wrong side of the internet connection?
  I would respectfully disagree.
- Re:HTTP GET is an authorization (Score:3, Insightful)
  
  by errxn ( 108621 ) writes:
  
  Here's an analogy of sorts:
  
  You leave your house unlocked. Someone walks in the front door and steals your TV.
  
  According to our laws, just because you left your house unlocked (giving the outside world access) does not give the person who stole your TV a legal right to do so. They still committed a crime.
  
  Now, where this analogy might fall flat on his face is the idea that when you make a GET request, and the party on the other end responds by sending you a stream of data, have they just performed the equivalent of giving you the TV after you walked into their house? They can't very well say that you stole it if they willingly gave it to you.
Dangerous Precedent (Score:5, Insightful)

by EnglishTim ( 9662 ) writes: on Friday February 07, 2003 @04:38PM (#5253143)

I find it sad that so many people seem to think it is just fine to mine their site for data. Sure, there's not all that much that they can do about it, except remove the data or make it harder for regular users of the site to use it.

For example, The EuroTV site seems to work on the concept that they provide the information for free for users of their site, but you can pay them to get it on your site. They're using their site as an advert for their services, while at the same time offering a useful service to the community. By making freely available a system to allow anybody to use their data in their own websites without paying them for it, you're completely ridding them of their reason for having the site up at all.

Yes, you can argue that they shouldn't put the information out there if they don't want people to use it, but then you're giving them a good reason not to put the information out there at all, which makes all of us poorer.

As for whether they can dictate that CPAN remove the modules, certainly it's fair enough of them to request that the module be removed, but it is a shame they leapt to threats of lawsuits quite so quickly.

Share
twitter facebook
The future of the web (Score:5, Interesting)

by KjetilK ( 186133 ) writes: <kjetil AT kjernsmo DOT net> on Friday February 07, 2003 @04:42PM (#5253181) Homepage Journal

The web was never intended to be a browser-only environment. From the start, it was intended to be a medium that would be useful for a wide varity of user agents, crawling for info and presenting compiled and digested information to the user.
This was not ever realized, I believed mostly because of overpaid "web designers".
But the Semantic Web [semanticweb.org] would require many funny user agents for all kinds of things.
Clearly, if this kind of thinking is allowed to persist in corporate headquarters, it will kill the Semantic Web before it gets started.
I wonder what Tim Berners-Lee thinks about this...

Share
twitter facebook
Content is important (Score:5, Interesting)

by binaryDigit ( 557647 ) writes: on Friday February 07, 2003 @04:43PM (#5253190)

One of the biggest sites that I've not seen anyone mention is eBay. Following is in their eula:

Our Web site contains robot exclusion headers and you agree that you will not use any robot, spider, other automatic device, or manual process to monitor or copy our Web pages or the content contained herein without our prior expressed written permission.

You agree that you will not use any device, software or routine to bypass our robot exclusion headers, or to interfere or attempt to interfere with the proper working of the eBay site or any activities conducted on our site.

You agree that you will not take any action that imposes an unreasonable or disproportionately large load on our infrastructure.

Much of the information on our site is updated on a real time basis and is proprietary or is licensed to eBay by our users or third parties. You agree that you will not copy, reproduce, alter, modify, create derivative works, or publicly display any content (except for Your Information) from our Web site without the prior expressed written permission of eBay or the appropriate third party.

Now why they do this is obvious, they have an absolute goldmine of information and they want to be able to take advantage of it when they're good and ready. I assume other sites could adopt this type of eula, which wouldn't make the software itself illegal, but would make using it so (or at least until someone challenges it).

Share
twitter facebook
- Re:Content is important (Score:5, Insightful)
  
  by anaradad ( 199058 ) writes: <chris.shafferNO@SPAMgmail.com> on Friday February 07, 2003 @04:49PM (#5253233)
  
  The eBay EULA only applies if you actually register for their service. If you have never signed up for eBay, you have never signed off on their EULA.
  
  Parent Share
  twitter facebook
  - Re:Content is important (Score:3, Insightful)
    
    by binaryDigit ( 557647 ) writes:
    
    From ebay again:
    
    Welcome to the User Agreement for eBay Inc. The following describes the terms on which eBay offers you access to our services.
    
    This agreement describes the terms and conditions applicable to your use of our services available under the domain and sub-domains of www.ebay.com (including half.ebay.com, ebaystores.com) and the general principles for the Web sites of our subsidiaries and international affiliates. If you do not agree to be bound by the terms and conditions of this agreement, please do not use or access our services.
    
    Notice that it doesn't say anything about registering, it says "using their serice", which could be interpreted as also browsing, since that is a "feature" offered by their website. Registering simply brings into effect other parts of the eula that are applicable to those actions. If nothing else, the contents of the site are still copyrighted, so even if you didn't agree to their eula, you still couldn't do anything with the content.
Can information be protected by copyright? (Score:4, Informative)

by Lumpish Scholar ( 17107 ) writes: on Friday February 07, 2003 @04:44PM (#5253193) Homepage Journal

Everyone's assuming the appropriate rules here are from copyright law, which allow you to protect the expression of an idea but not the idea itself. That's probably right. It's not the way some big organizations want to play.

In the United States, most major sports leagues (NFL, NBA, NHL, MLB, etc.) believe that they own the rights to real time scores, and can permit or restrict any desired use. I ran into this at a previous job: we could "broadcast" football, basketball, and hockey scores at the end of every "period," and baseball scores at the end of every half inning, but we couldn't send updated broadcasts for every new score. That information needed (so said the leagues) to be licensed, and most of it had been exclusively licensed for the medium (Internet) we were interested in.

Do they have a legal leg to stand on? No. (IANAL.) Are they leaning on a great, big, huge stick with nails driven through it? Apparently.

Share
twitter facebook
Back in the day... (Score:5, Insightful)

by TheTick ( 27208 ) writes: on Friday February 07, 2003 @04:44PM (#5253196) Homepage Journal

Remember when the web -- no, remember when the net was about sharing information? I miss that time. If somebody wrote a cool front end to your service, it was COOL and more power to them. If it made your service (site, whatever) more accessible, that mean more people were looking at your stuff, and that was COOL.

Now we have entities that threaten legal action for accessing the stuff they've made publically available. There may actually be a case when the software scrapes and repackages the content (or, more importantly, redistributes it), but I hope the stuff about decoding the URL for easy use is bogus. I have my doubts that a court will see it my way, but still I hope for reason. Nevertheless, the whole idea makes me sad and nostalgic.

Another thought: is my mozilla vulnerable to this sort of action because it blocks ads -- essentially repackaging the server output for display to me? Now I'm really depressed.

Share
twitter facebook
What's the problem here? (Score:5, Insightful)

by hmccabe ( 465882 ) writes: on Friday February 07, 2003 @04:45PM (#5253205)

I think this is something we're going to start seeing a lot of in coming years. Right now, the Internet in general is going through growing pains, and the pressure is starting to show in these "free services" type sites ( i.e. Mapquest )

I don't know about these site in particular, but many of the big sites around today were built with the failed dot-com business model of delivering free content and selling advertising that ran on the page (or popped up behind it.) This, of course, is dependant on people viewing the site in a browser. If people get the information without using a browser, therefore never seeing the ads, the advertisers won't want to spend any money on the site.

Another problem is, most companies don't want to take the risks associated with innovation, so instead they seek legal action to maintain the good thing they have going. While this is a quick fix, and in the company's best interests, we need companies to present a new business model to the public and see how it gets adopted. I would pay an annual subscription fee for things like Mapquest.com, tvguide.com and maybe even /. I believe others would as well.

Porn sites, Ebay auctions, games such as Everquest and services such as Apple's dot-mac are online services that subscribers happily pay for because more than anything, they are quality products(well, some of the porn is). If the company's revenue is coming from its users, they would be a lot less concerned about how the information is being distributed.

This isn't such a radical change, as they could add a premium subscription service, and slowly transition the focus of their business towards it. Wouldn't it be cool if I could write my own mapping application ( or download a pre-made one from the site ) and have it connect to xml.mapquest.com, give my username and password, and retrieve the data I requested.

Share
twitter facebook
ebay has already done this (Score:3, Informative)

by troydsmith ( 560294 ) writes: on Friday February 07, 2003 @04:56PM (#5253284)

About 2 years ago ebay did exactly this. Their case went to court and they won.
Here is some more info [computerworld.com]

Share
twitter facebook
Don't like it? Don't put stuff on the web! (Score:5, Insightful)

by Maul ( 83993 ) writes: on Friday February 07, 2003 @05:23PM (#5253505) Journal

If you put something on the web, you have to assume that people are going to access that information in any way that they possibly can.

I suppose the big complaint is that people might not be viewing the "ads" on pages if they use certain HTTP clients.

I have a suggestion for the sites that are complaining. If you don't like it, don't put stuff on the web. Write your own custom client-server solution if you don't want people accessing it with certain browsers or other software.

If you are depending on ad banners for your revenues, you and advertisers are taking a "risk" that people might not see the ads, or that they might not buy advertised products. Tough luck if you lose out on your bet. Hopefully you have a solid way of making money related to whatever service you are providing to make up for it.

Whining about lost ad revenue and such is the same as whining about losing money in Las Vegas. You should have assessed the risks before playing the game.

Share
twitter facebook
Thread at Perlmonks (Score:3, Informative)

by Neil Watson ( 60859 ) writes: on Friday February 07, 2003 @05:27PM (#5253543) Homepage

Go Here [perlmonks.org] for discussion last summer over at Perlmonks.

Share
twitter facebook
Fairly uninforceable. (Score:3, Insightful)

by nobodyman ( 90587 ) writes: on Friday February 07, 2003 @05:30PM (#5253553) Homepage

Even if you removed the screenscraping modules you wouldn't even come closs to solving the "problem" these website operators are having. Both Microsoft and (I think) Sun have XML api's that allow you to ssue http requests and easily access what the server sends back. Even if you didn't have a high-level "screenscraper", you could always go through the sockets api. Hell, if I want to find out the type of server a website is using I just open a telnet connection to port 80 and type

GET <document_name> HTTP/1.0

...hit the return key twice and boom. Being that easy, I'm sure there are tons of developers that screenscrap without even using a mod.

If a website operator is having their copyrighted content lifted by another site and presented as its own, then that operator can sue using traditional copyright law. If they are having their website slammed because some clueless developer is scraping too often, they can block the IP. But trying to restrict access to the api is heavy-handed and futile.

Share
twitter facebook
Legal? Probably. Rude? Maybe... (Score:5, Insightful)

by Rob Parkhill ( 1444 ) writes: on Friday February 07, 2003 @05:40PM (#5253633) Homepage

EuroTV has a robots.txt file that asks to leave the various /scripts directories alone. If this Perl module is just ignoring that robots.txt file, then that is just rude, although I don't see how it is illegal.

Streetmap doesn't even have a robots.txt file, so I don't see why they are whining about it.

Although I can see why these websites could get upset. The TV-listing screen scrapers are especially bad at hammering a site relentlessly for a sustained period of time to obtain all of the programming information for a certian broadcast area. The scraper has to hit the site repeatedly to obtain all of the information, since it isn't all displayed on a single page. If any one of these scrapers gets to be really popular, it could kill the site.

Of course, the solution to that is to make all of the listing available as one big chunk to avoid repeated requests. But then the site goes out of business in a few weeks due to lack of advertising revenue.

I, for one, wish I could buy a subscription to zap2it.com that would give me fast, easy access to the channel listings in, say, XMLTV format. Is $25/year a reasonable fee, considering that I would only hit the site once a day at the most, and grab a single file?

Share
twitter facebook
stupid business models (Score:3, Insightful)

by g4dget ( 579145 ) writes: on Friday February 07, 2003 @05:45PM (#5253674)

First, people come up with stupid business models ("we'll put up copyrighted map data for free and make money from advertising"). Then, when it predictably turns out that people access that data programmatically, they whine.
Let's not screw up our legal system with provisions to protect bogus business models. If streetmap.co.uk cannot figure out how to make money putting up information openly on the Internet, then either they should make room for someone who can, or maybe there just isn't a market there.

Share
twitter facebook
screen scraping software is completely legal (Score:5, Insightful)

by frovingslosh ( 582462 ) writes: on Friday February 07, 2003 @06:00PM (#5253780)

Some /. readers seem to be missing this, but this is not a debate on if it's right to take someone's content and post it elsewhere. (To me it's clearly not without their permission, but that's not the issue here at all so lets not even pretend that it is by debating it.) The issue is "is it legal / proper/ ligitimate to write software that is capable of looking at the output of a website, by any means - including examining the HTML returned or by capturing the computer screen itself and analizing that? Of course it is. Such software in no way pirates a website owner's content, it just gives me additional tools for keeping current with the content of those pages. There are plenty of legitimate uses (the Streetmap reference was perfectly on target for this, just to give one). That someone might abuse such a tool and pirate content is hardly the issue, if it were every C compiler would also be at fault. People need to stand up against cranks like btek's Kate Sutton who think they can bully everyone else in the world. Simon Batistoni should have never even tried to be reasonable with her, and he should make his tool available again and sue her and her company for the slander she has done to him in the main perl5 bug queue.
Even if he had provided a tool to make a copy of a map, which he did not, there is nothing at all wrong with making and supplying others with that tool. It's how the tool is used that is the issue, and a tool that has legitimate useful uses can never be allowed to be the target of such a complaint or suit.

Share
twitter facebook
Banning vs. Blocking (Score:4, Insightful)

by billstewart ( 78916 ) writes: on Friday February 07, 2003 @06:24PM (#5253934) Journal

All sorts of people who don't understand the web or the Internet keep trying to get rules made or bring lawsuits or abuse the DMCA in novel ways because they don't like how their data is being used. In most cases, this is way out of line (as opposed to mildly out of line) because they can simply set their web server not to respond to requests they don't like.
A classic instance is the "deep linking" [wired.com] cases, where somebody doesn't want to let you see their deep pages except by coming through their front page. Rather than taking this to court, as several content providers have done, and beat up on users one at a time, it's much simpler to check the HTTP-REFERER to find out what page the request came from, and send an appropriate response page to any request that doesn't come from one of their other pages. (Whether that's a 404 or a redirect to the front page or a login screen or whatever depends on the circumstances.)
Screen scapers are an interesting case for a couple of reasons. One of them is that blind people often use them to feed text-to-speech browsers, so banning them is Extremely Politically Incorrect, as well as rude and stupid. Another is that anybody with a Print-Screen program on their PC can screen-scrape - you're only affecting whether they get ugly bitmaps or friendlier HTML objects. So you not only have to ban custom-tailored CPAN objects, you have to get Microsoft and Linus to break the screen-grabbers in their operating systems.
The related question "ok, so how *do* I detect and block http requests I don't like?" is left as an exercise to the blocker (and to the people who build workarounds to the blocks, and the people who also block those workarounds, etc...) The classic answers are things like cookies (widely supported "need the cookie to see the page" features seem to be available), ugly URLs that are either time-decaying or dependent on the requester's IP address, etc., or just checking the browser to see which lies it's telling about what kind of browser it is. There's also the robots.txt [robotstxt.org] convention for politely requesting robots to stay away, and Spider traps [devin.com] to hand entertaining things to impolite robots or overly curious humans.

Share
twitter facebook
I say turn it around... (Score:3, Interesting)

by ZoneGray ( 168419 ) writes: on Friday February 07, 2003 @06:44PM (#5254058) Homepage

Well, if screen-scraping is illegal (and in some forms, it certainly is), then somebody should sue the people who sell programs that harvest e-mail addresses from web sites.

Share
twitter facebook
money, business models and digital futures of IP (Score:3, Insightful)

by drDugan ( 219551 ) writes: on Friday February 07, 2003 @08:46PM (#5254917) Homepage

It all comes down to money and the models people have used to force advertizements onto people while they are entertained or eduacted.

the cold, hard truth is that the digital future obviates the traditional content control mechanisms used to force consumers to watch ads for content. The exact same lines are playing out on the web, on TV, in music, movies, magazines -- everywhere informationcan be digitized and presented in ways not tied to physical mediums.

The (now old) business models that the digital methods circumvent will eventually be redefined. Short term laws will support them, because the industries have eough money and clout to cause the laws to happen. Long term, though, people will no longer stand for the absurd, one-sided contract with society that is our current IP system.

This a vague comment, quickly written -- but I see here the exact same theme played out over and over in recent years. Free communication (amortized) + 'digitizable' items of value => lack of control by provider for profits. This is yet another example.

Share
twitter facebook
Stupid, but true (Score:3, Interesting)

by Angst Badger ( 8636 ) writes: on Saturday February 08, 2003 @03:57AM (#5257232)

Under the current state of US law, unauthorized access to a computer system is a federal crime. (I can't speak to EU laws, but I suspect parallels exist.) If Company X says, "You must use Internet Explorer 5.5 to access this site," then you must use IE 5.5. Of course, it would be just plain stupid to do so, but it's their computer system, and they get to decide who is authorized.

To judge from most of the comments here, the fact that it is incredibly stupid to impose such restrictions has obscured what is actually a legally unambiguous situation. Just because it's dumb doesn't mean it's not legal.

That an http server is nominally "public" doesn't mean diddly here. Any number of http servers provide for member- or employee-only access. The brick and mortar parallel would be those signs that say things like, "No shirt, no shoes, no service."

It is surprising that so few people have touched on the reason why companies might object to the distribution of Perl modules designed to harvest data from their sites: bandwidth costs and site performance. It doesn't take too many cron jobs banging on a site every minute -- and being ignored by their users most of the time -- to degrade site performance for "live" users and run up steep bandwidth bills.

Now, there is certainly no legal basis for Company X to demand that CPAN remove the modules, though it is hardly out of line to ask nicely. But there is firm legal grounds to prohibit anyone from actually using those modules.

Legal action is probably the wrong way to handle this, though. Having written fairly complicated web scrapers before, I know how easy it would be to make a site virtually impossible to harvest. Rather than make a big stink about the Perl programmers who contribute to CPAN, Company X would be well-advised to hire a good Perl programmer to thwart automated harvesters.

Share
twitter facebook
- Re:Maybe they can't but... (Score:5, Insightful)
  
  by pla ( 258480 ) writes: on Friday February 07, 2003 @04:58PM (#5253313) Journal
  
  but they can dictate whether you get the content or not
  
  Yes, they can. They have the option of not putting it on a public webserver in the first place. Beyond that, they have no control over who sees it and how. They can use various technological measures to try to control access, but short of forcing some form of user authentication via a secure proprietary client, the ad-blockers and scrapers *WILL* win.
  
  If they are getting no ad impressions, then they are getting no money.
  
  This statement seems a common way of viewing these issues (Ad blocking, scraping, whatever). However, realize that they don't have a "right" to make money just because they offer otherwise-free content online. They offer that content in the *HOPE* of making money, but that comes with no guarantees. And yes, I go to the kitchen during commercials, or change the station, or fast-forward.
  
  I see the problem as involving how offensive these sites make the ads. I find Flash and Shockwave ads so offensive (and, I find that they often crash my browser - the huge offensive Flash ad currently on the Onion, for example, crashes my browser every time) that I simply browse with them disabled. Pop(up/under) ads bother me enough that I have the "dom.disable_open_during_load" preference set to completely block them. In comparison, the small, unintrusive text ad in the upper left of K5's front page doesn't bother me at all, and I've even *clicked* on it a few times.
  
  Companies (not just advertisers, but those who serve such ads) need to realize that more annoying ads do make an impression - a strongly negative one. If I want their products, *I'll* seek *them* out. If they detract from my web browsing experience, I will specifically make a point of seeking out their *competitors* if I need something they offer.
  
  In case any marketing folks read this, I'll mention the last ad I *DID* watch - The one with the hamster and rabbit from Blockbuster. Why? Because I found the ad sufficiently amusing to watch, on its own merits. Important point there. It didn't annoy me, and it had value all by itself. *THAT* makes a positive impression on a potential customer. I don't even know what the hampster and rabbit talked about, but it doesn't matter, I remember that "Blockbuster amused me for 30 seconds". Making me waste a few minutes to figure out how to filter out your crap does *not* make a good impression. I will remember "X10 pissed me off for 30 seconds, let's visit Logitech's cam offerings instead".
  
  Parent Share
  twitter facebook
- referer as auth = bad programming (Score:4, Insightful)
  
  by DrSkwid ( 118965 ) writes: on Friday February 07, 2003 @05:32PM (#5253568) Journal
  
  some firewalls strip http referer headers to protect internal URLs from leaking to the outside world.
  
  The only real way to protect from screen scraping is to use Java applets or, shudder, activeX or some other proprietry method.
  
  I run a recruitment site and there is a lot of fear in recruitment due to stealing jobs ads from other sites. My non-tech people came to me and wanted to try and implement some sort of protection. I advised them to change their business model because no matter what attempts I made at obfuscating the information someone it would always be cheap enough to hire someone to sit there, browse the site and manually type the information.
  
  It makes sense for the free pictures people [smutserver I presume] to block non-referers because the free pictures drive the subscriptions.
  
  The simple answer really is provide a bespoke client. This obsession with "everything must work in a browser" has set user interface design back 20 years. I mean HTML forms & URLs are okay as they go but even Access 1.0 had better forms.
  
  Parent Share
  twitter facebook
- aha, sense talking (Score:3, Interesting)
  
  by DrSkwid ( 118965 ) writes:
  
  seems some people only like the free market when it puts their profits up.
  
  Give away the goods & charge advertisers is a business model that can only last for so long, especially if you add nothing to that information.
  
  The tv listing company I use - digiguide.com - offer both a scrapable web interface *and* a bespoke client that will do the sort of things that make the listings useful - reminders, look ahead for similar programming etc. They then reasonably price this bespoke client £5.99 a year [iirc].

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

In short, no. (Score:5, Insightful)

Re-read the article... (Score:5, Insightful)

paging Jack Valenti (Score:5, Funny)

Re:paging Jack Valenti (Score:5, Funny)

Re:paging Jack Valenti (Score:5, Funny)

Birth agreement (Score:3, Funny)

Re:paging Jack Valenti (Score:3, Funny)

Re:paging Jack Valenti (Score:4, Funny)

Derivative work (Score:5, Informative)

Re:Derivative work (Score:5, Informative)

Re:Derivative work (Score:4, Interesting)

Re:Derivative work (Score:5, Insightful)

Re:In short, no. (Score:4, Interesting)

Re:In short, no. (Score:3, Interesting)

Re:In short, no. (Score:3, Interesting)

Re:In short, no. (Score:3, Interesting)

Re:In short, no. (Score:5, Insightful)

are "Facts" copyrightable? (Score:4, Interesting)

Re:In short, no. (Score:3, Informative)

Re:In short, no. (Score:5, Insightful)

Re:In short, no. (Score:4, Insightful)

Re:In short, no. (Score:5, Informative)

Re:In short, no. (Score:3, Insightful)

In short, maybe... (Score:3, Insightful)

Re:In short, no. (Score:4, Interesting)

Re:In short, no. (Score:5, Insightful)

Re:In short, no. (Score:3, Informative)

Re:Why not a fair medium? (Score:4, Insightful)

amazon.com web services (Score:5, Interesting)

Re:In short, no. (Score:3, Funny)

The question is (Score:2)

Re:The question is (Score:3, Interesting)

Sure they can! (Score:5, Interesting)

Comment removed (Score:4, Interesting)

Re:Sure they can! (Score:4, Insightful)

Re:Sure they can! (Score:3, Insightful)

What falls out the back end of a bull? (Score:5, Funny)

Re:What falls out the back end of a bull? (Score:3, Insightful)

Captchas (Score:5, Interesting)

Comment removed (Score:4, Interesting)

Re:What falls out the back end of a bull? (Score:3, Interesting)

Turing test? (Score:5, Insightful)

Re:Turing test? (Score:3, Insightful)

Re:Sure they can! (Score:5, Interesting)

Re:Sure they can! (Score:4, Funny)

Re:Sure they can! (Score:3, Insightful)

NF Chance (Score:3, Insightful)

Re: (Score:3, Informative)

Re:Sure they can! (Score:5, Insightful)

Re:Sure they can! (Score:5, Interesting)

Re:Sure they can! (Score:5, Insightful)

Re:Sure they can! (Score:3, Insightful)

Re:Sure they can! (Score:4, Insightful)

Short answer: No (Score:5, Insightful)

Re:Short answer: No (Score:2, Funny)

if it's on the web... (Score:5, Insightful)

Sure they can- (Score:2, Insightful)

Re:Sure they can- (Score:3, Insightful)

Really Stupid... Really Clever (Score:5, Insightful)

Silly (Score:3, Interesting)

Can companies dictate the client? (Score:2)

TiVo analogy? (Score:2)

Re:TiVo analogy? (Score:5, Insightful)

I would guess (Score:2)

david vs goliath (Score:2, Insightful)

I'll just use Oprah... (Score:3, Interesting)

Re:I'll just use Oprah... (Score:5, Funny)

Re:I'll just use Oprah... (Score:3, Funny)

It's their server... (Score:3, Insightful)

TerraServer (Score:3, Interesting)

Don't they already??? (Score:5, Interesting)

If you don't want window shoppers... (Score:5, Insightful)

Why not? (Score:5, Insightful)

Re:Why not? (Score:3, Interesting)

Learn from Google (Score:4, Insightful)

HTTP GET is an authorization (Score:5, Insightful)

Not completely (Score:3, Insightful)

Re:HTTP GET is an authorization (Score:3, Insightful)

Dangerous Precedent (Score:5, Insightful)

The future of the web (Score:5, Interesting)