Tuesday, 2 September 2014

Legality of Web Scraping vs Normal Use


I know the topic of web scraping has been discussed before (example), and I understand it's a bit of a grey area depending on a lot of factors (e.g. website's terms of use).

What I'd like to ask is: how is web scraping any different from (a) how we access the webpage via a web browser, and (b) how web crawlers (e.g. Google) download and index webpages?

Without knowing the legal background, I can't help but think that they're all just HTTP requests. If web scraping is illegal, then so should crawling and indexing (for instance be illegal).

Of course if your program is hitting the server so hard that it causes a denial of service, it's a different story altogether... my point is simply accessing and using data that is already open to the public.



I know this is a dead thread, but it would be nice to place some legal implications here due to its ranking in my Google Search. I cannot help but figure I am not the only one who searches like I do.

Legally, in the US, there are a few factors that seem to be important.

    Are you doing anything that is akin to hacking or gaining unauthorized access via the Computer Fraud and Abuse Act. Exploiting vulnerabilities and passing SQL in the URL to open a database no matter how bad the idiot programming like that was is illegal with a 15 year sentence (see the cases where an individual exploited security vulnerabilities in Verizon). Also, add a time out even if you round robin or use proxies. DDoS attacks are attacks. 1000 requests per second can shut down a lot of servers providing public information. The result here is up to 15 years in jail.

    Copyright Law: As mentioned, pure replication of data is illegal. Even 4% replication has been deemed a breach. With the recent gutting of the DMCA, a person is even more vulnerable to civil and criminal penalties.

    Trespass and Chattels: The following from wikipedia says it all.

    U.S. courts have acknowledged that users of "scrapers" or "robots" may be held liable for committing trespass to chattels,[5][6] which involves a computer system itself being considered personal property upon which the user of a scraper is trespassing. The best known of these cases, eBay v. Bidder's Edge, resulted in an injunction ordering Bidder's Edge to stop accessing, collecting, and indexing auctions from the eBay web site.

    Paywalls and Product: When going behind paywalls and breaching contract by clicking an agreement not to do something and then doing it, you add fuel to the protection of negligence v. willingness [an issue for damages and penalties not guilt] in civil and any criminal trials. (sorry originally wanted to say ignorance but it really isn't a defense)

    International: EU law and other law is way more lax. Corporations with big budgets dominate our legal landscape. They control the system in a very real way with their $$$.

Basically, get public information and information that is available without going behind a pay wall. Think like a user of the internet and combine a bunch of sources into a unique product. Don't just 'steal' an entire site (it isn't really stealing if it is a government site that offers public data especially for download but is if you download all or even more than a couple of the listings on ebay). Read the terms and conditions to know who actually owns the content.

Here are a few examples. Trulia owns its information but you could use it to go to an agents website and collect a legal amount of information. The legal amount is determinable. However, a public MLS listing lookup site with no agreement or terms and offering data to the public is fair game. The MLS numbers lists, however, are normally not fair game.

If a researcher can get to data, so can you. If a researcher needs permission, so do you. A computer is like having a million corporate researchers at your disposal.

AS for company policy, it is usually used internally to shield from liability and serves as a warning but is not entirely enforceable. The legal parts letting you know about copyrights and such are and usually are supposed to be known by everyone. Complete ignorance is not a legal protection. It does provide a ground set of rules. Be nice, or get banned is that message as far as I know.

My personal strategy is to start with public data and embellish it within legal means.


Source: http://stackoverflow.com/questions/14735791/legality-of-web-scraping-vs-normal-use

No comments:

Post a Comment