Yahoo Answers Scraping: 2013

Saturday 28 December 2013

Cloud-based Business ideas

Are you interested in starting a cloud based business? If yes, then below are the top ten cloud based business ideas you can start from home.

Wikipedia defines cloud computing as “internet-based computing whereby shared resources, software, and information are provided to computers and other devices on demand.”

Given the explanation above, popular platforms like Slideshare, Skype, Gmail, YouTube, Vimeo, Flickr, Amazon AWS and Cloudfront, Dropbox and WordPress can reasonably be included in a list of cloud applications; because they all hold your data (presentation slides, emails, videos, blog posts, etc), so you don’t have to worry about them.

Cloud computing provides a much more reliable alternative to keeping files on your own computer; a mode of storage that has been rendered insecure due to the emergence of various types of viruses and other threats to information security.

On the surface, cloud computing has many advantages over traditional methods of data storage. For example, if you store your data in a cloud-based retrieval system, you will be able to get back your data from any location that has internet access. And you wouldn’t need to carry around your physical storage device or use the same computer to save and retrieve your information. In fact, in the absence of security concerns, you could even allow other people to access the data, thereby turning a personal project into collaborative effort.

Like any other innovation that offers huge solutions for individuals and businesses, cloud computing has created huge opportunities for entrepreneurs who have a knack for computers and ICT. If you have a solid background in ICT and in-depth knowledge of cloud computing, then starting a cloud-based business might just be a life-changing business move for you. Without wasting time, below are 10 cloud-based business ideas that you can exploit for long-term income:

Top 10 Cloud-based Business ideas

1. Cloud computing consulting

Many individuals and businesses are becoming aware of the benefits of cloud computing and its advantages over traditional storage methods. But most people feel completely at sea when it comes to understanding how to move their systems and files onto the cloud storage platform. You can make a lot of money helping such individuals and businesses migrate to the cloud.

2. Tutoring

For security and other reasons, many individuals and businesses would fret at the idea of hiring a freelance contractor to help them with their migration to cloud. Rather, such individuals would prefer to learn how it works, so that they can handle the migration themselves.

Similarly, many businesses would prefer hiring you to train their in-house staff on the application of cloud computing. So, you can make a lot of money from just teaching people how to apply cloud computing to their businesses.

3. File hosting

If you have the required background and expertise, then you can make a lot of money by setting up your own platform for helping people hold their files in the cloud. That is, you can set up a cloud storage solution like Dropbox, Google Doc, Amazon AWS and Evernote, and charge people for helping them hold their files.

4. Cloud platform engineering

With a solid background in software or systems engineering, you can make money working as a cloud platform engineer. This position goes beyond helping individuals and businesses migrate to cloud; it also involves actually handling all of the technicalities and intricacies involved. After the initial setup process, you would be called on at intervals for maintenance and routine checks. And of course, you will get paid each time.

5. Cloud computing technologist

This involves working with companies that provide cloud-computing solutions. As a cloud-computing technologist, you will work with the company’s engineers to set up the company’s platform and packages. You will also help to set a user-friendly interface for their customers.

6. Cloud OS developer

A cloud OS developer analyzes, designs, programs, debugs, and modifies software enhancements and/or new products used in local, networked, or internet-related computer programs, primarily for end users.

As a cloud OS developer, you will also be required to test applications and interact with users to define system requirements and necessary modifications. You will earn a lot of money working for companies as an independent contractor. And there is no limit to the number of companies you can work with.

7. Cloud automation engineering

Working as an automation engineer, you will be responsible for deep automation of cloud services that will enable the company’s software development team to rapidly prototype, build, and deploy product offering to their customers. You will need to deeply understand cloud architectures and arrangements.

8. Cloud software engineering – This simply involves developing software that will ease the use of the cloud platform.

9. Web hosting

Yes, the popular web hosting is an application of cloud computing, since you will help individuals and businesses hold their web files and keep the secure. So you can set up you own web hosting company and make money.

10. Blogging (on cloud)

Because many people are yet to fully understand how cloud works, you can make a lot of money in the long term by starting a blog that discusses everything about cloud computing.

Source:http://www.mytopbusinessideas.com/cloud-based/

Friday 27 December 2013

Screen scraping: How to profit from your rival's data

Screen scraping might sound like something you do to the car windows on a frosty morning, but on the internet it means copying all the data on a target website.

"Every corporation does it, and if they tell you they're not they're lying," says Francis Irving, head of Scraper Wiki, which makes tools that help many different organisations grab and organise data.

To copy a document on a computer, you highlight the text using a mouse or keyboard command such as Control A, Control C. Copying a website is a bit trickier because of the way the information is formatted and stored.

Typically, copying that information is a computationally intensive task that means visiting a website repeatedly to get every last character and digit.

If the information on that site changes rapidly, then scrapers will need to visit more often to ensure nothing is missed.

And that is one of the reasons why many websites actively try to stop screen scraping because of the heavy toll it can take on their computational resources. Servers can be slowed down and bandwidth soaked up by the scrapers scouring every webpage for data.

"Up to 40% of the data traffic visiting our clients sites is made up of scrapers," says Mathias Elvang, head of security firm Sentor, which makes tools to thwart the data-grabbing programs.

"They can be spending a lot of money for infrastructure to serve the scrapers."

Scottish Grand National Betting aggregators often target the odds offered on particular sports events

And that's the problem. Instead of serving customers, a firm's web resources are helping computer programs that have no intention of spending any money.

Data loss

What's worse is that those scrapers are likely to be working for your rivals, says Mike Gaffney, former head of IT security at Ladbrokes, who spent a lot of his time at the bookmakers combating scrapers.

"Ladbrokes was blocking about one million IP addresses on a daily basis," he says, describing the scale of the scraping effort directed against the site.

Many of those scrapers were being run by unscrupulous rivals abroad that did not want to pay to get access to the data feed Ladbrokes provides of its latest odds, he says.

Instead, they got it for free via a scraper and then combined it with similar data scraped from other sites to give visitors a rounded picture of all the odds offered by lots of different bookmakers.

"It's important that your pricing information is kept as close to the chest as possible away from the competitor but is freely available to the punter," says Mr Gaffney.

The key, he said, was blocking the scraping traffic but letting the legitimate gamblers through.

The sites most often targeted by scrapers are those that offer time-sensitive data. Gambling firms offering odds on sports events are popular targets as are airlines and other travel firms.

The problem, says Shay Rapaport, co-founder of anti-scraping firm Fireblade, is determining whether a visitor is a human looking for a cheap flight or an automated program, or bot, intent on sucking all the data away,

"It's growing because it's easy to scrape and there are so many tools out there on the web," he says.

The best scraping programs mimic human behaviour and spread the work out among lots of different computers. That makes it hard to separate PC from person, he adds.

In many countries scraping is not illegal, adds Mr Rapaport, so scrupulous and unscrupulous businesses alike indulge in it.

House of Commons Scraping has helped make parliamentary debates and voting records more accessible

"A lot of big companies scrape content," he says. "Sometimes it's published on the web and re-packaged and sometimes it's just for internal use for business leads."

Talking heads

Frances Irving, head of ScraperWiki, says that not all of that grabbing of data is bad. There are legitimate uses to which it can be put.

For instance, says Mr Irving, good scraping tools can help to index and make sense of huge corpuses of data that would otherwise be hard to search and use.

Scrapers have been used to grab data from Hansard ,which publishes voting records of the UK's MPs and transcribes what they say in the Houses of Parliament.

"It's pretty uniform data because they have a style standard but it was done by humans so there's the odd mistake in it here and there," he says.

Scraping helped to organise all that information and get it online so voters can keep an eye on their elected representatives.

In addition, he says, it can be used to get around bureaucratic and organisational barriers that would otherwise stymie a data-gathering project.

And, he says, it's worth remembering that the rise of the web has been driven by two big scrapers - Google and Facebook.

In the early days the search engine scraped the web to catalogue all the information being put online and made it accessible. More recently, Facebook has used scraping to help people fill out their social network.

"Google and Facebook effectively grew up scraping," he says, adding that if there were significant restrictions on what data can be scraped then the web would look very different today.

Source:http://www.bbc.co.uk/news/technology-23988890

Hiring A Pro Air Duct Cleaning Service

When hiring a company that gives you air duct cleaning services you should use normal sense. Do some background investigation of the organizations you are thinking about. With the world-wide-web you can readily discover about any organisations you are looking at and uncover out if they have a history of enterprise complaints. You ought to ask any firm you are thinking about hiring questions about your air conditioning technique and make confident they are knowledgeable about their perform.

Are they licensed? A large number of states call for suppliers that clean air ducts to be licensed, if they should be and are not then this is a definite red flag. It’s also extremely critical to acquire an estimate in writing and inform the business that any significant adjustments in what they charge desires to be authorized by you prior to they continue functioning.

As with all factors of household repair and maintenance cleaning out dirty ducts is critical. Allowing ductwork to turn into excessively dusty can have a damaging influence on your health and could possibly lessen the life of your air conditioning method. Anytime contemplating hiring any corporation to work on your dwelling make certain you are informed about them. Do a small research, ask them concerns, and obtain estimates in writing. Any respected company ought to be content to talk with you about the work they will be performing as properly as give you a written estimate.

Hiring a organization that gives air duct cleaning services is just like hiring any other contractor, as lengthy as they are a reputable enterprise they really should present you with high quality service. So if you discover a lot of dust about your air conditioning vents don’t ignore the difficulty or place it off until it gets to be out of hand. Employ a enterprise that presents air duct cleaning solutions to assistance protect the wellness of your loved ones and the overall performance of your air conditioner.

Source:http://www.tampabaycleaning.com/176-hiring-a-pro-air-duct-cleaning-service-4

Basic Rules to Use for Your Data Entry Business

Setting up a data entry business from home sounds like a daunting prospect, but with a few basic requirements in place and the knowledge of what to look out for, it is much easier than it sounds.

So What is Required?

Essentially, all a person needs to get started with a data entry business is a computer with a regular Internet connection, MS Word, Excel and/ or Access and an ability to type reasonably quickly and, naturally, accurately. An Adobe reader to view or work on PDF files may also be necessary.

Then, of course, they will have to find work. This is where it gets a little more difficult, because many of the myriads of data entry opportunities advertised on the Internet will ultimately turn out to be elaborate scams set up to deceive people into handing over their money.

This should not, however, discourage an individual from trying. There are also many genuine, well paid jobs out there, and it is simply a matter of sorting the wheat from the chaff, so to speak. Knowing what to look out for and how to check out potential providers of work will protect an individual looking for work from becoming a victim to scam artists.

Finding Data Entry Work

By following a set of basic rules, it will be possible to avoid scams and get started without major pitfalls and costly mistakes. They are basically just three simple tips on checking out a potential person or company offering work.

Rule Number One - Avoiding Programs

The first rule is never to get involved with people, companies or so-called programmes offering work for which the individual looking for work has to pay to start with. Real employers pay for work, they don't ask people to pay them!

Let's face it, nobody would expect to pay to get a job interview on their High Street or on an industrial estate. The same applies to Internet based work. If it is genuine, no advance payment will be required.

Rule Number Two - Checking the Company

Even if there doesn't appear to be an obvious problem with a potential employer, the best advice is to check them out thouroughly before submitting any work. Some companies have been known to accept the work and then fail to pay for it.

Although this is comparatively rare, it does happen and a quick enquiry at one or both of these two websites: Better Business Bureau, referred to as the BBB for short, or the second site, Small Business Administration, known as the SBA, will reveal if a company can be trusted to pay on time.

Posting a query on a public forum can also be an excellent resource when trying to determine the authenticity of a company. If there is a problem, someone will know and respond to the query with a warning.

Rule Number Three - Checking the Work

An additional way of checking includes taking a good look at the way in which the provided work to be done is presented. A good, genuine employer will detail how they want the finished work to look, including details on file formats, formatting of text, the deadline for submission and rates of pay.

File formats usually include DOCS or RTF, excel or occasionally access files, PDF, HTML or SGML. Often the work is provided in the actual format it should be returned in.

The applicable rates of pay should equally be outlined clearly, usually the rates are per quantity submitted, rather than consisting of fantastic promises of easy money. Data entry, like any other work, is not easy money; earnings have to be worked for. Anyone promising otherwise can be regarded as dubious at best and should be double-checked, before falling into a trap.

Source:http://ezinearticles.com/?Basic-Rules-to-Use-for-Your-Data-Entry-Business&id=6558026

Thursday 26 December 2013

Benefits Of Article Writing Services

Even if you are a good author, you could want to think about employing post creating providers for your on the web company. High quality articles improvement needs treasured time and talent, and making use of post writing companies makes it possible for you and your personnel to focus on other critical aspects of your business. When you allow other people take care of branding, search engine optimization, and consumer-friendly articles generation, you can dedicate more time to building your merchandise, assisting your customers, and every little thing else that sets your company apart from the competitors. Here are some of the positive aspects of large-quality report creating solutions.

Readable, Intriguing Articles or blog posts

No matter what product you promote or support you provide Jeff Carter Jersey, your website needs to cater to your clients. You need to have articles that not only pitches a sale, but that readers will really engage prolonged enough to develop interest in your business. Fantastic write-up writing solutions specialize in delivering grammatically ideal, nicely-structured pieces which efficiently and entertainingly deliver your business's unique offering place. Retain the services of great writers, and you'll have far more time to really deliver on that place.

Branding, Authority, and a Loyal Readership

At a time when thousands of websites seem to supply the identical merchandise, solutions, and material, branding is crucial to your lengthy-term achievement. When folks visit your website, they want to see new, fresh material that delivers a message they have not currently read through tens or hundreds of times. They want something distinctive. Fantastic writers will set up the uniqueness of your enterprise by adapting to and additional building your website's fashion. They will also research your niche in buy to write articles with a tone that speaks right to your readers' deepest desires. By employing a fantastic content support, you can create yourself as an authority in your discipline and create a loyal group of readers who will ultimately turn out to be buyers.

Commitment and Professionalism

One of the largest difficulties with basic Search engine optimization and internet style firms is that they target on also a lot of projects and sub-projects at as soon as Drew Doughty Jersey, creating an overall item that is decent but unremarkable. They exemplify the "jack of all trades, master of none" clichA? and their customers endure simply because of it. On the other hand, a committed article writing services puts its total team's target into creating and editing superior posts for your website. Hiring a creating crew in addition to a net design service could expense a lot more in the brief phrase, but the impeccable articles you get will spend for itself hundreds of times more than with large site visitors and conversion prices.

Search engine optimisation Material that Converts

Numerous world wide web development companies appear at Search engine marketing and user-friendly content as totally separate entities. The dilemma with this mentality is that purely "search engine-pleasant" content articles are usually rife with uninteresting filler articles and unreadable, key phrase-stuffed sentences. To stay away from littering your stylish web site with this type of fluff , you need to have to have articles that is each highly readable and optimized for site visitors Jonathan Quick Jersey. Good article creating firms expertly weave key phrases and LSI terms into your messages to create articles or blog posts which will attain substantial search engine rankings and convert the readers who click on them.

Source:http://bpel.xml.org/blog/benefits-of-article-writing-services

Data journalism’s ‘secret weapon’, data newswires, and the newest data-scraping tools for journalists.

When investigative reporter and journalism instructor Chad Skelton needed help writing a curriculum for a data journalism course, he turned to NICAR-L, the email listerv for the National Institute of Computer Assisted Reporting, for advice. Skelton says that virtually every data journalist in North America is plugged in to the NICAR listserv, making it data journalism’s “secret weapon.”

In 5 tips for a data journalism workflow, the online journalism blog advises newsrooms to find and tap into “data newswires” in the same way newsrooms have used traditional newswires like AP and Reuters.

The newest data-scraping tool for non-coding journalists, Import.io, launched in public beta this week. Import.io allows data scraping from any website, and can create a single searchable database using information from several sources.

South Africa hosted a two-day hackathon this week, which was the first Editor’s Lab hackathon held in Southern Africa. The event was organized by the Global Editor’s Network (GEN), the African Media Initiative (AMI) and Google.

And finally, Owen Thomas writes on readwrite.com that the media world has a lot to learn from technologists like Jeff Bezos and Keith Rabois.

Source:http://strata.oreilly.com/2013/09/data-journalisms-secret-weapon-data-newswires-and-the-newest-data-scraping-tools-for-journalists.html

Tuesday 17 December 2013

Building a Website Scraper using Chrome and Node.js

A couple of months back, I did a proof of concept to build a scraper entirely in JavaScript, using webkit (Chrome) as a parser and front-end.

Having investigated seemingly expensive SaaS scraping software, I wanted to tease out what the challenges are, and open the door to some interesting projects. I have some background in data warehousing, and a little exposure to natural language processing, but in order to do any of those things I needed a source of data.

The dataset I built is 58,000 Flippa auctions, which have fairly well-structured pages with fielded data. I augmented the data by doing a crude form of entity extraction to see what business models or partners are most commonly mentioned in website auctions.

Architecture

I did the downloading with wget, which worked great for this. One of my concerns with the SaaS solution I demoed, is that if you made a mistake in parsing one field, you might have to pay to re-download some subset of the data.

One of my goals was to use a single programming language. In my solution, each downloaded file is opened in a Chrome tab, parsed, and then closed. I used Chrome because it is fast, but this should be easily portable to Firefox, as the activity within Chrome is a Greasemonkey script. Opening the Chrome tabs is done through Windows Scripting Host (WSH). The chrome extension connects to a Node.js server to retrieve the actual parsing code and save data back to a Postgres database. Having JavaScript on both client and server was fantastic for handling the back and forth communication. Despite the use of a simple programming language, the three scripts (WSH, Node.js, and Greasemonkey) have very different APIs and programming models, so it’s not as simple as I would like. Being accustomed to Apache, I was a little disappointed that I had to track down a script just to keep Node.js running.

Incidentally, WSH is using Internet Explorer (IE) to run its JavaScript; this worked well, unlike the typical web programming experience with IE. My first version of the script was a cygwin bash script, which involved too much resource utilization (i.e. threads) for cygwin to handle. Once I switched to WSH I had no further problems of that sort, which is not surprising considering its long-standing use in corporate environments.

Challenges

By this point, the reader may have noticed that my host environment is Windows, chosen primarily to get the best value from Steam. The virtualization environment is created on VirtualBox using Vagrant and Chef, which make creating virtual machines fairly easy. Unfortunately, it is also easy to destroy them. I kept the data on the main machine, backed up in git, to prevent wasting days of downloading. This turned out to be annoying because it required dealing with two operating systems (Ubuntu and Windows), which have different configuration settings for networking.

As the data volume increased, I found many new defects with this approach. Most were environmental issues, such as timeouts and settings for the maximum number of TCP connections (presumably these are low by default in Windows to slow the spread of bots).

Garbage collection also presented an issue, since the Chrome processes consume resources at an essentially fixed rate (their memory disappears when the process ends). The garbage collection in Node.js causes a sawtooth memory pattern. During this process many Chrome tabs open. The orchestration script must watch for this in order to slow down and allow Node.js to catch up. This script should also pause if the CPU overheats; unfortunately I have not been able to read CPU temperature. Although this capability is supposedly supported by Windows APIs, it is not supported either by Intel’s drivers or my chip.

Successes

A while back I read about Netflix’s Chaos Monkey and tried to apply its principle of assuming failure to my system. Ideally a parsing script should not stop in the middle of a several day run, so it is necessary to handle errors gracefully. Although the scripts have fail-retry logic, it unfortunately differs in each. Node.js restarts if it crashes because it is running tandem with Forever. The orchestration script doesn’t seem to crash, but supports resumption at any point, and watches the host machine to see if it should slow down. The third script, the Chrome extension, watches for failures from RPC calls, and does exponential backoff to retry.

Using the browser as a front-end gives you a free debugger and script interface, as well as a tool to generate xpath expressions.

Possibilities

The current script runs five to ten thousand entries before requiring attention. I intend to experiment with PhantomJS in order to improve performance, enable sharding, and support in-memory connections.

Source:http://www.garysieling.com/blog/building-a-website-scraper-using-chrome-and-node-js

Monday 16 December 2013

The “Ultimate Guide to Web Scraping” is Now Available

I wrote an article on web scraping last winter that has since been viewed almost 100,000 times. Clearly there are people who want to learn about this stuff, so I decided I’d write a book.

A few months later, I’m happy to announce: The Ultimate Guide to Web Scraping.

No prior knowledge of web scraping is necessary to follow along — the book is designed to walk you from beginner to expert, honing your skills and helping you becomes a master craftsman in the art of web scraping.

The book talks about the reasons why web scraping is a valid way to harvest information — despite common complaints. It also examines various ways that information is sent from a website to your computer, and how you can intercept and parse it. We’ll also look at common traps and anti-scraping tactics and how you might be able to thwart them.

There are code samples in both Ruby and Python — I had to learn Ruby just so I could write the code samples! If anyone’s willing to translate the sample code into PHP or Javascript, I’ll give you a free copy of the book. Get in touch.

—

Check out the table of contents:

    Introduction to Web Scraping

    Web Scraping as a Legitimate Data Collection Tool

    Understand Web Technologies: A Brief Introduction to HTTP and the DOM

    Finding The Data: Discovering Your “API”

    Extracting the Data: Finding Structure in an HTML Document

    Sample Code to Get You Started

    Avoiding Common Scraping Traps

    Being a Good Web Scraping Citizen

As a special deal for my blog subscribers, get 20% off with the code BLOGSUB. That coupon code is only good for a limited time, so order your copy today!

Source: http://blog.hartleybrody.com/web-scraping-guide/

Data Scraping Wikipedia with Google Spreadsheets

Prompted in part by a presentation I have to give tomorrow as an OU eLearning community session (I hope some folks turn up – the 90 minute session on Mashing Up the PLE – RSS edition is the only reason I’m going in…), and in part by Scott Leslie’s compelling programme for a similar duration Mashing Up your own PLE session (scene scetting here: Hunting the Wily “PLE”), I started having a tinker with using Google spreadsheets as for data table screenscraping.

So here’s a quick summary of (part of) what I found I could do.

The Google spreadsheet function =importHTML(“”,”table”,N) will scrape a table from an HTML web page into a Google spreadsheet. The URL of the target web page, and the target table element both need to be in double quotes. The number N identifies the N’th table in the page (counting starts at 0) as the target table for data scraping.

So for example, have a look at the following Wikipedia page – List of largest United Kingdom settlements by population (found using a search on Wikipedia for uk city population – NOTE: URLs (web addresses) and actual data tables may have changed since this post was written, BUT you should be able to find something similar…):

Grab the URL, fire up a new Google spreadsheet, and satrt to enter the formula “=importHTML” into one of the cells:

Autocompletion works a treat, so finish off the expression:

=ImportHtml(“http://en.wikipedia.org/wiki/List_of_largest_United_Kingdom_settlements_by_population”,”table”,1)

And as if by magic, a data table appears:

All well and good – if you want to create a chart or two, why not try the Google charting tools?

Google chart

Where things get really interesting, though, is when you start letting the data flow around…

So for example, if you publish the spreadsheet you can liberate the document in a variety of formats:

As well publishing the spreadsheet as an HTML page that anyone can see (and that is pulling data from the WIkipedia page, remember), you can also get access to an RSS feed of the data – and a host of other data formats:

See the “More publishing options” link? Lurvely :-)

Let’s have a bit of CSV goodness:

Why CSV? Here’s why:

Lurvely… :-)

(NOTE – Google spreadsheets’ CSV generator can be a bit crap at times and may require some fudging (and possibly a loss of data) in the pipe – here’s an example: When a Hack Goes Wrong… Google Spreadsheets and Yahoo Pipes.)

Unfortunately, the *’s in the element names mess things up a bit, so let’s rename them (don’t forget to dump the original row of the feed (alternatively, tweak the CSV URL so it starts with row 2); we might as well create a proper RSS feed too, by making sure we at least have a title and description element in there:

Make the description a little more palatable using a regular expression to rewrite the description element, and work some magic with the location extractor block (see how it finds the lat/long co-ordinates, and adds them to each item?;-):

DEPRECATED…. The following image is the OLD WAY of doing this and is not to be recommended…

…DEPRECATED

Geocoding in Yahoo Pipes is done more reliably through the following trick – replace the Location Builder block with a Loop block into which you should insert a Location Builder Block

yahoo pipe loop

The location builder will look to a specified element for the content we wish to geocode:

yahoo pipe location builder

The Location Builder block should be configured to output the geocoded result to the y:location element. NOTE: the geocode often assumes US town/city names. If you have a list of town names that you know come from a given country, you may wish to annotate them with a country identify before you try to geocode them. A regular expression block can do this:

regex uk

This block says – in the title element, grab a copy of everything – .* – into a variable – (.*) – and then replace the contents of the title element with it’s original value – $1 – as well as “, UK” – $1, UK

Note that this regular expression block would need to be wired in BEFORE the geocoding Loop block. That is, we want the geocoder to act on a title element containing “Cambridge, UK” for example, rather than just “Cambridge”.

Lurvely…

And to top it all off:

And for the encore? Grab the KML feed out of the pipe:

…and shove it in a Google map:

So to recap, we have scraped some data from a wikipedia page into a Google spreadsheet using the =importHTML formula, published a handful of rows from the table as CSV, consumed the CSV in a Yahoo pipe and created a geocoded KML feed from it, and then displayed it in a YahooGoogle map.

FAQ: Data journalism, scraping and Help Me Investigate

We call it data driven journalism (DDJ) nowadays, and we used to call it computer assisted reporting. Did only the name change or has there been a more profound change in this set of skills and methods?

I talk a bit about this in the data journalism chapter of the Online Journalism Handbook. I think it’s a qualitative and quantitative change: CAR was primarily about using spreadsheets and databases on datasets, locally on your computer.

DDJ is about the shifts to datasets and tools that are available via the network: it includes automation of process (e.g. scraping, querying APIs), it includes the expansion of ‘data’ beyond spreadsheets to include a vastly expanded range of digitised information: connections, images, audio, video, text. It includes the shift to using CAR for newsgathering to using DDJ techniques for the actual communication of the story: from interactive databases to live visualisation, data-driven tools and apps, and so on.

With the ubiquity of new tools, such as tools for data visualisation, should every journalist be a data journalist nowadays? Can every journalist adopt these skills?

Every journalist should look at ways they can incorporate DDJ techniques in their work – that might be automating part of their newsgathering (for example setting up, combining, filtering and publishing RSS feeds) but more widely it should be a recognition of what information in their field is digitized.

That used to be a minor aspect of reporting – occasional reports or statistical releases. But now it’s a regular and central part of everything from sport and fashion to politics and crime. You could possibly justify ignoring that in the past but to do so now will look increasingly ignorant and lazy. The expectation is higher that journalists justify their roles with something more than filling space.

Talking about visualisations, is visual representation of information getting more important than the actual textual narration online?

I don’t think so. I think it’s more important that it has been, simply because there’s an increased recognition that some people prefer to communicate and consume visually. But it’s not an either/or: visualisation has strengths and weaknesses, as does text, and each will work better for different stories.

You recently published the book “Scraping for Journalists”. Scraping is a new term in the Balkans, not widespread among journalists. So could you tell us what scraping is in DDJ and a bit more about your book – it seems to be a manual, what exactly can we learn from it?

Scraping is the process of gathering and storing data from online sources. That might be a single table on a webpage, or it might be information from hundreds of webpages, the results of a database search, or dozens of spreadsheets, or thousands of PDF reports.

It’s a very important way for journalists to be able to ask questions quickly without having to spend days or weeks manually printing off and poring through documents. When governments make it harder to ask questions, scraping pulls down some of those barriers.

The book teaches you how to write a very basic scraper in five minutes using Google Docs, and then takes you through more and more powerful scrapers using a range of free tools to solve a number of typical problems that journalists face. I really wrote it because I realised that journalists were trying to learn programming in the same way that they learned journalism – when actually what’s really important is not learning programming languages, but problem-solving techniques. I wanted a book that didn’t finish with the last page.

Do you cover DDJ in your university classes? What do you focus on – concrete skills and tools or changes DDJ is bringing to the media landscape or….?

Yes – I teach data journalism both on the MA in Online Journalism at Birmingham City University that I lead, and the MA courses at City University London, where I’m a visiting professor. I focus on some core techniques – such as spreadsheet tips and visualisation principles – along with those problem-solving techniques I mentioned earlier: where to look for solutions and the importance of engaging with online communities. I also talk about the context: how different parts of the media are using these data skills, both editorially and commercially.

Any particular tools in the DDJ toolbox you would recommend to beginners? And how should a self-taught journalists start learning DDJ skills?

I always advise students and trainees to start with the stories they’re reporting, rather than particular skills. A sports reporter will need different skills to an education correspondent, or a business reporter. Start with a simple question you have and see if you can find the data to answer it – or look for a simple clean dataset in your field and see what simple stories you can find in it: who’s top and bottom? Where does the money go? Then put the data aside and pick up the telephone.

Some skills are more likely to come in useful than others: using advanced search operators to find spreadsheets and reports, for example. Pivot tables and advanced filters in spreadsheets. Knowing about making FOI requests might be useful to some; scraping for others.

There are some good books for the big view (some here), but mostly they should be in the habit of searching online for answers to individual problems – and having conversations in online communities like NICAR, the Wobbing (European FOI) mailing list, the Scraperwiki mailing list, and so on.

How would you sell the need to invest resources in teaching journalists DDJ skills to a disinterested newsroom management?

Don’t fall for the myth that data journalism is ‘resource intensive’. It can save your staff time and money. It can lead to stories that are stickier, gather more user data, and are more appealing to advertisers. It can lead to new commercial opportunities and new revenue streams. It can differentiate your content from the commodity news that is rapidly depreciating in value.

You are a founder of the website HelpmeInvestigate. Tell us about the website.

HelpMeInvestigate.com was set up in 2009 to explore “crowdsourcing” investigative journalism – in other words collaborating with members of the public to do public interest investigations. The project is completely voluntary and has no paid staff. Investigations range from simple questions to longform investigations – the most recent was an investigation into the allocation of Olympic torchbearer places which led to coverage in The Guardian, Independent, Daily Mail, BBC radio, local newspapers across the UK, and even a German newspaper. The fruits of the investigation were published as a longform ebook – 8,000 Holes: How the 2012 Olympic Torch Relay Lost Its Way – in the final week of the Olympic torch relay, which was incredible to be able to do. The book is free by the way, but users can also pay a donation to the Brittle Bone Society.

Source:http://onlinejournalismblog.com/2012/12/19/faq-data-journalism-scraping-and-help-me-investigate/

Saturday 14 December 2013

Data Scraping Guide for SEO & Analytics

Web scraping or web data scraping is a technique used to extract data from web documents like HTML and XML files. Data scraping can help you a lot in competitive analysis like determining the titles, keywords, content categories and ad copies used by your competitors.

You can quickly get an idea of which keywords are driving traffic to your competitors’ website, which content categories are attracting links and user engagement. What kind of resources will it take to rank your site?

You can then replicate all the good strategies used by your competitors. The idea is to do what your competitors are doing and do it even better to outrank them. Through competitive analysis you can get a head start in any SEO project.

Scraping Organic Search Results

Scrape organic search results to quickly find out your SEO competitors for a particular search term. Determine the title tags and keywords they are using. The easiest way to scrape organic search results is by using the SERPs Redux bookmarklet.

For e.g if you scrape organic listings for the search term ‘seo tools’ using this bookmarklet, you see the following results:

You can copy paste the websites URLs and title tags easily into your spreadsheet from the text boxes.

    Pro Tip by Tahir Fayyaz:

    Just wanted to add a tip for people using the SERPs Redux bookmarklet.

    If you have a data separated over multiple pages that you want to scrape you can use AutoPager for Firefox or Chrome to loads x amount of pages all on one page and then scrape it all using the bookmarklet.

Another cool way of scraping SERP is through Keyword difficulty tool of SEOmoz. Through this tool you can scrape organic search results along with all the cool metrics offered by SEOmoz like PA, DA, Linking root domains etc.

You can download this report into excel by clicking on ‘Export to CSV’.

Scraping on page elements from a web document

Through this Excel Plugin by Niels Bosma you can fetch several on-page elements from a URL or list of URLs like: Title tag, Meta description tag, Meta keywords tag, Meta robots tag, H1 tag, H2 tag, HTTP Header, Backlinks, Facebook likes etc.

Scraping data through Google Docs

Google docs provide a function known as importXML through which you can import data from web documents directly into Google Docs spreadsheet. However to use this function you must be familiar with X-path expressions.

    Syntax: =importXML(URL,X-path-query)

    url=> URL of the web page from which you want to import the data.

    x-path-query => A query language used to extract data from web pages.

You need to understand following things about X-path in order to use importXML function:

1. Xpath terminology- What are nodes and kind of nodes like element nodes, attribute nodes etc.

2. Relationship between nodes- How different nodes are related to each other. Like parent node, child node, siblings etc.

3. Selecting nodes- The node is selected by following a path known as the path expression.

4. Predicates – They are used to find a specific node or a node that contains a specific value. They are always embedded in square brackets.

If you follow the x-path tutorial then it should not take you more than an hour to understand how X path expressions works. Understanding path expressions is easy but building them is not. That’s is why i use a firefbug extension named ‘X-Pather‘ to quickly generate path expressions while browsing HTML and XML documents. Since X-Pather is a firebug extension, it means you first need to install firebug in order to use it.

How to scrape data using importXML()

Step-1: Install firebug – Through this add on you can edit & monitor CSS, HTML, and JavaScript while you browse.

Step-2: Install X-pather – Through this tool you can generate path expressions while browsing a web document. You can also evaluate path expressions.

Step-3: Go to the web page whose data you want to scrape. Select the type of element you want to scrape. For e.g. if you want to scrape anchor text, then select one anchor text.

Step-4: Right click on the selected text and then select ‘show in Xpather’ from the drop down menu.

Then you will see the Xpather browser from where you can copy the X-path.

Here i have selected the text ‘Google Analytics’, that is why the xpath browser is showing ‘Google Analytics’ in the content section. This is my xpath:

    /html/body/div[@id='page']/div[@id='page-ext']/div[@id='main']/div[@id='main-ext']/div[@id='mask-3']/div[@id='mask-2']/div[@id='mask-1']/div[@id='primary-content']/div/div/div[@id='post-58']/div/ol[2]/li[1]/a

Pretty scary huh. It can be even more scary if you try to build it manually. I want to scrape the name of all the analytic tools from this page: killer seo tools. For this i need to modify the aforesaid path expression into a formula.

This is possible only if i can determine static and variable nodes between two or more path expressions. So i determined the path expression of another element ‘Google Analytics Help center’ (second in the list) through X-pather:

    /html/body/div[@id='page']/div[@id='page-ext']/div[@id='main']/div[@id='main-ext']/div[@id='mask-3']/div[@id='mask-2']/div[@id='mask-1']/div[@id='primary-content']/div/div/div[@id='post-58']/div/ol[2]/li[2]/a

Now we can see that the node which has changed between the original and new path expression is the final ‘li’ element: li[1] to li[2]. So i can come up with following final path expression:

    /html/body/div[@id='page']/div[@id='page-ext']/div[@id='main']/div[@id='main-ext']/div[@id='mask-3']/div[@id='mask-2']/div[@id='mask-1']/div[@id='primary-content']/div/div/div[@id='post-58']/div/ol[2]//li/a

Now all i have to do is copy-paste this final path expression as an argument to the importXML function in Google Docs spreadsheet. Then the function will extract all the names of Google Analytics tool from my killer SEO tools page.

This is how you can scrape data using importXML.

    Pro Tip by Niels Bosma: “Anything you can do with importXML in Google docs you can do with XPathOnUrl directly in Excel.”

    To use XPathOnUrl function you first need to install the Niels Bosma’s Excel plugin. It is not a built in function in excel.

Note:You can also use a free tool named Scrapy for data scraping. It is an an open source web scraping framework and is used to extract structured data from web pages & APIs. You need to know Python (a programming language) in order to use scrapy.

Scraping on-page elements of an entire website

There are two awesome tools which can help you in scraping on-page elements (title tags, meta descriptions, meta keywords etc) of an entire website. One is the evergreen and free Xenu Link Sleuth and the other is the mighty Screaming Frog SEO Spider.

What make these tools amazing is that you can scrape the data of entire website and download it into excel. So if you want to know the keywords used in the title tag on all the web pages of your competitor’s website then you know what you need to do.

Note: Save the Xenu data as a tab separated text file and then open the file in Excel.

Scraping organic and paid keywords of an entire website

The tool that i use for scraping keywords is SEMRush. Through this awesome tool i can determine which organic and paid keyword are driving traffic to my competitor’s website and then can download the whole list into excel for keyword research. You can get more details about this tool through this post: Scaling Keyword Research & Competitive Analysis to new heights

Scraping keywords from a webpage

Through this excel macro spreadsheet from seogadget you can fetch keywords from the text of a URL(s). However you need an Alchemy API key to use this macro.

You can get the Alchemy API key from here

Scraping keywords data from Google Adwords API

If you have access to Google Adwords API then you can install this plugin from seogadget website. This plugin creates a series of functions designed to fetch keywords data from the Google Adwords API like:

getAdWordAvg()- returns average search volume from the adwords API.

getAdWordStats() – returns local search volume and previous 12 months separated by commas

getAdWordIdeas() – returns keyword suggestions based on API suggest service.

Check out this video to know how this plug-in works

Scraping Google Adwords Ad copies of any website

I use the tool SEMRush to scrape and download the Google Adwords ad copies of my competitors into excel and then mine keywords or just get ad copy ideas. Go to semrush, type the competitor website URL and then click on ‘Adwords Ad texts’ link on the left hand side menu. Once you see the report you can download it into excel.

Scraping back links of an entire website

The tool that you can use to scrape and download the back links of an entire website is: open site explorer

Scraping Outbound links from web pages

Garrett French of citation Labs has shared an excellent tool: OBL Scraper+Contact Finder which can scrape outbound links and contact details from a URL or URL list. This tool can help you a lot in link building. Check out this video to know more about this awesome tool:

Scraper – Google chrome extension

This chrome extension can scrape data from web pages and export it to Google docs. This tool is simple to use. Select the web page element/node you want to scrape. Then right click on the selected element and select ‘scrape similar’.

Any element/node that’s similar to what you have selected will be scraped by the tool which you can later export to Google Docs. One big advantage of this tool is that it reduces our dependency on building Xpath expressions and make scraping easier.

See how easy it is to scrape name and URLs of all the Analytics tools without using Xpath expressions.

Note: You may need to edit the XPath if the results are not what you were expecting.

This post is very much a work in progress. If you know more cool ways to scrape data then please share in the comments below.

Source:http://www.seotakeaways.com/data-scraping-guide-for-seo/

FAQ: Data journalism, scraping and Help Me Investigate

Friday 13 December 2013

How Linkedin Profile Scraping Can Help

Your Business As the world changes and technology evolves, there has been an increased demand in automation tools. These tools help to reduce the amount of labor personnel that businesses need to hire. What used to take a long time is now done in a couple of minutes when automated tools are used. Linkedin Profile Scraping services allow businesses to get a huge list of contacts in just a matter of minutes. These services cut the burden on employees and increase the speed at which the job is done. These services allow any business to quickly and effectively collect information without having to hire someone to sit there and do it manually.

Every business needs a long list of contacts if it wants to be successful. It is impossible for a business to succeed without contacts in the industry. The problem is that this contact information is often hard to come by. When someone does happen to have this information, they are not willing to get rid of it at a reasonable price. Linkedin Profile Scraping companies make this process a lot easier. These companies specialize in scraping up to date data at affordable prices. They have specially designed software that can segment and search through all the databases of people in Linkedin. They can offer such a great discount because their software can do the job with minimal supervision. Once the software is created, there is not a lot of overhead for these companies. They then pass the savings on to their customers.

When you get your project report, the information will be available in an easy to sort text or csv file. This simplified information will only contain the fields that you asked for. This is typically a person’s name, phone number and email address. You can ask for whatever information is relevant to your needs and the company can scrape it for you. Linkedin Profile Scraping companies offer a simple solution to a problem that has plagued businesses for years. They offer a simple alternative to the usual ways of obtaining contacts. You can easily amass a large amount of contacts when using one of these companies.

Source: http://thewebscraping.com/linkedin-profile-scraping/

How to Scrape Websites for Data without Programming Skills

Searching for data to back up your story? Just Google it, verify the accuracy of the source, and you’re done, right? Not quite. Accessing information to support our reporting is easier than ever, but very little information comes in a structured form that lends itself to easy analysis.

You may be fortunate enough to receive a spreadsheet from your local public health agency. But more often, you’re faced with lists or tables that aren’t so easily manipulated. It’s common for data to be presented in HTML tables — for instance, that’s how California’s Franchise Tax Board reports the top 250 taxpayers with state income tax delinquencies.

It’s not enough to copy those numbers into a story; what differentiates reporters from consumers is our ability to analyze data and spot trends. To make data easier to access, reorganize and sort, those figures must be pulled into a spreadsheet or database. The mechanism to do this is called Web scraping, and it’s been a part of computer science and information systems work for years.

It often takes a lot of time and effort to produce programs that extract the information, so this is a specialty. But what if there were a tool that didn’t require programming?

Enter OutWit Hub, a downloadable Firefox extension that allows you to point and click your way through different options to extract information from Web pages.

How to use OutWit Hub

When you fire it up, there will be a few simple options along the left sidebar. For instance, you can extract all the links on a given Web page (or set of pages), or all the images.

If you want to get more complex, head to the Automators>Scrapers section. You’ll see the source for the Web page. The tagged attributes in the source provide markers for certain types of elements that you may want to pull out.

Look through this code for the pattern common to the information you want to get out of the website. A certain piece of text or type of characters will usually be apparent. Once you find the pattern, put the appropriate info in the “Marker before” and “Marker after” columns. Then hit “Execute” and go to town.

An example: If you want to take out all the items in a bulleted list, use <li> as your before marker and </li> as your after marker. Or follow the same format with <td> and </td> to get items out of an HTML table. You can use multiple scrapers in OutWit Hub to pull out multiple columns of content.

There’s some solid help documentation to extend your ability to use OutWit Hub, with a variety of different tutorials.

If you want to extract more complicated information, you can. For instance, you can also pull out information from a series of similarly-formatted pages. The best way to do this is with the Format column in the scraper section to add a “regular expression,” a programmatic way to designate patterns. OutWit Hub has a tutorial on this, too.

OutWit Hub isn’t the only non-programming scraping option. If you want to get information out of Wikipedia and into a Google spreadsheet, for instance, you can.

But even when pushed to the max, OutWit Hub has its limitations. The simple truth is that using a programming language allows for more flexibility than any application that relies on pointing and clicking.

When you hit OutWit’s scraping limitations, and you’re interested in taking that next step, I recommend Dan Nguyen’s four-post tutorial on Web scraping, which also serves as an introduction to Ruby. Or use programmer Will Larson’s tutorial, which teaches you both about the ethics of scraping (Do you have the right to take that data? Are you putting undue stress on your source’s website?) while introducing the use of the Beautiful Soup library in Python.

Source:http://www.poynter.org/how-tos/digital-strategies/e-media-tidbits/102589/how-to-scrape-websites-for-data-without-programming-skills/