Friday, 22 August 2014

Scraping dynamic data


I am scraping profiles on ask.fm for a research question. The problem is that only the top most recent questions are viewable and I have to click "view more" to see the next 15.

The source code for clicking view more looks like this:

<input class="submit-button-more submit-button-more-active" name="commit" onclick="return Forms.More.allowSubmit(this)" type="submit" value="View more" />

What is an easy way of calling this 4 times before scraping it. I want the most recent 60 posts on the site. Python is preferable.

You could probably use selenium to browse to the website and click on the button/link a few times. You can get that here:

    https://pypi.python.org/pypi/selenium

Or you might be able to do it with mechanize:

    http://wwwsearch.sourceforge.net/mechanize/

I have also heard good things about twill, but never used it myself:

    http://twill.idyll.org/



Source: http://stackoverflow.com/questions/19437782/scraping-dynamic-data

Wednesday, 20 August 2014

Web Scraping data from different sites


I am looking for a few ideas on how can I solve a design problem I'm going to be faced with building a web scraper to scrape multiple sites. Writing the scraper(s) is not the problem, matching the data from different sites (which may have small differences) is.

For the sake of being generic assume that I am scraping something like this from two or more different sites:

    public class Data {
        public int id;
        public String firstname;
        public String surname;
        ....
    }

If i scrape this from two different sites, I will encounter the situation where I could have the following:

Site A: id=100, firstname=William, surname=Doe

Site B: id=1974, firstname=Bill, surname=Doe

Essentially, I would like to consider these two sets of data the same (they are the same person but with their name slightly different on each site). I am looking for possible design solutions that can handle this.

The only idea I've come up with is scraping the data from a third location and using it as a reference list. Then when I scrape site A or B I can, over time, build up a list of failures and store them in a list for each scraper so that it can know (if i find id=100 then i know that the firstname will be William etc). I can't help but feel this is a rubbish idea!

If you need any more info, or if you think my description is a bit naff, let me know!

Thanks,

DMcB


Source: http://stackoverflow.com/questions/23970057/web-scraping-data-from-different-sites

Tuesday, 19 August 2014

Scrape Data Point Using Python


I am looking to scrape a data point using Python off of the url http://www.cavirtex.com/orderbook .

The data point I am looking to scrape is the lowest bid offer, which at the current moment looks like this:

<tr>
 <td><b>Jan. 19, 2014, 2:37 a.m.</b></td>
 <td><b>0.0775/0.1146</b></td>
 <td><b>860.00000</b></td>
 <td><b>66.65 CAD</b></td>
</tr>

The relevant point being the 860.00 . I am looking to build this into a script which can send me an email to alert me of certain price differentials compared to other exchanges.

I'm quite noobie so if in your explanations you could offer your thought process on why you've done certain things it would be very much appreciated.

Thank you in advance!

Edit: This is what I have so far which will return me the name of the title correctly, I'm having trouble grabbing the table data though.

import urllib2, sys
from bs4 import BeautifulSoup

site= "http://cavirtex.com/orderbook"
hdr = {'User-Agent': 'Mozilla/5.0'}
req = urllib2.Request(site,headers=hdr)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page)
print soup.title



Here is the code for scraping the lowest bid from the 'Buying BTC' table:

from selenium import webdriver

fp = webdriver.FirefoxProfile()
browser = webdriver.Firefox(firefox_profile=fp)
browser.get('http://www.cavirtex.com/orderbook')

lowest_bid = float('inf')
elements = browser.find_elements_by_xpath('//div[@id="orderbook_buy"]/table/tbody/tr/td')

for element in elements:
    text = element.get_attribute('innerHTML').strip('<b>|</b>')
    try:
        bid = float(text)
        if lowest_bid > bid:
            lowest_bid = bid
    except:
        pass

browser.quit()
print lowest_bid

In order to install Selenium for Python on your Windows-PC, run from a command line:

pip install selenium (or pip install selenium --upgrade if you already have it).

If you want the 'Selling BTC' table instead, then change "orderbook_buy" to "orderbook_sell".

If you want the 'Last Trades' table instead, then change "orderbook_buy" to "orderbook_trades".

Note:

If you consider performance critical, then you can implement the data-scraping via URL-Connection instead of Selenium, and have your program running much faster. However, your code will probably end up being a lot "messier", due to the tedious XML parsing that you'll be obliged to apply...

Here is the code for sending the previous output in an email from yourself to yourself:

import smtplib,ssl

def SendMail(username,password,contents):
    server = Connect(username)
    try:
        server.login(username,password)
        server.sendmail(username,username,contents)
    except smtplib.SMTPException,error:
        Print(error)
    Disconnect(server)

def Connect(username):
    serverName = username[username.index("@")+1:username.index(".")]
    while True:
        try:
            server = smtplib.SMTP(serverDict[serverName])
        except smtplib.SMTPException,error:
            Print(error)
            continue
        try:
            server.ehlo()
            if server.has_extn("starttls"):
                server.starttls()
                server.ehlo()
        except (smtplib.SMTPException,ssl.SSLError),error:
            Print(error)
            Disconnect(server)
            continue
        break
    return server

def Disconnect(server):
    try:
        server.quit()
    except smtplib.SMTPException,error:
        Print(error)

serverDict = {
    "gmail"  :"smtp.gmail.com",
    "hotmail":"smtp.live.com",
    "yahoo"  :"smtp.mail.yahoo.com"
}

SendMail("your_username@your_provider.com","your_password",str(lowest_bid))

The above code should work if your email provider is either gmail or hotmail or yahoo.

Please note that depending on your firewall configuration, it may ask your permission upon the first time you try it...



Source: http://stackoverflow.com/questions/21217034/scrape-data-point-using-python

Saturday, 16 August 2014

Data From Web Scraping Using Node.JS Request Is Different From Data Shown In The Browser

Right now, I am doing some simple web scraping, for example get the current train arrival/departure information for one railway station. Here is the example link, http://www.thetrainline.com/Live/arrivals/chester, from this link you can visit the current arrival trains in the chester station.

I am using the node.js request module to do some simple web scraping,

app.get('/railway/arrival', function (req, res, next) {
    console.log("/railway/arrival/  "+req.query["city"]);
    var city = req.query["city"];
    if(typeof city == undefined || city == undefined) { console.log("if it undefined"); city ="liverpool-james-street";}
    getRailwayArrival(city,
       function(err,data){
           res.send(data);
        }
       );
});

function getRailwayArrival(station,callback){
   request({
    uri: "http://www.thetrainline.com/Live/arrivals/"+station,
   }, function(error, response, body) {
      var $ = cheerio.load(body);

      var a = new Array();
      $(".results-contents li a").each(function() {
        var link = $(this);
        //var href = link.attr("href");
        var due = $(this).find('.due').text().replace(/(\r\n|\n|\r|\t)/gm,"");   
        var destination = $(this).find('.destination').text().replace(/(\r\n|\n|\r|\t)/gm,"");
        var on_time = $(this).find('.on-time-yes .on-time').text().replace(/(\r\n|\n|\r|\t)/gm,"");
        if(on_time == undefined)  var on_time_no = $(this).find('.on-time-no').text().replace(/(\r\n|\n|\r|\t)/gm,"");
        var platform = $(this).find('.platform').text().replace(/(\r\n|\n|\r|\t)/gm,"");

        var obj = new Object();
        obj.due = due;obj.destination = destination; obj.on_time = on_time; obj.platform = platform;
        a.push(obj);
console.log("arrival  ".green+due+"  "+destination+"  "+on_time+"  "+platform+"  "+on_time_no);      
    });
    console.log("get station data  "+a.length +"   "+ $(".updated-time").text());
    callback(null,a);

  });
}

The code works by giving me a list of data, however these data are different from the data seen in the browser, though the data come from the same url. I don't know why it is like that. is it because that their server can distinguish the requests sent from server and browser, that if the request is from server, so they sent me the wrong data. How can I overcome this problem ?

thanks in advance.

2 Answers

They must have stored session per click event. Means if u visit that page first time, it will store session and validate that session for next action you perform. Say, u select some value from drop down list. for that click again new value of session is generated that will load data for ur selected combobox value. then u click on show list then that previous session value is validated and you get accurate data.

Now see, if you not catch that session value programatically and not pass as parameter with that request, you will get default loaded data or not get any thing. So, its chalenging for you to chatch that data.Use firebug for help.

Another issue here could be that the generated content occurs through JavaScript run on your machine. jsdom is a module which will provide such content but is not as lightweight.

Cheerio does not execute these scripts and as a result content may not be visible (as you're experiencing). This is an article I read a while back and caused me to have the same discovery, just open the article and search for "jsdom is more powerful" for a quick answer:

Source:http://stackoverflow.com/questions/15785360/data-from-web-scraping-using-node-js-request-is-different-from-data-shown-in-the?rq=1

Tuesday, 5 August 2014

How Your Online Information is Stolen - The Art of Web Scraping and Data Harvesting

Web scraping, also known as web/internet harvesting involves the use of a computer program which is able to extract data from another program's display output. The main difference between standard parsing and web scraping is that in it, the output being scraped is meant for display to its human viewers instead of simply input to another program.

Therefore, it isn't generally document or structured for practical parsing. Generally web scraping will require that binary data be ignored - this usually means multimedia data or images - and then formatting the pieces that will confuse the desired goal - the text data. This means that in actually, optical character recognition software is a form of visual web scraper.

Usually a transfer of data occurring between two programs would utilize data structures designed to be processed automatically by computers, saving people from having to do this tedious job themselves. This usually involves formats and protocols with rigid structures that are therefore easy to parse, well documented, compact, and function to minimize duplication and ambiguity. In fact, they are so "computer-based" that they are generally not even readable by humans.

If human readability is desired, then the only automated way to accomplish this kind of a data transfer is by way of web scraping. At first, this was practiced in order to read the text data from the display screen of a computer. It was usually accomplished by reading the memory of the terminal via its auxiliary port, or through a connection between one computer's output port and another computer's input port.

It has therefore become a kind of way to parse the HTML text of web pages. The web scraping program is designed to process the text data that is of interest to the human reader, while identifying and removing any unwanted data, images, and formatting for the web design.

Though web scraping is often done for ethical reasons, it is frequently performed in order to swipe the data of "value" from another person or organization's website in order to apply it to someone else's - or to sabotage the original text altogether. Many efforts are now being put into place by webmasters in order to prevent this form of theft and vandalism.

Source:http://ezinearticles.com/?How-Your-Online-Information-is-Stolen---The-Art-of-Web-Scraping-and-Data-Harvesting&id=923976

Thursday, 31 July 2014

Article Writing Services and Article Ghostwriting Benefits

Article writing services and article ghostwriting can offer you more benefits than you might expect. With Google Penguin 2 now prowling around the web, it is essential that your web content is written with a high degree of knowledge of the Google algorithm updates.

Panda and Penguin were bad enough, but Penguin 2 is hunting down web pages that are 'over optimized' in Google's opinion. Google is also targeting content that is not original. Not necessarily just duplicate content, but also scraped content

This is not something to ignore, as many ignored the first issue of Penguin. It should be taken seriously, so here are some tips how to avoid having your web pages and blog posts lose their current ranking. Whether you use article writing services or do it yourself, here are some of the benefits that article ghostwriting services can offer.

Keyword Density

There is no such thing anymore. Many will insist that you must have at least 2% KD on your web page for Google to list it. Nonsense! Google will decide itself what the theme of your page is - if you have written badly, then the algorithms will possibly get the wrong message and you are toast! That is what LSI is about. Latent Semantic Indexing is about indexing your page for the semantic meaning of the vocabulary used on it.

This means that you can have a #1 ranked page for a search term (keyword) that does not even appear in the content of that page! Google uses semantic relevance to search terms used by Google clients - those using it as a search engine to find information. Many internet marketers have forgotten that Google is first and foremost a search engine and not an advertising platform!

Article Writing Services and Semantic Relevance

Those offering professional article writing services are fully aware of Google's needs, and can design the content of your individual web pages around these needs. Article ghostwriting involves professional article writing in your name. They will still ensure that your content has a keyword density of around 0.7% - 1.0%, or even more when warranted.

That is because you cannot always use synonyms with the exact semantic relevance you need. There are sometimes situations, or topics, where it is difficult to use synonyms with the same semantic meaning as the main theme of the page. To do this, you need a good knowledge of the language of your site.

In these cases, professional article writing services can adjust the keyword usage (KD) to ensure that the algorithms (crawlers/spiders) have the character strings (data) to correspond with their programmed data - the data that assesses the relevance of your web page with the search term employed. SEO content writing is both a science and a art.

Article Ghostwriting Benefits

That said, the benefits of article ghostwriting should be apparent. You avoid the claws (or webs) of Penguin2 by avoiding over-optimization of your web pages. You also get professional SEO content writing that is 100% original - so no duplicate content issues. This is another issue that Google algorithms are now punishing severely. Your content must be 100% original.

Scraped Content Software

You have no doubt come across software that scrapes the web and generates articles from existing content. Terms such as 'article scraper,' 'instant article,' 'article wizard,' and so on involve the copying of copyrighted material to generate a so-called 'unique' article for you.

Not only is the legality of this dubious, but Google algorithms are now searching such software-generated articles out, and delisting them. They won't tell you that of course, and major article directories are also taking action against such scraped articles.

If you are using this type of software, or have websites containing it, then your Google ranking might disappear overnight. It has already happened, and is likely happening all the time now that the Penguins are on the hunt. Original content is, and always will be, king!

Original SEO Content is King

Article writing services can offer you 100% original content without you worrying about future Google algorithm updates. Article ghostwriting ensures that only you have the article provided to you and that it is published in your name as author. No need to cheat and no need to worry about your online business failing with the next Google update - or even when Penguin2 catches on.

Source: http://ezinearticles.com/?Article-Writing-Services-and-Article-Ghostwriting-Benefits&id=8004387

Thursday, 10 July 2014

Online Data Entry - How Online Data Entry is Useful in Business?

In last article about "Online Data Entry Projects - Grab An Online Audience by Data Entry", I mentioned some newly provoked ideas which are currently outsourced by various companies around the world such as United States, United Kingdom, United Arab Emirates, Canada and Others. In this article, I emphasize on some symbolized and basic online data entry techniques that most of the businesses require. Here we go:

Online Compilation from Websites: Company requires a huge amount of information to run business smoothly. You require details of raw material suppliers, Machine suppliers, maintenance service vendor, product dispenser and many more. If company executive have compiled information, they can act promptly and complete the task quickly. Online websites are great source to search for particular details. By outsourcing online compilation from website task to some reputed data entry company, you can get highly accurate information that helps you in taking powerful decisions.

Online Business Card Entry: Business Cards are much helpful not only in getting a better idea about business of someone but also getting the contact information easily. Sometimes it happens that you misplace the business card when you require it urgently. If you have entered business card information in your PC, you can easily search for such. You can quickly contact the needed person and solve your queries promptly.

Online Catalog Data Entry: Catalog is the most powerful tool to sell your products. If you don't have informative and attractive catalog, you can not convert your viewers into customers. It is also possible that you are avoiding online potential customer by not uploading your catalog online. However, online catalog data entry can be the solution for such need. Insert good amount of information in your catalog and attract more visitors. You can get not only good business from this but also provoke your brand.

Online Survey Form Entry: Survey is very important tool to check the mindset of customers. The data mention in survey forms are very important while upgrading product, emerging into new field, changing strategy, branding and marketing products. The information is only useful if it is precise and quickly available. Online survey form entry can help you in making surveyed information organized so that you can clearly divert your focus on right direction.

These are various online data typing projects which are helpful for your business to generate more efficient environment and increase the productivity and profitability. You can meet high goals with efficient environment and increased productivity.

Source: http://ezinearticles.com/?Online-Data-Entry---How-Online-Data-Entry-is-Useful-in-Business?&id=4505450