How to use Web Content Extractor(WCE) as Email Scraper?
Web Content Extractor is a great web scraping software developed by Newprosoft Team. The software has easy to use project wizard to create a scraping configuration and scrape data from websites.
One day I came to see the Visual Email Extractor which is also product of Newprosoft and similar to Web Content Extractor but it’s primary use is to scrape email addresses by crawling websites you feed to the scraper. I had noticed that with the little modification in Web Content Extractor project configuration you can use it same as Visual Email Extractor to extract email addresses.
In this post I will show you what configuration makes the Web Content Extractor to extract email addresses. I still recommend Visual Email Extractor as it has lot more features then extracting email using WCE.
Here are the configuration that makes WCE to Extract Emails.
Step 1 : Open Web Content Extractor and Create New Project and Click on Next.
Step 2: Under Crawling Rules -> Advanced Rules Tab do the following settings
Crawling Level 1 Settings
Follow Links if link text equals:
*contact*; *feedback*; *support*; *about*
for 'Follow Links if link text equals' text box enter following values:
contact; feedback; support; about
for 'Do not Follow links if URL contains' text box enter following values:
google.; yahoo.; bing; msn.; altavista.; myspace.com; youtube.com; googleusercontent.com; =http; .jpg; .gif; .png; .bmp; .exe; .zip; .pdf;
Set 'Maximum Crawling Deapth' to 2
set 'Crawling Order' to Deapth First Crawling
Tick mark below below check boxes:
->Follow all internal links
Crawling Level 2 Settings
set 'Follow links if link text equals' to below value
*contact*; *feedback*; *support*; *about*
set 'Follow links if url contains' text box to below value
contact; feedback; support; about
set 'DO NOT follow links if url contains' text box to below value
=http
Step 3 After doing above settings now click on Next -> in Extraction Pattern window -> Click on Define -> in Web Page Address (URL) give any URL where email is given. and click on + sign right of Date Fields to define scraping pattern.
Now inside HTML Structure selects HTML check box or Body check box which means for each page it will take whole page content to parse data.
Now last settings to extract emails from page using regular expression based email extraction function. Open Predefined Script window and select ‘Extract_Email_Addresses‘ and click on OK. and if you have used page that contains email then in Script Result’ you will be able to see the harvested email.
Hope this will help you to use your Web Content Extractor as a Email Scraper.. Share your view in comment.
Source: http://webdata-scraping.com/use-web-content-extractor-as-email-scraper/
Web Content Extractor is a great web scraping software developed by Newprosoft Team. The software has easy to use project wizard to create a scraping configuration and scrape data from websites.
One day I came to see the Visual Email Extractor which is also product of Newprosoft and similar to Web Content Extractor but it’s primary use is to scrape email addresses by crawling websites you feed to the scraper. I had noticed that with the little modification in Web Content Extractor project configuration you can use it same as Visual Email Extractor to extract email addresses.
In this post I will show you what configuration makes the Web Content Extractor to extract email addresses. I still recommend Visual Email Extractor as it has lot more features then extracting email using WCE.
Here are the configuration that makes WCE to Extract Emails.
Step 1 : Open Web Content Extractor and Create New Project and Click on Next.
Step 2: Under Crawling Rules -> Advanced Rules Tab do the following settings
Crawling Level 1 Settings
Follow Links if link text equals:
*contact*; *feedback*; *support*; *about*
for 'Follow Links if link text equals' text box enter following values:
contact; feedback; support; about
for 'Do not Follow links if URL contains' text box enter following values:
google.; yahoo.; bing; msn.; altavista.; myspace.com; youtube.com; googleusercontent.com; =http; .jpg; .gif; .png; .bmp; .exe; .zip; .pdf;
Set 'Maximum Crawling Deapth' to 2
set 'Crawling Order' to Deapth First Crawling
Tick mark below below check boxes:
->Follow all internal links
Crawling Level 2 Settings
set 'Follow links if link text equals' to below value
*contact*; *feedback*; *support*; *about*
set 'Follow links if url contains' text box to below value
contact; feedback; support; about
set 'DO NOT follow links if url contains' text box to below value
=http
Step 3 After doing above settings now click on Next -> in Extraction Pattern window -> Click on Define -> in Web Page Address (URL) give any URL where email is given. and click on + sign right of Date Fields to define scraping pattern.
Now inside HTML Structure selects HTML check box or Body check box which means for each page it will take whole page content to parse data.
Now last settings to extract emails from page using regular expression based email extraction function. Open Predefined Script window and select ‘Extract_Email_Addresses‘ and click on OK. and if you have used page that contains email then in Script Result’ you will be able to see the harvested email.
Hope this will help you to use your Web Content Extractor as a Email Scraper.. Share your view in comment.
Source: http://webdata-scraping.com/use-web-content-extractor-as-email-scraper/
 
No comments:
Post a Comment