I'm trying to scrape data from our local government. What I want is
address from kids adoption offices. Here, in Brazil, all adoptions go
through the government. So I have the URL of one office, there are 2 or 3
thousands more. But if I can manage to get one, the others will be
easy. I made many attempts, bellow I show three.
The problem could be related to a Javascript (Ajax maybe) that refresh the page.
Note: I am not a PHP developer.
First attempt
echo '<html><head></head><body>';
echo '<h1>Scraper PHP GET 1</h1>';
echo ini_get("allow_url_fopen");
echo ini_get("allow_url_fopen");
// I used this url for test
//$url = 'http://www.portaldaadocao.com.br';
//This is the URL that I really want
$url = 'http://www.cnj.jus.br/cna/Controle/ConsultaPublicaBuscaControle.php?transacao=CONSULTA&vara=2673';
$html = file_get_contents($url);
var_dump($html);
echo '</body></html>';
// Output
// 11
// Warning:
file_get_contents(http://www.cnj.jus.br/cna/Controle/ConsultaPublicaBuscaControle.php?
transacao=CONSULTA&vara=2673) [function.file-get-contents]: failed to open stream: HTTP
request failed! HTTP/1.1 404 Not Found in /home/rsl/www/sc01_get.php on line 14
// bool(false)
Second attempt
echo '<html><head></head><body>';
echo '<h1>Scraper PHP CURL 3</h1>';
// I used this url for test
//$url = 'http://www.portaldaadocao.com.br';
//This is the URL that I really want
$url = 'http://www.cnj.jus.br/cna/Controle/ConsultaPublicaBuscaControle.php?transacao=CONSULTA&vara=2673';
$curl = curl_init($url);
@curl_setopt($curl, CURLOPT_POSTFIELDS, "foo");
@curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
@curl_setopt($curl, CURLOPT_CUSTOMREQUEST, "POST");;
$html=@curl_exec($curl);
if (!$html) {
echo "<br />cURL error number:" .curl_errno($curl);
echo "<br />cURL error:" . curl_error($curl);
exit;
}
else{
echo '<br>begin HTML[';
echo $html;
echo '<br>]end html ';
}
echo '</body></html>';
// Output
// 1
third attempt
function curl($url){
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.6 (KHTML, like Gecko) Chrome/16.0.897.0 Safari/535.6');
curl_setopt($ch, CURLOPT_HEADER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_COOKIEFILE, "cookie.txt");
curl_setopt($ch, CURLOPT_COOKIEJAR, "cookie.txt");
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 30);
curl_setopt($ch, CURLOPT_REFERER, "http://www.windowsphone.com");
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
echo '<html><head></head><body>';
echo '<h1>Scraper PHP CURL 5</h1>';
// I used this url for test
//$url = 'http://www.portaldaadocao.com.br';
//This is the URL that I really want
$url = 'http://www.cnj.jus.br/cna/Controle/ConsultaPublicaBuscaControle.php?transacao=CONSULTA&vara=2673';
$curl = curl_init($url);
@curl_setopt($curl, CURLOPT_POSTFIELDS, "foo");
@curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
@curl_setopt($curl, CURLOPT_CUSTOMREQUEST, "POST");;
$html=@curl($curl);
if (!$html) {
echo "<br />cURL error number:" .curl_errno($curl);
echo "<br />cURL error:" . curl_error($curl);
exit;
}
else{
echo '<br>begin HTML[';
echo $html;
echo '<br>]end html ';
}
echo '</body></html>';
// Output
// cURL error number:0
// cURL error:
If the pages are really ajax based meaning the information that you need to scrape is loaded or shown through javascript execution, you will need another approach. You would need to automate with a real browser. You can go the Selenium route which can be written in a number of languages or use CasperJS with Javascript as the programming language.
Source: http://stackoverflow.com/questions/24611046/scraping-data-in-dynamic-sites
The problem could be related to a Javascript (Ajax maybe) that refresh the page.
Note: I am not a PHP developer.
First attempt
echo '<html><head></head><body>';
echo '<h1>Scraper PHP GET 1</h1>';
echo ini_get("allow_url_fopen");
echo ini_get("allow_url_fopen");
// I used this url for test
//$url = 'http://www.portaldaadocao.com.br';
//This is the URL that I really want
$url = 'http://www.cnj.jus.br/cna/Controle/ConsultaPublicaBuscaControle.php?transacao=CONSULTA&vara=2673';
$html = file_get_contents($url);
var_dump($html);
echo '</body></html>';
// Output
// 11
// Warning:
file_get_contents(http://www.cnj.jus.br/cna/Controle/ConsultaPublicaBuscaControle.php?
transacao=CONSULTA&vara=2673) [function.file-get-contents]: failed to open stream: HTTP
request failed! HTTP/1.1 404 Not Found in /home/rsl/www/sc01_get.php on line 14
// bool(false)
Second attempt
echo '<html><head></head><body>';
echo '<h1>Scraper PHP CURL 3</h1>';
// I used this url for test
//$url = 'http://www.portaldaadocao.com.br';
//This is the URL that I really want
$url = 'http://www.cnj.jus.br/cna/Controle/ConsultaPublicaBuscaControle.php?transacao=CONSULTA&vara=2673';
$curl = curl_init($url);
@curl_setopt($curl, CURLOPT_POSTFIELDS, "foo");
@curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
@curl_setopt($curl, CURLOPT_CUSTOMREQUEST, "POST");;
$html=@curl_exec($curl);
if (!$html) {
echo "<br />cURL error number:" .curl_errno($curl);
echo "<br />cURL error:" . curl_error($curl);
exit;
}
else{
echo '<br>begin HTML[';
echo $html;
echo '<br>]end html ';
}
echo '</body></html>';
// Output
// 1
third attempt
function curl($url){
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.6 (KHTML, like Gecko) Chrome/16.0.897.0 Safari/535.6');
curl_setopt($ch, CURLOPT_HEADER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_COOKIEFILE, "cookie.txt");
curl_setopt($ch, CURLOPT_COOKIEJAR, "cookie.txt");
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 30);
curl_setopt($ch, CURLOPT_REFERER, "http://www.windowsphone.com");
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
echo '<html><head></head><body>';
echo '<h1>Scraper PHP CURL 5</h1>';
// I used this url for test
//$url = 'http://www.portaldaadocao.com.br';
//This is the URL that I really want
$url = 'http://www.cnj.jus.br/cna/Controle/ConsultaPublicaBuscaControle.php?transacao=CONSULTA&vara=2673';
$curl = curl_init($url);
@curl_setopt($curl, CURLOPT_POSTFIELDS, "foo");
@curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
@curl_setopt($curl, CURLOPT_CUSTOMREQUEST, "POST");;
$html=@curl($curl);
if (!$html) {
echo "<br />cURL error number:" .curl_errno($curl);
echo "<br />cURL error:" . curl_error($curl);
exit;
}
else{
echo '<br>begin HTML[';
echo $html;
echo '<br>]end html ';
}
echo '</body></html>';
// Output
// cURL error number:0
// cURL error:
If the pages are really ajax based meaning the information that you need to scrape is loaded or shown through javascript execution, you will need another approach. You would need to automate with a real browser. You can go the Selenium route which can be written in a number of languages or use CasperJS with Javascript as the programming language.
Source: http://stackoverflow.com/questions/24611046/scraping-data-in-dynamic-sites
No comments:
Post a Comment