Hey all,
I am trying to extract plain text from within HTML tags.
The script starts out with a spider:
PHP Code:
<?php
class spider
{
// This class grabs the content from the sites
function setup()
{
$cookieJar = 'cookies.txt';
curl_setopt($this->curl,CURLOPT_COOKIEJAR, $cookieJar);
curl_setopt($this->curl,CURLOPT_COOKIEFILE, $cookieJar);
curl_setopt($this->curl,CURLOPT_AUTOREFERER,true);
curl_setopt($this->curl,CURLOPT_TIMEOUT,30);
curl_setopt($this->curl,CURLOPT_CONNECTTIMEOUT,25);
curl_setopt($this->curl,CURLOPT_FOLLOWLOCATION,true);
curl_setopt($this->curl,CURLOPT_RETURNTRANSFER, true);
}
function get($url)
{
$this->curl = curl_init($url);
$this->setup();
return $this->request();
}
function request()
{
return curl_exec($this->curl);
}
}
$spider=new spider();
$link=$spider->get("http://www12.statcan.ca/census-recensement/2006/dp-pd/prof/92-597/P3.cfm?Lang=E&CTCODE=4707&CATYPE=CMA");
?>
This works fine. The problem I am having is identifying the correct functions to use for actually parsing the information.
I was trying to use 'substr()' and a loop, but I dont know how I should go about identifying the start point ('<td class="alignBottomRight col_1"' plus an increasing row id that changes which I will need to take into account) and the end point ('</td>').
Do you guys have any suggestions as far as how I can set this up to retrieve the data, or of any good pre-made scripts that might be of use here?
Thanks!
__________________
DVD Movie Release Database: Please login or register to view this content. Registration is FREE
|