Im using Curl to retrieve links from designated pages. My start address is http://www.newsmuncher.com and when i gather all the links from the page they all point towards the base url http://www.benwebdeveloper.com with a 301 status code. I have a .htaccess setup to transfer all requests from http://www.benwebdeveloper.com to http://www.newsmuncher.com, therefore i shouldnt be getting links to benwebdeveloepr when crawling through the site. Any ideas why Curl does this?
Its almost as if it knows the two links are identical but seeing as it redirect's any requests from benwebdeveloper to newsmuncher, its back tracing its routes. When there aren't actually any links on the page to benwebdeveloper!
PHP Code:
function get_file($location) { $ch = curl_init(); curl_setopt($ch, CURLOPT_URL, $location); curl_setopt($ch, CURLOPT_USERAGENT, $this->user_agent); curl_setopt($ch, CURLOPT_HEADER, FALSE); curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE); curl_setopt($ch, CURLOPT_FOLLOWLOCATION, FALSE); $data = curl_exec($ch); echo curl_getinfo($ch, CURLINFO_EFFECTIVE_URL); $status_code = curl_getinfo($ch, CURLINFO_HTTP_CODE); $content_type = explode(';', curl_getinfo($ch, CURLINFO_CONTENT_TYPE)); $content_type = $content_type[0]; curl_close($ch); return array('status_code' => $status_code, 'content_type' => $content_type, 'data' => $data); }
function extract_links($html) { $dom = new DOMDocument(); @$dom->loadHTML($html); $xpath = new DOMXPath($dom); $hrefs = $xpath->evaluate("/html/body//a"); for ($i = 0; $i < $hrefs->length; $i++) { $href = $hrefs->item($i); $url = $href->getAttribute('href'); $this->add_queue($url); } echo '<p>Links Found: '. $hrefs->length . '</p>'; }
Output From Script
PHP Code:
Crawler Initiated Links Found: 26 bool(false) http://www.benwebdeveloper.com/ 301 bool(false) #content 0 bool(false) http://www.benwebdeveloper.com/ 301 bool(false) http://www.benwebdeveloper.com/about/ 301 bool(false) http://www.benwebdeveloper.com/2010/10/hello-world/ 301 bool(false) http://www.benwebdeveloper.com/2010/10/hello-world/ 301 bool(false) http://www.benwebdeveloper.com/author/admin/ 301 bool(false) http://www.benwebdeveloper.com/category/uncategorized/ 301 bool(false) http://www.benwebdeveloper.com/2010/10/hello-world/#comments 301 bool(false) http://themeforest.net/item/circlosquero-premium-wordpress-theme/163014?ref=benwebdeveloper 302 bool(false) http://themeforest.net/item/alyeska-premium-wordpress-theme/164366?ref=benwebdeveloper 302 bool(false) http://themeforest.net/item/dandelion-powerful-elegant-wordpress-theme/136628?ref=benwebdeveloper 302 bool(false) http://themeforest.net/item/king-size-fullscreen-background-wordpress-theme/166299?ref=benwebdeveloper 302 bool(false) http://themeforest.net/item/lotus-for-business-software-corporate-portfolio/164748?ref=benwebdeveloper 302 bool(false) http://themeforest.net/item/striking-premium-corporate-portfolio-wp-theme/128763?ref=benwebdeveloper 302 bool(false) http://www.benwebdeveloper.com/2010/10/hello-world/ 301 bool(false) http://wordpress.org/ 200 bool(false) http://www.benwebdeveloper.com/2010/10/hello-world/#comment-1 301 bool(false) http://www.benwebdeveloper.com/2010/10/ 301 bool(false) http://www.benwebdeveloper.com/category/uncategorized/ 301 bool(false) http://www.newsmuncher.com/wp-login.php 200 bool(false) http://www.benwebdeveloper.com/feed/ 301 bool(false) http://www.benwebdeveloper.com/comments/feed/ 301 bool(false) http://wordpress.org/ 200 bool(false) http://www.benwebdeveloper.com/ 301 bool(false) http://wordpress.org/ 200
Last edited by evans123; 03-17-2011 at 03:14 PM..
|