Tycoon Talk
Become a Big fish!
The number 1 forum for online business!
Post topics, ask questions, share your knowledge.
Tycoon Talk is part of Freelancer.com - find skilled workers online at a fraction of the cost.

PHP Forum


You are currently viewing our PHP Forum as a guest. Please register to participate.
Login



Freelance Jobs

Reply
regular expressions - everything after
Old 07-10-2006, 04:11 AM regular expressions - everything after
Junior Talker

Posts: 4
Name: anders
Trades: 0
Hi
I am trying to write my own spider script and am in the process of extracting urls from a page. I have managed to output the urls with "<a href" attached at the beginning i.e. <a href="index.htm". I have then tried removing the "<a href" by putting the matching string into another regular expression preceded by $' which effectively outputs 'everything after' a pattern, in my case everthing after <a href. No errors are highlighted in my code but the latter part doesn't work. Can anyone put me straight?

Code:-

<?php

//get page contents
$page = file_get_contents("http://www.jagprops.co.uk/");

//find urls in $page, matches are put in $matches
preg_match_all ("/<\s*a\s+[^>]*href\s*=\s*[\"']?([^\"'>]+)[\"'>]/", $page, $matches);

//only first 2 urls are echoed at the moment
echo $matches[0][0];
echo $matches[0][1];

$string = $matches[0][0];
//get everything after href - doesn't work
preg_match("/$'href\s*=\s*/",$string,$url);
echo $url[0][1];


?>
Solaar is offline
Reply With Quote
View Public Profile
 
 
Register now for full access!
Old 07-10-2006, 02:37 PM Re: regular expressions - everything after
Webmaster Talker

Posts: 626
Trades: 0
Try changing your regex to the following:

<?php

//get page contents
$page = file_get_contents("http://www.jagprops.co.uk/");

//find urls in $page, matches are put in $matches
preg_match_all ("href=\"[^"]*\"", $page, $matches);

?>


This will extract ALL href="blah.htm" in the page... Then depending on what you want to do, you could manipulate it from there... If you need help manipulating it from there let me know.

Last edited by jim.thornton; 07-10-2006 at 02:38 PM..
jim.thornton is offline
Reply With Quote
View Public Profile
 
Old 07-11-2006, 08:06 AM Re: regular expressions - everything after
Junior Talker

Posts: 4
Name: anders
Trades: 0
Thanks for the code and offer, I have actually come up with the code below which works OK although i get some dodgy results at some sites like www.skint.co.uk compared to clean results at sites like :- www.jagprops.co.uk. Thanks again and any more advice is thoroughly welcome.

<?php
//get page contents
$page = file_get_contents("http://www.skint.co.uk");

//find urls in $page, matches are put in $matches
preg_match_all ("/href=[\"']?([^\"' >]+)/", $page, $matches);

$count = count($matches[0]);

for ( $counter = 0; $counter <= $count; $counter++){

$link = $matches[0][$counter];

$link = preg_replace( "/href=[\"']/","", $link);

$link = preg_replace( "/href=/","", $link);

echo "$link<br/>";
}
?>
Solaar is offline
Reply With Quote
View Public Profile
 
Old 07-11-2006, 08:29 AM Re: regular expressions - everything after
ibbo's Avatar
Super Spam Talker

Posts: 880
Location: Leeds UK
Trades: 0
When your crawling a site you could well benefit from applying a depth tag.

I.E

All links on domain.com are stored, but do you then follow those links and grab sub page links? Its time consuming and foreach extra bit of depth you crawl you can expect a 30second + penalty in crawl time.

Once you get a list of links then its a good idea to then open those links and rescan for more links. (remember to remove duplicates)

foreach($matches[0] as $url){
$page = file($domain.$url);
// rescan new page for more links
}

Also check out this thread http://www.webmaster-talk.com/php-fo...highlight=ibbo.

There is a crawler script on there that you can gleam some insight from as it grabs URL from a site and matches keywords on them. You can scrap the keyword match part and simply grab the URL's instead.

Ibbo
__________________

Please login or register to view this content. Registration is FREE

Please login or register to view this content. Registration is FREE

Please login or register to view this content. Registration is FREE

Please login or register to view this content. Registration is FREE

Linux user #349545 :
(GNU/Linux)iD8DBQBAzWjX+MZAIjBWXGURAmflAKCntuBbuKCWenpm XoA7LNydllVQOwCf

Last edited by ibbo; 07-11-2006 at 08:41 AM..
ibbo is offline
Reply With Quote
View Public Profile Visit ibbo's homepage!
 
Old 07-11-2006, 09:00 AM Re: regular expressions - everything after
Junior Talker

Posts: 4
Name: anders
Trades: 0
Thanks very much, thats extremely useful, cool script. If you were to make a full blown search engine would you stick to a depth of four of lessen the depth?
Solaar is offline
Reply With Quote
View Public Profile
 
Old 07-11-2006, 09:05 AM Re: regular expressions - everything after
ibbo's Avatar
Super Spam Talker

Posts: 880
Location: Leeds UK
Trades: 0
Well the depth is the key, the deeper you go (in the case of my script and its purpose) the bigger the % of successfull results.

As with any web site your intending to spider/ crawl you need to ensure your getting to the very extremities of its pages. These could be upto N directories deep.

My script does not match for "a href" but searches for /dir/dir/ whcih could be a problem if your looking at matching incoming and outgoing links (you will need to use your regex for that and I would store them in a sperate domain Array maybe) .

However the deeper the better for you will get the most results.

Ibbo
__________________

Please login or register to view this content. Registration is FREE

Please login or register to view this content. Registration is FREE

Please login or register to view this content. Registration is FREE

Please login or register to view this content. Registration is FREE

Linux user #349545 :
(GNU/Linux)iD8DBQBAzWjX+MZAIjBWXGURAmflAKCntuBbuKCWenpm XoA7LNydllVQOwCf
ibbo is offline
Reply With Quote
View Public Profile Visit ibbo's homepage!
 
Reply     « Reply to regular expressions - everything after
 

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are Off
Refbacks are Off





   
RSS Feed  Feeds: RSS   JS   XML
RSS Feed  Feeds for this forum: RSS   JS   XML



Page generated in 0.18250 seconds with 12 queries