Tycoon Talk
Become a Big fish!
The number 1 forum for online business!
Post topics, ask questions, share your knowledge.
Tycoon Talk is part of Freelancer.com - find skilled workers online at a fraction of the cost.

PHP Forum


You are currently viewing our PHP Forum as a guest. Please register to participate.
Login



Freelance Jobs

Reply
Most correct way to extract all <a href> links from a page
Old 12-13-2005, 08:32 AM Most correct way to extract all <a href> links from a page
mtishetsky's Avatar
King Spam Talker

Posts: 1,226
Name: Mike
Location: Mataro, Spain
Trades: 0
Hello,

I got a problem while extracting links from HTML. Usually I use the following function call to do this:
Code:
$count = preg_match_all("/<a[^>]+href\s*=\s*([\"']?)([^\\s\"'>]+)\\1/is", $data, $matches, PREG_SET_ORDER);
But today I got a page that contains a link like this:
Code:
<a href="/sitemap/Women's-Interests.html">
That single quote in "Women's" confuses me - it is allowed by HTML specs, but it does not pass through the regexp. Does anyone have a working method to extract all possible forms of <a href> links from a page? I mean ones that can be enclosed with single or double quotes along with having single quotes inside the link.
__________________

Please login or register to view this content. Registration is FREE
-
Please login or register to view this content. Registration is FREE
-
Please login or register to view this content. Registration is FREE

And don't forget to give me talkupation!
mtishetsky is offline
Reply With Quote
View Public Profile Visit mtishetsky's homepage!
 
 
Register now for full access!
Old 12-13-2005, 08:51 AM
mtishetsky's Avatar
King Spam Talker

Posts: 1,226
Name: Mike
Location: Mataro, Spain
Trades: 0
Changed regexp to the following one:
/<a[^>]+href\s*=\s*([\"']?)([^\\s\\1>]+)\\1/is

Seems like it works now. At least the desired URL matched.
__________________

Please login or register to view this content. Registration is FREE
-
Please login or register to view this content. Registration is FREE
-
Please login or register to view this content. Registration is FREE

And don't forget to give me talkupation!
mtishetsky is offline
Reply With Quote
View Public Profile Visit mtishetsky's homepage!
 
Old 12-13-2005, 08:57 AM
ibbo's Avatar
Super Spam Talker

Posts: 880
Location: Leeds UK
Trades: 0
This captures them

Code:
$fd = fopen(<your file>','r');
$document = fread($fd, sizeof(<your file>));

preg_match_all("'<\s*a\s.*?href\s*=\s*	    # find <a href=
                         ([\"\'])?                          # find single or double quote
                         (?(1) (.*?)\\1 | ([^\s\>]+))  # if quote found, match up to next matching
                                                             # quote, otherwise match up to next space
			'isx",$document,$links);


if($links){
  print_r($links);	
}
__________________

Please login or register to view this content. Registration is FREE

Please login or register to view this content. Registration is FREE

Please login or register to view this content. Registration is FREE

Please login or register to view this content. Registration is FREE

Linux user #349545 :
(GNU/Linux)iD8DBQBAzWjX+MZAIjBWXGURAmflAKCntuBbuKCWenpm XoA7LNydllVQOwCf

Last edited by ibbo; 12-13-2005 at 09:00 AM..
ibbo is offline
Reply With Quote
View Public Profile Visit ibbo's homepage!
 
Reply     « Reply to Most correct way to extract all <a href> links from a page
 

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are Off
Refbacks are Off





   
RSS Feed  Feeds: RSS   JS   XML
RSS Feed  Feeds for this forum: RSS   JS   XML



Page generated in 0.91198 seconds with 12 queries