Tycoon Talk
Become a Big fish!
The number 1 forum for online business!
Post topics, ask questions, share your knowledge.
Tycoon Talk is part of Freelancer.com - find skilled workers online at a fraction of the cost.

PHP Forum


You are currently viewing our PHP Forum as a guest. Please register to participate.
Login



Freelance Jobs

Reply
Content Scraper help needed
Old 07-11-2009, 01:12 AM Content Scraper help needed
sameer785's Avatar
Super Talker

Posts: 115
Location: Kathmandu,Nepal
Trades: 0
I would really appreciate if you guys could help me out on this problem.

Basically I was trying to build a content scraper using file_get_contents and regular expressions. Everything worked out fine except for the fact that the page I am trying to scrape generates tables in two different ways.

The first would be where everything (eg. Name, location, link etc.) are present. I successfully scraped this table’s contents.

Here’s the code of this type of table:

HTML Code:
<div class="A" id="B" style="display:none"> 
                        	<div class="C"> X </div>
                            <div class="D"><table border="0" cellspacing="1" cellpadding="0" width="100%">
	<tbody>
		<tr>
			<td width="24%" height="20" valign="top">Y</td>
			<td width="76%" valign="top">Z</td>
		</tr>
		<tr>
			 
			<td height="20" valign="top">X</td>
			<td valign="top">Y</td>
		</tr>
		<tr>
			 
			<td height="20" valign="top">X</td>
			<td valign="top">Y</td>
		</tr>
		<tr>
			 
			<td height="20" valign="top">Z</td>
			<td valign="top">X</td>
		</tr>
		<tr>
			 
			<td height="20" valign="top">Y</td>
			<td valign="top">Z</td>
		</tr>
<tr>
			 
			<td height="20" valign="top">X</td>
			<td valign="top">Y</td>
		</tr>
		<tr>
			 
			<td height="20" valign="top">Z</td>
			<td valign="top"><a href="mailto: X">X</a></td>
		</tr>
		<tr>
			<td height="20" valign="top">Y</td>
			<td valign="top"><a href="Z" target="_blank">Z</a></td>
		</tr>	</tbody>
</table>
 </div>
However, the page also generates a few tables where everything (name, location etc.) is present except the cell in which the link is. The scraper I built collapses when this happens and instead ends up scraping an altogether different table’s links, and then continuing through the rest of code.

Here’s the code for this “other” type of table.

HTML Code:
<div class="A" id="B" style="display:none"> 
                        	<div class="C"> X </div>
                            <div class="D"><table border="0" cellspacing="1" cellpadding="0" width="100%">
	<tbody>
		<tr>
			<td width="24%" height="20" valign="top">Y</td>
			<td width="76%" valign="top">Z</td>
		</tr>
		<tr>
			 
			<td height="20" valign="top">X</td>
			<td valign="top">Y</td>
		</tr>
		<tr>
			 
			<td height="20" valign="top">X</td>
			<td valign="top">Y</td>
		</tr>
		<tr>
			 
			<td height="20" valign="top">Z</td>
			<td valign="top">X</td>
		</tr>
		<tr>
			 
			<td height="20" valign="top">Y</td>
			<td valign="top">Z</td>
		</tr>
		<tr>
			 
			<td height="20" valign="top">X</td>
			<td valign="top">Y</td>
		</tr>
		<tr>
			 
			<td height="20" valign="top">X</td>
			<td valign="top"><a href="mailto:Y">Y</a></td>
		</tr>
	</tbody>
</table>
 </div>
And here’s my scraper.

PHP Code:
<?php 
$url
='hotels5.html';
$raw=file_get_contents($url);

$newlines = array("\t","\n","\r","\x20\x20","\0","\x0B");
$html str_replace($newlines""html_entity_decode($raw));

preg_match_all('/<div class="A" id=".*?" style="display:none">.*?<div class="B">(.*?)<\/div>.*?<div class="C"><table border=".*?0.*?" cellspacing=".*?1.*?" cellpadding=".*?0.*?" width=".*?100%.*?">.*?<tbody>.*?<tr>.*?<td width=".*?24%.*?" height=".*?20.*?" valign=".*?top.*?">.*?<\/td>.*?<td width=".*?76%.*?" valign=".*?top.*?">(.*?)<\/td>.*?<\/tr>.*?<tr>.*?<td height=".*?20.*?" valign=".*?top.*?">.*?<\/td>.*?<td valign=".*?top.*?">(.*?)<\/td>.*?<\/tr>.*?<tr>.*?<td height=".*?20.*?" valign=".*?top.*?">.*?<\/td>.*?<td valign=".*?top.*?">(.*?)<\/td>.*?<\/tr>.*?<tr>.*?<td height=".*?20.*?" valign=".*?top.*?">.*?<\/td>.*?<td valign=".*?top.*?">(.*?)<\/td>.*?<\/tr>.*?<tr>.*?<td height=".*?20.*?" valign=".*?top.*?">.*?<\/td>.*?<td valign=".*?top.*?">(.*?)<\/td>.*?<\/tr>.*?<tr>.*?<td height=".*?20.*?" valign=".*?top.*?">.*?<\/td>.*?<td valign=".*?top.*?">(.*?)<\/td>.*?<\/tr>.*?<tr>.*?<td height=".*?20.*?" valign=".*?top.*?">.*?<\/td>.*?<td valign=".*?top.*?"><a href=".*?">(.*?)<\/a><\/td>.*?<\/tr>/s'
    
$html,
    
$posts
    
PREG_SET_ORDER 
    
);

foreach (
$posts as $post) {....
Obviously, what I want to do is to scrape the contents of both the tables. If the hyperlink in the very last cell is not present, I would the scraper to jump to the next block of matching code, and not search for a matching hyperlink.
I was wondering if you guys had a way past this problem.

I am new to regular expressions, and somewhat new to php itself.

Thanks!
__________________

Please login or register to view this content. Registration is FREE
sameer785 is offline
Reply With Quote
View Public Profile
 
 
Register now for full access!
Old 07-11-2009, 06:17 AM Re: Content Scraper help needed
Extreme Talker

Posts: 181
Name: David Jackson
Trades: 0
i dont have an answer to your problem without writing a prototype

i sugest you should use cURL to do the scrape as its much better suited

http://www.merchantos.com/makebeta/p.../#curl_content
__________________

Please login or register to view this content. Registration is FREE
davidj is offline
Reply With Quote
View Public Profile
 
Old 07-11-2009, 08:36 AM Re: Content Scraper help needed
tripy's Avatar
Do not try this at home!

Posts: 3,621
Name: Thierry
Location: I'm the uber Spaminator !
Trades: 0
And forget about regexp, please...
Use the PHP dom extension to parse the page and retrieve elements.

http://www.php.net/manual/en/book.dom.php
__________________
Only a biker knows why a dog sticks his head out the window.
tripy is offline
Reply With Quote
View Public Profile Visit tripy's homepage!
 
Reply     « Reply to Content Scraper help needed
 

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are Off
Refbacks are Off





   
RSS Feed  Feeds: RSS   JS   XML
RSS Feed  Feeds for this forum: RSS   JS   XML



Page generated in 0.85578 seconds with 12 queries