Tycoon Talk
Become a Big fish!
The number 1 forum for online business!
Post topics, ask questions, share your knowledge.
Tycoon Talk is part of Freelancer.com - find skilled workers online at a fraction of the cost.

PHP Forum


You are currently viewing our PHP Forum as a guest. Please register to participate.
Login



Freelance Jobs

Reply
Extracting plain text from within HTML tags
Old 10-02-2010, 02:51 PM Extracting plain text from within HTML tags
Truly's Avatar
Ultra Talker

Posts: 322
Trades: 0
Hey all,

I am trying to extract plain text from within HTML tags.

The script starts out with a spider:
PHP Code:
<?php
class spider  
    
{   
     
// This class grabs the content from the sites                  
     
function setup()    
         {
         
$cookieJar 'cookies.txt';                 
         
curl_setopt($this->curl,CURLOPT_COOKIEJAR$cookieJar);         
         
curl_setopt($this->curl,CURLOPT_COOKIEFILE$cookieJar);        
         
curl_setopt($this->curl,CURLOPT_AUTOREFERER,true);        
         
curl_setopt($this->curl,CURLOPT_TIMEOUT,30);        
         
curl_setopt($this->curl,CURLOPT_CONNECTTIMEOUT,25);        
         
curl_setopt($this->curl,CURLOPT_FOLLOWLOCATION,true);        
         
curl_setopt($this->curl,CURLOPT_RETURNTRANSFERtrue);          
         }        
     function 
get($url)    
         {         
         
$this->curl curl_init($url);        
         
$this->setup();                 
         return 
$this->request();    
         }        
     function 
request()    
         {        
         return 
curl_exec($this->curl);    
         }  
     }
    
$spider=new spider();
    
$link=$spider->get("http://www12.statcan.ca/census-recensement/2006/dp-pd/prof/92-597/P3.cfm?Lang=E&CTCODE=4707&CATYPE=CMA");
    
?>
This works fine. The problem I am having is identifying the correct functions to use for actually parsing the information.

I was trying to use 'substr()' and a loop, but I dont know how I should go about identifying the start point ('<td class="alignBottomRight col_1"' plus an increasing row id that changes which I will need to take into account) and the end point ('</td>').

Do you guys have any suggestions as far as how I can set this up to retrieve the data, or of any good pre-made scripts that might be of use here?

Thanks!
__________________
DVD Movie Release Database:
Please login or register to view this content. Registration is FREE
Truly is offline
Reply With Quote
View Public Profile
 
 
Register now for full access!
Old 10-02-2010, 03:14 PM Re: Extracting plain text from within HTML tags
chrishirst's Avatar
Missing! presumed drunk.

Posts: 42,384
Name: Chris Hirst
Location: Blackpool. UK
Trades: 0
How about going about it in reverse?

http://php.net/manual/en/function.strip-tags.php
__________________
Chris. ->>
Please login or register to view this content. Registration is FREE
<<-

A foolish consistency is the hobgoblin of little minds
Thought for today:- Is SEO the only industry where all the cowboys are Indians?
chrishirst is offline
Reply With Quote
View Public Profile Visit chrishirst's homepage!
 
Old 10-02-2010, 06:26 PM Re: Extracting plain text from within HTML tags
Truly's Avatar
Ultra Talker

Posts: 322
Trades: 0
Hey Chris,

The problem is that the first half of the script gives me the source code for the entire page. I am then trying to pull data out of one specific column on that page. So by keeping the tags I am able to use that to identify the correct column, but I am just not sure how to go about doing this.

Thanks for you help!
__________________
DVD Movie Release Database:
Please login or register to view this content. Registration is FREE
Truly is offline
Reply With Quote
View Public Profile
 
Old 10-02-2010, 06:31 PM Re: Extracting plain text from within HTML tags
chrishirst's Avatar
Missing! presumed drunk.

Posts: 42,384
Name: Chris Hirst
Location: Blackpool. UK
Trades: 0
Can the document be parsed as a XML tree?
__________________
Chris. ->>
Please login or register to view this content. Registration is FREE
<<-

A foolish consistency is the hobgoblin of little minds
Thought for today:- Is SEO the only industry where all the cowboys are Indians?
chrishirst is offline
Reply With Quote
View Public Profile Visit chrishirst's homepage!
 
Old 10-03-2010, 02:14 PM Re: Extracting plain text from within HTML tags
Truly's Avatar
Ultra Talker

Posts: 322
Trades: 0
Chris how would I do that?

I am pulling the data from a government website and this page doesnt use XML. They do have an option to export to CSV but unfortunately they use a unique identifier on the link that doesnt seem to have any relevance other than as an ID.
__________________
DVD Movie Release Database:
Please login or register to view this content. Registration is FREE
Truly is offline
Reply With Quote
View Public Profile
 
Old 10-03-2010, 02:57 PM Re: Extracting plain text from within HTML tags
chrishirst's Avatar
Missing! presumed drunk.

Posts: 42,384
Name: Chris Hirst
Location: Blackpool. UK
Trades: 0
If the page uses a XHTML Strict DTD you should be able cURL the source to a variable and read the node content using the ID attribute.

(in theory)
__________________
Chris. ->>
Please login or register to view this content. Registration is FREE
<<-

A foolish consistency is the hobgoblin of little minds
Thought for today:- Is SEO the only industry where all the cowboys are Indians?
chrishirst is offline
Reply With Quote
View Public Profile Visit chrishirst's homepage!
 
Old 10-03-2010, 03:55 PM Re: Extracting plain text from within HTML tags
Truly's Avatar
Ultra Talker

Posts: 322
Trades: 0
Wow that sounds intense . Do you know of a good turotial or example that explains this?

Shouldnt there be a simpler way to do it using substr() and some other function?
__________________
DVD Movie Release Database:
Please login or register to view this content. Registration is FREE
Truly is offline
Reply With Quote
View Public Profile
 
Old 10-03-2010, 04:46 PM Re: Extracting plain text from within HTML tags
chrishirst's Avatar
Missing! presumed drunk.

Posts: 42,384
Name: Chris Hirst
Location: Blackpool. UK
Trades: 0
http://www.php.net/manual/en/domdocu...lementbyid.php
__________________
Chris. ->>
Please login or register to view this content. Registration is FREE
<<-

A foolish consistency is the hobgoblin of little minds
Thought for today:- Is SEO the only industry where all the cowboys are Indians?
chrishirst is offline
Reply With Quote
View Public Profile Visit chrishirst's homepage!
 
Old 10-03-2010, 05:09 PM Re: Extracting plain text from within HTML tags
Truly's Avatar
Ultra Talker

Posts: 322
Trades: 0
Nice.

So I gave that a try:

$doc = new DomDocument;

// We need to validate our document before refering to the id
$doc->validateOnParse = true;
$doc->Load($link);

echo "The element whose id is books is: " . $doc->getElementById('col_1')->tagName . "\n";

The problem is that they dont have a unique ID for them

I tried using getElementsByTagName() but it didnt return anything.

Chris you rock!
__________________
DVD Movie Release Database:
Please login or register to view this content. Registration is FREE
Truly is offline
Reply With Quote
View Public Profile
 
Old 10-03-2010, 05:38 PM Re: Extracting plain text from within HTML tags
chrishirst's Avatar
Missing! presumed drunk.

Posts: 42,384
Name: Chris Hirst
Location: Blackpool. UK
Trades: 0
Quote:
The problem is that they dont have a unique ID for them
That's quite unsporting of them

What does the document source look like and is there something unique about the section you want?

Regular expressions are "greedy" by nature so may get too much, so it will be a case of using strpos to get the start position and the length, then use substr() to extract the data.
__________________
Chris. ->>
Please login or register to view this content. Registration is FREE
<<-

A foolish consistency is the hobgoblin of little minds
Thought for today:- Is SEO the only industry where all the cowboys are Indians?
chrishirst is offline
Reply With Quote
View Public Profile Visit chrishirst's homepage!
 
Old 10-05-2010, 10:59 AM Re: Extracting plain text from within HTML tags
Truly's Avatar
Ultra Talker

Posts: 322
Trades: 0
Haha that is rather unsporting of them isn't it?

I am trying to pull information from Statistics Canada on different census tracts. Here is an example of a page for one of the census tracts: http://www12.statcan.ca/census-recen...707&CATYPE=CMA

I want to take all the information from the column marked '0109.01 (CT)' and add it to an array so that i can output it to a CSV file once I have done this for all the necessary census tracts (about 100).

If all else fails I can spend a Saturday copy pasting and then formatting everything but that kind of sucks, plus I would like to figure this out in case I ever need to do something like this again in the future.
__________________
DVD Movie Release Database:
Please login or register to view this content. Registration is FREE
Truly is offline
Reply With Quote
View Public Profile
 
Old 10-05-2010, 11:50 AM Re: Extracting plain text from within HTML tags
Skilled Talker

Posts: 52
Name: Alex
Trades: 0
strip_tags()?
elf2002 is offline
Reply With Quote
View Public Profile
 
Old 10-05-2010, 12:06 PM Re: Extracting plain text from within HTML tags
chrishirst's Avatar
Missing! presumed drunk.

Posts: 42,384
Name: Chris Hirst
Location: Blackpool. UK
Trades: 0
How about looking at

http://wonshik.com/snippet/Convert-H...to-a-PHP-Array and http://www.phpbuilder.com/board/show...php?t=10313404
__________________
Chris. ->>
Please login or register to view this content. Registration is FREE
<<-

A foolish consistency is the hobgoblin of little minds
Thought for today:- Is SEO the only industry where all the cowboys are Indians?
chrishirst is offline
Reply With Quote
View Public Profile Visit chrishirst's homepage!
 
Old 10-05-2010, 05:00 PM Re: Extracting plain text from within HTML tags
Truly's Avatar
Ultra Talker

Posts: 322
Trades: 0
Chris the links from the second site to the script dont seem to work but the first script looks great!

I won't have a chance to try it until Friday but I will let you know how it works out when I do.

Thanks for all your help
__________________
DVD Movie Release Database:
Please login or register to view this content. Registration is FREE
Truly is offline
Reply With Quote
View Public Profile
 
Old 10-05-2010, 08:41 PM Re: Extracting plain text from within HTML tags
ScrapingWeb.com's Avatar
Average Talker

Posts: 25
Location: ScrapingWeb.com
Trades: 0
strip_tags() for a small chunk of HTML. For large document such as an entire web page, preg_match() or preg_match_all() with regular expressions are better in dealing with complicated situations.
__________________

Please login or register to view this content. Registration is FREE
for webmasters who need the data to get started on niche information sites. Here are
Please login or register to view this content. Registration is FREE
.
ScrapingWeb.com is offline
Reply With Quote
View Public Profile Visit ScrapingWeb.com's homepage!
 
Old 10-07-2010, 05:27 PM Re: Extracting plain text from within HTML tags
Truly's Avatar
Ultra Talker

Posts: 322
Trades: 0
Sorry Chris but I could use some hand holding on this because I have never really worked with classes before and the instructions are confusing me more

Should 'var $source = NULL;' actually be equal to the URL of the page? And then just change the anchor variables up top there too?

Sorry, I feel dumb asking this question
__________________
DVD Movie Release Database:
Please login or register to view this content. Registration is FREE
Truly is offline
Reply With Quote
View Public Profile
 
Reply     « Reply to Extracting plain text from within HTML tags
 

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are Off
Refbacks are Off





   
RSS Feed  Feeds: RSS   JS   XML
RSS Feed  Feeds for this forum: RSS   JS   XML



Page generated in 0.51545 seconds with 12 queries