Tycoon Talk
Become a Big fish!
The number 1 forum for online business!
Post topics, ask questions, share your knowledge.
Tycoon Talk is part of Freelancer.com - find skilled workers online at a fraction of the cost.

PHP Forum


You are currently viewing our PHP Forum as a guest. Please register to participate.
Login



Freelance Jobs

Reply
Old 06-26-2006, 08:46 AM Site search script
ibbo's Avatar
Super Spam Talker

Posts: 880
Location: Leeds UK
Trades: 0
Code:
<?php

// author: Mark Ibbotson
// email : 
error_reporting(E_COMPILE_ERROR|E_ERROR|E_CORE_ERROR);

class Search{

	var $_domain = "";
	var $_keyword = "";
	var $_links = Array();
	var $_matched = Array();
	var $_usefull=Array();

	function Search(){
	}

	function openSite(){
    // opens DOMAIN and scans the home page for potential site links 
	// looking for short links I.E "/\/[a-z0-9]*/i"	/support and not www.DOMAIN.com/support
		
		if(isset($this->_domain)){

			$page = file($this->_domain);

			foreach($page as $val){

				if(preg_match("/\/[a-z0-9]*/i", $val, $m)){
					// add matched short links to array for future consideration
					array_push($this->_links, $m[0]);
				}elseif(preg_match("/\/[a-z0-9]*\/[a-z0-9]*/i", $val, $mn)){
					array_push($this->_links, $mn[0]);
				}
			}
		}
	}

	function findPageOnSearch(){
		// provide a list of links of matched pages and search terms
		
		foreach($this->_links as $key=>$val){
		
			// remove duplicates from top level link array
			if(!in_array($val,$this->_usefull)){
				// usefull, unique list of short links
				array_push($this->_usefull, $val);
			}
		}

        // lets make sure we dont parse any pages that dont require parsing. 
        // css directory, image directory etc
                
		foreach($this->_usefull as $key=>$val){
			
			if($val != "/css" && $val != "/js" && $val != "/images"){
				// if none of the above pages grab the current page
				
				$page = file_get_contents($this->_domain . $val);

				// can we match our keyword in that page
				if(strstr($page, $this->_keyword)){
					if($val == "desktop") echo $val;
					// if so lets display that match and a link to the page
					echo "MATCHED : $this->_keyword<hr />";
					echo "<a href=" . $this->_domain . $val .">" . $this->_domain.$val . "</a> <br />"; 
					echo "<hr />";
				}
			}
		}
	}

}

// ensure POST is set before kick off
if(!empty($_POST)){

	if(isset($_POST['domain']) && isset($_POST['keyword'])){
		// new search obj.
		$search = new Search();

		// search vars
		$search->_domain = $_POST['domain'];
		$search->_keyword = $_POST['keyword'];

		// dog work methods
		$search->openSite();
		$search->findPageOnSearch();
	}
}
?>
<html>
<head>
<title>Site search</title>
<body>
 <form method="post" action="search.php">
  <table border="0">
   <tr><td>Domain</td><td><input type="text" name="domain" value="http://domain.com" /></td></tr>
   <tr><td>Keyword</td><td><input type="text" name="keyword" value="" /></td></tr>
   <tr><td></td><td><input type="submit" value="Go" /></td></tr>
  </table>
 </form>
</body>
</html>
This is in the prtotype stage and its CASE SENSATIVE (add str_tolower where appropriate if you use it).

It works like thus:

Enter domain name (http://) included.

The script then goes away and preg's all /[a-z0-9]* looking for short links like href="/support" which it then adds to $_links.

$_links is then polled and all duplicates are removed leaving us with $_usefull.

Poll $_usefull and open each page it referes to. A simple strstr on that page to match your keyword search and hey presto it lists linbks with your keyword.

Of course its far from been a google worthy search engine (and it seriously lacks features bar matching keywords).

Anyway its there for you all to chop to bits improve upon and use at your leisure.

Any mods or ideas to improve it would also be welcome.

Hope it helps some of you, it certainly is usefull to me.

Ibbo
__________________

Please login or register to view this content. Registration is FREE

Please login or register to view this content. Registration is FREE

Please login or register to view this content. Registration is FREE

Please login or register to view this content. Registration is FREE

Linux user #349545 :
(GNU/Linux)iD8DBQBAzWjX+MZAIjBWXGURAmflAKCntuBbuKCWenpm XoA7LNydllVQOwCf

Last edited by ibbo; 06-27-2006 at 08:13 AM..
ibbo is offline
Reply With Quote
View Public Profile Visit ibbo's homepage!
 
 
Register now for full access!
Old 06-27-2006, 06:41 AM Re: Site search script
ibbo's Avatar
Super Spam Talker

Posts: 880
Location: Leeds UK
Trades: 0
PHP Code:
<?php

// author: Mark Ibbotson
// email :

error_reporting(E_COMPILE_ERROR|E_ERROR|E_CORE_ERROR);

class 
Search{

    var 
$_domain "";
    var 
$_keyword "";
    var 
$_links = Array();
    var 
$_matched = Array();
    var 
$_usefull=Array();
    var 
$_case fasle;

    function 
Search(){}

    
/*
     * Purpose: Grabs a root webpage
     * Params : null
    */
    
function openSite(){
        
// opens DOMAIN and scans the home page for potential site links
        // looking for short links I.E "/\/[a-z0-9]*/i"    /support and not www.DOMAIN.com/support
        
if(isset($this->_domain)){
            
$this->findLinks(file($this->_domain));
        }
    }

    
/*
     * Purpose: Poll a page for possible links
     * Params : $page page of interest
    */
    
function findLinks($page){
        
// we are looking for dir's to add to our list for future consideration
        
foreach($page as $val){
            
// 1 dir deep
            
if(preg_match("/\/[a-z0-9]*/i"$val$m)){
                
array_push($this->_links$m[0]);
            }
            
// 2 deep
            
if(preg_match("/\/[a-z0-9]*\/[a-z0-9]*/i"$val$mn)){
                
array_push($this->_links$mn[0]);
            }
            
// 3 deep
            
if(preg_match("/\/[a-z0-9]*\/[a-z0-9]*\/[a-z0-9]*/i"$val$mn)){
                
array_push($this->_links$mn[0]);
            }
            
// 4 deep
            
if(preg_match("/\/[a-z0-9]*\/[a-z0-9]*\/[a-z0-9]*\/[a-z0-9]*/i"$val$mn)){
                
array_push($this->_links$mn[0]);
            }
        }
    }

    
/*
     * Purpose: Parse link array, remove duplicates
     *          Investigate sub pages for further links   
     * Params : NULL
    */
    
function parseLinks(){
        
// provide a list of links of matched pages and search terms
        
foreach($this->_links as $key=>$val){
            
// remove duplicates from top level link array
            
if(!in_array($val,$this->_usefull)){
                
// usefull, unique list of short links
                
array_push($this->_usefull$val);
                
// lets see if sub pages hold any links for us to explore
            
}
        }
    }

    
/*
     * Purpose: Poll usefull links and open each page therein.
     *          match keywords on that page, if a match is found echo a link
     * Params : NULL
    */
    
function showResults(){
        echo 
"Please be patient this could take several mninutes<br />";
        echo 
"Results for search " $this->_keyword;
        if(
$this->_case){ echo " using case insensitive match";}
        echo 
"<br />";

        
// grab sub page links and restart, 
        // killer on time as each sub page needs to be polled for links within it
        // then those links need to be scanned for keyword match
        
foreach($this->_usefull as $key=>$val){
            
$this->findLinks(file($this->_domain $val));
           
$this->parseLinks();
        }
        
        foreach(
$this->_usefull as $key=>$val){
            
// lets make sure we dont parse any pages that dont require parsing.
            // css directory, image directory etc
            
if($val != "/css" && $val != "/js" && $val != "/images" && $val != "/W3C"){
                
// if none of the above pages grab the current page
                
$page file_get_contents($this->_domain $val);
                
                
// convert to lowercase on case insesitive match
                
if($this->_case){
                    
$page strtolower($page);
                    
$this->_keyword strtolower($this->_keyword);
                }
                
// can we match our keyword in that page
                
if(strstr($page$this->_keyword)){
                    
// if so lets display that match and a link to the page
                    
echo "<a href=" $this->_domain $val .">" $this->_domain.$val "</a> <br />";
                }
            }
        }
    }
}

// ensure POST is set before kick off
if(!empty($_POST)){

    if(isset(
$_POST['domain']) && isset($_POST['keyword'])){
        
// new search obj.
        
$search = new Search();

        
// search vars
        
$search->_domain $_POST['domain'];
        
$search->_keyword $_POST['keyword'];
        isset(
$_POST['case']) ? $search->_case true $search->_case false;

        
// dog work methods
        
$search->openSite();
        
$search->parseLinks();
        
$search->showResults();
    }
}
?>

<html>
<head>
<title>Site search</title>
<body>
 <form method="post" action="search.php">
  <table border="0">
   <tr><td>Domain</td><td><input type="text" name="domain" value="http://www.ubuntu.com" /></td></tr>
   <tr><td>Keyword</td>
       <td><input type="text" name="keyword" value="" />
           <input type="checkbox" name="case" /> case insensitive
       </td>
   </tr>
   <tr><td></td><td><input type="submit" value="Go" /></td></tr>
  </table>
 </form>
</body>
</html>
A modified version that polls sub pages for links and follows them too.

Penalty, takes approx 3 minutes to search "BUT if its there i'm 91% sure it will find it".

Ibbo
__________________

Please login or register to view this content. Registration is FREE

Please login or register to view this content. Registration is FREE

Please login or register to view this content. Registration is FREE

Please login or register to view this content. Registration is FREE

Linux user #349545 :
(GNU/Linux)iD8DBQBAzWjX+MZAIjBWXGURAmflAKCntuBbuKCWenpm XoA7LNydllVQOwCf

Last edited by ibbo; 06-27-2006 at 08:12 AM..
ibbo is offline
Reply With Quote
View Public Profile Visit ibbo's homepage!
 
Old 06-27-2006, 12:18 PM Re: Site search script
mgraphic's Avatar
Truth Seeker

Latest Blog Post:
JAMISONTUNES
Posts: 2,918
Name: Keith Marshall
Location: Connecticut
Trades: 0
Quote:
"BUT if its there i'm 91% sure it will find it"
The other 9% is sure that you forgot to put it there in the first place!
__________________

<mgraphic /> - I don't have a solution but I admire the problem.
mgraphic is offline
Reply With Quote
View Public Profile
 
Old 06-28-2006, 04:49 AM Re: Site search script
ibbo's Avatar
Super Spam Talker

Posts: 880
Location: Leeds UK
Trades: 0
PHP Code:
function findLinks($page){
        
// we are looking for dir's to add to our list for future consideration
        
foreach($page as $val){
            
// 1 dir deep
            
if(preg_match("/\/[a-z0-9]*/i"$val$m)){
                
array_push($this->_links$m[0]);
            }
            
// 2 deep
            
if(preg_match("/\/[a-z0-9]*\/[a-z0-9]*/i"$val$mn)){
                
array_push($this->_links$mn[0]);
            }
            
// 3 deep
            
if(preg_match("/\/[a-z0-9]*\/[a-z0-9]*\/[a-z0-9]*/i"$val$mn)){
                
array_push($this->_links$mn[0]);
            }
            
// 4 deep
            
if(preg_match("/\/[a-z0-9]*\/[a-z0-9]*\/[a-z0-9]*\/[a-z0-9]*/i"$val$mn)){
                
array_push($this->_links$mn[0]);
            }
        }
    } 
Modify this method to take another param for depth. As it only searched 4 dir's deep at the moment I did not feel confident saying 100%. However if you pass a depth parameter to this method and loop a preg_match instead of the default 4 in there and you will soon start to reach your 100% success rate.

However the penalty will be an increase in crawl time (for thats all it is in essence). In its present state it takes just short of 3 mins to crawl 4 dirs deep. Increase this to six and your probably going to be sat waiting for 6+ mins.

Its a case of how low do you go.

Give it a try on any site, you will find it finds your keywords and will list the pages it finds them on.

Ibbo
__________________

Please login or register to view this content. Registration is FREE

Please login or register to view this content. Registration is FREE

Please login or register to view this content. Registration is FREE

Please login or register to view this content. Registration is FREE

Linux user #349545 :
(GNU/Linux)iD8DBQBAzWjX+MZAIjBWXGURAmflAKCntuBbuKCWenpm XoA7LNydllVQOwCf

Last edited by ibbo; 06-28-2006 at 04:52 AM..
ibbo is offline
Reply With Quote
View Public Profile Visit ibbo's homepage!
 
Reply     « Reply to Site search script
 

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are Off
Refbacks are Off





   
RSS Feed  Feeds: RSS   JS   XML
RSS Feed  Feeds for this forum: RSS   JS   XML



Page generated in 0.70855 seconds with 12 queries