|
Cache creator & Crawler to detect dead links on website.
06-07-2007, 04:09 PM
|
Cache creator & Crawler to detect dead links on website.
|
Posts: 8
|
Hi,
I have about 25000 external links on my community website resource database; how easy it is to create a crawler to automatically detect the dead links and report them back to admin for manual removal, and,
How to create and display cache for all these external links, in case the external website is down. Just like Google has.
If you can refer me to some free/cheap software or literature available, it would be great.
Thanks in advance!
Ins
|
|
|
|
06-07-2007, 04:19 PM
|
Re: Cache creator & Crawler to detect dead links on website.
|
Posts: 3,189
|
I've done something similar in the past with Snoopy. Although, it was never that amount of links or caching. PHP might not be optimal for that sort of thing anyways. Do you have a language of preference for this project?
|
|
|
|
06-07-2007, 04:28 PM
|
Re: Cache creator & Crawler to detect dead links on website.
|
Posts: 8
|
Thanks Republikin,
The database is already in PHP/MySQL, and can't change it at this time. All I need is for this software to add a column called cache and store caches for all these links and then to regularly crawl the database to (ping each link?) and generate a report of dead links.
I have heard the script for this is very easy to generate, but I have no idea what and how. So, I am looking for third-party software.
Let me know if you know something on this one.
Thanks.
Ins
|
|
|
|
06-07-2007, 05:15 PM
|
Re: Cache creator & Crawler to detect dead links on website.
|
Posts: 5,662
Name: John Alexander
|
Sounds like what you really need is a programmer. When you buy a 3rd party component, how will you integrate it with what you have now?
|
|
|
|
06-07-2007, 05:59 PM
|
Re: Cache creator & Crawler to detect dead links on website.
|
Posts: 8
|
Yes, having a programmer do this for my website would be better. That is why I ask how and how easy it would be, so I can make a better judgement on what to pay, what particular script guy to approach, and how long it will take. Can you give any ideas?
In addition, if I can check a commercially available script, it may not be as bad, becasue I don't think this needs to be integrated in my website (I may be wrong). Just get cache for all links once and put it in my database field, and just receive daily report of dead-links. I believe an outside script can do this.
Ins
|
|
|
|
06-07-2007, 06:22 PM
|
Re: Cache creator & Crawler to detect dead links on website.
|
Posts: 5,662
Name: John Alexander
|
I don't know. If it was asp/aspx I could do it for a couple hundred bucks, I think. But I don't have any knowledge of PHP, and while .net code can easily talk to a mysql database and do what you need in terms of web interaction. But it would need Windows to run on.
Republican hangs out more in these coding forums than in places like Google and SEO, and so far as I can tell he's very competent at what he sets his mind to. I've only bumped into him so much, but it sounds like he's done this sort of thing before. If you think a coder is the right road to go down, he should be your first candidate.
|
|
|
|
06-07-2007, 06:28 PM
|
Re: Cache creator & Crawler to detect dead links on website.
|
Posts: 3,189
|
Thanks for the recommendation Learning Newbie. The price and time frame would be dependent a bit on what level of integration you need. If you have an admin backend that it would need to be integrated into this could cost some more time to do it right.
PM me send me the exact details; exactly what you need this script to do and not do. If I can't do it I can certainly help you find a competent and dependable programmer.
|
|
|
|
06-07-2007, 07:36 PM
|
Re: Cache creator & Crawler to detect dead links on website.
|
Posts: 5,662
Name: John Alexander
|
Quote:
Originally Posted by Republikin
The price and time frame would be dependent a bit on what level of integration you need. If you have an admin backend that it would need to be integrated into this could cost some more time to do it right.
|
This already displays more knowledge about Linux hosting and administration than I've built up over the course of my career. That's not because I'm stupid, I work in a different environment, but for comparisons sake, you know.
|
|
|
|
06-10-2007, 08:37 AM
|
Re: Cache creator & Crawler to detect dead links on website.
|
Posts: 27
Name: Mike Robinson
Location: London, England
|
Why not just keep a status alongside each link in the database and the date & time it was set. If a page is not reachable then just set the flag to unavailable and don't display it. Keep checking that the page is unavailable and, perhaps after 3 weeks, delete it.
Why bother keeping a cache of pages? if the page is no longer available then just don't show it. If you want to cache 25K pages then I suspect you might need a lot more storage.
25k links is quite a lot of links for a community web site! Remember that if you have to read all those pages each day/week/month then your web usage (I forget the proper word) costs are likely to mount.
Mike
|
|
|
|
06-11-2007, 01:11 PM
|
Re: Cache creator & Crawler to detect dead links on website.
|
Posts: 5,662
Name: John Alexander
|
I agree 25 K is a lot for a community site, a portal, even a normal directory.
But I don't think it would take a huge amount of space, at least if done well. If you figure 20 KB per page on average, that amounts to half a meg, although storing long character data like that in a relational, structured database will balloon the figure.
|
|
|
|
06-11-2007, 09:11 PM
|
Re: Cache creator & Crawler to detect dead links on website.
|
Posts: 27
Name: Mike Robinson
Location: London, England
|
But why store pages that no longer exist? I can understand only wanting to show users good links but I can't see the benefit of showing the user the contents of a page that no longer exists. They should decide whether they want to build a community site that has good links or whether they want to cache the Internet.
Interfacing to the existing system may be easier if they leave the current links table alone (assuming there is just one table full of links) and set up a parallel table containing each link, it's current status, time it was last checked and time it was last ok. You'd want a batch to run periodically to keep the two tables in sync, deleting bad links from the original table, putting them back if they come back on line later and deleting them permanently from both tables if they don't come back online in a given time frame.
For a laugh I once had a go at a program to check peoples home pages and it produced ok reports to show bad links, spelling mistakes etc etc. If you're patient it should still work http://www.checkmypages.com . It does limit the number of warnings and the number of pages it will check though. Company sites tend not to allow web robots to go reading through their pages so it's better for user home pages.
Mike
Last edited by mike_bike_kite; 06-12-2007 at 08:05 AM..
Reason: corrected link
|
|
|
|
06-12-2007, 12:21 AM
|
Re: Cache creator & Crawler to detect dead links on website.
|
Posts: 3,023
Name: Forrest Croce
Location: Seattle, WA
|
Quote:
Originally Posted by mike_bike_kite
Interfacing to the existing system may be easier if they leave the current links table alone (assuming there is just one table full of links) and set up a parallel table containing each link, it's current status, time it was last checked and time it was last ok. You'd want a batch to run periodically to keep the two tables in sync, deleting bad links from the original table, putting them back if they come back on line later and deleting them permanently from both tables if they don't come back online in a given time frame.
|
You could do that using triggers in SQL Server and Oracle; I would assume, or at least hope, that MySQL would let you do the same. That lets you attach a baby stored procedure to table updates, so as soon as it happens, the parallel table is synced.
|
|
|
|
06-12-2007, 05:07 PM
|
Re: Cache creator & Crawler to detect dead links on website.
|
Posts: 27
Name: Mike Robinson
Location: London, England
|
Quote:
|
You could do that using triggers in SQL Server and Oracle; I would assume, or at least hope, that MySQL would let you do the same.
|
Yes - MySQL version 5 onwards has triggers and stored procs.
Must admit I hate using triggers though (on any database system) - I always feel they hide the logic of what's going on. My personal preference is to nearly always go with stored procs as it's easy to see what's happening. I also find it easier to maintain the performance of systems when the data grows.
Systems with complicated cascading triggers are usually a nightmare to debug. I remember 20 years ago being responsible for the testing of an intelligent financial database where everything happened as if by magic using triggers. You'd add a row here and have no idea what other tables had changed unless you read through all the trigger code - it just put me off for life!
Mike
|
|
|
|
|
« Reply to Cache creator & Crawler to detect dead links on website.
|
|
|
| Thread Tools |
Search this Thread |
|
|
|
Posting Rules
|
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts
HTML code is Off
|
|
|
|