Tycoon Talk
Become a Big fish!
The number 1 forum for online business!
Post topics, ask questions, share your knowledge.
Tycoon Talk is part of Freelancer.com - find skilled workers online at a fraction of the cost.

The Other Search Engines


You are currently viewing our The Other Search Engines as a guest. Please register to participate.
Login



Reply
How does Alexa choose which URL's to crawl? Looks dangerous.
Old 08-14-2007, 11:31 AM How does Alexa choose which URL's to crawl? Looks dangerous.
Skilled Talker

Posts: 77
Trades: 0
I was looking through the logs of a website of mine, and I noticed that the ia_archaiver bot (IP: 64.208.172.177) had crawled some pages that I've never linked to from my main webpages or anywhere else for that matter. Also, I do not mention these specific pages in my robots.txt.

So then, I am wondering how Alexa comes about crawling pages that have never been linked to.

If I have the Alexa Toolbar installed, which I have the Alexa/PageRank/Compete toolbar for Firefox - does Alexa crawl every web page that I might visit, even if the page is not referenced anywhere else online?

If this is the case - I would suspect that having the Alexa Toolbar would be dangerous for website owners who sell information products with static download pages. As of right now, when I do a search on Alexa for my website, the top result is the page that I've never linked to. . .

Unless I am completely missing something, this means that a lot of internet marketers might be unintentionally sharing their download page with the world through Alexa if they have the toolbar installed & they sell a product from their website using a static download page.


Not to scare anyone as maybe this is common knowledge by some. . . I just thought it was a little disturbing seeing a search engine robot crawling a webpage that is not public. Of course, I know that Alexa is spying on me every time I browse a web page - I only didn't know that they indexed pages based solely on me viewing them.

Does anyone have any insight into this?
__________________

Please login or register to view this content. Registration is FREE

Please login or register to view this content. Registration is FREE
whooligan is offline
Reply With Quote
View Public Profile
 
 
Register now for full access!
Old 08-14-2007, 11:45 AM Re: How does Alexa choose which URL's to crawl? Looks dangerous.
ADAM Web Design's Avatar
Canadastaninianite

Posts: 5,938
Name: Adam for web page design, not program
Location: Toronto, Ontario, Canada
Trades: 0
Yeah...add something into your robots.txt that tells bots not to index or follow private pages. ia_archiver, to the best of my knowledge, will obey robots.txt directives.

This is why I never bother with the Alexa toolbar (that, and it's useless).
__________________

Please login or register to view this content. Registration is FREE
|
Please login or register to view this content. Registration is FREE
(my blog)


Please login or register to view this content. Registration is FREE
(with proof)
ADAM Web Design is offline
Reply With Quote
View Public Profile Visit ADAM Web Design's homepage!
 
Old 08-14-2007, 12:05 PM Re: How does Alexa choose which URL's to crawl? Looks dangerous.
Skilled Talker

Posts: 77
Trades: 0
Thanks for the reply.

For me, adding my pages to the robots.txt files would work out fine; my bigger concern was for people that have pages that "really" should be hidden. While it's a little disturbing for me, if I don't want people to access a page - they really need to know what they're doing to get what I am hiding from them. The pages that were indexed on my site were of no consequence; some testing pages for scripts written on windows to be sure they worked on linux.

These people (those with static download pages) shouldn't put their hidden pages in the robots.txt file as this is one of the first places to look when looking to steal a product from a digital download website.

I wanted a Google toolbar for Firefox, and it came with Alexa attached. I agree that Alexa rank is useless (except for maybe making uninformed people feel better about themselves) as is the toobar pagerank unreliable.

So then, aside from me seeing this with my own eyes; does this happen often>
__________________

Please login or register to view this content. Registration is FREE

Please login or register to view this content. Registration is FREE
whooligan is offline
Reply With Quote
View Public Profile
 
Old 08-17-2007, 04:17 PM Re: How does Alexa choose which URL's to crawl? Looks dangerous.
Learning Newbie's Avatar
Defies a Status

Latest Blog Post:
Astounding Republican Paranoia
Posts: 5,662
Name: John Alexander
Trades: 0
Quote:
Originally Posted by whooligan View Post
These people (those with static download pages) shouldn't put their hidden pages in the robots.txt file as this is one of the first places to look when looking to steal a product from a digital download website.
Well there's your problem! Static html has ZERO security. If you want to use words like steal, it's like if I went to work and left my front door wide open - a person could see that as an invitation and have plausible denial. You have ZERO security if you don't have passwords and user accounts at a minimum.

Quote:
Originally Posted by whooligan View Post
So then, aside from me seeing this with my own eyes; does this happen often>
Well why do you think Alexa makes the toolbar? Not for charity I hope! It's to allow them to collect data. That's where their rank comes from - every time you visit a page the toolbar sends the URL and your IP address to the mothership so they can rank them for traffic or viewership.
__________________

Please login or register to view this content. Registration is FREE


Please login or register to view this content. Registration is FREE
Learning Newbie is offline
Reply With Quote
View Public Profile
 
Old 08-17-2007, 06:00 PM Re: How does Alexa choose which URL's to crawl? Looks dangerous.
Skilled Talker

Posts: 77
Trades: 0
This was what I was looking for:

Quote:
Originally Posted by Learning Newbie View Post
... every time you visit a page the toolbar sends the URL and your IP address to the mothership so they can rank them for traffic or viewership.
Being rather ignorant as to the ways of alexa, I was only wondering why a page that I never linked to was getting hits from their crawler.

Quote:
Originally Posted by Learning Newbie View Post
Well why do you think Alexa makes the toolbar? Not for charity I hope!
I have no vested interest in knowing the specifics of the alexa toolbar nor what they collect nor how they use it, it came attached to a plugin that I use for Firefox. Because I don't know too much about what Alexa does, other than in a general way, this precipitated my need to know why their robot was hitting a specific page on my website. I think that'd be a rather foolish assumption to think that they distribute their toolbar with no potential gain on their part.

Quote:
Originally Posted by Learning Newbie View Post
Well there's your problem! Static html has ZERO security. If you want to use words like steal, it's like if I went to work and left my front door wide open - a person could see that as an invitation and have plausible denial. You have ZERO security if you don't have passwords and user accounts at a minimum.
I am not worried about me, it's not something that affects *my* websites. I haven't used a static download page for my own products since 2003. Depending on the format, my downloads are either stored in mysql, compiled, zipped & then written to the server & deleted after a few minutes or they're stored below the public_html directory & copied to a unique download location that is obfuscated with a query string then deleted a few minutes later.

My concern was for the ignorant webmasters who don't know that having the alexa toolbar installed might make it easy for someone to steal their products.

I used the word steal because I don't see how there's any other way to put it. Regardless of whether I or you or anyone else is foolish enough to leave a door wide opened or unlocked doesn't negate the fact that making a conscious decision to obtain something that is not rightfully theirs to take is "stealing". If there's another word that more aptly describes the situation, I apologize for my ignorance.

In thinking about this further, there's even more concern for webmasters with static download pages than I initially was thinking. It's not only if a webmaster/marketer has the alexa toolbar installed that will give up the download page, if anybody who purchases a product has the alexa toolbar installed, this will send the data to alexa for indexing as well.
__________________

Please login or register to view this content. Registration is FREE

Please login or register to view this content. Registration is FREE
whooligan is offline
Reply With Quote
View Public Profile
 
Old 08-17-2007, 10:14 PM Re: How does Alexa choose which URL's to crawl? Looks dangerous.
ForrestCroce's Avatar
Half Man, Half Amazing

Posts: 3,023
Name: Forrest Croce
Location: Seattle, WA
Trades: 0
Quote:
Originally Posted by whooligan View Post
In thinking about this further, there's even more concern for webmasters with static download pages than I initially was thinking. It's not only if a webmaster/marketer has the alexa toolbar installed that will give up the download page, if anybody who purchases a product has the alexa toolbar installed, this will send the data to alexa for indexing as well.
If you believe the writing all over the wall, the "next big thing" in search is personalization. If you're interested in content production, when you do a search, especially one that's ambiguous, a company that has stored up a history of your actions, even a narrow slice of them, is going to be able to get you what you want with less trouble than a different company that doesn't know a thing about you. This is why all the big web search engines have their own toolbar...

I'm not sure the info John gave you is 100 % accurate, but ultimately that's probably what happened. Alexa makes its ranking by taking the data their toolbars send them, and assuming they have a sample of whatever % of the net population, then extrapolate from there. It seems like for that to work they wound need the toolbar to send at least your domain name to Alexa's servers ... maybe the page URL since they run a search engine, too.

I worked for a particular digital content house that used expiring urls, somewhat like you describe. That wasn't to prevent theft or infringement so much as to minimize it and not waste too much bandwidth on the problem.

A much better approach would be to field the request for whatever protected content you have and when someone, somewhere, asks for http://example.com/somefile.jpg, write logic to make sure they're authenticated and authorized. I'm not well enough versed in Apache to tell you exactly what code to write, but if you only ever link to files that don't exist, and use a custom 404 page with logic, or mod rewrite to send the request to a php page with encrypted query string parameters and/or cookies to let you maintain enough session state to know who the person is, assuming they've logged in, you've got the problem solved.

And if you can manage all that, building a real security system, you'll be able to use robots.txt for its intended purpose.

On that note: I hope you have no-index tags on the pages, if you won't put the urls into your robots file? If one of your customers links to the page on a blog, suddenly all the big search engines will know about it.
__________________

Please login or register to view this content. Registration is FREE
|
Please login or register to view this content. Registration is FREE
|
Please login or register to view this content. Registration is FREE
ForrestCroce is offline
Reply With Quote
View Public Profile Visit ForrestCroce's homepage!
 
Old 08-18-2007, 01:02 AM Re: How does Alexa choose which URL's to crawl? Looks dangerous.
vangogh's Avatar
Post Impressionist

Posts: 10,689
Name: Steven Bradley
Location: Boulder, Colorado
Trades: 0
Keep in mind that because of the toolbar Alexa can find whatever it's users can find. I have the toolbar installed and believe me I can and have found a lot of pages that were supposed to be hidden. Just because you don't link to something doesn't mean it can be found.

Alexa may also crawl your directories instead of following your links. I'm not sure if they do, but I see no reason why they can't.
__________________
l Search Engine Friendly Web Design |
Please login or register to view this content. Registration is FREE

l Tips On Marketing, SEO, Design, and Development |
Please login or register to view this content. Registration is FREE

l
Please login or register to view this content. Registration is FREE
|
Please login or register to view this content. Registration is FREE
vangogh is offline
Reply With Quote
View Public Profile Visit vangogh's homepage!
 
Old 08-18-2007, 02:32 AM Re: How does Alexa choose which URL's to crawl? Looks dangerous.
ForrestCroce's Avatar
Half Man, Half Amazing

Posts: 3,023
Name: Forrest Croce
Location: Seattle, WA
Trades: 0
Quote:
Originally Posted by vangogh View Post
Keep in mind that because of the toolbar Alexa can find whatever it's users cAlexa may also crawl your directories instead of following your links. I'm not sure if they do, but I see no reason why they can't.
You can allow or disallow directory listings across a whole site, or you can add an index.html file that will be served up when someone asks for just the directory name.
__________________

Please login or register to view this content. Registration is FREE
|
Please login or register to view this content. Registration is FREE
|
Please login or register to view this content. Registration is FREE
ForrestCroce is offline
Reply With Quote
View Public Profile Visit ForrestCroce's homepage!
 
Reply     « Reply to How does Alexa choose which URL's to crawl? Looks dangerous.
 

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Trackbacks are Off
Pingbacks are Off
Refbacks are Off





   
RSS Feed  Feeds: RSS   JS   XML
RSS Feed  Feeds for this forum: RSS   JS   XML



Page generated in 0.69251 seconds with 12 queries