Tycoon Talk
Become a Big fish!
The number 1 forum for online business!
Post topics, ask questions, share your knowledge.
Tycoon Talk is part of Freelancer.com - find skilled workers online at a fraction of the cost.

The Google Forum


You are currently viewing our The Google Forum as a guest. Please register to participate.
Login



Reply
How G treats Last-Modified header
Old 05-24-2005, 04:54 AM How G treats Last-Modified header
mtishetsky's Avatar
King Spam Talker

Posts: 1,226
Name: Mike
Location: Mataro, Spain
Trades: 0
I thought about subject today and got some interesting ideas. All of my websites are dynamic, I do not develop anything in plain HTML. So there's a question about spiders behavior regarding Last-Modified header.

Normally a web page is a static file. It has modification time so it is possible to save bandwidth by sending 304 Not Modified instead of full page if a requested page was not modified since the date and time provided by user-agent within If-Modified-Since header. In this case spider will know that the page was not modified and there is no need to re-index it again.

A request to PHP-generated page always returns 200 OK which means that server does not know anything about page modification time (if only a script developer does not handle If-Modified-Since request header explicitly). In this case G spider will receive the page contents anyway, but may or may not consider it for re-indexing and will be right in both cases.

I assume that sending Last-Modified header with date and time equal to current date and time minus, let's say, 10 seconds can help a page to be spidered more frequently and will allow robot to have the freshest copy in its database. Is this assumption correct? In fact the page is changed very often, becuase visits and downloads counters are updated almost every second.
__________________

Please login or register to view this content. Registration is FREE
-
Please login or register to view this content. Registration is FREE
-
Please login or register to view this content. Registration is FREE

And don't forget to give me talkupation!
mtishetsky is offline
Reply With Quote
View Public Profile Visit mtishetsky's homepage!
 
 
Register now for full access!
Old 05-24-2005, 04:37 PM
chrishirst's Avatar
Missing! presumed drunk.

Posts: 41,519
Name: Chris Hirst
Location: Blackpool. UK
Trades: 0
It will make little difference to SEs They seem to visit and grab the page regardless of them getting a 304 response.

It is something I intend to test at some point.
__________________
Chris. ->> Links are advertising NOT optimising!! <<-
A foolish consistency is the hobgoblin of little minds
Thought for today:- I SEO the only industry where all the cowboys are Indians?
chrishirst is offline
Reply With Quote
View Public Profile Visit chrishirst's homepage!
 
Old 05-25-2005, 01:40 AM
mtishetsky's Avatar
King Spam Talker

Posts: 1,226
Name: Mike
Location: Mataro, Spain
Trades: 0
RFC2616 says:

10.3.5 304 Not Modified

If the client has performed a conditional GET request and access is
allowed, but the document has not been modified, the server SHOULD
respond with this status code. The 304 response MUST NOT contain a
message-body, and thus is always terminated by the first empty line
after the header fields.

How does spider grab a page if there is no HTML in response?
__________________

Please login or register to view this content. Registration is FREE
-
Please login or register to view this content. Registration is FREE
-
Please login or register to view this content. Registration is FREE

And don't forget to give me talkupation!
mtishetsky is offline
Reply With Quote
View Public Profile Visit mtishetsky's homepage!
 
Old 05-25-2005, 03:19 AM
chrishirst's Avatar
Missing! presumed drunk.

Posts: 41,519
Name: Chris Hirst
Location: Blackpool. UK
Trades: 0
It's probably because, after looking through some logs on a static site, I have not seen an instance of any crawlers getting a 304 response from a page where browsers do.

crawlers don't have a cache to update so probably don't request an age response. Crawlers don't "read" pages in the same way browser do, they only grab the page source and store it in the DB. Here's a bit of educated guesswork based on my understanding of inverted index storage (which is probably what is used) and how to index multi-terabytes of data in the shortest possible time.

The crawler will generate a checksum from the retrieved data and store this with the data, The indexer will pull the checksum and compare it to the previously stored one. If they match discard the new data. This way processing priority is give to new data.
if-modified headers are in the control of the server operator and can be manipulated using a self generated checksum cannot.

don't take this as gospel BTW. Simply experience and observation and as a programmer it makes sense.
__________________
Chris. ->> Links are advertising NOT optimising!! <<-
A foolish consistency is the hobgoblin of little minds
Thought for today:- I SEO the only industry where all the cowboys are Indians?
chrishirst is offline
Reply With Quote
View Public Profile Visit chrishirst's homepage!
 
Old 05-25-2005, 03:37 AM
mtishetsky's Avatar
King Spam Talker

Posts: 1,226
Name: Mike
Location: Mataro, Spain
Trades: 0
I have four websites spidered frequently, so I will try to implement logic for analyzing request headers and in a couple of days I will be able to tell you the facts about do spiders send If-Modified or not.
__________________

Please login or register to view this content. Registration is FREE
-
Please login or register to view this content. Registration is FREE
-
Please login or register to view this content. Registration is FREE

And don't forget to give me talkupation!
mtishetsky is offline
Reply With Quote
View Public Profile Visit mtishetsky's homepage!
 
Reply     « Reply to How G treats Last-Modified header
 

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are Off
Refbacks are Off





   
RSS Feed  Feeds: RSS   JS   XML
RSS Feed  Feeds for this forum: RSS   JS   XML



Page generated in 0.15418 seconds with 12 queries