Quote:
A URL is provided.
The script accesses the URL through one of the browser engines like webkit, trident, gecko, etc.
A screenshot is taken, resized to a thumbnail, and added to the database in association with the URL.
|
I can give rough directions, as the implementation itself is way more complicated than just assembling ready made programs.
1) The backend. As Virtuosimedia guessed, I use a python binding of the firefox rendering library: gecko.
This allows me to interpret a web page from a python script.
2) The rendering. This is done through a virtual display, which is randomly selected at call time in a spool of 30 of them.
3) The screenshoting. This is handled in 2 steps. First, a screenshot is taken using the glibc library (it's a linux system library that handles graphical operations).
Once the screenshot is taken, I use the PIL python imaging module to enhance it's quality, sharpness, do a crop (if it's asked) and resize it to the final size.
This screenshot is then sent to a in-memory cache, shared by the python process and the web site.
A typical request would be:
° As a request hit the server, it's parsed, and the paying options are activated as asked depending of the subscription level of the user.
°Then, a reverse DNS call is done to verify that the domain is valid.
° A cache check is done on 2 levels. First on the live cache, and if it's not found there, on the DB cache.
The reason for the 2 levels of cache is that the live cache have a decaying system in place. After a certain time with no access, the object is removed from the cache, to let other pictures fill it.
If we found a picture in the DB cache that is not present into the live cache, it's re-inserted into the live cache.
To determine if a cached version of a screenshot exists, I do a checksum on several properties of that screenshot, and use it as a key to identify them.
It's needed to allow different payig users to have different caching delays.
° If a picture is found from the cache, it's served right away (with the recording of a cached hit from the user). If not, the python script is called, and when it's semaphore is cleared, the PHP script handles back the screenshot from the cache.
There is a lot more going on, in the details. Because some flash or java can crash the rendering engine, and in that case, the python process might get stuck running without end.
That's why there is a scheduler there, that will check that no screenshot process takes more than 1 minute to run. If it does, it's killed, period.
I don't want to block system resources on 1 request that is bound to fail.
I first started this as a "I've heard you cannot do this in PHP, but I'd like to anyway. Can you do it?" contest.
At first, it had around 80% successful hits, which was already a lot compared to what my customer had at this time (around 50% with a windows service that gave crappy blocky pictures).
Today, I've incremented the success rate to approx. 98%...
http://www.webalis.com/2008/04/what-...y-application/
Quote:
As the screenshots are fetched live, I was thinking that I could end up with a lot of "bad" screenshots, as the request would pile up, but it seems that it stay relatively low.
My last test ran for 2502 requests (1002 not cached, and 1500 who where cached).
I had a 0.07% of error on the cached myspace page, and a 1.8% on the live Google web page, which makes a global error rate of 0.76% for 2500 requests.
Considering that most of the request should be done for cached screenshots, I'm pretty satisfied of those results. The errors where all that the screenshots had a 0 byte size, and was discarded. As those are automatically rejected from the cache system, it does looks fine to me.
I was not expecting so much out of the box !!!
|
In the end, it's around 2 years of development, but not contiguous.
I spend some time there, then nothing for 2 months. Then again some weeks....
I'm pretty happy of its results now, but I still haven't really spread the word about it.
Developing this is fun. Promoting it, on the other hand, is not much appealing to me...