|
We all know the drill. You post the same thing on 2 pages of your site, only 1 of them will show up in the SERPs. Post something from a different site on yours, and you won't (or shouldn't) turn up at all. I've never done any testing, but I assume this is true. That means Google is able to filter duplicate content, but it also means they're doing something more than just hashing the HTML code, or different templates would seem to be different content.
A good friend of mine wants to create a personal database system. It should capture all types of documents, from voice recordings to email. And especially because of that last one, he's asking me for ideas how he can filter out the duplicate content from his own system. But in a fuzzy way, instead of binary comparisons that would give false negatives. And I haven't been able to come up with one, but I do know that Google is pretty good at "organizing the world's information" so it seems like borrowing their ideas would be a good start.
I don't suppose anyone has any ideas, or thoughts where I should send him to look?
|