First, how often is this caught? If I use hidden links (
I don't mean stuff you can turn on and off for convenience, but hidden stuff that's never turned on), I can expect to be caught, and probably sooner than later. Is that basically true for sites that use madlib spam? This is what I'm talking about
Quote:
|
De la mortgage in visual frames the show dashboard option viagra middle assertion of select the freight Harrison drive. On test smiley cat server fry dork season's greetings text clock.
|
Hopefully this is quickly spotted, and the web site is banned.
Are there any thoughts on how this might be caught? Human review? Markov chain style? Google certainly
has the data to pull that off.
If they're able to detect madlib spam, I'm guessing it's not through Markov. I'm getting more than 70 thousand results for "
colorless green ideas" - and the probability of seeing 'green' immediately after 'colorless' is nil. Or, could it be that enough web sites have published this particular phrase, that it's probability is far enough from zero, to
not raise flags?
Are there other algorithms I don't know of, that might trap this? Don't say bigrams and trigrams - they don't work for madlib spam. There are an infinte # of possibilities for randomly generated text, with both grammatical + sensical, and also spamtacular nonsense, being born every day. It's entirely possible (
and overwhelmingly likely) to find good or bad text that's never occured before, making probability based models, well, not work.
I'm not planning to go out and proliferate spam, but I'd like to have a bette understandinf for how this all works.