I've been doing a lot of work on spam filtration lately. It pretty much all comes down to statistical and linguistic analysis. A lot of spam is written by machines; you see a lot of random words, or semi-coherent noun-verb sentences. There's no tense: run, running, ran... There's no stemming: run, runner.
Bayesian inference works fairly well on poetry, but look at the linguistics, and the software has a really hard time. Poetry tends to show a lot of artistic license. There can be a lot of repetition, repetition, repetition ... to bring home a point, or for alliteration. Rhythm and cadence are more important than grammar. The software just can't cope.
If you didn't know any better, and only had linguistic clues to go on ... would you think this was spam?
Quote:
When, in disgrace with fortune and men's eyes,
I all alone beweep my outcast state,
And trouble deaf heaven with my bootless cries,
And look upon myself, and curse my fate,
Wishing me like to one more rich in hope,
Featured like him, like him with friends possessed,
Desiring this man's art and that man's scope,
With what I most enjoy contented least;
Yet in these thoughts myself almost despising,
Haply I think on thee—and then my state,
Like to the lark at break of day arising
From sullen earth, sings hymns at heaven's gate;
For thy sweet love rememb'red such wealth brings
That then I scorn to change my state with kings.
When in disgrace with fortune and men's eyes
Sonnet 29
William Shakespeare
|
|