|
I have built a script that categorizes content based upon a training set using the bayesian theorm and ngrams etc. Im having issues when trying to compare the ngrams to categorize the content as the symbols in the database get reaplced by their entities (UTF-8), i.e. £ and £.
Im just wondering if anyone has built a similar type of script using the bayesian theorm and has any ideas on whats the best way to get around this? Should i consider removing all non-alphanumeric characters and just forgot?
Any help would be appreciated.
|