Bayesian Filters bug 1

Documented: 2004

Note (4/2006): Anyone who has looked at the SPAM they're receiving these days can see that Bayesian filters are broken in many ways. And they've only made spam uglier than before.

First of all, we aren't here to learn what Bayesian Filters are or how they mark spam. You can research that on your own.

Bayesian Filters aren't broken in the sense that they don't do as specified, they are broken in the sense that they don't accomplish their goals of properly sorting spam. Soon (if not already) Bayesian Filters are going to cause lots of trouble for innocent people by marking them out as spammers.

Here's why.

Imagine the following scenario. Joe Bar is a regular net user, and let's say his email is ''

Then one day a spammer comes along and includes "notarealdomain" in one of their spam messages. This can happen without Joe's consent, obviously, and even without the spammer directing malicious intent directly towards Joe, because, for example, he's randomly chosen Joe's domain as a fake domain to send email from.

It's important to realize that there's nothing Joe can do about this - because we're not even talking about open relays or anything like that, the spam can just be a poor forgery that never even comes near Joe's domain.

So now what happens?

Consider, first of all, that many of the mail providers (hotmail, yahoo,...) are installing Bayesian filters. Let's say Joe has a few friends on hotmail he's emailed a couple of times. So the notarealdomain 'word' is actually starting to move up the non-spam bayesian scale.
But then suddenly this spammer sends out millions of messages that have notarealdomain in the message, because of a forgery or whatever.
And suddenly many hotmail and yahoo and other users are clicking on that message as SPAM, and suddenly the score for notarealdomain goes way high on the spam bayesian scale.

Don't think it can happen? It's happened to me twice, so that now much of the email from two of my domains are often marked as spam. Admittedly, I have many domains to choose from, and I write anti-spam software which opens me up to such malicious attacks, but even so, this can be a silent killer for completely innocent users ability to communicate.

One possible 'solution' that would somewhat help would be to only use the body and subject, or to just ignore unverifiable headers. The former would lose much of the power in bayesian filtering (as also assumed by Paul Graham). The latter would take alot of work and require an unusual amount of traffic for what's merely a spam filter.

Maybe someday we'll realize that it's time to start moving towards verifiable email. Even though it wouldn't work until everyone is doing it, if we had the infrastructure in place and enough of a critical mass moved towards it, then we might be able to get there down the road (I must confess I like yahoo's brilliant domainkeys idea which does 90% of the job with very little work).