Spamassassin has failed me; what do I do?
November 10, 2006 1:04 AM   Subscribe

What's the state of the art in installable spam filters for Unix mail hosts? Spamassassin has failed me.

A year ago you guys convinced me not to use a challenge-response system for spam filtering. In the meantime my spamassassin setup has failed to the point that 80% of the mail that makes it through my filter is still spam. It's intolerable. What can I do?

I'm running postfix and dovecot on a Debian Linux box. I've got spamassassin set up with bayesian filtering, razor, pyzor, and am running sa-update regularly. 85% of my incoming mail is filtered as spam immediately, but 80% of the remainder is still spam. I'm a software engineer and capable of doing all sorts of hacks, but I'm just looking for something simple that I can just install and be done with it.
posted by Nelson to Computers & Internet (17 answers total) 5 users marked this as a favorite
On the front page of digg right now is:

Enhance Your Mail Server With ASSP (Anti-Spam SMTP Proxy)

Something to look at!
posted by mattdini at 1:12 AM on November 10, 2006

Greylisting works pretty well for me
posted by aubilenon at 1:34 AM on November 10, 2006

Try DNS blocklists through your mail server. I use : : :
posted by aye at 1:35 AM on November 10, 2006

You might want to try a free trial of MailChannels' Traffic Control product. It does traffic shaping for SMTP, which should reduce the amount of spam you see, as well as reducing the load on your content filter so you can tune it to be more agressive.

(Disclosure, I sit on the board of MailChannels and came up with some of the technology involved. Obviously I think it's awesome but you should really find out for yourself)
posted by mock at 1:40 AM on November 10, 2006

And you're always using sa-learn to learn from spam that passes through your setup? You can also learn from non-spam (option --ham).
posted by donut at 1:45 AM on November 10, 2006

"... I'm just looking for something simple that I can just install and be done with it."

Sorry. The spammers beat SMTP long ago. Running a mail server is work, not fun, and has been for some time.

The more tests you add, the more work your mail server is doing, and the more ways things can break. But if you're pissed enough about spam, here are some things you could be doing, beyond what you've said you're doing. And it's important to review your filter implementations, informed by your spam statistics, so that you are throwing out spam in the most efficient way.

1) Do reverse DNS lookups, and add failures to your bayesian filtering. Failure on a reverse lookup doesn't drop a connection by itself, but it "pre-weights" the bayesian score for spam.

2) You can implement sender verification schemes such as SPF, and DomainKeys.

3) Teergrube. High volume spammers and botnet operators will eventually mark you as a teergrube, and skip you. Frankly, I find teergrube is easier to setup and control on Exim than Postfix, but then again, I don't spend a lot of time with Postfix.

4) I see some real benefit for ASSP for medium to large mail systems, but for operators of small single server systems, the effort and expense may not be more beneficial than maintaining a greylist system in SpamAssasin.
posted by paulsc at 2:16 AM on November 10, 2006

Response by poster: Thanks for the advice so far. I am running a single-user mail system so the effort is a significant nuisance. The mail host is in now way overloaded o I don't mind doing more processing. Of the suggestions here the one the one that seems the most hopeful + new for me so far is greylisting. I found this article on Debian, greylisting, and postfix that looks like a place to start.

Please keep the suggestions coming!
posted by Nelson at 2:57 AM on November 10, 2006

You can also learn from non-spam (option --ham).

No, you *MUST* also learn non-spam email for the Bayesian filter to work.

Quick test. "sa-learn --dump magic." In particular, these lines.

0.000 0 106378 0 non-token data: nspam
0.000 0 10531 0 non-token data: nham
0.000 0 498061 0 non-token data: ntokens

That's "number of spam messages that provided tokens" and "number of ham messages that provided tokens." Both numbers *must* be above 200, or the Bayesian filter shuts down.

Another check is in the headers -- there should be a BAYES_XX (where XX is a number) in the X-Spam-Status: header. If not, Bayesian didn't run.

Bayesian makes all the difference on my setup between useful and useless.

Another issue -- if you get lots and lots of spam, and not that much non spam, you'll find that the default token expire count is too low -- you end up expiring out your ham tokens almost as fast as you save them. The answer here is in ~/.spamassasin/user_prefs, change or add this:

bayes_expiry_max_db_size 500000

The number is in tokens, so I'm telling SA to expire old tokens only when there are more than 500,000 of them. Note how I'm showing 498K tokens, but only ~117K messages with tokens -- 200 ham and spam tokens isn't enough, you need 200 ham *messages*, each providing at least one token, and 200 spam, ditto, for the filter to kick in.

Finally, I use rbldnsd to block several countries. The rule I have is if I get 100,000 spam, and zero real email, I no longer accept email from that country.
posted by eriko at 5:29 AM on November 10, 2006

What eriko said. Every time I've, as a contract sysadmin, taken a look at a installation of a 'broken' spamassassin, it's broken because they're not training the bayesian filter.
posted by SpecialK at 5:39 AM on November 10, 2006

What about using Google Gmail for domains? Just run all your mail through them, and give all your users Gmail accounts. Let them handle the problem.
posted by fcain at 6:38 AM on November 10, 2006

I've been extremely happy with CRM114, which is essentially a Bayesian analyzer on steroids. After about a week of training, it's scarily accurate; I'd estimate only about 1 spam in 1000 slips though, and I can't remember the last false-positive I had.

Silly name, yes, but give it a shot.
posted by jacobian at 7:12 AM on November 10, 2006

Of the suggestions here the one the one that seems the most hopeful + new for me so far is greylisting.

One technique I read about recently was to use greylisting, but to exempt any messages from the greylisting process that could be verified via SPF. Major mail providers like gmail and Yahoo use SPF and thus would avoid the delay in delivery that sometimes occurs with greylisting. You might also want to look at qpsmtpd, an SMTP daemon that can add an additional level of message filtering configurable via plugins written in Perl.
posted by finn at 7:49 AM on November 10, 2006

paulsc says: You can implement sender verification schemes such as SPF, and DomainKeys.

It's my understanding that these technologies will not necessarily reduce inbound SPAM but prevent other hosts from munging the mail header to spoof your domain (provided the receiver actively checks for an SPF record). So, if more and more hosts implemented (and checked) for the SPF record, it would be less likely that legitimate mail from your mail server would be labeled as SPAM and anyone else that is spoofing your domain in their sent mail would be dropped on the receiving host.
posted by purephase at 7:51 AM on November 10, 2006

Or, you could ignore the haters and use my simple procmail C/R script which works quite well.
posted by nicwolff at 9:54 AM on November 10, 2006

"It's my understanding that these technologies will not necessarily reduce inbound SPAM..."
posted by purephase at 10:51 AM EST on November 10

purephase, I think your understanding above was originally right, in terms of the ambitions for SPF, but many people are now using SPF lookups as a test to combat botnets. Botnets may be sending messages with artfully spoofed headers, but if a botnet IP doesn't match the SPF records for the domain they say they are from, or there is no SPF record for that domain, the connection may be dropped into teergrube, or the message flagged for additional filter steps. Gmail, AOL, and Earthlink are pretty good now about shutting down spammers from their internal networks, mostly by internal throttles and list filtering, and while you may still get a lot of nuisance mail from addresses in those domains in aggregate, it's small scale compared to the botnets spoofing them. And that is what is making the SPF idea helpful, as you cover in your second sentence. ...:-)

And make no mistake, it's botnets that have caused the explosion in spam noted in the last 4 to 6 months.
posted by paulsc at 10:10 AM on November 10, 2006

Are you keeping your spamassassin up to date by running sa-update? I run it weekly. You must be running version 3 or better.

Are you using any of the third-party rules that others have written? Those helped me immensely. Get them from here. I'm running:

-rw-r--r-- 1 root root 3839 Jun 1 2005
-rw-r--r-- 1 root root 24298 Oct 5 2005
-rw-r--r-- 1 root root 187643 Dec 26 2005
-rw-r--r-- 1 root root 384645 Oct 30 2005
-rw-r--r-- 1 root root 28066 Jun 3 22:00
-rw-r--r-- 1 root root 39625 Jun 3 22:00
-rw-r--r-- 1 root root 66 Feb 14 2005
-rw-r--r-- 1 root root 158513 Oct 1 2005
-rw-r--r-- 1 root root 18190 Dec 12 2005
-rw-r--r-- 1 root root 97820 May 27 20:00
-rw-r--r-- 1 root root 59515 Oct 18 13:00
-rw-r--r-- 1 root root 15481 May 15 20:00
-rw-r--r-- 1 root root 57580 Feb 14 2005
-rw-r--r-- 1 root root 14284 Feb 14 2005
-rw-r--r-- 1 root root 22546 Feb 14 2005
-rw-r--r-- 1 root root 23422 Feb 14 2005
-rw-r--r-- 1 root root 4883 Feb 14 2005
-rw-r--r-- 1 root root 56238 Jun 1 2005
-rw-r--r-- 1 root root 3880 Feb 14 2005

sare_stocks in particular is pretty good at killing those damn stock market spams, which still get through greylisting for me. All those rules get updated weekly with rules_du_jour.

You can't just expect to install something and forget it. Whether you use bayes, SA, or something else, you've got to keep pace with the spammers or eventually their state of the art will supercede yours. Get sa-update and rules_du_jour running weekly in cron, set up greylisting, and you should see a marked improvement. I'm not even using Bayes with my SA setup and I do pretty well. I'm using Debian and Postfix, same as you.
posted by autojack at 2:24 PM on November 10, 2006 [1 favorite]

Oh, and I use Postgrey for my greylisting.
posted by autojack at 2:45 PM on November 10, 2006

« Older Disturbing Neighbors   |   mountain bike jumping Newer »
This thread is closed to new comments.