Need help squashing foreign spam
September 21, 2006 8:12 AM   Subscribe

I get tons of spam in languages other than english -- entire messages in other languages that I can't even read to determine what the spam is about. I'd like to make a simple set of email filters to whack out any email in russian, chinese, japanese, and korean. Can someone give me a single character from each language that represents the most commonly used letter? (like the letter "a" in English)

People I know never seem to send me stuff with other languages in it, so even if I go with the sledgehammer of looking for single popular characters, I don't think I'll get too many false positives.

Or should I try using language message headers instead? Do spammers stick to language-specific character sets?
posted by mathowie to Writing & Language (15 answers total)
throw a sample paragraph in here or into a translation site such as babelfish.

I'm getting tons of japanese spam myself.
posted by krautland at 8:19 AM on September 21, 2006

I'd block using the content-type header. If you have control over your mail server, you can also just use country-specific blacklists and block entire countries. This is the approach I use; the bounce message includes instructions for getting whitelisted in case I unintentionally bounce any human correspondents. In the last year or so, exactly zero senders in Asia have added themselves to my whitelist.
posted by kindall at 8:27 AM on September 21, 2006

i thought the letter 'e' was the most commonly used letter in english?
posted by sonofslim at 8:28 AM on September 21, 2006

It's perhaps a bit of a big hammer to apply, but I've found that filtering messages having any non-ASCII character (or any Base64 escaped character) in the Subject: header pretty much clobbers all of the non-ASCII spam.

I used to do this with a regular expression in Procmail, but SpamAssassin has long supported language filtering, so I no longer have a convenient filter expression to paste for you. Sorry.
posted by majick at 8:34 AM on September 21, 2006

(Or any MIME-escaped character, for that matter /=\d\d/ appearing in the Subject: header matches a certain chunk of my non-ASCII spam)
posted by majick at 8:37 AM on September 21, 2006

Here is an archived version of a now-gone page that addresses this issue for Procmail users.

Here's a simpler filter recipe based on it, which I found somewhere long ago:

:0 BD
* -1^1 .
* 2^1 =[0-9A-F][0-9A-F]
* 33^1 [¡¢£¤¥¦§¨©ª«¬­®¯°±²³´µ¶·¸¹º»¼½¾¿]
* 33^1 [àáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ]
* 33^1 =[A-F][0-9A-F]
| formail -i "Subject:[Asian character set]"

(not sure how well that'll appear in y'all's browsers)
posted by Eater at 8:48 AM on September 21, 2006

SpamAssassin has an option like this.
posted by Loto at 8:51 AM on September 21, 2006

eater, it looks here as if you are blocking a bunch of german and danish characters as well...
posted by krautland at 8:52 AM on September 21, 2006

I've always wondered why ISPs can't handle this. My mailserver is hosted by Dreamhost. I want an option in their control panel that will allow me to block all e-mail with non-Western character sests, and it frustrates me that none exists. It seems that this simple act (along with blocking of image-only e-mails) would cut 50% of the spam I receive.
posted by jdroth at 8:56 AM on September 21, 2006

Best answer: Previously asked. :)

I still think my answer is good: look for gb2312, koi8-r, big5, and so on in the Content-Type header of your email. This works extremely well, is easy, and has zero false positives.
posted by jellicle at 8:59 AM on September 21, 2006

Yeah, the pasting of the recipe above misrendered the high-bit characters.
posted by Eater at 9:03 AM on September 21, 2006

Just use Outlook 2003 or higher :) It lets you allow/block messages by language.
posted by blindcarboncopy at 9:24 AM on September 21, 2006

Response by poster: jellicle, this is for gmail, so I think I can still do that by searching the header for those phrases.
posted by mathowie at 9:59 AM on September 21, 2006

This is the nearest phonetic equivalent to 'a' in Korean : ㅏ
posted by stavrosthewonderchicken at 4:54 PM on September 21, 2006

I use gmail, have the same problem, and have been able to block most non-English spam by creating single-character filters. I did this by scanning the text of the spam and trying to find characters which appeared more than once. Sometimes this worked and sometimes it didn't, but when I couldn't find a duplicate, I chose a random character & created a filter for that. This isn't a perfect system and doesn't always work on the first try, but it definitely helped cut down on spam, right away.
posted by jessicapierce at 8:54 AM on September 22, 2006

« Older Keep the foam thingies on my earbuds?   |   How do I convert this video file to something more... Newer »
This thread is closed to new comments.