Spam-b-gone
June 8, 2006 3:29 PM   Subscribe

Seeking comment spam uber advice.

I recently started a new website. Within days of talking about the under-construction site on my blog I started getting comment spam on the new site. Okay. Sure. I know it's all done by robots so hardly a surprise. And I wanted to start playing with captchas anyway, so I put one of those (freecap) in place. All that did was slow the spam down. Slowed it down a lot to be sure, but still not enough. Either they're brute-forcing the word list, using a middle-man attack, or OCRing the image.

For the first option I've already created my own word list. For the second I'm not sure what to do. And for the third I'm reluctant to obscure the image more. Good luck right? Ideas?

Now, after spending several hours of reading webpages about securing and breaking captchas, I've decided a) I already know more about this than I wanted to, b) I haven't even started to learn about other spam prevention methods, and c) I'm not much closer to solving my comment spam problem. I'm a geek, so learning about hacks is fun and all, but I'd really rather be doing content creation. So I'm hoping someone else can point me to the good stuff.

So I'm asking for some personal experiences with beating comment spammers. What works and what doesn't. If not captchas, then what?

To make things harder - I want to use whatever technique I settle on at all of my sites, and one of those sites wouldn't really work with user registration.

Also - I'd prefer to code a solution myself, so I'm really more interested in strategies that work rather than off-the-shelf products.
posted by y6y6y6 to Computers & Internet (27 answers total) 5 users marked this as a favorite
 
The clearest solution is to require registration for comments. However, that's not friendly to casual commentors.

Have you examined your logs carefully? To what extent is the IP address of the spam consistent from spam to spam, or from salvo to salvo? Repeat offenders can be blocked by IP, and damage from single-IP (or single-IP-range) salvos can be cleaned up pretty quickly. Both of those are reactive and not proactive, but depending on the nature of your spammers and your willingness to do cleanup, it may be the simplest route.
posted by cortex at 3:37 PM on June 8, 2006


Response by poster: IP blocking would work for a very small minority of the spams.

I'm also doing some pattern match blocking even if they do get through the captcha. Tht gets a good percentage. My next escalation was going to be changing the form html on a random basis. But I'm not sure that's worth the effort. The spammers have already proven they're willing to customize their bots for my homebrewed code. Adding one more hurdle (random form elements) seems like more trouble for me than for them.
posted by y6y6y6 at 3:46 PM on June 8, 2006


y6y6y6-the spam scripts seem to be intelligent enough no matter what your form's html says. They're not targeting you personally, the scripts are just that good at spamming. I had the idea that someone was customizing his bots for my homebrewed code for a long time as well, but eventually figured out that actually, they weren't.

I use nothing but string-blocking for my homebrewed anti-spam solution. I have a file that contains a list of strings, one per line, and my comment engine refuses to accept anything containing any of those strings. I get about 1200 failed spam attempts a day, and anywhere from 400-800 legitimate comments. I have a captcha, but I only use it to protect email addresses, not for comment spam blocking. I deliberately avoid any measures that are expensive in terms of server performance, and this lets my cheap server handle truly enormous spam attacks (400 in 15 minutes, that kind of thing) with aplomb. For instance, the blacklist being in a simple file, not in a database. Also, I don't use regular expressions for the blacklisted terms, just string comparison.
posted by evariste at 3:57 PM on June 8, 2006


You don't say what CMS you're using, sounds like it's homebrewed. I've been using a spam-avoidance plugin for Movable Type that uses Javascript to create a hidden hash on page load; if that's not submitted as part of the form, the comment gets rejected. So far, 100% success.

It's probably a lot more work, but one thing that seems to be very (but not 100%) effective is to look at statistics. If you suddenly get comments on an entry that hasn't received comments in a long time, it's more likely spam, and should go into the penalty box. Comment-spammers seem to avoid hitting new entries (where the spam would be more readily detected).
posted by adamrice at 4:03 PM on June 8, 2006


Best answer: Some ideas:

Set a cookie when sending the comment form, then check for the cookie when accepting the form submission in a script.

Add a hash of some sort to the form so that a new form must be requested each time a comment is posted. For example, concatenate the poster's IP address, the thread ID, and the date/time with a site-specific salt, take an MD5 hash of that, and put it in a hidden form field. (You'll also need to put the date in a form field as well -- don't worry, the spammer can't change it without breaking the hash, so you can trust the value.) When it's submitted you re-calculate the hash to make sure it's still valid, then check the date to make sure the form is not more than, say, 15 minutes old.

For extra credit use the hash as the NAME of your comment field, and include a dozen or so similar textareas with randomly-generated names the same length as the hash in the HTML source. (Hide these dummy fields using CSS.) If ANY of the dummy fields have text in 'em, throw the comment away. Only the field named for the correct hash should appear. :-D

Use an image as your SUBMIT button and check to make sure your posting script has received coordinates, a proper referrer, etc.

Have the URL for your form handler calculated by a JavaScript; use a dummy script as the ACTION on your FORM tag, and have the JavaScript replace the ACTION when they click SUBMIT.

I use some of these tactics in my own comment script.
posted by kindall at 4:06 PM on June 8, 2006


Oh yeah. I have to manually approve any comments that contain more than one URL, that's another good approach.
posted by kindall at 4:12 PM on June 8, 2006


I saw this article a while ago, using kittens where users are shown 9 images, and must click on the 4 kittens to submit. I don't have any experience using this, just thought I'd throw it out there.
posted by defcom1 at 4:22 PM on June 8, 2006


Response by poster: Kindall - Wow. Those are good.
posted by y6y6y6 at 4:28 PM on June 8, 2006


I let the spam come until it was ridiculous. Then, I changed the address of the form to submit comments and left the old form on the server but disabled the blog from displaying it's comments. All the robots are going to the old form (100s of them a day) and none of the comments are displayed on the site. Been good for 6 mos now.
posted by dobbs at 4:35 PM on June 8, 2006


Response by poster: BTW - The main category of spam getting through these days is one where they use one link, and cheery message, and a common name. And since the link changes every day, a blacklist doesn't help much.
posted by y6y6y6 at 4:37 PM on June 8, 2006


Akismet is fantastic and super easy, and there are versions for lots of different systems.
posted by nylon at 4:41 PM on June 8, 2006


Asking a simple, obvious question ("What color is the sky?" or "What color is a banana?") with a text-input field will probably eliminate your spam.
posted by waldo at 4:44 PM on June 8, 2006


I find that 99% of my comment spam is for old archived posts, so I've required anything older than two weeks to be moderated. I still have to clean it out every now and then, but nothing much actually makes it through to my site. (I'm running a home-brewed system, and the only other tricks I use are rejecting anything with more than a few URLs and changing the name of my form elements a couple of times a year.)

For what it's worth, I hate CAPTCHAs as a user, so I'm committed to not using them on my site.
posted by web-goddess at 4:46 PM on June 8, 2006


Add a hash of some sort to the form so that a new form must be requested each time a comment is posted. For example, concatenate the poster's IP address, the thread ID, and the date/time with a site-specific salt, take an MD5 hash of that, and put it in a hidden form field. (You'll also need to put the date in a form field as well -- don't worry, the spammer can't change it without breaking the hash, so you can trust the value.) When it's submitted you re-calculate the hash to make sure it's still valid, then check the date to make sure the form is not more than, say, 15 minutes old.

For extra credit use the hash as the NAME of your comment field, and include a dozen or so similar textareas with randomly-generated names the same length as the hash in the HTML source. (Hide these dummy fields using CSS.) If ANY of the dummy fields have text in 'em, throw the comment away. Only the field named for the correct hash should appear. :-D


Diabolical!
posted by evariste at 4:48 PM on June 8, 2006


What blog software are you using? Akismet comes with Wordpress now and it works great for me.
posted by IndigoRain at 4:49 PM on June 8, 2006


Response by poster: The CMS is homebrewed.
posted by y6y6y6 at 4:53 PM on June 8, 2006


Akismet is a web service, I'd strongly recommend using it, a lot of smart people have spent a lot of time trying to battle comment spam, why waste your time in an arms race?
posted by Firas at 4:59 PM on June 8, 2006


Not all Captchas are created equal. Some are easier to OCR than others. The freecap version, lacking broken letters, and having each letter wholly in one color seems prety easy to break.

The technique used in What The Font could probably be adapted to break it.

I particularly like the CSS-hidden dummy form field idea.
posted by Mr. Gunn at 5:04 PM on June 8, 2006


I got some spam awhile back before adding CAPTCHAS; luckily I had comments set to mail to me and I compared that with IPs in the logs. The spams were all coming from the same IP so I did a lookup and emailed the ISP with the evidence who, a week later, wrote back to say they talked it over with the customer and, not liking the repsonses, had cancelled the account.

Not the easiest solution from your standpoint, but I'd be lying if I didn't say I felt some satisfaction about it.
posted by Tuwa at 11:26 PM on June 8, 2006


You could always ask Matt Mullenweg (of Wordpress and Akismet fame), or Dr Dave (whose Spam Karma 2 plugin for Wordpress has eliminated 99.9% of spam from my wordpress blog). Those two guys do a dayam good job of spam catching (SK2 better than Akismet, IME), and probably have lots of good tips and pointers.
posted by antifuse at 2:53 AM on June 9, 2006


Response by poster: Great stuff folks. Thanks. I'll be implementing some of this, and I'll report back.

Thoughts -

Several people mentioned quarantining comments on old threads. Unfortunately the site I have that gets the most traffic also gets many legitimate comments on old posts.

I notice lots of people also suggest Akismet. But that's a 3rd party service. Over the years my experience with those has been almost 100% bad. The reason to reinvent the wheel is to ensure the wheel doesn't go by by at any time and leave you SOL.

Also, yes, Freecap is OCR-able. But a) it's just one link in the chain, and b) it will be easy to take that code and rewrite it at some point to make it stronger. And remember - Even though captchas may be 50% or worse OCR-able, you still block a huge amount of spam just by having it. I would guess that the captcha and some rudimentary pattern match filtering has reduced my comment spam at least 95%.

But I think I have enough really good ideas here to build something fairly bullet proof. Of course I can't code against someone just manually spamming the comments, but being able to block the bots will be enough.
posted by y6y6y6 at 7:23 AM on June 9, 2006


Response by poster: Bonus question - The main attack I'm interested in countering right now is the middle-man attack on captchas.

Basically the spammer also runs a porn website, and on that site they require visitors to solve a captcha puzzle to view the porn. And of course they use your captcha. Now *that* is a cool hack.

I can timestamp hashes somehow, but what is a good time limit? Does anyone have some knowledge about how this attack actually plays out? If I put a 15 minute time limit on the comment form is that even going to do anything against this attack?

I already have my htaccess file set to disallow other sites from displaying my images.
posted by y6y6y6 at 7:46 AM on June 9, 2006


htaccess referer blocking will help against the man-in-the-middle attacks, but won't help against replay attacks. Make sure that you don't serve up your images with predictable names, or the MITM will just cache all your captcha images and farm them out locally. Randomize the filename, or better, the data and the filename so that the only useful images are those you serve in the context of a captcha query, and become invalid when that context has gone away.
posted by soundslikeobiwan at 8:29 AM on June 9, 2006


You should take a look at the WP Hashcash source code, it's a plugin for Wordpress that implements a complex version of the JavaScript checksum described above. I run 8 weblogs and it blocks 99% of automated spam. I get one or two a day (after it blocks 700-1000) and those may well be manually entered.

I don't believe in CAPTCHAs, they cause me enough annoyance as a user that I avoid subjecting my visitors to them. If spam starts to get through my current system (and it will) I'll probably start asking an extra question (like "what color is a banana" above, but better yet specific to my site's topic.)

Generally, the more you do that makes your comments work differently from other sites, the better. It's not worth a spammer's trouble to attack your site individually.
posted by mmoncur at 8:42 AM on June 9, 2006


Response by poster: First of all, I hope I don't sound like I'm being dismissive with people's suggestions here. Even the ideas I can't use or don't like are great additions to the mix. I'm seeing my "at the end of the day" solution as a shotgun approach, and the more ideas the better. So thank you, thank you, thank you.

I think Freecap does a good job of preventing replay attacks, but I'm still mulling that over. The image is custom generated for each form, and is good for one use.

As for the folks saying they don't like captchas - Can I get some feedback on why you feel that way?

My personal objection to captchas has been that they are too hard to read and too easy to mis-enter. My solution for that was to start with a "good" captcha and then hack the code a bit to make it more user-friendly. I've made it a real word, all letters, all lower case, a wee bit less distorted. Is that better? (if you don't mind me doing some quasi self-linking you can see the image in question by following my profile and checking my blog comment form)

I know that some will say this captcha is too easy to OCR or brute force, but remember it's just one part. And the proof is in the numbers, already it's dried up a huge volume of spam. Previously I think I had about 4 bots doing about 100 spams a day just on my blog. I now have 1 bot doing about 2 spams a day.

Also - Once I get all this coded I'll be trying each part individually on my wife's blog. She talks about various medications a lot and is almost like a honey pot for comment spammers. We deleted over 20k spams from her site last weekend. Should be a good test.
posted by y6y6y6 at 9:11 AM on June 9, 2006


My only issue with Captcha is that it slows me down. This is not an issue for a one-off comment, but for any site that at which I intend to comment more than three or four times (in short, any site I frequent at all), it chafes, and makes me less inclined to bother.
posted by cortex at 10:41 AM on June 9, 2006


y6y6y6-Sam Ruby implemented CAPTCHA in a really interesting and friendly way, check it out. If I had to do it, I'd do it his way.
posted by evariste at 1:09 PM on June 9, 2006


« Older ... so the potential victim can take a left and...   |   Before you accuse me, take a look at yourself. Newer »
This thread is closed to new comments.