Best practice for stopping bandwidth thieves via captcha
January 18, 2010 3:47 AM   Subscribe

How do I properly use a captcha to protect file downloads from bandwidth leaches?

A friend has a simple PHP page, offering some (free) files for download. It's constantly getting hit by bots, driving up bandwidth costs. It is also possible that his files are getting deep-linked from other sites offering similar content. Blacklisting the offending IPs hasn't helped much, as they keep changing. I have been asked to help with the matter, and I'm looking for a lightweight solution.

This is the kind of problem that a captcha tries to solve, but I am not clear on how to apply a captcha to protect file downloads: Since, at the end of the validation process, the user will be pointed to some URL to download a file, what is to stop a bot from directly hitting the file, sidestepping the captcha?

I don't want to apply a full login scheme, since the content is essentially free and supporting a whole authentication mechanism sounds like overkill, but I am open to any other ideas.
posted by Dr Dracator to Computers & Internet (8 answers total) 3 users marked this as a favorite
what is to stop a bot from directly hitting the file, sidestepping the captcha?

You need to 1) make sure that the bot won't be able to see the URL without filling in the captcha, and 2) (preferrably) make sure that the download URL:s you hand out expire after a while.
posted by effbot at 3:55 AM on January 18, 2010

Best answer: In a managed download of this kind, the URL given to the user is typically not the actual URL of the end file. In other words, you'll probably want to move the destination files to a location outside the webroot.

Then you have a download script which basically validates the user (have they passed the captcha stage?), then gets the content of the requested file and serves it up. For this step you can use PHP's readfile function.
posted by le morte de bea arthur at 4:23 AM on January 18, 2010

When the captcha is successfully completed, you create a token which you then make part of the download URL for that user. When the user hits the download URL you first verify that the token is valid before sending the file. You can expire tokens by time (i.e. the link is only good for 5 minutes) or you can tie the token to the IP address that completed the captcha; or you can do a combination of both. You could also use a cookie for the token instead of making it part of the URL. The disadvantage there is that you screw users that want to use something different than their web browser (e.g. wget) to download the file.
posted by Rhomboid at 5:17 AM on January 18, 2010

Oh and one caution about associating the token with the IP address is that some users are behind an institutional transparent proxy, which pools all web requests for a group of people through one or more central caching proxies. This is a cost-saving measure because it cuts down on the amount of external bandwidth needed for the institution. However, if the institution is large enough there will be a pool of such proxies, with each request going to a randomly assigned member of that pool. The end result is that two requests from the same user can appear to be originating from two different IP addresses, even though they're really headed to the same destination. This breaks IP-based authentication schemes. And again note that this is transparent proxying, there is nothing on the end user's computer that they can change or configure to get around it.
posted by Rhomboid at 5:23 AM on January 18, 2010

I might be misunderstanding slightly, but are you referring to indexing robots ala google? If that is what you're talking about, point your friend towards the robots.txt file.
posted by handle_unknown at 6:19 AM on January 18, 2010

Use a session-based dynamically-generated URL that verifies the session before sending the file. Simple.
posted by rr at 9:12 AM on January 18, 2010

handle_unknown has a point. Are your files getting random hits from crawling bots? There are a heap ton more than just Google's bot that roam the interwebs indexing sites. On average I bet your site gets crawled half a dozen times in a day by as many bots.
posted by Gainesvillain at 10:56 AM on January 18, 2010

Response by poster: Thanks for the suggestions - I'm marking le morte de bea arthur as best because readfile was the piece of the puzzle I was missing.

The robots file has already been fixed, but didn't help. The hosting provider thinks the hits are coming from a malicious bot, which to be honest I have a hard time believing. My guess is he's been hotlinked and doesn't know it, moving the files out of the document root and protecting them via captcha should fix that anyway.
posted by Dr Dracator at 12:01 AM on January 19, 2010

« Older Drag King Magic   |   New air conditioner features Newer »
This thread is closed to new comments.