Protecting mah bandwidth
February 26, 2007 8:12 PM   Subscribe

Website admin question: I want to put up a website with about 10000+ charts. I don't mind people copying images, but how can I protect against a site sucker barging in, scooping them up, and killing my bandwidth? I wish I could generate a bunch of static HTML pages with simple IMG tags, but I'm guessing this isn't possible. What might PHP have to offer for protecting my content from site suckers?
posted by calhound to Computers & Internet (11 answers total)
P.S. I should add that the HTML pages will probably have a lot of informational content and there will be 10000+ of those, so I guess at the -very- least I'll have to be displaying the pages as PHP with some sort of abuse checking, or monitoring for too many connections from one IP (through an Apache control?) I've been on some sites every now and then that ban me for a couple of hours for looking at too much content, so that's the direction I want to be heading in.
posted by calhound at 8:28 PM on February 26, 2007

Post the images on Flickr (or some external image hosting site) and link to them from their servers. Be mindful of any bandwidth allowances they might have, but let someone else bear the burden of serving the media. Alternatively, use a high-bandwidth media provider/service such as Ninesystems/Akamai
posted by drinkspiller at 8:54 PM on February 26, 2007

flickr doesn't allow non-photographic images in public spaces. They'll suspend the account in a heartbeat.
posted by FlamingBore at 9:12 PM on February 26, 2007

Basically you're probably going to have to program something. There are a few approaches I might try:

* cookie based: send them a cookie that details how many charts they've used in the last X minutes. When it gets too high, deny them access. Trivially defeated by anyone who cares to.

* IP based: keep track of accesses over the last X minutes per IP. Block when the number gets too high. More difficult to defeat for anyone who doesn't have access to lots of different IPs. Kind of a pain in the ass if you have a lot of traffic since you'll end up tracking a lot of data.

* Login based: make users create a login, and login with it. Track usages over the last X minutes by login. Pain in the ass for casual users, not hard to defeat if you don't mind making multiple accounts, and probably just as much of a pain in the ass to track as IP based.

Is the problem bandwidth? Your question implies that it is but then you go on to say that you wish you could provide the content with static HTML and img tags - which would have the same bandwidth requirements as on-the-fly generated charts. Maybe the real problem is computational resources? If so, actively caching everything may help. Precomputing commonly need charts may help. I guess this really depends on whether you know ahead of time which charts to generate.

On re-reading it sounds like your content may actually be static, and you want to serve it that way, but you're afraid you won't be able to, because you need some kind of algorithm to detect abuse? In that case there are lots of options. Many webserver platforms offer auth filtering on any kind of document, static HTML or other. At the very least you could have the content static and have a single php-or-whatever-scripting-language processor that simply coughs up the document if you're allowed, and gives back nothing if you aren't.

I guess I'm not 100% clear on what sort of answer you're looking for. If you're not averse to paying someone, there are lots of people who could handle this issue for you. If you have programming skills, it shouldn't be too tough to cobble something together.
posted by RustyBrooks at 9:26 PM on February 26, 2007

how can I protect against a site sucker barging in, scooping them up, and killing my bandwidth?
I am confused too. Are you concerned about someone scraping your site and taking your content? Or are you concerned that this scraper will hand you a huge bandwidth bill?

If the scraper (wget or whatever) is not a kludge, its impact on your bandwidth will be no different than serving these files to your regular users (and will only happen once per file, if all goes as planned)

If you are concerned about not having somebody steal your files, you could do a number of things, depending on how you will make these files available to your "regular" users. First of all, disable directory listing. Don't have an index page that links to every file. Disable hotlinking via Apache. But of course, these things make it harder for regular people to interact with your stuff. A List apart had an article on "smarter" hotlink prevention. Might be worth a look.
posted by misterbrandt at 10:07 PM on February 26, 2007

Seconding Rusty's head-scratching.

Why can't you just generate flat HTML, and what advantage do you think you'd get by doing this? Are you saying you can't serve flat HTML because then you couldn't programatically check for abuse?

Why do you imagine someone will want to grab all your content at once?

How popular are you thinking the site will be? If you're putting up 10,000 pages and someone sucks them all down one by one, how is that different from your site becoming popular and getting thousands of hits a day?

If I was going to do this kind of thing to you (and I have in the past!) I'd just add a polite pause to the script and get one page every five seconds. I'd have your whole site downloaded in just over 13 hours. Would you even notice?
posted by AmbroseChapel at 10:17 PM on February 26, 2007

I think his point is that while the typical user might only access, say, 0.5% of the 10,000 charts, a scraper might grab all 10,000 in a single swoop. That's a lot more usage, and clearly a quick way to top out on a bandwidth limit for a single pair of virtual (fake) eyeballs.

Naturally, you can build into PHP a simple session or cookie based script that will check for page-load requests over a certain window of time. (Store hits to a db, check for hits by IP address in a certain frame of time, etc.) If it's more than your preferred threshold, have them complete a math question or other CAPTCHA to proceed, and all's well.
posted by disillusioned at 11:16 PM on February 26, 2007

bandwidth is dirt cheap these days. Host it with (or any webhost that oversells) and rest easy.
posted by Satapher at 11:33 PM on February 26, 2007

It isn't necessary to resort to PHP if you really would prefer to serve static HTML. Instead you can write a script that monitors the Apache logs for excessive access from single sources (I'd probably use a combination of IP & User Agent as a crude source ID). The script could wake up every 15 minutes or so, assemble a list of abusers, and then write out a set of mod_rewrite instructions to redirect page requests from these users to a "Please stop" page and image requests to a very light weight "Go Away" image.

Google might decide to crawl some or all of your pages, so if you implement any sort of abuse checking you might want to make sure you only penalize those who actually retrieve the charts, not just any page.
posted by RichardP at 11:43 PM on February 26, 2007

There's an easier way.

Assuming your web host is running Apache, either

1. Limit the daily download from an IP with mod_cband.
2. Use mod_rewrite to block scraping.

Google either and read up on how to use them, they're a bit too complicated to type up in a post.
posted by mphuie at 2:12 AM on February 27, 2007

If you put it on the internet, assume that someone will take all of it.
posted by aye at 12:23 PM on February 27, 2007

« Older Annyong ha shimnikka   |   Help find a camera to film while driving! Newer »
This thread is closed to new comments.