My website's scripts eat up too much bandwidth. How do I resolve this?
April 5, 2008 9:21 AM   Subscribe

My site's cron tasks eat up a lot of bandwidth. Can I change the paths from http to local? Or can I move the tasks to my home server?

One of my websites (the election one) runs a truckload of PHP to parse, mesh, filter, and cache an equally large number of external RSS feeds which are then stored and included in a reasonably lean (~10k) index.html. This creates a strange situation where the bandwidth consumed by the PHP (locally, but via http) far exceeds the actual bandwidth consumed by visitors. My web hosting provider has warned me about the excessive activity, but given the limited number of visitors, upgrading to a hosting deal with a higher bandwidth limit isn't really worthwhile for me.

How do I resolve this?

Current situation: a bunch of FeedForAll PHP scripts are executed using a Ruby-as-CGI script (which wackybrit kindly helped me with) called via cron at regular intervals, varying by nature of content: none-time critical content would be updated hourly, whereas latest news etc. is updated every 10 minutes (was 2 minutes, but I took it down a notch for now to avoid further problems with my host). My monthly bandwidth limit is 8 GB; actual visitor usage is a modest fraction of that, but the scripting alone was responsible for 37 GB (!) in March.

Several options:

1) The bandwidth is measured by the host using HTTP responses. If I run the scripts using local UNIX paths, e.g. /usr/gnfti/www/ etc., the scripts generate no HTTP headers and thus don't count towards the quota. Of course they will still incur server load, but I have discussed this with the hosting company and it seems they're okay with it if I limit the use somewhat as a compromise. But then, not all of these scripts seem to be okay with local paths, most notably the keyword filter (rssFilter.php), which is responsible for 25% of the bandwidth on its own.

Any ideas how to make it work this way anyway?

2) I have a home server running WAMP on XP. Could I run the PHP on this box instead and have the output automatically upload to the remote server via FTP? Most of the bandwidth is to do with moving stuff around locally anyway; the actual output is in the order of kilobytes. I imagine this would involve running some sort of cron + FTP client/scheduler on the home box, but my knowledge in this area is limited.

Could this be done, and if so, how?

3) Lastly, if you have any suggestions on how to pull this off aside from the examples above, that too would be much appreciated.

Thanks in advance for any suggestions or insight you might have to offer, guys.
posted by goodnewsfortheinsane to Computers & Internet (18 answers total) 2 users marked this as a favorite
 
I think your best option for a proper solution is to move to another hosting firm; 8GB is nothing nowadays, with even cheap hosting plans offering hundreds of times more.

For now, how about adjusting the frequency of feed-fetching based on time of day and day of the week? You can probably halve your bandwidth requirements with that measure alone.
posted by malevolent at 9:50 AM on April 5, 2008


Search for Dreamhost in AskMe, that's how I chose them. I *love* dreamhost. My monthly bandwidth limit is 5TB. It's cheap as hell, too.
posted by popechunk at 10:12 AM on April 5, 2008


I've reread what you're doing a few times and it still doesn't click where you are burning so much bandwidth. It may be one of your script is inefficient -- fetching complete feeds, when it may only need to do is pull the headers to see if the feed is updated. I do agree with the Dreamhost recommendation, the simplest recommendation might be to find another host. You may not want to redo what you've setup, but what you looked at Yahoo! Pipes for mashing data up?
posted by bprater at 10:56 AM on April 5, 2008


There are a lot of ways to solve this problem, but the bottom line is that if you need X bytes of data from a remote site, you're going to have to pull it across the 'net. It should scare you that your host is willing to let you use "local" paths to fetch remote data because it suggests that they either a) don't understand what you're asking for or b) don't really understand how the internet works.

So the solution is to either reduce the amount of data you need to pull, reduce the frequency at which you're fetching it, find a new host, or some combination of the three.

Reducing the amount of data could be using Yahoo Pipes! (or similar) to combine the feeds for you, only requesting headers first (to see if the feed has actually been updated), or cutting out some feeds that you really don't need.

The frequency is easy... do you REALLY need to update your headlines every 10 minutes? Are there some feeds that you're polling twice a day but only update twice a week? Could you use one of the cloud notification services like blo.gs to tell you exactly when a feed is updated... then the next time the cron runs it knows what it needs to pull and what it doesn't.

Switching hosts is probably the easiest solution, but be warned that shared hosting environments vary, and while one host might give you a whole mess of bandwidth, they might balk at the processor usage required for the scripts you're running.
posted by toomuchpete at 11:34 AM on April 5, 2008


Not sure about your specific problem, but in response to some of the answers above (Search for Dreamhost in AskMe, that's how I chose them. I *love* dreamhost. My monthly bandwidth limit is 5TB. It's cheap as hell, too.) -- some people would say that there's a reason it's cheap as hell, and don't love Dreamhost. So if you look to switch, make sure you do your research.
posted by inigo2 at 11:36 AM on April 5, 2008


Make sure your feed software supports HTTP Conditional GET.
posted by steveminutillo at 12:10 PM on April 5, 2008


Best answer: Yeah... I also use dreamhost for whatever it's worth, but I bet the home / wamp solution would work just fine.

It works exactly like you'd expect... wamp has command-line php, same as anything else. You'd probably have to write some quick batch files and run 'em from the windows scheduler, but it isn't that hard to deal with.

If you want real cron support under windows, you can also grab cygwin and do all the stuff you'd normally do under unix. Within cygwin, you can even install ruby, php, and whatever else you need and probably move your current setup with only path adjustments and a final little script that ftp's data from your local directory to the remote one.
posted by ph00dz at 12:49 PM on April 5, 2008


Best answer: If you run the scripts using local paths? Meaning you're currently doing something like...
wget http://example.com/myscript.php in your crontab? Could you perhaps post an example line from your crontab?

If that's the case there are a number of things you can do to help yourself out.

First of all, it's quite likely that your feed fetching software may not be taking full advantage of the available http headers... i.e. the aforementioned http conditional get.

Secondly it sounds as if outbound bandwidth in form of the data the server fetches from other servers isn't necessarily the problem, but perhaps the output of the script. If you're wget-ing towards a script on your site that produces a ton of output while it populates whatever data on the backend you're just senselessly throwing bandwidth away.

But then, not all of these scripts seem to be okay with local paths[. . .]
That can almost certainly be fixed with a minimal amount of tweaking, if you need help send me a message.

Otherwise, any additional details you can provide may be helpful.
posted by Matt Oneiros at 1:51 PM on April 5, 2008


I'm a little confused at how this bandwidth calculation is done. You say:

The bandwidth is measured by the host using HTTP responses.

It sounds like your provider is just parsing their Apache logs for your site. But when you fetch an external resource, your hosting provider shouldn't see any HTTP responses (unless they're sniffing your traffic, in which case dump them). So how, exactly, is this bandwidth appearing to them.

Here is my speculation. Tell me if it's right:

1) Cron launches a script on your provider's shell.
2) The script makes an HTTP request to a PHP file on your website, through HTTP. Something like:

wget http://mywonderfulsite.com/internal/agregate.php

3) That PHP file gets the external feeds and dumps them as the web response.
4) The cron script saves that response as your index.html file.

If this is the case, then I can see why you're wasting bandwidth. First of all:

a) Have you done anything to make sure that no one else can get to http://mywonderfulsite.com/internal/agregate.php? A search engine accidently indexing this would go nuts. Obviously the data updates all the time, so it would start to decrease your indexing interval, thus eating up a ton of bandwidth.

b) The proper way to do this is to not make a HTTP call, but to just use the CLI PHP. For example, instead of:

wget http://mywonderfulsite.com/internal/agregate.php

You'd do:

/usr/bin/php5 -f /home/mywonderfulsite/www/internal/agregate.php

That way nothing hits the Apache logs. But that's assuming your host enabled CLI PHP.
posted by sbutler at 2:23 PM on April 5, 2008 [1 favorite]


You don't say how much you're paying now. Just for purposes of comparison, you can get a little VPS (virtual private system) from slicehost, for example, for $20/mo., which includes 100 gigs of bandwidth. If that's not excessive for your budget, that would probably save a lot of troubleshooting time.

If that is too much, then you might be able to cut down your bandwidth usage by using a client that supports gzipped compression. mod_gzip is very frequently enabled on webservers, and it can cut your bandwidth by 90%. If you're using, for example, curl for the actual data transfer, simply passing it a "--compressed" switch might solve your problem.
posted by Malor at 2:31 PM on April 5, 2008


CGI is just a set of environment variables - if you set the right environment variables before running the scripts from the command line, they ought to work.
posted by sergent at 3:18 PM on April 5, 2008


sbutler has the answer. That's the only reasonable scenario in which measuring bandwidth using "HTTP Responses" makes sense. Using local PHP instead of wget should solve your problem, but I can't believe the host considers that better for them.

Also, any host that measures bandwidth in that way (and not at a router) is incredibly primitive by today's hosting standards. And a bit strange. I'd strongly suggest getting a different host - I recommend Pair Networks.
posted by mmoncur at 2:09 AM on April 6, 2008


Best answer: I'd love to take a look at the PHP scripts which are causing this problem, but following the link to FeedForAll above, I see that you have to pay to even see the source code, which comes as something of a surprise.

I'd be very interested to know how the rssFilter.php script works, because it's doing so much traffic, and because it doesn't work from the command line.
posted by AmbroseChapel at 4:30 PM on April 6, 2008


Using local PHP instead of wget should solve your problem, but I can't believe the host considers that better for them.

It is better for them. Running wget http://yoursite.com/some.php creates several processes. WGet, DNS lookup, traffic on loopback, HTTP daemon, and finally the PHP interpreter.

Running with php on local cuts out everything but the PHP interpreter. In a crowded, shared hosting environment this could make a big difference.

But yes, if you're running wget to grab data from your own, local domain, stop that. Especially if the internal traffic is counting against you. BUT... if all you're doing is running WGet and it's not outputting anything substantial, there's not going to be an increase in bandwidth used.

Using the WGet method counts the bandwidth the script sucks up PLUS the bandwidth of the output from the script. If your script has no output, the extra will be negligible.
posted by toomuchpete at 2:29 PM on April 7, 2008


Response by poster: Sorry I took my time to reply - I've been busy with this and other things and I have an annoying little bout of the flu.

Thanks for your input, all. Maybe I'll address some of your answers more specifically later, but for now I can summarize that I moved it all to my home server. I hope it proves to be reliable.

The problem was, the main script ran locally, but it called one or two other scripts that demanded http paths. I.e.

/home/gnfti/www/script.php?feed=http://example.com/filter.php?feed=http://example.com/mesh.php

I've worked with Pipes before and I love it, but it just doesn't update frequently enough for my needs. I still use it for some things on the site that don't need to be super-fresh, though.

Ambrose, I'll MefiMail you.

Thanks, guys!
posted by goodnewsfortheinsane at 11:48 AM on April 8, 2008


Best answer: Oh, and for those interested, the setup is as follows:

-Win XP
-WAMP (Apache, PHP)
-Ruby
-pycron

Ruby script is like

data = open("http://localhost/electicker/rss2html2_Flickr.php?XMLFILE=http://localhost/electicker/filter.php?feed=http://localhost/electicker/mesh-headlines.php&TEMPLATE=news-template-bw.html&NOFUTUREITEMS=1&MAXITEMS=7&ItemDescriptionLength=100").read
File.open(this_directory + "/headlines-bw.html", 'w') do |f|
f.write data
end

puts "Fetched BW Headlines"


and having done this for all the entries:


ftp = Net::FTP::open("[$FTPSERVER]")
ftp.login("[$LOGIN]", "[$PASS]")
ftp.chdir("/www")
ftp.puttextfile(this_directory + "/headlines-bw.html")
ftp.puttextfile(this_directory + "[$ETC]")
ftp.puttextfile(this_directory + "[$ETC]")

[...]

ftp.close

puts "Uploaded All."


For now, I'm happy. Except that the host still wants me to pay for the bandwidth, even though they had previously assured me that the cron activity wouldn't count towards the limit. Oh well.
posted by goodnewsfortheinsane at 12:52 PM on April 8, 2008 [1 favorite]


Response by poster: Marked own answer as best for future reference.
posted by goodnewsfortheinsane at 2:29 PM on April 20, 2008


Response by poster: And for cross-reference, a post on this on the FFA forums.
posted by goodnewsfortheinsane at 2:31 PM on April 20, 2008


« Older Is it me or the contacts?   |   Crib Mattress on a Cot? Newer »
This thread is closed to new comments.