How Can I Write An Anonymyzing Proxy In PHP?
September 12, 2007 8:57 PM   Subscribe

I would like to make an anonymous proxy using PHP (similar in concept to how http://anonymouse.org works). The difference between my project and Anonymouse is that my project is intended for private group use on a specific group of sites (note that this does not break the TOS of either my awesome webhost or the sites on which I'll be using the proxy).

The proxy needs to be able to do a few things. First and foremost, it needs to be able to retrieve a URL of my choice and store Cookies. I have a fairly good idea how to do that.

The part I'm having trouble with is URL rewriting (so, for example, <a href="http://www.google.ca"> will be rewritten so it links to http://www.myproxy.com/?link=http://www.google.ca instead of directly to google), and also, Javascript stripping (one of the reasons I'm creating this is so my friends can access the site while on a locked down computer - they can't disable Javascript, and Javascript can be cleverly written to avoid URL rewriting).

Is there any feasible way to ensure that the proxy remains truly anonymous? I've come up with two solutions so far:

1) Somehow strip all Javascript from the document, as well as intercept all incoming URLs and rewrite them. I know how to do this on a basic level, but I'm sure there are cases I'm missing. Any suggestions?

(I am aware of the php proxy Poxy - but it says that its Javascript stripping is imperfect)

2) Since I know the sites that this will be used on, I could potentially write the proxy such that it acts as a screen scraper, getting all the useful info from the site itself, and writing it out in html-escaped form, using its own formatting. However, I'm worried about what might happen if the sites change layout suddenly. Is there a way to scrape html effectively so that it's not as sensitive to layout change?

3) Open to any other suggestions on how to write an anonymous PHP proxy (that runs on a shared host - so I can't do some fancy mod_rewrite trickery or anything and simulate a real proxy, unfortunately)

Also, feel free to substitute PHP with Perl, Python, or Ruby (or some other scripting language that can run server-side). I'm asking about PHP because it's the easiest to deploy - but if there are compelling arguments for another language, I'm open to that too!
posted by mebibyte to Computers & Internet (7 answers total) 1 user marked this as a favorite
 
There's CGIProxy and EzProxy (which costs, but the non-registered version can be useful.)
posted by roue at 4:37 AM on September 13, 2007


A proper HTML parser like Ruby's Hpricot would probably be the best approach to screen-scraping with some layout independence and maintainability.

Such a parser would also be useful for rewriting; you can walk all the elements and find hrefs, Javascript events, script tags, etc etc, more easily and accurately than trying to make some huge regexps to cope with every edge case.

Other potentially useful tools for Ruby are ScrAPI and Mechanise::WWW, which might operate at a nicer level for robust screen scraping.

There are probably some similar things for PHP, but I've not been keeping up with what libraries are popular or good there.
posted by Freaky at 4:47 AM on September 13, 2007


There is a cgi proxy available from john marshall written in perl. here. It can allow javascript filtering and only certain sites. it has fulfilled all of my needs.
posted by DJWeezy at 5:05 AM on September 13, 2007


Where will the proxy be running, just out of curiosity? On a web hosting account/domain name registered under your name? I'm just wondering what level of privacy you can hope to achieve in this way. The point of third party privacy service providers is just that, that they're third parties. You're going to be your own privacy service provider?
posted by AmbroseChapel at 5:23 AM on September 13, 2007


Here's a PHP port of cgi proxy. I've used it. It is good. No longer in active development though.
posted by zackola at 6:20 AM on September 13, 2007


I can't read. Sorry about that.
posted by zackola at 6:20 AM on September 13, 2007


Response by poster: roue, zackola, DJWeezy: Unfortunately, both CGI Proxy and its PHP port (Poxy) state that their Javascript stripping is imperfect. This doesn't work for me :( Thanks anyway, though!

Freaky: Interesting. I've never used these libraries before - the HTML that's being generated isn't really all that semantic... there are no IDs or classes to watch out for. Is that a problem if the layout suddenly changes?

AmbroseChapel: Yeah, I know that the domain has my registration info on it. That's fine. When I say anonymity, I mean that my actions on this site appear to be coming from my webserver instead of my client computer. I may also run this from one of my own servers, but am not quite sure yet.

Thanks!
posted by mebibyte at 5:07 PM on September 13, 2007


« Older Liberal/Lefty/Progressive Blogs that might help a...   |   ItunesFilter: Help me manage myout of control... Newer »
This thread is closed to new comments.