Dynamic content and clean urls with perl and apache?
October 14, 2005 8:51 AM   Subscribe

I want to use a monolithic script to delegate all document requests on my website.

Or something like that. I feel like not knowing how to ask the question has stopped me from finding an answer on my own.

A quick sketch: I have a website-in-the-making, basically rolling my own blog. Let's call it website.com from now on. What I want to do is to handle any request for a URL to resolve to a call to my delegate.cgi script, which will be in charge of serving up content dynamically according to the URL.

So http://website.com/ will trigger a call to the script. And http://website.com/20051014_0923 will do so as well. And http://website.com/recordings/ as well. And so on.

Using "clean"/"perennial" URL styling is important. I explicitly want to avoid http://website.com/node?20051014_0923 style URLs.

What the script chooses to display here based on that should be immaterial.

I know Perl well. I know just enough CGI to get things working. The site is running on Apache, which I have only passing familiarity with but which I can configure (or have configured by a friendly server-mate).

How can I accomplish this? (Commentary on why I'm trying to accomplish the wrong thing, or accomplish it the wrong way, is also welcome. But no, dammit, I don't feel like learning PHP at the moment.)
posted by cortex to Computers & Internet (16 answers total) 1 user marked this as a favorite
 
Best answer: I'd look in to Apache's mod_rewrite. It allows you to specifiy in your .htaccess file how you want certain types of requests to be handled (by using regular expressions). For example, my site handles everything from an index.cgi file that uses the appropriate action based on the passed module name. Rather than use 'ugly' URLs with the module name as a parameter, I have mod_rewrite rules to allow for cleaner-looking URLs.

For example:
RewriteRule   ^images/(.+)/(.+)$  index.cgi?mod=Images;act=Image;sect=$1;img=$2
RewriteRule   ^images/(.+)$   index.cgi?mod=Images;act=Section;sect=$1
would let you go to the /images/CoolPictures URL and see the index listing for that section, or to the /images/CoolPictures/ReallyCoolPic.jpg to see the page for that image (which in my case, is a scaled version of the image embedded in an XHTML page).
posted by Godbert at 9:01 AM on October 14, 2005 [1 favorite]


And to expand on what I said (because I didn't quite answer your question):

You can have multiple rewrite rules that end up going to the same file on the server. I have the "images" rule that goes to index.cgi (with the params) but I have rules for "page", "section", etc. that also go to index.cgi with a different set of parameters.

Unless you specifically add "[R]" to the rule, these are all "silent" redirects, meaning the user's browser will still display the clean URL. Adding "[R]" (for 'redirect', if I'm not mistaken) will redirect the browser to the modified URL, and they'll see the 'ugly' URL.

If my explanation doesn't quite make sense, my site is in my profile. It might help to click around some of the links to see what the URL bar shows, and then consider they all end up running from the same script underneath.
posted by Godbert at 9:07 AM on October 14, 2005 [1 favorite]


You can also do this with mod_perl. Specifically, you'd want a response handler.
posted by sbutler at 9:18 AM on October 14, 2005


Install the Perl module HTML::Mason, which lets you define default templates to handle whole directory hierarchies and embed Perl code that can parse the path info. Or try Rails, which isn't Perl-based (but is built in Ruby which is much more Perl-like than PHP is) but uses /controller/method/id URL-based dispatching.
posted by nicwolff at 9:19 AM on October 14, 2005


Set up your CGI as the 404 error handler. No need for mod_rewrite or any of that jazz.
posted by kindall at 9:32 AM on October 14, 2005


Response by poster: kindall: any negative ramifications from search-engine spiders or other sorts of clients thinking that they have, in fact, not found something?
posted by cortex at 9:37 AM on October 14, 2005


Response by poster: (I ask because it's otherwise a very clever idea. That and Godbert's seem most in line with what I was conceiving.)
posted by cortex at 9:46 AM on October 14, 2005


Besides using mod_rewrite to rewrite to regular GET parameters, you can use the path_info method of your CGI object to get anything added after the real path to your CGI program. That is, if someone requests:

http://website.com/delegate.cgi/something

calling path_info would return something. You might use the path info as the ID number or title of a post, or parse it into additional data however you like.

You'll probably want to rewrite to hide the delegate.cgi from the URL anyway, though, so you might as well use GET format. It's just a handy thing to be aware of.

If you're looking to invest more time, you can also try Catalyst (CPAN, 6-month-old perl.com article), a Perl MVC web app framework similar to Rails. The documentation isn't so hot though, and it sounds like it would definitely be a learning project for you.

Catalyst can map URLs to your controller code by module path, global method name, or regex. Using regex, you get the additional URL fields passed as arguments right into your function, so you can have /controller/method/id URLs if you like.
posted by markpasc at 9:47 AM on October 14, 2005


I've done exactly this thing for my web site, for the same reason: I want clean URLs and wanted to have a script serve up everything. All I had to do was put a very simple .htaccess file in my document root directory:

.htaccess:
RewriteEngine On
RewriteBase /
RewriteRule !^files/ source/main.php


The PHP script grabs the requested URL from the PATH_INFO. I just put in that !^files/ part so I could dump plain, boring files from the /files/ path without going through the script.
posted by Khalad at 10:18 AM on October 14, 2005


Best answer: The advantage of using the 404 script are manyfold. First, you have full programmatic control of your URLs. Second, you can store some directories/files "real" on your server -- for example you could have an "images" directory that's served normally.

You will obviously have to have some logic in the 404 script to detect files that really aren't there an return an appropriate 404 page.

As for search spiders getting confused, they won't, because you won't actually return a 404 status code when you return content (200 is the one you want to use).

It's a perfectly reasonable solution. I know of at least one large site that used it -- they were a Web host that let users set up Web sites automatically. The pages were stored in a SQL database and a 404 script was used to look them up in the db because it was easier and faster to serve them from there than to actually create the pages in the Web server directory.

I use a similar trick on my own Web site, although all it does is redirect random subdomains of my main domain to www.jerrykindall.com (I have wildcard DNS but want the "www" to be canonical).
posted by kindall at 10:20 AM on October 14, 2005


kindall : As for search spiders getting confused, they won't, because you won't actually return a 404 status code when you return content (200 is the one you want to use).

Actually, in that case, the server would be returning a 404 status code; it just also sends content along with it, ostensibly for a custom-designed page to tell you the URL doesn't correspond to anything. (I just tested this, and it does indeed return a 404 status code; a user in a browser would never know, since the page still shows up, but search spiders would read it as a 404.)
posted by Godbert at 10:37 AM on October 14, 2005


Response by poster: khalad:

Is there any reason to put that in a .htaccess in the document root instead of putting it in httpd.conf itself?
posted by cortex at 10:46 AM on October 14, 2005


(I just tested this, and it does indeed return a 404 status code; a user in a browser would never know, since the page still shows up, but search spiders would read it as a 404.)

You can set the proper status code from the script, and should. I mean it's not going to magically return the right status, obviously.

Another advantage over the Rewrite method is that this may work on servers other than Apache.
posted by kindall at 11:20 AM on October 14, 2005


HTML::Mason seconded. It can do everything you want. I use it for small projects all the time, and have used it for large projects. Salon.com uses it. HTML::Mason uses mod_perl, which embeds a Perl interpreter inside Apache, so the extra cost of a monolithic script is relatively small. :-)

I love mod_rewrite, but it is not a development environment. It's a last resort.
posted by ldenneau at 11:37 AM on October 14, 2005


Response by poster: mod_rewrite is now working as advertised. Thanks to Godbert, and thanks to all of you.
posted by cortex at 6:11 PM on October 14, 2005


The 404-as-CGI trick is especially useful for cacheing. The CGI gets called when an object doesn't exist, and it can then determine that the object needs to be created and written to the file named in the request. So the next time that URL is requested the web server can serve that static file, which is very efficient. To keep things fresh you just periodically (or as needed) delete the generated files on disk and the 404 handler takes care of recreating them on demand.

FAQ-o-matic is one web application that uses this method.

If done right (i.e. the CGI knows when to send a 404 and when to send a 200 and when to send a 304) it is undetectable to the user/search engine spider.
posted by Rhomboid at 11:07 PM on October 14, 2005


« Older USB programmable keyboard like AnyKey for Mac/PC?   |   PayPal credit cards? Newer »
This thread is closed to new comments.