Make me a cache money millionaire!
May 21, 2012 2:49 PM   Subscribe

Help me set up a simple 100% caching forward proxy for my home. I dicked around with squid and then apache+mod_proxy/mod_cache yesterday afternoon, and while they proxy beautifully, they don't seem to have much cache hit even on static content- almost or completely 0% cache hit.

First, I am at work so I'm going to be a touch fuzzy on the details, and can't implement anything till I'm home.

Goal:
I'm looking to set up a forward caching proxy for static content on my home network, mostly for browsing while on my home machines (not counting mobile devices like ipad, iphone, android) and largely from FF on Mac, which is my principle browsing option. That Mac is running FF with FoxyProxy Standard installed, and I set up rules such that "mostly" static content like jpg, gif, png, css, js and known static pages from whitelisted URLs/sistes (even those with queries that I can trust to be minimally volatile for my purposes) are sent to the proxy to hopefully be cached for a few days, and thus avoid the RTT/lag of loading from the original site on repeat visits or browsing around, especially given how flaky my Comcast is. This is especially useful for sites like imgur, which I visit more often than is healthy and has all of those thumbnail images on the home page. I also have a couple of GM scripts that do preloading for sites like Craigslist etc, which on page reload would really cut down on traffic generated outside of my router.

Anything not matching these whitelist rules of *.jpg, etc will bypass the proxy altogether and load as normal. And yes, I am well aware of the risks, but honestly I trust my instincts and web knowledge, and ability to one-click disable foxyproxy if I suspect erratic behavior. And no, the browser's default behavior is not caching nearly enough for my tastes.

Setup:
Base machine is a 2008 Mac Pro running OSX 10.6 (Snow Leopard, I believe- definitely not Lion). I have VMWare Fusion 4 running a couple of Windows VMs and an Ubuntu 11.10 VM. I was setting up the proxies in my Win2k3 VM, simply because it was there and acts as little more than a VPN client for TS'ing to work, and tends to be running in the background as often as the Mac is powered on- which is to say, 24/7.

The ideal here, for my short term purposes, is a caching proxy on the Win2k3 VM or the Mac (the Linux is for experimenting, and is less stable/consistently there) where I can filter in the browser to effectively have a local disk cache that supplants the browser's in a way I can explicitly view and control.

Failures:
FoxyProxy is working fine when enabled, as I see the traffic going to the proxy only for those whitelists, and pages continue to work fine.

With both Squid and Apache mod_proxy/mod_cache/mod_disk_cache, they seem to work great as proxies, and even seem to create cache files... yet even for urls that aren't parameterized such as http://site.com/static/images/1234abcd.jpg, they both show evidence of cache miss despite repeat visits. Even just clicking forward/back shows the browser requests the content anew, the proxy logs show a cache miss (TCP_MISS in Squid, the SetEnv/CustomLog trick in Apache, and Netmon 3.x to confirm the outbound re-request by the proxy for content it ostensibly cached). Some content does get written to the cache folder, but doesn't appear to be used- the cache miss ratio is almost 100% in Apache, and exactly 100% in Squid.

Squid was a snap to install and setup, but looking at its logs while it proxied, it was doing a TCP_MISS 100% of the time- despite the cache folder being populated with *some* content. I tried adjusting the refreshfilter and cache rules, and again this would result in content being written to disk... and then showing TCP_MISS in the logs 100% of the time on page reload (by reload I mean both F5, and simply revisiting the same URL in a new tab).

Because Apache is about as universal as it gets, I tried that after Squid failed, and it exhibited the same behavior: proxies for images fine, writes files to disk, so the browsing is seamless... but doesn't appear to actually use the cache on followup visits. I tried enabling just about every cache element in mod_cache including the multiple items to ignore certain headers and those that violate the HTTP standard and would normally be a bad idea if I wasn't whitelisting via FoxyProxy... but no dice: it still won't cache.

Outcome:
Basically, I want a 100% caching forward proxy that I can whitelist some types of traffic to (via FoxyProxy) and have them server from disk cache for N minutes/days (configurable) before expiring. Ostensibly, Apache should work fine for this, but while it's caching some files to disk, it doesn't then use them. I'd prefer to run the caching proxy on the Win2k3, but since I have Mac and Linux as options those would work as well- although the Linux is the most volatile as an OS, what with it being a VM and upgraded/rebuilt relatively often.
posted by hincandenza to Computers & Internet (5 answers total) 2 users marked this as a favorite
 
Have you checked to see what kinds of cache-related directives sites may be sending you in their HTTP headers? For instance, using the Live HTTP Headers Firefox plugin, I can see that Metafilter is sending a 'Cache Control: private' directive. This designates the request as user-specific and instructs any caches not to place the response into a multiuser shared cache. I imagine a lot of sites are probably doing that, and it could be that Apache and Squid are by default not caching it.
posted by RonButNotStupid at 3:22 PM on May 21, 2012


Seconding what Ron said. There are many sites that do not set cache expiration on static content, so the static content is effectively served with a "don't cache me" instruction. The sad thing is that it would be in their interest as well as yours to set their web server caching correctly so that their bandwidth costs go down and user experiences fast-loading web pages. Big popular sites like Facebook or Twitter have caching strategies. Let me check Metafilter...

Here is what YSlow ( http://yslow.org/ ) says about Metafilter on the subject of caching:

There are 6 static components without a far-future expiration date.
(no expires) http://static.chartbeat.com/js/chartbeat.js
(no expires) http://connect.decknetwork.net/i/atmail_envelope.png
(2012/5/22) http://www.google-analytics.com/ga.js
(no expires) http://d217i264rvtnq0.cloudfront.net/styles/mefi/favicon.ico
(2012/4/14) http://www.metafilter.com/scripts/favorite_front031611-min.js
(no expires) http://connect.decknetwork.net/deckMF_js.php?...

I don't necessarily agree with YSlow's above criticism of Metafilter. It is just an example. Are you hoping to override the caching instructions given by the server? It seems like that would give you a long series of small, irritating problems to deal with.
posted by ErikH2000 at 4:14 PM on May 21, 2012


Response by poster: Ron: Right, but if you look at the mod_cache documentation, there's a number of directives, include a CacheStorePrivate which I specifically set to "On" (default is off) so in theory, it should be caching even with the Cache Control: private header.

The set of options on that page I basically enabled across the board as appropriate wherever they dictated when to forceably override normal cache behavior. About the only one I didn't set was the CacheIgnoreHeaders because I wasn't sure which headers to specify. I'll look at the HTTP headers and see if any others are coming through, and if they can be explicitly overridden with CacheIgnoreHeaders.

ErikH2000: I also set the CacheIgnoreLastMod and CacheDefaultExpire to essentially handle the no expiration date issue. Although now we're getting to a place I'd have to double-check the conf, and since I don't leave my ssh open on my router unless I'm going on vacation, I can't pop back into my machine from work to double check right now.

Also, as I said in my initial writeup, I am aware of where these problems could crop up, and it's why I'd use such as proxy as a whitelist proxy for file types or specific sites, and simply click to disable FoxyProxy if anything seemed "irritating" or broken. As you say, a lot of sites don't make good use of caching, or have a number of small page elements that are more costly simply because of the new download than the actual bytes/sec time (and many sites do not have pipelining enabled, etc). I'm not oblivious to issues of scale and caching on web servers... which is why I trust myself to have a cache I control in front of my browsing experience, where I can whitelist sites or content types as desired. FoxyProxy supports RegEx whitelisting as well as simple wildcards.
posted by hincandenza at 4:22 PM on May 21, 2012


Response by poster: A thought occurred to me just now, that I have all my cache values in an httpd_cache.conf in /conf/extras, and an include line in the main httpd.conf... but is it possible the httpd.conf is overriding those values later in its own doc- I assume the include is in-place? I hadn't even checked, so for all I know the resultant settings are not what I think from /extras/httpd-cache.conf.

Is there an easy way to see what the run time config is on startup, perhaps through a verbose logging setting?
posted by hincandenza at 4:26 PM on May 21, 2012


Response by poster: Just FYI, this is my Apache httpd-cache.conf:
# http://httpd.apache.org/docs/2.2/mod/mod_proxy.html
<IfModule mod_proxy.c>
ProxyRequests On
<Proxy *>
Order Deny,Allow
Deny from all
Allow from all
</Proxy>
ProxyVia On
</IfModule>
<IfModule mod_cache.c>
<IfModule mod_disk_cache.c>
CacheRoot E:/PROXYCACHE
CacheEnable disk /
CacheDirLevels 3
CacheDirLength 2
CacheIgnoreCacheControl On
CacheIgnoreNoLastMod On
CacheIgnoreQueryString On
CacheStoreNoStore On
CacheStorePrivate On
CacheMaxFileSize 100000000
CacheDefaultExpire 259200
CacheMaxExpire 432000
</IfModule>
ProxyTimeout 60
#NoProxy 192.168.*.*/255.255.*.*
# When acting as a proxy, don\'t cache the list of security update
CacheDisable http://security.update.server/update-list/
</IfModule>
# End of proxy directives

posted by hincandenza at 8:58 PM on May 21, 2012


« Older Moving is very moving, in a bad way   |   Your goose is cooked son Newer »
This thread is closed to new comments.