Is There a "Diff" for HTML That'd Save the Spent Resources for a Full Page Reload?
February 8, 2005 1:43 AM   Subscribe

On community blog sites like Metafilter, a lot of bandwidth seems to be consumed by redundant requests like previewing comments or checking for new ones where the entire page is reloaded. So when in the thread on Google Maps, mosch mentioned the HTTPRequest javascript object, that got me thinking. Are there any ways to write code that can cut down on resending the same data. Some kind of 'diff' method for HTML?
posted by daksya to Computers & Internet (19 answers total)
this is a really neat idea - send instructions to modify the existing document model rather than a whole new document. however, i've not heard of anything and my google-foo is failing.

i suspect it's going to be inefficient in most cases, since html pages are not bandwidth hogs anyway (at least not the structure part - embedded images and other data are expensive, but would be needed anyway). but it would be nice to play with. bet someone writes a demo soon...

("dhtml" is the general umbrella term for this kind of thing (obviously), but generally without the server being involved)
posted by andrew cooke at 4:54 AM on February 8, 2005

You could do this in a page-specific way, coding the 'diff' functionality into individual pages (with DHTML and probably the HTTPRequest object you mentioned.) The page would 'know' how to refresh itself.

But if you mean something more general, with code that runs on the server and works for all pages, your hands are tied by the HTTP spec. Browsers are designed to request either header information or full pages. There's no HTTP command for "tell me just the parts that have changed." (You could write a custom browser and server that support this interface, but that would be outside the HTTP spec and would only work for people who installed your software.)
posted by deshead at 5:18 AM on February 8, 2005

RFC 3229 describes "delta encoding in HTTP," which is basically an diff over HTTP. I think that servers and browsers implimenting this would be the most elegant way to control the updates you're talking about.
posted by revgeorge at 6:02 AM on February 8, 2005

Response by poster: since html pages are not bandwidth hogs anyway

In general, yes. But this should be helpful in bloglike browsing. That Google Maps thread is right now 86 comments and "weighs" 44 KB. After filling in the comment box, I press preview, that's 44KB reloaded. Then another 44KB as I press Post. Similarly, I visit the thread right now. That's 44KB. An hour later, there's 8 new comments, and the original 44KB + 3 KB is downloaded. From a hosting bandwidth quota point--of-view, I'm assuming a 'diff' system would cut down bandwidth quite a bit.

Can anyone well-versed with HTMLfu tell me if Metafilter-like thread pages could incorporate some such thing?
posted by daksya at 6:05 AM on February 8, 2005

it's possible. javascript has access to the structure of the document and can alter it. so you would associate javascript action with clicking on "submit" which would send the form and receive the preview data; the code would insert the preview data into the page and you would see it, without fetching the rest of the page. you could extend it to include extra posts too.

there's currently no standard and/or libraries that i, at least, know of for doing this, so you'd need to hand roll your own. but yes, in principle, it is possible. sorry of that wasn't clear from my first post.

(i've since read the google map thread and this is what they are doing. it doesn't make any difference, in principle, whether it's a preview post or a search result or a map image)
posted by andrew cooke at 6:16 AM on February 8, 2005

Well, short answer, yes it would be possible to code such a thing for Metafilter in much the same way as you are suggesting, and it is fairly straight forward. This is pretty much what GMail does; if you leave your inbox open, new mails will magically appear without you having refreshed the page, and only the data to render the new mail will have been downloaded.

The flip side though is that if you have this as an automatic process, you run the risk of creating more bandwidth and/or server strain than you would do in a static environment, what with all the requests and database hits looking for potential updates that may not exist and may not be required by the end user anyway.
Google have a lot of servers that can handle this load and turn around the requests in lightening speed, but the rest of us mortals don't have this luxury, and so you have to weigh up whether it is actually worthwhile in the long run.
posted by chill at 6:22 AM on February 8, 2005

Response by poster: The flip side though is that if you have this as an automatic process, you run the risk of creating more bandwidth and/or server strain than you would do in a static environment, what with all the requests and database hits looking for potential updates that may not exist and may not be required by the end user anyway.

If one limits such updates to user-initiated action such as pressing Preview or Post or visiting a page after sometime, I suppose there are no downsides.

Preview and Post are code within a page. A link is just a link. I assume when I click on a link, the browser compares some type of header info from the server with the page in cache, like timestamp, or alternatively sends some info from the cached page, relying on the server to send either the updated page or a "no need" msg . How would this work for revisiting a page? Wouldn't it download the whole thing? I suppose one way out is to leave whatever information is compared as static and simply have an 'Update' button on the (cached) page?
posted by daksya at 6:46 AM on February 8, 2005

In late 2003 this was discussed for MeFi - I even still have the mock page I threw together here (write comment, preview, repeat) - the main issue missing from mine is the "on preview" phenomenon, where posts get added while you're writing. Still quite do-able. I think this landed on the stack of Matt's to-possibly-do MeFi pile.
posted by kokogiak at 6:54 AM on February 8, 2005

Apparently the magic underlying this and many similar tricks, like Google Suggest and (which predates Google Maps) is XMLHttpRequest. I don't know much more about it than that--I read about it here and there. But that should help get you started.
posted by adamrice at 7:00 AM on February 8, 2005

Check out Kuro5hin, it has done something similar to this for years. Read a story and choose "dynamic threaded" for the comments style. Then it will show all the top level messages, and you can click to expand one to follow a thread.

Roller Blog has a similar hidden comments feature. Check out this blog for an example.
posted by gus at 7:03 AM on February 8, 2005

The "cheap" way to do this is to add an iframe to the preview page, that contains a link to the full previous page. In theory the browser should pull the full page out of the cache, saving any hits.

It works well enough.
posted by smackfu at 7:07 AM on February 8, 2005

There's a WordPress plug-in (I use it on my blog) that provides live preview, saving that unnecessary reload that comes of hitting that preview button. I intend to implement something similar on all of my sites.
posted by waldo at 8:02 AM on February 8, 2005

Response by poster: Never rely on browser cache.

Come to think of it, yeah. What if comments are deleted, but I suppose even that can be managed.

Is anyone out there developing a modular object-oriented web document format?
posted by daksya at 8:03 AM on February 8, 2005

Although I am not a programming guru, the first thing that came to mind was memcached. It's what LiveJournal and other sites with many users and insane request-rates use to speed up response times while reducing bandwidth usage.

From the about page:
memcached is a high-performance, distributed memory object caching system, generic in nature, but intended for use in speeding up dynamic web applications by alleviating database load.
If that sounds like what you're looking for, check it out.
posted by ElfWord at 9:16 AM on February 8, 2005

Memcached is server-side, though. LiveJournal still resends the whole page when you post a new comment. Memcached just lets them resend the page without another database call.
posted by nebulawindphone at 10:44 AM on February 8, 2005

ElfWord, memcached isn't what we're looking for in this situation. What memcached does is cache database queries so that the webserver can access the results of those queries without having to do another (more expensive) query. But it has nothing to do with getting that data to the client's webbrowser.

Or, said another way, memcached is designed to save CPU cycles, and has no effect on bandwidth, or the data transferred between server and client.

The problem we have is that we want the client to say "I've read up to comment n. Send me everything after that, but not the stuff you've already sent." For that, we need something on the client, such as some kind of javascript diff, or even better, clients and servers that support RFC 3229.

OP: Or what nebulawindphone said.
posted by cactus at 11:06 AM on February 8, 2005

re: preview, post, preview, post cycle....

Comment Live Preview would do this for you.
See it in action on a newer page.
posted by seanyboy at 4:34 PM on February 8, 2005

I had the same idea back when I played with JSRS which does client-server scripting for *all* browser engines. I didn't implement it, mostly because I figured that the scarce resource in a CMS/blog-type system is the underlying database spitting out the forum pages, not piping the HTML over the network: a live-update feature would mean that you would have to ask the DB to send back the new comments for a thread since the last post you have in the current "view" of that thread. Well, that translates to a specific SQL command in the back-end. Now, look at this new situation from the DB server's point of view: without dynamic updates the DB server sees a lot of SQL requests that are *exactly* identical: "send me all comments for this thread that you currently have". With dynamic updates the DB is seeing a bunch of *different* queries that must be examined separately. The first situation is much, much easier to optimize (i.e. keep in memory-cache) for the DB server. But I could be wrong, this needs to be smoke-tested to be answered conclusively.
posted by costas at 7:45 PM on February 8, 2005

Response by poster: Thanks, everyone.
posted by daksya at 6:14 AM on February 9, 2005

« Older Any suggestions on dealing with a package lost (or...   |   Photoshop 7 won't start - what can I do? Newer »
This thread is closed to new comments.