Client side screen scraping help?
August 26, 2005 6:17 AM   Subscribe

Can I implement a screen scraper app completely on the client side?

That is, can javascript or similar support a http request to a 3rd party site then parse the results? I'm not wedded to any particular tech, but would like to achieve this on the client side with no additional plug-ins etc.
Responses appreciated, this isn't my area at all.
posted by bystander to Computers & Internet (10 answers total)
 
Best answer: No, because to prevent cross-site scripting attacks browsers don't let Javascript loaded from one server load a page from any other server into another frame or with XMLHTTPRequest. (I'm glossing over a lot of details here.)
posted by nicwolff at 6:42 AM on August 26, 2005


Does it have to be in-browser? There are lots of ways to do screen scraping with programs that run on the command line.
posted by RustyBrooks at 7:05 AM on August 26, 2005


Response by poster: Thanks Nicwolff, I suspected their might be a security implication. Is there any way to go around this (maybe getting explicit user input?). How about using a flash or similar mini-app?
Any ideas would be welcome.
posted by bystander at 7:07 AM on August 26, 2005


Response by poster: Does it have to be in browser? I think so, I want to distribute the screen scaping to a lot of clients, rather than it all happening from a single server.
So a separate executable could work, but it would be pretty inelegant, and I suspect the non IT types I want to target would find it a hassle.
posted by bystander at 7:11 AM on August 26, 2005


You could get around the security by using a proxy. For example, your screenscraper would be here:

http://yourserver.com/screenscraper.htm

It would be allowed to operate on urls of the form:

http://yourserver.com/proxy.cgi?address=url

where your URL could be anything, like:

http://nytimes.com/article2345.htm

All it would require is the proxy.cgi (or whatever) program that would take any web content and make it appear like it comes from your site, which would allow you to get around Javascript cross-site scripting blocks.
posted by Turtle at 8:32 AM on August 26, 2005


I mean the screenscraper would work on URLs of the form:

http://yourserver.com/proxy.cgi?address=http://nytimes.com/article2345.htm
posted by Turtle at 8:35 AM on August 26, 2005


Turtle's Proxy suggestion is good, or you could also code something up as a userscript for greaseMonkey (for firefox), It has an xmlhttprequest component that routes around the security roadblocks.
posted by kokogiak at 8:37 AM on August 26, 2005


IE6 supports XMLHTTPRequest to other sites, but Firefox doesn't and anyway, it's a glaring security hole to do so.
posted by abcde at 9:07 AM on August 26, 2005


Flash is blocked from doing this for the same reason.
posted by bruceyeah at 10:09 AM on August 26, 2005


IE6 will allow XMLHttpRequest to other domains, but only if the originating site is on the list of trusted sites. I thought Firefox could do it, too, if given explicit permission, but I'm not sure what the mechanism is.
posted by cerebus19 at 10:45 AM on August 26, 2005


« Older To (Pentium) D or not to D?   |   Cunnilingus technique? Newer »
This thread is closed to new comments.