Client side screen scraping help?
August 26, 2005 6:17 AM Subscribe
Can I implement a screen scraper app completely on the client side?
That is, can javascript or similar support a http request to a 3rd party site then parse the results? I'm not wedded to any particular tech, but would like to achieve this on the client side with no additional plug-ins etc.
Responses appreciated, this isn't my area at all.
That is, can javascript or similar support a http request to a 3rd party site then parse the results? I'm not wedded to any particular tech, but would like to achieve this on the client side with no additional plug-ins etc.
Responses appreciated, this isn't my area at all.
Does it have to be in-browser? There are lots of ways to do screen scraping with programs that run on the command line.
posted by RustyBrooks at 7:05 AM on August 26, 2005
posted by RustyBrooks at 7:05 AM on August 26, 2005
Response by poster: Thanks Nicwolff, I suspected their might be a security implication. Is there any way to go around this (maybe getting explicit user input?). How about using a flash or similar mini-app?
Any ideas would be welcome.
posted by bystander at 7:07 AM on August 26, 2005
Any ideas would be welcome.
posted by bystander at 7:07 AM on August 26, 2005
Response by poster: Does it have to be in browser? I think so, I want to distribute the screen scaping to a lot of clients, rather than it all happening from a single server.
So a separate executable could work, but it would be pretty inelegant, and I suspect the non IT types I want to target would find it a hassle.
posted by bystander at 7:11 AM on August 26, 2005
So a separate executable could work, but it would be pretty inelegant, and I suspect the non IT types I want to target would find it a hassle.
posted by bystander at 7:11 AM on August 26, 2005
You could get around the security by using a proxy. For example, your screenscraper would be here:
http://yourserver.com/screenscraper.htm
It would be allowed to operate on urls of the form:
http://yourserver.com/proxy.cgi?address=url
where your URL could be anything, like:
http://nytimes.com/article2345.htm
All it would require is the proxy.cgi (or whatever) program that would take any web content and make it appear like it comes from your site, which would allow you to get around Javascript cross-site scripting blocks.
posted by Turtle at 8:32 AM on August 26, 2005
http://yourserver.com/screenscraper.htm
It would be allowed to operate on urls of the form:
http://yourserver.com/proxy.cgi?address=url
where your URL could be anything, like:
http://nytimes.com/article2345.htm
All it would require is the proxy.cgi (or whatever) program that would take any web content and make it appear like it comes from your site, which would allow you to get around Javascript cross-site scripting blocks.
posted by Turtle at 8:32 AM on August 26, 2005
I mean the screenscraper would work on URLs of the form:
http://yourserver.com/proxy.cgi?address=http://nytimes.com/article2345.htm
posted by Turtle at 8:35 AM on August 26, 2005
http://yourserver.com/proxy.cgi?address=http://nytimes.com/article2345.htm
posted by Turtle at 8:35 AM on August 26, 2005
Turtle's Proxy suggestion is good, or you could also code something up as a userscript for greaseMonkey (for firefox), It has an xmlhttprequest component that routes around the security roadblocks.
posted by kokogiak at 8:37 AM on August 26, 2005
posted by kokogiak at 8:37 AM on August 26, 2005
IE6 supports XMLHTTPRequest to other sites, but Firefox doesn't and anyway, it's a glaring security hole to do so.
posted by abcde at 9:07 AM on August 26, 2005
posted by abcde at 9:07 AM on August 26, 2005
Flash is blocked from doing this for the same reason.
posted by bruceyeah at 10:09 AM on August 26, 2005
posted by bruceyeah at 10:09 AM on August 26, 2005
IE6 will allow XMLHttpRequest to other domains, but only if the originating site is on the list of trusted sites. I thought Firefox could do it, too, if given explicit permission, but I'm not sure what the mechanism is.
posted by cerebus19 at 10:45 AM on August 26, 2005
posted by cerebus19 at 10:45 AM on August 26, 2005
This thread is closed to new comments.
posted by nicwolff at 6:42 AM on August 26, 2005