Image search engine scraper/downloader
May 16, 2006 10:08 PM   Subscribe

I need a web scraper script/proggie to download and thumbnailize images for a long list of search terms.

.

Input: Long list of search terms.

Output: corresponding images and thumbnails on my hard drive, arranged/accessed/organized by search term.

My already feeble google-fough (<-- note ignorance of even the spelling) is failing me.
posted by Moistener to Computers & Internet (6 answers total)
 
What operating system are you running?
posted by secret about box at 11:05 PM on May 16, 2006


Pretty trivial programming task, in a number of languages. But as Mikey says, we need to know where we're starting from.

And it's worth noting that you're not allowed to do this (run scripts on Google), or rather, you're not allowed to do it without signing up to Google's Developer program and getting a special key, and agreeing not to make more than a certain number of searches in a 24 hour period.

That's for the regular Google anyway, and I suppose it applies to Images as well.

I don't know what happens if you get caught cheating though.
posted by AmbroseChapel at 3:34 AM on May 17, 2006


Actually, I don't know that Google has an approved API for image searching. The venerable "Google API" only applies to standard web searches, and while they have an API for Maps and XML feeds for News and Groups, I think that for Images, Froogle, and various other features, scraping would be your only option.

Google probably wouldn't bother to notice if you did this, or take any action against you if they did notice, unless you're doing it on a massive scale, but (historically) they do block certain types of requests that seem scraper-like.
posted by staggernation at 6:58 AM on May 17, 2006


Response by poster: I'm using Windows XP. I can run perl or python if needed.
posted by Moistener at 10:54 AM on May 17, 2006


Well, here's a very quick and dirty Perl 'scraper':
use LWP::UserAgent;my $agent = LWP::UserAgent->new();$agent->agent('Mozilla/5.0');my $search_term = 'camel';my $search = $agent->get('http://images.google.com.au/images?hl=en&btnG=Search+Images&q=' . $search_term ) || die "$!";my $google_code = $search->content();while ($google_code =~ m|<a href=/imgres\?imgurl=([^&]+).*?<img src=/images\?q=tbn:[^:]+:([\S]+)|g){print "original: $1;\nthumbnail: $2;\n"}
which gets the first page and the URLs of all the thumbnails displayed by Google and their original URLs. Exercise for the reader etc.

But note line 3 where I had to set it to pretend to be Mozilla.

I had to do that because I got a 403 error without it -- Google definitely knew that someone using LWP was up to no good...
posted by AmbroseChapel at 5:08 PM on May 17, 2006 [1 favorite]


<taps thread>

Is this thing on?
posted by AmbroseChapel at 2:14 PM on May 20, 2006


« Older Advice on this growth on my tongue?   |   Where can I find this video again?!? Newer »
This thread is closed to new comments.