How do I collect thousands of band URLs?
December 13, 2005 2:32 PM   Subscribe

I have a huge list of band names in a text file. I need them turned into links to each band's website. Perhaps there is such a thing as a bulk "I'm Feeling Lucky" Google query? Plz help, CS gurus.

The list is big enough to make any one person doing it by hand out of the question. I have limited capabilities in PHP and (of course) Pascal. I'm guessing it could be done through the Google API somehow (by simply taking the URL of the first hit for the band name) but I've never worked with it, and don't know just how much of a project that would make. The links need not to be 100% accurate: some bands won't have websites, and the occasional mix up when the first hit isn't the right one is OK too. Any suggestions?
posted by Count Ziggurat to Computers & Internet (16 answers total)
 
You can put that "I'm Feeling Lucky" in the url,

Eg, a line that reads

Ani Difranco

gets changed to

http://www.google.co.nz/search?q=Ani%20Difranco&btnI=I'm%20Feeling%20Lucky
posted by holloway at 2:42 PM on December 13, 2005


Response by poster: Ah, OK. Can I get PHP to follow the link and collect the URL Google brings back?
posted by Count Ziggurat at 2:49 PM on December 13, 2005


Well yeah but why bother? Just make a page of links like that. Much simpler.

If you do want to scrape the redirect, look into using a PHP Http Client and check the location header response.
posted by holloway at 3:00 PM on December 13, 2005


Try an fopen() on the above URL, which will open the HTTP socket as a file stream. You should be able to parse out the URL Google redirects you to. It will probably be like

Location: http://www.dreamtheater.net/

in the HTTP header. If you need help writing this code I could take a stab at it and post example code.
posted by Khalad at 3:03 PM on December 13, 2005


I would take that big old text file of yours into Excel (or any spreadsheet software):

Column A: Band name
Column B: <a href="http://www.google.co.nz/search?q=
Column C: &btnI=I'm%20Feeling%20Lucky">
Column D: </a><br>
Column E: =concatenate(B1,A1,C1,A1,D1)

Copy the fileds in Column E into an HMTL file and off you go.
posted by bwilms at 4:01 PM on December 13, 2005


In addition to being less work for you, linking to Google will ensure that the links update when the bands change websites. If you are going to get each URL from Google, use the API. That's what it's there for. There are open source PHP implementations, e.g..

bwlims, band names would also need to be URL encoded, e.g. space becomes %20.
posted by scottreynen at 4:06 PM on December 13, 2005


I feel you on that. It's easy enough to add. Besides, it seems to work alright without the %20.
posted by bwilms at 4:11 PM on December 13, 2005


Do you want a script so you can keep doing this over and over, or is this a one time thing?

If it's a one time thing, let me know... I may be able to make use of my existing database of bands... It's pretty big at this point (as in about 7,500 bands big)...
posted by twiggy at 4:14 PM on December 13, 2005


bwilms: It works without the %20 because your browser does the work for you (and some browsers* won't). Remember to rawurlencode() the band names.

* like NETSCAPE 4
posted by holloway at 4:26 PM on December 13, 2005


Response by poster: holloway: thanks for the suggestion, but it suits me to have the actual URL used in the link (for user-friendliness, non-dependence on Google, general nonhackiness.)
Khalad: That's exactly what I was looking for. I'll write it out myself. Thanks!
twiggy: That database is huuge (if it's not already there, it might be a good one for Projects). it's a one time thing, for about a thousand bands. I'll try the PHP method, but if it doesn't work out (if Google turns out to be less accurate than I thought) I might take you up on it.

I'll post code here when I get around to it.
posted by Count Ziggurat at 4:32 PM on December 13, 2005


I would have used comic sans too but the man is keeping me down.
posted by holloway at 5:05 PM on December 13, 2005


Minor suggestion: toss an "official" (and maybe "home" as well) in the querystring if you're going to rely on "I'm feeling lucky". You'll be less likely to end up on a fan's geocities site for some more obscure artists.
posted by TimeFactor at 7:54 PM on December 13, 2005


Best answer: TimeFactor: I like your suggestion at first, but in actuality, tossing that in may exclude the band's site.. not all bands (in fact I'd venture to say not most) have the word "official" on their website.

A good thing to tack on, however, might be the word "band"...

Problem is, with the google i'm feeling lucky search, sometimes it's gonna be better with band, sometimes without.. sometimes with official, sometimes without, etc... Just depends on the text contained on the site..

Unfortunately, this will never be a perfect science, and the biggest problem will be that many google searches will turn up sites that do not belong to the band or even relate to it at all...

This is why I'm offering to just run your list up against my database and give you the websites I have for bands -- we don't enter a website into our database unless it is actually for the band... (otherwise you get a link to a google search on the band profile page...)
posted by twiggy at 8:05 PM on December 13, 2005


MusicBrainz has an API. Most popular artists (but by no means all) have official sites linked to their profiles.

Note that the I'm Feeling Lucky approach has been used on certain websites, with sometimes amusing results. There are many band names that are also something else.
posted by dhartung at 8:37 PM on December 13, 2005


It's probably worth noting that you're not supposed to use Google programatically, or rather, if you want to do that you should sign up for their API and get a key. You're allowed ten thousand searches a day or something.

Also, I second the suggestion that you should be wary of just accepting what comes back when you do an "I'm feeling lucky", although of course you may well get links to a non-official site simply because it's better than a band's official site.

For some reason, bands tend very badly toward huge, ugly, stupid Flash sites.
posted by AmbroseChapel at 10:32 PM on December 14, 2005


Response by poster: The "I'm Feeling Lucky" approach didn't work out so great. It needed a whole lot of editing.
So twiggy, I'm emailing you the data. (BTW, thanks a lot for the offer.) It seems I was exaggerating with "around a thousand": in actuality, it's only a few hundred.
posted by Count Ziggurat at 5:57 PM on December 16, 2005


« Older Is it 5 o'clock yet?   |   I need early issues of Internet World. Newer »
This thread is closed to new comments.