Need help extracting links off of a webpage (RuneHQ to be exact). Need to put them in a database.
August 25, 2009 5:31 PM   Subscribe

Need help extracting links off of a webpage (RuneHQ to be exact). Need to put them in a database.

Hey guys,

I am working on a little coding project related to an online game called runescape (yes its a legit program), but I am wondering how to go about extracting all the links I want from a webpage, so I can visit them all with my script.

The webpage in question is RUNEHQ and I would like to extract every link on the right "Items menu" which scrolls vertically forever.

An example item link looks like http://www.runehq.com/database.php?type=item&id=008016 but the number on the end does not increase in a logical fashion.

Does anyone have any ideas to only extract all the links on the right hand "Items" menu? It can be a direct scraping approach or somehow querying their database?

I will eventually be using these links to do some trend analysis on runescape item prices, using python and urllib2, beautifulsoup etc.
posted by Javed_Ahamed to Computers & Internet (13 answers total)
 
Quick and dirty:

curl http://www.runehq.com/databasesearch.php | grep square|cut -d "\"" -f 8|sed 's/\/database/http\:\/\/www.runehq.com\/databasesearch.php\?/g'
posted by Cat Pie Hurts at 5:45 PM on August 25, 2009


ack..type. fixed:

curl http://www.runehq.com/databasesearch.php | grep square|cut -d "\"" -f 8|sed 's/\/database/http\:\/\/www.runehq.com\/databasesearch/g'
posted by Cat Pie Hurts at 5:46 PM on August 25, 2009


ack..typo!
posted by Cat Pie Hurts at 5:46 PM on August 25, 2009


uh im about to test that out right now, but do you mind explaining how it works? i don't know what half of those commands do, sorry little newbie :)
posted by Javed_Ahamed at 5:55 PM on August 25, 2009


uhg..borked it again (very sloppy):

curl http://www.runehq.com/databasesearch.php | grep square|cut -d "\"" -f 8|sed 's/\/database/http\:\/\/www.runehq.com\/database/g'|sed 's/amp;//g'
posted by Cat Pie Hurts at 5:56 PM on August 25, 2009


Since you're planning on using BeautifulSoup eventually, you might as well try to use that. The problem is that BeautifulSoup is hurting its teeth on the poor html of the page you linked:

import urllib2
from BeautifulSoup import BeautifulSoup
page = urllib2.urlopen("http://www.runehq.com/databasesearch.php?db=item&query=&field=0&sort=price&order=desc")
soup = BeautifulSoup(page)
HTMLParseError: bad end tag: u"", at line 26, column 196

If you can fix that, you should be able to use the findAll function.
posted by spaghettification at 6:02 PM on August 25, 2009


using python and urllib2, beautifulsoup etc.

No offense, but if you're going to do stuff with the above, I would hope that you're familiar with general unix text processing (ok..making a lot of assumptions..sorry).

If you're on Linux or OSX, the above line will work out of the box.
If you're on Windows, get thee Cygwin.

From peering at the page source, the Item list belongs to class "square".

curl - grabs the webpage and feeds it to grep.

grep does pattern matching. Here, we're looking for "square" and piping the result to cut.

cut - slices up a line based on a delimiter: -d "\"" specifies a double quote mark (\ escapes it for the shell). -f 8 indicates that, using " as a delimiter, we only want to see the contents of field 8. Then we pipe the result to sed.

sed - an in-place test editor. Here, we tell it to search for the "/database..." string and prepend it with the site url (because the actual link just uses a reference). That gets piped to a 2nd sed statement to clean up some other junk (because I'm sloppy that way).

If you append > filename to the end, it will spit it all out to a text file.
posted by Cat Pie Hurts at 6:05 PM on August 25, 2009 [1 favorite]


Here's an example that works:

import urllib2
from BeautifulSoup import BeautifulSoup
page = urllib2.urlopen("http://ask.metafilter.com/131138/Need-help-extracting-links-off-of-a-webpage-RuneHQ-to-be-exact-Need-to-put-them-in-a-database")
alllinks=soup.findAll("a", href=True)
print [link["href"] for link in alllinks]
posted by spaghettification at 6:12 PM on August 25, 2009


Thanks guys! Especially Cat Pie Hurts for explaining what the different commands did in a nutshell. I can usually find my way around Linux but know almost nothing about text processing except regexp. Thanks again guys!
posted by Javed_Ahamed at 6:14 PM on August 25, 2009


sed 's/\/database/http\:\/\/www.runehq.com\/databasesearch.php\?/g'

Protip: don't use / as your regex delimiter if you're trying to also match it. It looks like ass.

sed 's,/database,http://www.runehq.com/databasesearch.php\?,g'

Doesn't that look better?
posted by Rhomboid at 8:00 PM on August 25, 2009 [3 favorites]


Yeah, you need to use BeautifulSoup 3.07 to work with crappy malformed HTML, as it uses the much more liberal (but removed in Python 3.0) SGMLParser. He explains it here.
soup = BeautifulSoup(urllib2.urlopen(url))
for link in soup.body('a', 'menuItem'):
    item_name = link.renderContents()
    item_url = link['href']
    # do something with those...

posted by cj_ at 8:03 PM on August 25, 2009


I'll throw a Ruby solution in the mix, too ;)
require 'rubygems'
require 'scrapi'

item = Scraper.define do
  process "a.menuItem", :title => :text, :link => "@href"
  result :title, :link
end 

rune = Scraper.define do
  array :items
  process "table.newleft2 a.menuItem", :items => item
  result :items
end

url = URI.parse("http://www.runehq.com/databasesearch.php")
items = rune.scrape(url)
items.each do |item|
  puts "#{item.title} (#{item.link})"
end
which returns:
'perfect' gold bar (/database.php?type=item&id=001822)
'perfect' gold ore (/database.php?type=item&id=001821)
'perfect' necklace (/database.php?type=item&id=001824)
'perfect' ring (/database.php?type=item&id=001823)
'voice of doom' potion (/database.php?type=item&id=003785)
1/2 anchovy pizza (/database.php?type=item&id=000504)
1/2 meat pizza (/database.php?type=item&id=000507)
1/2 p'apple pizza (/database.php?type=item&id=000509)
1/2 plain pizza (/database.php?type=item&id=001162)

posted by cdmwebs at 7:54 AM on August 26, 2009


thanks cdmwebs :), I haven't done any scraping in ruby yet so its a nice example!
posted by Javed_Ahamed at 9:00 AM on August 26, 2009


« Older Time signatures are important, hemiolas be damned.   |   BookFilter: The War Of Forgetting? Newer »
This thread is closed to new comments.