Extracting Data from Myspace and creating a date sorted list of gigs.
February 17, 2008 10:14 AM   Subscribe

How could extract and combine the data from about 40 gig pages on Myspace (like this and this) and end up with a date-sorted list of all of the data?

Would it be easy, or quick to do this once a week? The more automated this can be the better. I don't really want an RSS feed but a resultant list like the one below which can be generated when I need it.

1/01/08: The Beatles: The Venue, London
1/01/08: The Verve: La Venue, Paris
2/01/08: The Beatles: The Venue, Manchester
2/01/08: The Rolling Stones: The Venue, York
2/01/08: The Beatles: The Venue, Skegness
4/01/08: The Kinks: The Venue, York
posted by takeyourmedicine to Computers & Internet (9 answers total) 1 user marked this as a favorite
 
I have done stuff like this using Python many times. This should be easy, since it looks like you just have to automatically visit a list of easily designed URLs. Here's a rough outline; I haven't tested this. In the below code, bandnames.txt should be a file that contains a list of band names.

import urllib2, re
bandnames = file("bandnames.txt","r").readlines()
baseurl = http://collect.myspace.com/index.cfm?fuseaction=bandprofile.listAllShows&friendid=18786133&n='
output_file = file('outputdata.txt','w')
for bandname in bandnames: #note: the following lines should be indented.
urlend = "+".join(bandname)
url = baseurl + urlend
resp = urllib2.urlopen(url)
html_code = resp.read()
### comment: you would have to design regular expressions (string patterns) to extract the data you are looking for. Do a Google search for "python regular expressions" and learn how to extract dates and other strings.
occurrence = re.findall(r'someregularexpression', html_code)[0]
output_file.write(occurrence + '\n')

More documentation at the following links, including password authentication, etc:
http://therning.org/magnus/archives/270
http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/391929

Hope that helps. Let me know if you have any questions. If you want to get more elaborate and store the data in XML or SQL, let me know and I can dig up some code that does that.
posted by lunchbox at 10:30 AM on February 17, 2008 [1 favorite]


Play around with Dapper to get the data you want, and if you need to reformat it, you can use the RSS feed from that and work it into Yahoo Pipes.

Or, write a script to screen scrape it and parse it using regular expressions.
posted by bertrandom at 10:31 AM on February 17, 2008


I immediately noticed a few typos in my code. (e.g. the baseurl should have a quotation mark at the beginning.) But let's see what solutions other people come up with first.
posted by lunchbox at 10:33 AM on February 17, 2008


Here is a piece of PHP code I found somewhere that I used to grab the dates from a buddies myspace page and embed them in his personal page.

Fiddle a little with it and you have yourself a solution.
posted by petethered at 11:15 AM on February 17, 2008


Here's a groovy script that I use to scrape stock quotes off of MoneyCentral, it should be pretty easy to repurpose this to do what you're looking for if you've got a little programming background:

#!/usr/local/groovy/bin/groovy
// need to have the TagSoup jar in your classpath for this to work, it is better at parsing html that is malformed
// see: http://ccil.org/%7Ecowan/XML/tagsoup/

def symbols = ["EEM", "QQQQ", "SPY", "VFORX", "VPU", "VWO"]

def getQuotes(findSymbols = ["AAPL"]) {
def url = new URL("http://moneycentral.msn.com/detail/market_quote?symbol=${findSymbols.unique().sort().join('+')}")
def quotes = []

url.withReader { reader ->
def html = new XmlSlurper(new org.ccil.cowan.tagsoup.Parser()).parse(reader)
// crappy html on moneycentral, the data is in the only table of class "t", and the only rows in that table we care about have 6 cells
def rows = html.'**'.grep { it.name() == "tr" && it.children().size() == 6 && it.parent().name() == "table" && it.parent().@class == "t" }
def headers = rows[0].th.collect { it.text() }
rows[1..rows.size() - 1].each { row ->
def quote = [:]
headers.eachWithIndex { header, i -> quote[header] = row.td[i] }
quotes <> }
}
return quotes
}

getQuotes(symbols).each { println "${it['Symbol']}\t${it['Last']}\t${it['Change']}" }
posted by freshgroundpepper at 2:28 PM on February 17, 2008


hmm...it didn't deal with a double "less than" sign well in the paste.

Replace DOUBLE_LESS_THAN with the actual append symbol:

headers.eachWithIndex { header, i -> quote[header] = row.td[i] }
quotes DOUBLE_LESS_THAN quote
}
}
posted by freshgroundpepper at 2:31 PM on February 17, 2008


Response by poster: I have no real programming knowledge, so I can't really make sense of this stuff, but Petethered's solution seems to come closest - but I'm not sure how to fiddle with it to make it work - where to input the urls etc. Further or more explicit help on this would be useful!
posted by takeyourmedicine at 2:59 PM on February 17, 2008


If you lack programming chops, you might hit up rentacoder.com and submit a project. Something like this could probably be done for less than $20.
posted by bprater at 4:36 PM on February 17, 2008


Here's an actively maintained Myspace gigs parser written in PHP. If you have no coding skills, use the web-based version of that with Yahoo! Pipes or Dapper, and you should be good to go.
posted by waxpancake at 4:41 PM on February 17, 2008


« Older Advice on moving to Austin from SF Bay Area   |   Two Cats are Better than One? Newer »
This thread is closed to new comments.