Scrape groups.msn?
February 7, 2006 5:09 PM   Subscribe

Need advice on moving an MSN group - particularly, recovering/scraping existing messages.

I'm a member of a club who uses an MSN group ( as it's central message board. This groups is about four years old so there are tens of thousands of individual messages.

I'd like to be able to (by some manner) download all of the messages that have been posted since the club's inception. The reasons are several and are listed in order of priority:

1 - To be able to move the club off of MSN and onto a private server running vBulletin or the like.
2 - To be able to set up a search function (already using Bancado's MSN.Groups index builder and isn't that free).
3 - To be able to backup the messages.

Even if I were able to somehow get everything dumped into a huge textfile, I'd be able to parse it out (I was a software developer in a former life).

Any help would be great appreciated.

Ed T.
posted by Lactoso to Computers & Internet (4 answers total)
GreaseMonkey and XPath are your friends when it comes to screen-scraping.

If for some odd reason you don't like GreaseMonkey, perl's WWW::Mechanize can also be helpful.

(I've built scrapers from both.)
posted by orthogonality at 6:30 PM on February 7, 2006

I like Beautiful Soup for quick and easy scraping. It really excels at picking out individual pieces of HTML - the concept is something like "First <td> with class=foo, then third <p> inside that, and print out the contents of each <a> in there with a href that matches /"
posted by pocams at 9:04 PM on February 7, 2006

Response by poster: Orthogonality & Pocams - Thanks! Great general directions to research. I'll post back with results.

Thanks again,
Ed T.
posted by Lactoso at 9:50 PM on February 7, 2006

use perl and WWW::Mechanize to dump the whole thing to something like mbox format. WWW::Mechanize::Shell is the best place to get started. Saved me hours of work.
posted by singingfish at 4:47 AM on February 8, 2006

« Older Chimney question   |   BS in social science? Newer »
This thread is closed to new comments.