How to scrape a web forum?
September 23, 2011 11:23 AM Subscribe
How to scrape a web forum? I need help understanding the process.
I want to figure out how to scrape a website, specifically a forum. It’s a site that’s been around for a long time with a lot of knowledge, but the last couple years the owner has gotten progressively less active, and is AWOL now. I’m afraid it will go away once the domain expires and I want a back up so I can access the information if I need to. Its an invision board forum and you need a password to see the forum content, if that effects how its done.
I’ve been looking for how to do it but I all I am finding is blackhat SEO sites that are talking about scraping and reusing content, which is not what I’m looking to do.
I read some posts here but I’m still left not fully understanding what steps I need to take. I looked at scraperwiki (https://scraperwiki.com/ - recommended in this post: http://ask.metafilter.com/178622/Scraping-the-web) And I’m not sure exactly I’m supposed to do. I’ve been trying to dig into the tutorials and the intro video and I think it might be what I want but now I more confused than when I started. I’m not even sure what to do for a data request if I decide to go that route.
I’m on a limited budget, but would consider software if it didn’t break the bank, so if there are any suggestions for software, I’d also appreciate it. I can also usually hack my way around php and mysql, but this isn’t like anything I’ve done before, leaving me more confused every time I search. Looking at scraperwiki I suspect its outside my skill level (but am willing to give it a shot
if I can find some good tutorials).
I’m really looking for some guides and advice to understand it better, so anything you can share would be great.
I want to figure out how to scrape a website, specifically a forum. It’s a site that’s been around for a long time with a lot of knowledge, but the last couple years the owner has gotten progressively less active, and is AWOL now. I’m afraid it will go away once the domain expires and I want a back up so I can access the information if I need to. Its an invision board forum and you need a password to see the forum content, if that effects how its done.
I’ve been looking for how to do it but I all I am finding is blackhat SEO sites that are talking about scraping and reusing content, which is not what I’m looking to do.
I read some posts here but I’m still left not fully understanding what steps I need to take. I looked at scraperwiki (https://scraperwiki.com/ - recommended in this post: http://ask.metafilter.com/178622/Scraping-the-web) And I’m not sure exactly I’m supposed to do. I’ve been trying to dig into the tutorials and the intro video and I think it might be what I want but now I more confused than when I started. I’m not even sure what to do for a data request if I decide to go that route.
I’m on a limited budget, but would consider software if it didn’t break the bank, so if there are any suggestions for software, I’d also appreciate it. I can also usually hack my way around php and mysql, but this isn’t like anything I’ve done before, leaving me more confused every time I search. Looking at scraperwiki I suspect its outside my skill level (but am willing to give it a shot
if I can find some good tutorials).
I’m really looking for some guides and advice to understand it better, so anything you can share would be great.
Response by poster: I'm familiar with wget for ftp a little bit, and I have a mac so I could probably do it through terminal. How would I use it to crawl and grab pages? Will it work for a password protected section of the site? What will be output?
posted by [insert clever name here] at 11:40 AM on September 23, 2011
posted by [insert clever name here] at 11:40 AM on September 23, 2011
The thing that you want to do isn't called scraping. screen-scraping is extracting specific data from poorly-formatted sources online, like what you find when you search for that term.
You just want to make a cache of the entire site, so that you could (for instance) read it offline (or put it back online somewhere else, later).
You want a piece of software like this: http://www.httrack.com/
posted by tylerkaraszewski at 11:40 AM on September 23, 2011 [1 favorite]
You just want to make a cache of the entire site, so that you could (for instance) read it offline (or put it back online somewhere else, later).
You want a piece of software like this: http://www.httrack.com/
posted by tylerkaraszewski at 11:40 AM on September 23, 2011 [1 favorite]
You really don't want to scrape (extracting and reusing data), you want to mirror/cache/archive. You might want to seek guidance from Archive Team, which was founded by jscott. They're very concerned about preserving endangered online communities.
posted by zsazsa at 11:42 AM on September 23, 2011
posted by zsazsa at 11:42 AM on September 23, 2011
Yeah, you definitely want wget. Here's a tutorial.
posted by cdmwebs at 11:50 AM on September 23, 2011
posted by cdmwebs at 11:50 AM on September 23, 2011
Response by poster: I think I do still want to scrape it. I think. My plan was run a local copy of the site so if I needed to I could do all the same searched.
HOWEVER, I'm wondering if maybe what I want to do is capture the pages, and then if I ever did need to try and re-grab it to a database, I'd then scrape it? Does that sound more like what I'd want to do?
posted by [insert clever name here] at 11:52 AM on September 23, 2011
HOWEVER, I'm wondering if maybe what I want to do is capture the pages, and then if I ever did need to try and re-grab it to a database, I'd then scrape it? Does that sound more like what I'd want to do?
posted by [insert clever name here] at 11:52 AM on September 23, 2011
Response by poster: BTW, that's an excellent suggestion, contacting the archive team. I don't know if they'd be interested or not, its a pretty niche site. But its also been around since 2000 so maybe.
posted by [insert clever name here] at 11:53 AM on September 23, 2011
posted by [insert clever name here] at 11:53 AM on September 23, 2011
Best answer: I've successfully used SiteScraper for something similar (grabbing a bunch of hiking pages for offline access while in the "wilds").
http://www.sitesucker.us/mac/mac.html
posted by jeffch at 12:03 PM on September 23, 2011
http://www.sitesucker.us/mac/mac.html
posted by jeffch at 12:03 PM on September 23, 2011
If the forum requires a login, then you'll need to look at your cookie data in your browser and replicate that cookie data when you invoke wget.
Also, you're probably going to want to write exclusion rules that limit the recursive mirroring. Dynamic forums have tons and tons of links (e.g. "view member list", printable thread views, etc.) that can turn into complete tarpits for a spider.
My plan was run a local copy of the site so if I needed to I could do all the same searched.
What you're going to get is a completely static copy of the site. It will contain all the information on the page, but nothing more. No dynamic site functionality (posting, replying, searching, editing, favoriting, editing view settings, etc.) will work. All of that stuff requires the code on the server and there is no way for you to get that. That isn't to say that you couldn't rig up your own search feature, something generic that indexes HTML, but that would most likely involve much more work, such as running your own web server and scripting language. The nice thing about a static mirror is that you can just load it into your browser using file:// URLs without having to set any of that up.
posted by Rhomboid at 12:42 PM on September 23, 2011
Also, you're probably going to want to write exclusion rules that limit the recursive mirroring. Dynamic forums have tons and tons of links (e.g. "view member list", printable thread views, etc.) that can turn into complete tarpits for a spider.
My plan was run a local copy of the site so if I needed to I could do all the same searched.
What you're going to get is a completely static copy of the site. It will contain all the information on the page, but nothing more. No dynamic site functionality (posting, replying, searching, editing, favoriting, editing view settings, etc.) will work. All of that stuff requires the code on the server and there is no way for you to get that. That isn't to say that you couldn't rig up your own search feature, something generic that indexes HTML, but that would most likely involve much more work, such as running your own web server and scripting language. The nice thing about a static mirror is that you can just load it into your browser using file:// URLs without having to set any of that up.
posted by Rhomboid at 12:42 PM on September 23, 2011
You also might want to see if somebody has already done it for you. The Internet Archive tries to archive as many web sites as possible. Check the wayback machine to see if they have scrapped it already.
posted by bottlebrushtree at 2:00 PM on September 23, 2011
posted by bottlebrushtree at 2:00 PM on September 23, 2011
Best answer: Downloading an entire web site with wget.
As Rhomboid said, if you need a login it's going to need a little more effort, and running a local copy requires more than "get a static copy of the site".
posted by straw at 3:24 PM on September 23, 2011
As Rhomboid said, if you need a login it's going to need a little more effort, and running a local copy requires more than "get a static copy of the site".
posted by straw at 3:24 PM on September 23, 2011
Response by poster: Rhomboid, Straw - that's why I thought maybe I did want to do scraping. Because if I needed it searchable, I could dump a database into the same software its running on locally. I'm running php and mysql via MAMP so that part isn't a problem. Right now I use the site for searching the archives. There are some active members lingering but not really enough to generate much worthwhile new content.
Of course, I could be putting the cart in front of the horse. Maybe I just need to wget it, and try and convert it to a database later, if it does go away? What would be the best to do in this scenario?
bottlebrushtree, I checked archive.org but they don't have it because its blocked by a robots.txt. Yes, that would have made it much easier.
posted by [insert clever name here] at 6:07 PM on September 23, 2011
Of course, I could be putting the cart in front of the horse. Maybe I just need to wget it, and try and convert it to a database later, if it does go away? What would be the best to do in this scenario?
bottlebrushtree, I checked archive.org but they don't have it because its blocked by a robots.txt. Yes, that would have made it much easier.
posted by [insert clever name here] at 6:07 PM on September 23, 2011
Best answer: You need a scrawler converter. I have used it in the past (invisionfree to phpbb) and it worked quite well. You may need to modify it a bit depending on the forum's skin though.
posted by Memo at 6:28 PM on September 23, 2011 [1 favorite]
posted by Memo at 6:28 PM on September 23, 2011 [1 favorite]
Best answer: *crawler!
I mention that kind of script because it basically gives an exact copy of the forum with the ability to search through posts through whatever message board software you want to use.
posted by Memo at 6:34 PM on September 23, 2011
I mention that kind of script because it basically gives an exact copy of the forum with the ability to search through posts through whatever message board software you want to use.
posted by Memo at 6:34 PM on September 23, 2011
Response by poster: Thanks Memo! I'm checking out crawler converter right now, and I still have some things to learn, but I think this will do the trick. I especially like it because I can then just run a local copy of phpbb and use it for searching if I need to. And, being that it would be in a database, I could later decide to take the database, and make something that is more useful to searching rather than forum discussions.
I'm also marking some of the wget suggests as best answer; they would be a decent alternative. Actually all the answers here were really good and not things I was able to find via google. Thanks everyone for the advice!
posted by [insert clever name here] at 10:48 AM on September 24, 2011
I'm also marking some of the wget suggests as best answer; they would be a decent alternative. Actually all the answers here were really good and not things I was able to find via google. Thanks everyone for the advice!
posted by [insert clever name here] at 10:48 AM on September 24, 2011
Response by poster: I couldn't use crawler convert because it only worked for invisionfree not the paid version. I am, however, using site sucker and it seems to be working. A bit imperfectly (some duplicates) but enough that I think I'll be able to search for future use.
posted by [insert clever name here] at 7:06 AM on December 14, 2011
posted by [insert clever name here] at 7:06 AM on December 14, 2011
This thread is closed to new comments.
posted by These Premises Are Alarmed at 11:35 AM on September 23, 2011 [1 favorite]