Datamining the public web
July 30, 2007 1:52 AM   Subscribe

How do i build a data warehouse that scrapes data from public websites for my own use? Tools? Tips?

Hi. I would like to track apartments on a classifieds site and use the data for analyzing the inpact of diffrent things on price. What i need is a tool or scripting language that would make it easy for me to spider the website and put the data in a database. Preferable this would be an open source solution.

I am also looking for good tools for extracting information out of longer pieces of text. For example on the site i want to mine users can put in comments on every object. I would like to be able to decide if a comment is positive, negative och neither. I have seen this be done on one online art site that i cant remember the name of right now. The artist used blog post and decided the mood of the writer by what words were used.
posted by ilike to Computers & Internet (15 answers total) 19 users marked this as a favorite
 
You might be able to find some scripts and other tools out there to help you gather some of this information, but I doubt there is an all-in-one tool out there that does all of it.

However, I am sure there could be for a certain price.

If you post to places like scriptlance.com, rentacoder.com, guru.com or any of the other freelance programing sites you should be able to find somebody to build you something custom for a couple hundred or maybe even less.
posted by B(oYo)BIES at 2:06 AM on July 30, 2007 [1 favorite]


i did something similar with local listings... a local newspaper had this godawful, MS Frontpage-esque site, and I ended up using PHP and the functions strpos (find occurrence of a text string) and str_replace (replace ugly formatting tags), and a cron job to do it periodically.
posted by slater at 2:08 AM on July 30, 2007


I don't think a toolkit for this sort of thing exists, but there sure seems to be a market for one. There are a thousand ways to write the actual scraping code, but there's a real need for a higher-level control language and an app that makes "guesses" at what data you're looking for, assigns fields, etc.

I've wanted one myself many times, but always ended up writing a single-use, ugly-as-sit custom script instead. Too bad.
posted by rokusan at 2:34 AM on July 30, 2007


I'm glad to see you're interest in this, since I'm actually thinking of making this tool. Making the tool is certainly doable, but I think the interesting part of this would be making the GUI usable enough to make this easy for non-coders.

By the way, if you've heard web people like me rant about the semantic web, and XML over HTML, this is what that would accomplish.
posted by cheerleaders_to_your_funeral at 2:54 AM on July 30, 2007


Best answer: Well, given that Regular Expressions can't parse SGMLs, this is not trivial. You need a parser.

I know Python and BeautifulSoup make it very easy to work with the messy superweb.
posted by cmiller at 3:18 AM on July 30, 2007


Best answer: Adrian Holovaty, the award-winning journalist/developer behind Django and ChicagoCrime, one of the earliest geodata-based mashups, recently released Templatemaker, which does exactly this function in Python.

This kind of tool in general is called a screen-scraping, or just scraping, script. Here's a relatively recent article on screen-scraping in Perl. One of the most commonly-used tools for doing this in Perl is Template::Extract by Audrey Tang. In a similar vein is Andy Lester's WWW::Mechanize. The article mentioned here references both of those tools but introduces Yung-chung Lin's Fear::API, which seems to be more comprehensive and is probably closer to what you're looking for.
posted by anildash at 5:27 AM on July 30, 2007 [2 favorites]


Response by poster: Thanks for all the suggestions. If ill have to learn a programming language to do this i would prefer ruby or python. Anyone know more tools similar to what anildash suggested for ruby or python?
posted by ilike at 5:48 AM on July 30, 2007


It seems most people doing this with Ruby are using Hpricot
posted by bitdamaged at 6:07 AM on July 30, 2007


Best answer: Ruby scraping tools. I like ScrAPI myself.

Just for completeness, web-harvest. I can't vouch for it cos I've never used it but I likes the look of it.
posted by vbfg at 6:20 AM on July 30, 2007


O'Reilly sells a book "Spidering Hacks 100 Industrial-Strength Tips & Tools" which you might find useful as background. The version I have is from 2003 and may be a little dated now on the library and tool front. Depending on how much experience you have it may be useful.
posted by bottlebrushtree at 9:15 AM on July 30, 2007


Best answer: I do this kind of thing a fair amount. If the page is really simple I'll use regexp, otherwise I'll use BeautifulSoup. That's a real pain of hand crafted parsing; the templatemaker Anil links to sounds like it could be a lot better.

Some tips for doing it well:

While you're developing be sure you're working from cached local copies of the pages. It'll be faster and you won't tip off the website what you're doing.

Built a good unit test suite for your scraper/parser. You're going to be tweaking it a lot.

Websites may look unkindly on you taking their data. When fetching pages you may need to impersonate a web browser user agent so you aren't obvious. Also, be polite. Access pages slowly with random sleep intervals between fetches.

If you get an error fetching / parsing the page, never ever fetch the same page right away again. If you do that some day a bug will blow up and your scraper will loop and spam page requests and make the server operator angry.

BeautifulSoup is very slow. Like several seconds per 50k page. That may or may not matter to you depending on volume.

If you go the BeautifulSoup route, work from a good tree view of the HTML. Most HTML editors will give you one, so will BeautifulSoup's pretty print. BeautifulSoup is really just a forgiving DOM-like system with lots of search capability, so you're going to be working with trees. You can often get to the data you want by searching for a tag with a CSS attribute to get in the right area of the page, but then you're walking the tree.

If you're going to be scraping for more than a few days, invest in a monitoring system so you know immediately if the page format changes and breaks your parsers. BeautifulSoup isn't as brittle as regexp, particularly if you write the code carefully, but it's still going to break at some point.
posted by Nelson at 9:24 AM on July 30, 2007 [1 favorite]


You might like to take a look at Nutch. It's a java-base open source search engine which uses Apache Lucene to index documents found online. I don't know if it will help with the analysis of the data gathered but it can definitely cut your time with respect to spidering websites.

There are also tools such as Luke and Solr that allow you to do in-depth examinations of the index data that Lucene creates.
posted by talkingmuffin at 9:46 AM on July 30, 2007


You might want to try Solvent, part of the Simile Project.
posted by euphorb at 10:17 AM on July 30, 2007


Greenplum has created a massively parallel version of PostgreSQL. The latest version has a feature called web tables. Web tables allow you to import remote structured data as a table and join it via SQl queries to internal tables and other web tables. It may be useful to you in your experiments. Full disclosure:- I have been a contractor for Greenplum in the past. However even independent of that I am also quite impressed with the work they are doing. They are making their closed extensions available for free (as in free beer) so for your needs that might just work. Although the technology is massively parallel and runs best on multiple machine clusters, it can also be run on a single machine while learning about the software - you just don't get the advantage of the parallel stuff that's all.
posted by nborwankar at 8:09 PM on July 30, 2007 [1 favorite]


twill is an excellent web-automation package for python (which uses the aforementioned BeautifulSoup to parse its HTML). It's geared towards web testing, and comes with its own mini-scripting language for that purpose, but it's very easy to access from a Python script as well if you need to. It could probably help you with a lot of the more tedious navigational aspects of spidering, and it can fill out forms for you as well.

As a note, if you are a polite netizen you should respect robots.txt when you crawl other people's sites; twill ignores it but there's a simple parser you can use in the python stdlib.
posted by whir at 5:42 PM on July 31, 2007


« Older Help me improve my interview skills.   |   Advice or tips for camping on the beach? Newer »
This thread is closed to new comments.