Scraping the web
February 16, 2011 11:54 AM   Subscribe

I keep hearing about web scraping. What tools are used to scrape content from the web?

Let's say I wanted to teach myself to scrape data from the web. What are the tools that one uses for that?

Are there any online tutorials out there for people interested in learning how this works?

posted by dfriedman to Computers & Internet (12 answers total) 29 users marked this as a favorite
A scripting language (e.g., Python), basic fluency in regular expressions, and knowing how to unravel the structure of XML documents will get you pretty far. It really depends on what you want to scrape. Basically, if you can look at the source code of a web page and deduce a reliable rule about where your particular data is located, then it's pretty easy to program a computer to follow that rule and extract the data. If you want to get high-volume and large-scale, things become harder (e.g., "scrape all of the NASDAQ stock prices ten times per second").
posted by qxntpqbbbqxl at 12:03 PM on February 16, 2011

You're going to need a scripting language of some sorts, but there are no plenty of pacakges that will make your job easier. I've usually done scraping in Perl with my own regular expressions, but I've heard that Python plus Beautiful Soup will make your life a lot easier.
posted by eisenkr at 12:09 PM on February 16, 2011 [2 favorites]

I don't have recommendations for particular tools, but if you're building your own, an excellent understanding of regular expressions and DOM parsing are going to cover most cases.
posted by zippy at 12:19 PM on February 16, 2011

Best answer: Also check out ScraperWiki and its tutorials (languages they help you with: Python, PHP, & Ruby).
posted by brainwane at 12:32 PM on February 16, 2011 [4 favorites]

Beautiful Soup is so easy it feels like cheating.
posted by contraption at 1:07 PM on February 16, 2011 [1 favorite]

I actually went to a talk about this last week, because I was curious. Slides are here, under "R and Collecting Internet Data".
posted by madcaptenor at 1:09 PM on February 16, 2011

My go to tools in recent years have been Ruby and hpricot.
posted by Zed at 1:45 PM on February 16, 2011

Most HTML responds to wget, grep and sed. If it's really messy, you want to pipe it through tidy (which is kind of what BeautifulSoup does). wget allows you to spoof user agents and fiddle with cookies for those "difficult" sites. Most custom Google maps are trivial to convert to geodata. Tabular PDFs respond well to pdf2xml, plain ones to pdftotext. Really intractable protected PDFs need printing to image files and OCR.

Really accurate scrapers seldom survive web redesigns. Data accuracy is usually in inverse proportion to data parser elegance.
posted by scruss at 2:09 PM on February 16, 2011

You can also use Yahoo! Pipes to pull in RSS feeds and the text of websites and output an RSS feed containing the scraped data you want from blogs and other websites with serialized content. There's a regular expressions module in there, as well as modules that'll organize feeds on the basis of pretty much any other field available in RSS.
posted by limeonaire at 4:13 PM on February 16, 2011 [1 favorite]

It may not turn up on searches because they call it "spidering", but O'Reilly has a whole book about this.

Personally, I use Perl and WWW::Mechanize.
posted by AmbroseChapel at 5:47 PM on February 16, 2011

Essentially, you need to write a piece of software that pretends to be a web browser - it needs to (1) issue http requests and (2) do something with the data that is returned. As long as we're throwing out specific tool suggestions, I like Groovy, HTTPBuilder and TagSoup.
posted by primer_dimer at 3:05 AM on February 17, 2011

Response by poster: Thanks for all the suggestions. ScraperWiki looks like the perfect place for me to get started, as I'm teaching myself Python already.

posted by dfriedman at 8:47 AM on February 17, 2011

« Older Are there stand-alone .srt players for OSX?   |   The Liberal Education ideal is ruining my life.... Newer »
This thread is closed to new comments.