How do I scrape information from a website?
October 4, 2011 9:10 AM   Subscribe

How do I scrape information from a website?

I would like to take some basic information from a site and have it exported into an Excel file (or similar) where each entry has a bunch of information contained within individual cells.

I'm a beginner at this and I literally don't even know where to start.

Right now I'm only worried about the text, but ultimately I would like to learn how to do that for images as well.

Basically, I want to take some information and put it into a spreadsheet so I can run some calculations on the numbers. Right now I have to do it manually. And while it is possible, I'd rather not have to do it every day (or at least minimize the time I do this).
posted by darkgroove to Computers & Internet (12 answers total) 10 users marked this as a favorite
 
I've never used it, but you could explore this: http://needlebase.com/
posted by nowoutside at 9:13 AM on October 4, 2011


I take it you are not a programmer at all? I tend to use Beautiful Soup in Python for this sort of thing, but doing it that way requires some basic programming skills.
posted by burnmp3s at 9:16 AM on October 4, 2011 [4 favorites]


If your needs are simple you can actually scrape straight from Excel using a web query: http://www.mrexcel.com/tip103.shtml
posted by michaelh at 9:21 AM on October 4, 2011


This is the only thing I use Internet Explorer for - it can extract data from an HTML table into Excel! Open the site in Internet Explorer and right click on the spot where the data is - you should see the option "Export to Microsoft Excel". The earlier commenters presented better long-term options, but this is a quick & dirty way that doesn't involve any programming or installation.
posted by beyond_pink at 9:21 AM on October 4, 2011


It really depends on what kind of data it is and how it's formatted.
posted by empath at 9:22 AM on October 4, 2011


Scraperwiki has pre-written scrapers for many sites.

OutWit Hub might also do what you want without the need for programming.
posted by James Scott-Brown at 9:25 AM on October 4, 2011


The first time I had to scrape a page, I used screen-scraper. Going through the tutorials was useful to understand the concepts behind scraping, and to get something functional cobbled together. Overall, I found it to be somewhat buggy, but good enough to get the job done.
posted by bessel functions seem unnecessarily complicated at 9:37 AM on October 4, 2011


select the cells, ctrl c it, then go to excel of google docs and paste.
posted by dripped at 11:02 AM on October 4, 2011


Response by poster: Excellent. I will try some of these out later.

I'm not a programmer and I don't have any formal knowledge, but I've been known to be able to 'read' HTML (to a very basic extent). So assuming the program is well documented I should be able to figure it out.
posted by darkgroove at 11:48 AM on October 4, 2011


Response by poster: Needlebase appears like it does everything I need it to, but after spending about an hour using it, the results it yielded were less than impressive. One site wouldn't let me scrape it (Needlebase gave a warning, "this site does not allow scraping"). The next time I tried a different site and it looked like everything would work as planned, but it was SO slow. We are talking 5 minutes to scrape two pages. That is unacceptable. This does appear to be a good option though so I'll do some trial and error.

Beautiful Soup. Looks complex, will explore.

OutWit Hub looked hard to use as well, but I didn't try to dig too deep with it.

These three options were great, thank you for that. All of the Excel options won't work as they are too basic. Additionally, I want the pages scraped daily. Me physically copying and pasting anything isn't an option.
posted by darkgroove at 4:26 PM on October 4, 2011


Generally, in the past I have used this. Making these things work in a general case can actually be fairly difficult, depending in the site. I've also written Javascript programs to do it implemented as firefox extensions or Greasemonkey scripts, for sites that require javascript to get all the data you want onto the page in the first place.

Alternatively, it is quite feasible to hire a programmer to write you a script that you run every day (or set up as an a task that runs automatically every day).
posted by tylerkaraszewski at 6:08 PM on October 4, 2011


I've written simple macros in VBA (which is built into excel) that would open up web pages, find certain parts of the text and images, and then output them into certain cells in order to populate a database.

It didn't require anything super complex, besides reading up on the documentation on how to have VBA control internet explorer and some basic programming loops, but if you are unfamiliar with simple programming, that might be a bit too hard for you.

I bet you could hire a programmer to write you a simple VBA macro in excel. Excel macros have the advantage of an interface that you are used to. They could make a button that you click which would automatically go and get all the information and then output directly into the table you are using.
posted by vegetableagony at 9:15 PM on October 13, 2011


« Older Must have penguin ice cubes!   |   Evolution of the English present progressive tense Newer »
This thread is closed to new comments.