Is this web scraping?
October 26, 2014 7:17 AM   Subscribe

So I'm doing a project where I want to do some analysis on a whole bunch of text from newspapers and magazines.

I want to download all these text into a corpus, which I will then analyze. What's this called? Is this web scraping? I know how to do the analysis but I do not know how to automatically download stuff from the web.

Thanks
posted by MisantropicPainforest to Technology (10 answers total) 14 users marked this as a favorite
 
Yes, that's scraping. Are you familiar with R?
posted by travelwithcats at 7:28 AM on October 26, 2014 [1 favorite]


Screen scraping is when you manually copy text and paste it into another program for use.

You might be able to get a spidering tool to help you with this, though.
posted by Riverine at 7:28 AM on October 26, 2014


It is. There is a Wikipedia article about it.

One way (of many) to download Web pages, is to use wget.

One of the difficulties you'll probably run into sooner rather than later, is separating "signal" (the content you're interested in for your project) from "noise" (the content you're not). HTML was designed for mark-up, with no thought for distinguishing content from say, advertising.

Further, these days much content is delivered via JavaScript and Flash, which make matters even more challenging.
posted by biersquirrel at 7:46 AM on October 26, 2014


Response by poster: Thanks! I use R a lot but I'm thinking of doing this in python because it all seems easier and there seems to be better documentation/tutorials/packages for doing it in python.
posted by MisantropicPainforest at 7:52 AM on October 26, 2014


If you're doing scraping, I suggest you check out Kimono Labs and Parsehub, both of which are web/GUI-driven scraping solutions that look pretty intelligent and robust.
posted by suedehead at 8:17 AM on October 26, 2014


Python is a good choice for downloading lots of news articles. I believe the current favored library is scrapy. You'll find a lot of references to BeautifulSoup, too, which is not bad but slow and perhaps outdated.

Node.js is another language option; that can be nice because it makes it easy to run some Javascript in the page while scraping.
posted by Nelson at 9:13 AM on October 26, 2014 [2 favorites]


I've done a fair bit of Python scraping using Requests + BeautifulSoup, and that's what I'd recommend. Another possibility, which I haven't tried yet, could be Portia.
posted by daisyk at 9:19 AM on October 26, 2014


Make sure that whatever you're doing is OK with the webmasters of the site, before they block you.
posted by dilaudid at 9:47 AM on October 26, 2014 [2 favorites]


Depending on your use case, microsoft Excel has a little known feature called web query that does this.
posted by anti social order at 10:23 AM on October 26, 2014 [1 favorite]


Depending on how much you are scraping and how fully you want to automate your workflow one of the Chrome extensions might work, or maybe OutWit Hub for Firefox (what I've used for similar projects).
posted by pantarei70 at 12:09 PM on October 26, 2014


« Older Epilepsy in the blogosphere   |   How to proceed when things are bad? Newer »
This thread is closed to new comments.