Is this web scraping?
October 26, 2014 7:17 AM Subscribe
So I'm doing a project where I want to do some analysis on a whole bunch of text from newspapers and magazines.
I want to download all these text into a corpus, which I will then analyze. What's this called? Is this web scraping? I know how to do the analysis but I do not know how to automatically download stuff from the web.
Thanks
I want to download all these text into a corpus, which I will then analyze. What's this called? Is this web scraping? I know how to do the analysis but I do not know how to automatically download stuff from the web.
Thanks
Screen scraping is when you manually copy text and paste it into another program for use.
You might be able to get a spidering tool to help you with this, though.
posted by Riverine at 7:28 AM on October 26, 2014
You might be able to get a spidering tool to help you with this, though.
posted by Riverine at 7:28 AM on October 26, 2014
It is. There is a Wikipedia article about it.
One way (of many) to download Web pages, is to use wget.
One of the difficulties you'll probably run into sooner rather than later, is separating "signal" (the content you're interested in for your project) from "noise" (the content you're not). HTML was designed for mark-up, with no thought for distinguishing content from say, advertising.
Further, these days much content is delivered via JavaScript and Flash, which make matters even more challenging.
posted by biersquirrel at 7:46 AM on October 26, 2014
One way (of many) to download Web pages, is to use wget.
One of the difficulties you'll probably run into sooner rather than later, is separating "signal" (the content you're interested in for your project) from "noise" (the content you're not). HTML was designed for mark-up, with no thought for distinguishing content from say, advertising.
Further, these days much content is delivered via JavaScript and Flash, which make matters even more challenging.
posted by biersquirrel at 7:46 AM on October 26, 2014
Response by poster: Thanks! I use R a lot but I'm thinking of doing this in python because it all seems easier and there seems to be better documentation/tutorials/packages for doing it in python.
posted by MisantropicPainforest at 7:52 AM on October 26, 2014
posted by MisantropicPainforest at 7:52 AM on October 26, 2014
If you're doing scraping, I suggest you check out Kimono Labs and Parsehub, both of which are web/GUI-driven scraping solutions that look pretty intelligent and robust.
posted by suedehead at 8:17 AM on October 26, 2014
posted by suedehead at 8:17 AM on October 26, 2014
Python is a good choice for downloading lots of news articles. I believe the current favored library is scrapy. You'll find a lot of references to BeautifulSoup, too, which is not bad but slow and perhaps outdated.
Node.js is another language option; that can be nice because it makes it easy to run some Javascript in the page while scraping.
posted by Nelson at 9:13 AM on October 26, 2014 [2 favorites]
Node.js is another language option; that can be nice because it makes it easy to run some Javascript in the page while scraping.
posted by Nelson at 9:13 AM on October 26, 2014 [2 favorites]
I've done a fair bit of Python scraping using Requests + BeautifulSoup, and that's what I'd recommend. Another possibility, which I haven't tried yet, could be Portia.
posted by daisyk at 9:19 AM on October 26, 2014
posted by daisyk at 9:19 AM on October 26, 2014
Make sure that whatever you're doing is OK with the webmasters of the site, before they block you.
posted by dilaudid at 9:47 AM on October 26, 2014 [2 favorites]
posted by dilaudid at 9:47 AM on October 26, 2014 [2 favorites]
Depending on your use case, microsoft Excel has a little known feature called web query that does this.
posted by anti social order at 10:23 AM on October 26, 2014 [1 favorite]
posted by anti social order at 10:23 AM on October 26, 2014 [1 favorite]
Depending on how much you are scraping and how fully you want to automate your workflow one of the Chrome extensions might work, or maybe OutWit Hub for Firefox (what I've used for similar projects).
posted by pantarei70 at 12:09 PM on October 26, 2014
posted by pantarei70 at 12:09 PM on October 26, 2014
This thread is closed to new comments.
posted by travelwithcats at 7:28 AM on October 26, 2014 [1 favorite]