Scraping & saving?
November 19, 2009 3:13 PM   Subscribe

I have perhaps a thousand delicious links (to documents in the SEC database). All of these could be broken at anytime if the SEC changes the way it displays these. How do I automate the process of copying the contents of those documents so I can save them in a database?

I have checked into previous questions and web scraping software, but the web scraping/crawling/spidering software out there requires what looks a little to much to me like programming.

Is there an easy way to collect the documents? I am hoping to feed something the list of links and be done. Fair warning: if I can figure this out, then I will ask how best to save the documents in a database. I have considered using Mechanical Turk, or something, but I think this really ought to be a job for a machine. Free software solutions preferred, but willing to pay to make it easy for me to do...

Sample document:
posted by extropy to Computers & Internet (12 answers total) 4 users marked this as a favorite
Response by poster: sample document
posted by extropy at 3:15 PM on November 19, 2009

You don't indicate where your computer skills lie in the spectrum, but it sounds like what you want is a script that, for each link in your list, runs wget to suck down a copy of the page.
posted by axiom at 3:22 PM on November 19, 2009

Oh, and to address the database issue: why? Why not just store the files on disk in the normal fashion?
posted by axiom at 3:23 PM on November 19, 2009

Well, sed and regex may be too much like programming to you, but they're likely ideal. You need to download the list of links and feed it to ftp in a script.

But it sounds like you want something along the lines of WS_FTP.
posted by dhartung at 3:24 PM on November 19, 2009

Oh, and you'll want document management software, such as KnowledgeTree, to keep track of them once you have them.

I'm unsure why you think that the SEC is likely to change their format abruptly. They've been internet savvy for a long while. (I am also assuming a document, once filed at the SEC, remains unmodified. This is an issue of concern when making a local mirror.)
posted by dhartung at 3:30 PM on November 19, 2009

Best answer: This might be easier:

1. Export all your delicious bookmarks to an .html file (takes about 2 clicks under the Delicious "settings" tab) on your desktop.
2. Download the "DownThemAll" extension into Firefox.
3. Point DTA at the delicious .html file and have DTA spider all of the links.
4. Save resulting file to PDF or some other searchable archive.

Here's a link to the DTA manual on their spidering functions:
posted by webhund at 3:30 PM on November 19, 2009 [1 favorite]

Are the filenames unique?

If so, paste all links into a text file, one per line, then:

wget -i textfile.txt
posted by pompomtom at 3:43 PM on November 19, 2009

Seconding wget. You'll need to install cygwin and select it during install if you're on windows.
posted by sanko at 4:15 PM on November 19, 2009

wget is what I'd use too -- there is a windows version that is just a standard exe so cygwin is not required. It means using the command line, but is not difficult and is reliable and predictable and tweakable (you're able to throttle bandwidth, choose a delay between files, use passwords and more).
posted by Quinbus Flestrin at 5:19 PM on November 19, 2009

Response by poster: OK, I am going to start trying with DownThemAll. I have installed cygwin and wget as a Plan B.

... (20 minutes later with DownThemAll): Wow! Done!

Took me a while to figure out how to do the "pointing", but I have downloaded almost a thousand documents, named based on my Delicious descriptions, into a directory on my computer. That is something I have wanted to do, but feared, for years.

Thanks everyone for the help.

Next: the database!
posted by extropy at 6:00 PM on November 19, 2009

If you're a Mac user, try DevonThink Pro (or DT Pro Office). Assuming you have these documents for research purposes of some sort, don't waste your time creating tags, indexing them, manually entering database fields, etc... Use DT's extremely fast and thorough search and AI functions instead.

Note: I'm not connected with DevonThink at all; just a very happy user.
posted by webhund at 11:57 AM on November 20, 2009

Seconding DT if you're a Mac user, you could easily script it to do this, save the URIs as PDFs, web archives, etc. Plus it will be searchable and can be organized, etc.
posted by Brian Puccio at 5:12 PM on November 20, 2009

« Older Name this ghost movie   |   Breaking the News Newer »
This thread is closed to new comments.