Web scraping from our own database based on a CSV?
January 24, 2018 11:34 AM Subscribe

At my job we have information in a database but reports aren't available for every field. I'd like to use a list of record ID numbers in a CSV to visit a page and extract a piece of information to go back into the CSV (or in a new CSV but paired with the ID number) so I don't have to look up thousands of records and copy/paste the results. I don't know what platform to use/where to begin. Help is very much appreciated!

I asked this question a while ago; iMacros worked well for me until recently when they made it so you can't use CSVs unless you pay like $90 a month which there is no way my employer is going to do.

What's the best way to do extract information from our own database that we can only access through a website? The URL for each record is like "www.database.com/blah-blah-blah/contact-ID" and then on the page for the record is an email address with a status of something like subscribed or blocked or non-existing. I want to have a CSV with a list of contact records, run a program (or process or whatever), and have it look up each record and insert the email status into a CSV next to the number. I know this will take a while and I will build in wait times so I don't overload the system.

I like problem solving and puzzles so I'm happy to try to figure this out and in fact will probably enjoy it but how do I start? My programming experience is mostly theoretical and in browsers so even if I did write a program to do this I wouldn't really know how to run it on my computer.

Thank you so much for any help you can provide!

posted by Mrs. Pterodactyl to Computers & Internet (11 answers total)

This would be a relatively easy program to write in python.

Python itself can be downloaded from its website or installed via e.g. homebrew on OSX. Use the requests and beautiful soup libraries for fetching and parsing the data; python has built-in csv support.

The general structure of the program would be: open input CSV file, get the IDs and URLs you want to look up using python's csv support, fetch the data from the URL using requests, parse out the bit you want using beautiful soup, store all this info in a list or dict, upon completion write out either a separate CSV with just the data you fetched or an updated original + new data csv, done.
posted by beerbajay at 11:59 AM on January 24, 2018 [2 favorites]

Chiming in to beerbajay's answer, newcoder.io has tutorials with some well-explained Python code for (among other things) doing screenscraping and dealing with CSV files, so you might be able to figure something out with some of the code snippets there.
posted by A Robot Ninja at 12:05 PM on January 24, 2018

Yeah, ~~seconding~~ thirding Python. It's basically perfect for this kind of thing, and pretty easy to pick up even if you're a relatively novice programmer. I haven't used the above libraries, but those are pretty much the go-tos for interacting with web pages from Python. (Although there's also Selenium, which operates a little differently but will still let you extract data from a page. I've used it on a couple of small projects.)
posted by Mr. Bad Example at 12:05 PM on January 24, 2018

I'll try not to threadsit but I should say that I'd like to download as little on to my work computer as possible. Also, I tried to use python on a PC previously and getting it installed and working properly was EXHAUSTING. If python is really the best/only answer I'll do what I can but I'd especially appreciate anything with a low barrier to getting started.
posted by Mrs. Pterodactyl at 12:07 PM on January 24, 2018

Looks like the free version of iMacros will still work with CSVs, but only on internet explorer and 100 rows at a time, with 3 columns. Can you modify your workflow to work on IE?
posted by suedehead at 12:23 PM on January 24, 2018

> download as little on to my work computer as possible

If you mean the data from your internal site; python will 'forget' all of the downloaded data as soon as the program completes (or earlier depending on garbage collection). If you're worried about requests/beautiful-soup; these get installed as a part of the python installation, so they're not extra programs.

> getting it installed and working properly was EXHAUSTING

In my experience, setting up your development environment can be one of the more difficult parts of programming since it's dealing with a bunch of unknowns all at once rather than incremental improvement.

> low barrier to getting started

For 'real programming' python has a very low-barrier to entry. There are likely web-based tools that'd do some of this for you (I think yahoo pipes used to be able to do something like this), but they won't be able to access any sites on your office network, which is a good thing in general.

Another other option is to use macros or Visual Basic scripts inside of Excel which (I think) can download and parse html. VB is in my opinion harder to use than python, but this may integrate better into your workflow.
posted by beerbajay at 12:37 PM on January 24, 2018

Is there really no way to get your DBA or developers to run a query and give you the results? I know that in some organizations, this actually would be refused categorically, but I'd exhaust these possibilities before resorting to scraping the web to get my own data.
posted by thelonius at 12:54 PM on January 24, 2018

If your work will allow you to install the Linux Subsystem for Windows it makes installing and working with Python so much easier.
posted by coleboptera at 12:56 PM on January 24, 2018

To answer questions:

-We use an external database and can only access it via a website. I want to run these searches frequently and it takes the organization running the website a while to get back to me with stuff like this.

-I have to ask an external IT person, who is also not great about responding, to enter the administrator password to download anything on my work computer so the fewer things I have to download and install the better (so, for example, anything browser-based would be fantastic but I know that's not necessarily feasible).

With that in mind, Excel might be the answer, thanks! I'll look into that and see if I can use it before I try to get the IT company to let me install Python.
posted by Mrs. Pterodactyl at 12:58 PM on January 24, 2018 [2 favorites]

Check out OutWit Hub. It can be used as a standalone application, or as a Firefox add-on.

I've done exactly what you described, in similar circumstances. My method was to use Excel to build a list of the URLs I wanted data from (e.g. http://example.com/data/01.html, http://example.com/data/02.html, ...).

Then, I could use OutWit to "crawl" that page of links (I think it might be able to open a CSV, or I may have made an html page with the links).

Then you can set scraping rules, called a "catch" in OutWit terminology. It will pull out data from simple structures easily, or you can craft your own custom "scrapers" to pull out particular elements.
posted by reeddavid at 5:21 PM on January 24, 2018

I just remembered - if you already have Java installed on your computer you could also use OpenRefine for this. OpenRefine is massively useful for so many things.

Here is a reasonably in-depth tutorial on how to do the type of thing you're wanting to do. You don't need to actually install it - you can just run it from wherever you download it - but it does require that you already have Java installed. The trickiest bit will be the parsing of the HTML once you've fetched it but I've found trudging on with trial and error eventually gets it working.
posted by coleboptera at 9:25 PM on January 24, 2018

« Older Looking for examples of awesome-looking social... | Did getting a car in NYC make you happier? Newer »

This thread is closed to new comments.

Ask MetaFilter

Web scraping from our own database based on a CSV?
January 24, 2018 11:34 AM Subscribe

Tags

Share

Web scraping from our own database based on a CSV? January 24, 2018 11:34 AM Subscribe

Tags

Share

Web scraping from our own database based on a CSV?
January 24, 2018 11:34 AM Subscribe