Webpage to database for a nonprogrammer?
March 29, 2007 12:16 PM   Subscribe

How can a nonprogrammer make a database from a set of search query results?

I need to search the CRISP database and get the list of funded grants for a set of queries, then follow the link to each grant/search result, and copy some data from the grant details page into a spreadsheet or CSV file. The fields would be stuff like Name, Grant number, Grant title, etc. I don't know any java, perl, or anything that I understand is usually employed to do this kind of thing. I've messed around with some of the online web page scraper tools, but it seems like they're mostly geared towards making RSS feeds from sites that don't offer them or they assume familiarity with Java or Visual Basic. How would you recommend a nonprogrammer do this?
posted by Mr. Gunn to Computers & Internet (14 answers total) 4 users marked this as a favorite
 
What you're asking for unavoidably entails some heavy duty coding; the script will need to parse the website markup to look for the information you require. Any solution will need to be custom-made or heavily modified.

But there is one way a non-coder might be able to see it done with a minimum of effort. You'll probably need more than one, though.
posted by The Confessor at 3:43 PM on March 29, 2007 [1 favorite]


I wouldn't recommend a non-programmer do this. However, I think one of the first questions I ever saw on ask.metafilter dealt with something very similar to what you want, so you might try searching the archives.
posted by rhizome at 3:46 PM on March 29, 2007


This is sorta kinda like asking "how can a non-electrician install a breaker box?" only with less possibility of death.

A non-programmer can't (The Confessor's suggested tack notwithstanding.)
posted by Zed_Lopez at 4:04 PM on March 29, 2007


I'm completely out of my element here, but I note that some of this information is available directly in csv format. Granted, even if you can find a relevant csv on the OER site, you'd need to find a way to approximate your CRISP searches within your spreadsheet program, which could be awkward.
posted by coined at 4:20 PM on March 29, 2007


Response by poster: Well, I have been able to use the Solvent extension to generate the necessary javascript to enable the PiggyBank extension to scrape the pages. Unfortunately, I still have to visit each results page in order to run my CRISP scraper. Now I just need to figure out how first grab a list of links, then run this script programmatically on each.

Hopefully I can continue to cheat death.
posted by Mr. Gunn at 4:49 PM on March 29, 2007


Best answer: OK, I might be able to help you. What are the search criteria? (Email me if you want this to remain private).
posted by Civil_Disobedient at 5:08 PM on March 29, 2007


As long as you realize that you're running a serious risk of turning into a programmer by doing all this, you'll be fine.
posted by flabdablet at 5:14 PM on March 29, 2007


It's too late for me.

Anyway, I figured out how to get the max results in one go, so just feed me the criteria and I'll get you the data.
posted by Civil_Disobedient at 5:20 PM on March 29, 2007


Best answer: Dapper?
posted by wackybrit at 7:56 PM on March 29, 2007


Response by poster: Civil_Disobedient: "What are the search criteria?"

Thanks for your kind offer, C_D. Sorry I couldn't reply sooner, I was getting Cold Fusion errors trying to load this page.

The search parameters I'm using are "mesenchymal stem cell", "Adult stem cell", marrow stromal cell", "stromal stem cell", and "msc stem cell". Each of those phrases in a separate search using AND logic, no stemming, with % or All in the other fields. Each query generates around 3-400 results, and they're overlapping. The thesaurus function is, in theory, the way to do this, instead of redundant slightly different queries, but in practice it's useless. PiggyBank will show me the number of times each name or grant number has been saved, so that would allow me to make a list of unique grant numbers, but there's probably a way to do it by just converting the RDF to CSV and sorting by name in Excel, which is how this non-programmer wrangles data currently.

Civil_Disobedient: "Anyway, I figured out how to get the max results in one go, so just feed me the criteria and I'll get you the data."

In the spirit of "teaching a man to fish", would you send me the script you're using, or could you perhaps point me in the right direction with my PiggyBank script? I can use solvent to generate a script to get all the links, and I have the scraper for the details page, I just don't know how to combine the two.

flabdablet: "As long as you realize that you're running a serious risk of turning into a programmer by doing all this, you'll be fine."

That wouldn't be a bad thing. A little functional scripting ability would make my life much easier. I'm not doing hardcore bioinformatics in my work, but I'm drifting towards experiments that generate fairly large data sets and excel's charting and statistics capability is becoming less and less useful. If we had someone in our group that was hardcore bioinformatics, I could work with them and wouldn't have to learn all this on my own, but hell, I'm still explaining to these guys why it sucks to move data around the lab on Zip disks.
posted by Mr. Gunn at 6:02 AM on March 30, 2007


Response by poster: wackybrit: "Dapper?"

I tried Dapper, if you look, there's a CRISP "Dapplication" that I was playing with(might still be sandboxed), and I might try it again, but the problem I ran into was that I didn't know how to deal with the submission of the parameters for the dropboxes or subsequent fetching of the details pages. The form is kinda weird, and it requires every box to be filled in or it just hangs on searching forever. Maybe I could look at the URL it's sending and pick out what to use in Dapper, so I guess it's worth another try, but I'm not sure how to go about subsequently fetching details pages. It would be real nice to have notification when a new entry was added, though.
posted by Mr. Gunn at 6:07 AM on March 30, 2007


Best answer: I finally got it figured out. I had to splice the code to scrape the details pages into the code to scrape the page of search result links. Since I really don't know a lick of java, I basically had to stick bits of code different places in the template code they provided and try to rename variables and just see what happened.

Then I used Babel to convert the data and used their "preview in Exhibit" to display it in CSV. You can see my CRISP Multi Scraper here.

Although the documentation for Solvent is OK, I don't think I could have done it without wingerz's solvent tutorial, so if anyone knows him, tell him thanks for me.
posted by Mr. Gunn at 3:09 PM on March 30, 2007


Hey, Mr. Gunn, glad to hear you got it figured out! I just got your message after work today. :)
posted by Civil_Disobedient at 4:26 PM on March 30, 2007


Congrats on cheating death! (Are you sure you're not a programmer by now?)
posted by Zed_Lopez at 10:19 PM on March 30, 2007


« Older Network hard drive makes external internet...   |   Stupid computers.....do what I want! Newer »
This thread is closed to new comments.