What is this magical OAI of which they speak?
June 9, 2010 7:25 AM Subscribe
Help me acquire some data from an OAI/PMH database. For science!
My university has a dataset I would like to mine (a coauthorship network in particular). Basically, what I would like is to produce a text file formatted as
author1, author2, ..., authorN, journal name
with maybe some other data to be included. Once I get the data, I can do all kinds of fun math! All of the data is available via OAI/PMH, so I just need to craft some sort of query to get that data. I was sent a link to the OIA protocol, and a sample query (warning: slow load time) that gives me a large quantity of data.
Unfortunately, I am a bit lost on how to figure out how to structure my query. Entirely lost, in fact. I can get back to the guys who maintain the database for more help, but I'd rather have a slightly better understanding of what it is that's going on before I bother them. So, I have a couple of questions:
(1) Where can I read a bit more about how to structure these queries? The difficulty being it needs to be a bit more basic than the protocol link above, which I found entirely unhelpful. What are the kinds of keywords that I can put into google to make this make sense. It seems like it must be simple to extract "author," right?
(2) Once I get this data, how do I convert it into a format I can use? The example is quite messy, and to use the data I would need it in a more regular format. The protocol link seems to suggest that I will be ending up with xml formatted data, which I also don't really know how to work with.
Any wisdom you could impart would be greatly appreciated, but mostly I'm looking for a good place to start to understand this for myself. Please help me figure out the basics, so I don't have to ruin some poor developer's day with stupid questions!
tl;dr: How do I convince google to teach me to use OAI? And how to I format the results once I get them from the database?
My university has a dataset I would like to mine (a coauthorship network in particular). Basically, what I would like is to produce a text file formatted as
author1, author2, ..., authorN, journal name
with maybe some other data to be included. Once I get the data, I can do all kinds of fun math! All of the data is available via OAI/PMH, so I just need to craft some sort of query to get that data. I was sent a link to the OIA protocol, and a sample query (warning: slow load time) that gives me a large quantity of data.
Unfortunately, I am a bit lost on how to figure out how to structure my query. Entirely lost, in fact. I can get back to the guys who maintain the database for more help, but I'd rather have a slightly better understanding of what it is that's going on before I bother them. So, I have a couple of questions:
(1) Where can I read a bit more about how to structure these queries? The difficulty being it needs to be a bit more basic than the protocol link above, which I found entirely unhelpful. What are the kinds of keywords that I can put into google to make this make sense. It seems like it must be simple to extract "author," right?
(2) Once I get this data, how do I convert it into a format I can use? The example is quite messy, and to use the data I would need it in a more regular format. The protocol link seems to suggest that I will be ending up with xml formatted data, which I also don't really know how to work with.
Any wisdom you could impart would be greatly appreciated, but mostly I'm looking for a good place to start to understand this for myself. Please help me figure out the basics, so I don't have to ruin some poor developer's day with stupid questions!
tl;dr: How do I convince google to teach me to use OAI? And how to I format the results once I get them from the database?
I wrote an OIA/PMH interface, but it was years ago so I'm a little rusty.
As I recall, OIA/PMH is not really a "database" system. that is, you can't use it to query data in the normal sense. Instead, it's a method by which metadata sharing systems can synchronize. So if System A has had some new data loaded into it, System B can come along and say "give me metadata for all records created in the last week."
One possibility for you would be to extract/save all the metadata (which may be what your sample query above is doing) and extract the stuff you want from that dataset. OIA/PMH uses XML for its responses, so instead of Googling for info about the protocol you'd look for info about extracting data from XML documents. For that, you'd probably want to use XSLT. I don't think there's an option to not use XML, unfortunately, but I could be wrong about that.
mefi-mail me if you need more info.
posted by lex mercatoria at 12:06 PM on June 9, 2010
As I recall, OIA/PMH is not really a "database" system. that is, you can't use it to query data in the normal sense. Instead, it's a method by which metadata sharing systems can synchronize. So if System A has had some new data loaded into it, System B can come along and say "give me metadata for all records created in the last week."
One possibility for you would be to extract/save all the metadata (which may be what your sample query above is doing) and extract the stuff you want from that dataset. OIA/PMH uses XML for its responses, so instead of Googling for info about the protocol you'd look for info about extracting data from XML documents. For that, you'd probably want to use XSLT. I don't think there's an option to not use XML, unfortunately, but I could be wrong about that.
mefi-mail me if you need more info.
posted by lex mercatoria at 12:06 PM on June 9, 2010
This thread is closed to new comments.
This seems to be the standard tutorial for OAI-PMH. It's kind of technical but I think the information you need is probably in there. Basically there are six "verbs" that you use to get different types of information from the repository; your sample query uses ListRecords to get, well, a list of records available for harvesting. You can limit your results by date range and maybe a few other things. To avoid clobbering the server with a request for a single enormous XML file, it might be worth using ListIdentifiers to get a list of identifiers for the records you want, rather than the full records, and then use GetRecord to pull each record individually.
Keep in mind that OAI-PMH is a metadata harvesting protocol, not a search protocol. It's designed to help metadata aggregators collect all available records, not to help people search for relevant records within the dataset. If you are trying to use it to do a search (e.g., retrieve all records where the author is John Smith), well, I don't think you can do that directly -- you'd have to grab the complete dataset (the XML output of your sample query, containing all available records) and extract your results from it separately.
As for dealing with the resulting XML ... I dunno. If it was me I'd hack something together in Perl. There may be more user-friendly ways of dealing with the data, but I'm not aware of them.
posted by twirlip at 12:00 PM on June 9, 2010