Clinical information database design questions
March 30, 2009 5:07 PM   Subscribe

I've been tasked with making a large amount of clinical research information we have easier to search and maintain. I have some questions about the best way to organize a database to do this, and in so doing see how much of my time I've wasted so far.

First off I'm using Access 2000--that's what our organization provides. Second, my database skills are marginally more than basic. I know how to make tables, form, queries, etc with Access. I have some idea about how to design relatively simple relational databases, but I'm having trouble determining if what I've done so far with this one is reasonable or if there's something better.

The data I have consists of clinical information for a number of research subjects. The data consists of things that are simple observations (e.g. temperature, weight, etc.) but also things that are more complicated like blood electrolyte measurements that contain many parts for a single "observation." In addition to observations there are also treatments like drug or fluid injections. The data as I get it is basically a list of observations and treatments for each subject organized by day. It's more or less a diary for each subject. A big unsearchable diary.

What I've done so far is to create a number of tables to sort of compartmentalize everything. First I have a table for all the research subjects. For each subject there are 10 or so different observations or types of treatments so I've made tables for all of those. What I've ended up with is a relationship diagram (in Access) that looks like a pinwheel. The field that ties all the tables together is the subject ID. It's present in every table except the ones I use for lookup lists. So, I have lots of one to many links from the subjects table to all my other tables.

Is this the best way to organize this type of log or journal type information? I thought I pretty well understood the concept of normalization and relationships--primary keys, etc.--after going through it again recently, but when I try to apply that to this database it seem that maybe I don't understand it as well as I thought. One problem I'm having trouble with is that there's very little unique in any of the tables' records except the research subject table. I set up artificial primary keys for the tables since for most of them I'd have to use all or most of the fields to identify a record as unique.

I'm thinking there must be a better way to organize the data. Unfortunately I haven't found a discussion anywhere of this kind of database. Or, else I haven't been able to see the many video store and shipping company examples as the same as what I need.
posted by sevenless to Computers & Internet (8 answers total)
What will you be doing with the data?

If anyone is going to make any decisions based on it (and why make it available otherwise?), then the regulations covering Good Clinical Practice in your country likely apply.

This means that the system needs to be developed, validated and use by qualified and experienced individuals among other things, and that the data manipulations be recorded in an audit trail.

If the data will result in analyses and electronic records used to support conclusions submittable to FDA or other countries' regulatory authorities, then 21 CFR Part 11 will possibly also apply.

You must also comply with any stipulations of the ethics boards who sanctioned the gathering of the data, and these will likely include rules surrounding who may see the data, what privacy rules apply, etc.

There will (should) also have been a Trial Protocol Document which describes acceptable / intended analyses.

That all said, for clinical data, you are typically going to create a Demography table to contain the basic information about each research subject. This sounds like the center of your pinwheel, so this is reasonable.

There should also be a set of data relating to the History of each research subject (pre-existing conditions, risk factors, selection and exclusion criteria for the trial, etc.).

All other data should be organized by date/time and in tables according to type. If it helps, these typically fall into the categories of: Treatment, Concomitant Conditions/Adverse Events, Efficacy Measures and data describing at which point the patient dropped-out or completed the trial (Follow-up information).

It is also helpful to review the various measurements to determine whether they are true Efficacy Measures (i.e., indicators that treatment is having a desired, good effect) or whether they are Safety Measurements (i.e., things monitored that might reveal a unexpected or deleterious effect).
posted by blue_wardrobe at 5:46 PM on March 30, 2009

Where is this data coming from, and is it electronic? If it's clinical data from a CRO, odds are it's been through SAS at some point. You might be able to parasitize the existing table design if this is the case. There is a PROC SQL that is fairly easy to use which treats the SAS tables like database tables and can be useful for export via ODBC. Also, you might want to peer at some CDISC docs in case this data might be going to the FDA electronically at some point.

This sounds like a really large undertaking unless you are making pie charts that are for entertainment only. I assume you have no budget for a commercial solution that meets all the regs?

If the questions your audience is asking are simple, like "Where is patient report 4567?" you might be better off using an indexer on the documents with dumb keyword searches.
posted by benzenedream at 6:25 PM on March 30, 2009

I used to work for a company that does almost exactly this: InvivoData
posted by odinsdream at 7:06 PM on March 30, 2009

I can't speak to the legal requirements.

As far as database design goes, you're appear to be doing very well as far as Normalized database design goes. Very well indeed if you're an amateur. If your relationships form a "pin-wheel", so be it; it just means that there are no relations between entities in your model, other than the relation of each measure to the patient on which/for which it was measured.

However, you should know that clinical data is often models according to an Entity-Attribute-Value model, in which each row is a that tuple, Attributes are heterogeneous and essentially "untyped", and the type and range of the Value depends on the Attrubue. (Conceptually untyped, that is; in practice that means a string/varchar, as it can represent any type.) E.g.,

PatientID, Attribute, Value
1, 'temperature', '98 Celsius'
1, 'handidness', 'right'
2, 'eye-color', 'blue'
2, 'complaint', 'fatigue'

etc. The idea is, since what we're recording is so broad, we essentially can't effectively model it as we'd "normally" (pun intended) do in a database, and even if we could, we'd end up with unmanageably broad tables full of nulls. So we just record "something happened", possibly adding a timestamp of when it happened. Columns become rows, and ad-hoc.

So your pinwheel may indicate that there's not enough structure to extract. On the other hand, that you've managed to normalize your data into categories (tables) and look-ups may indicate that for your data, there is enough structure to make a model.

As to searching, well, an E-A-V table is easy to search. The problem isn't finding rows, it's finding related rows (or conversely, the problem not finding what's extraneous to your needs). (Andrew Koenig tells us that "Abstraction is selective ignorance" -- in E-A-V you can't get Abstract, because it's difficult to ignore, to see the forest for the trees.) We can easily look up "electrolyte" or "sodium" or "potassium", but only our own knowledge (external to the database) informs us that there is any relation between these things. They're purely tags, and could as well be represented as "xyzzy", "plugh", and "foobar", or by their Chinese or French equivalents.

Without additional structure (such as you've added) or external knowledge, there's no way to ask the database, "show me all findings related to electrolytes". Even if we supply, with external knowledge, a list of electrolytes (in the language and form that matches the tags in the database), the E-A-V database has no way to match those terms without simply grinding through every attribute (possibly for some subset of entities) (other than very generic and so not particular useful conventions, such as an alphabetical index of attributes).

I hesitate to ever recommend Wikipedia for anything, but the Wikipedia article on E-A-V is comprehensive and well worth reading -- I suspect because it's primarily the work of one author. I'd read it now if I were you, as it seems (as with most worthwhile articles there) to have fallen afoul of the Wikipedia management precisely because it's informative without adhering to the Wikipedia conventions of being dry and turgid and mostly useless to anyone who doesn't already know everything the article explains.
posted by orthogonality at 7:08 PM on March 30, 2009

Thanks everyone for the responses so far. Just to clarify a little:

1. I'm not making a database designed to record ongoing clinical activity or to help make clinical decisions. The data available to me is all retrospective. I used the word clinical purely to help describe the kind of data I have. I'm just trying to put the data into a more usable form.

2. I don't have access to the database that is actually used to originally record the data. All I have is a primitive web interface that lists entries for each procedure, etc. and the date. I have to click on the entry to see the details. In other words for now I'll be entering the data mostly by hand. Well, hopefully soon by "I," I mean "student helpers."
posted by sevenless at 9:02 AM on March 31, 2009

> In other words for now I'll be entering the data mostly by hand. Well, hopefully soon by "I," I mean "student helpers."

Please don't do that. You're shooting yourself in the foot. Go back to the people who gave you this task and tell them that in order to make it work you need better access to the dataset. If they have to scrub it somehow before giving it to you, fine, but copy-paste is a guaranteed way to make your database that you build have more errors than the original data.
posted by tarheelcoxn at 9:26 AM on March 31, 2009

Hmm. Well, I didn't mean "clinical decisions" - I meant decisions relating to the efficacy and safety of the treatment or medicinal product. It can still all be retrospective, and still be subject to controls required by 21 CFR Part 11 (or other equivalents) for electronic records in clinical studies.

Anyway - that doesn't actually answer your question, so I'll rest my case.

I hope the other pointers were helpful.
posted by blue_wardrobe at 3:22 PM on March 31, 2009

All I have is a primitive web interface that lists entries for each procedure, etc. and the date. I have to click on the entry to see the details. In other words for now I'll be entering the data mostly by hand. Well, hopefully soon by "I," I mean "student helpers."

Please don't do this. Even if you can't get access to the original database, screen scraping is better than re-entry. You will at least double your error rate in the data by re-typing, plus you will be wasting human time on things better done by computers. At least get the student helpers to use the proper downloading tools so that they aren't learning Mavis Beacon Teaches Clinical Data Typing.

Look up tools such as wget (available for Linux, Windows, and Mac) or curl to copy the data served via the web. There are also Firefox plugins that will do the same thing (Downthemall) if the web data is behind some sort of funky authentication barrier.
posted by benzenedream at 3:46 PM on March 31, 2009 [1 favorite]

« Older Violent action ensues.   |   Books for the Middle of Nowhere Newer »
This thread is closed to new comments.