In, Out, Shake-it-all about... I need a methodology and/or software application (for analyzing differing input, storage and output concepts) that I can hand on a silver-platter to my analysts...
September 1, 2009 8:40 AM   Subscribe

In, Out, Shake-it-all about... I need a methodology and/or software application (for analyzing differing input, storage and output concepts) that I can hand on a silver-platter to my analysts...

I am looking for a scientific / semi-scientific approach to analysis of the following problem:

I have a database of information of incidents relating to medicinal products and their specifications. The inputs to the database come from multiple sources for each incident and for each specification, with different ideas about how to represent certain concepts. The outputs of the database get printed on various reports and forms, whose format is specified by various governments, that also use differing concepts. Let's assume for the sake of this discussion that there are no agreed published mappings between the different concepts. (For examples of clashing concepts, see end.)

Goals:
1) Minimize / simplify data entry from variable quality source data;
2) Minimize duplicative storage;
3) Ease ad hoc analysis;
4) Generate compliant reports and forms;

It seems obvious to me that you have to store a 'super-concept' that is more precise than all of the input and output concepts. But because the input and output concepts could be re-defined at any point in time, the stored super-concept also needs to be able to be re-defined more precisely at any point in time. And we need mappings from the input and output concepts to the super-concept that can be tracked over time. But is there a standardized or at least well-accepted approach to analysis of this problem, or to the problem itself? I would like to hand my analysts the right approach 'ready-baked', or show them systems that tackle this problem effectively. Today, their analysis is too simplistic (a field for every concept even if it is duplicative), or inconsistent (some concepts insufficiently represented). Result: upset users.

Bottom line... What methodology can I point them to? Preferably something that exists and is presented in a straightforward manner. Is there existing software to manage the analysis?

Bonus points... Since this field cannot be unique to this industry, are there good examples of systems that already effectively use input to storage to output mappings and make them obvious to the user and auditor alike?

===============

Examples of concept clashes:
DRUG COLOR: the US government might have a list of ten named colors; the UK government might have a list of thirty colors; another government might use RGB values; any of these lists may change at any time;

PATIENT AGE: classically you use number/unit combinations, such as 43 years, 2 months or 3 weeks, etc. But you might receive or have to output a category (defined differently by different governments), such as 'Adult', 'Neonate', 'Pubescent', etc. There are overlapping concepts such as 'Senior Citizen' which might differ in age range by jurisdiction and sex.

PRODUCT CATEGORY: In the US, ‘vaccines’ are in a regulatory category all their own, whereas in the EU, ‘vaccines’ are a sub-category of ‘drug’ products. There are known products which are classified as a 'drug' in one jurisdiction, and as a 'device' in others - sometimes arbitrarily. The product category tends to drive which version of a concept one needs for all the other variables.

HOSPITALIZATION NEEDED / EXTENDED: This is one checkbox on some reports and forms which we receive or send, and two on others. It depends on the form or report. No other real logic applies. Just governments being arbitrary.

LABELED USE / OFF-LABEL USE: If a drug product is used as stated on the label, then it's 'Labeled Use'. If not, then it's 'Off-Label Use'. But for vaccines, an additional, overlapping concept is whether the vaccination was given as part of the recommended ‘primary’ course (in other words in the appropriate number, separated appropriately by time, and within the appropriate patient age range for each administration). This is called 'Primary Use'. But there is no accepted term for non-'Primary Use' - the box 'Primary Use' is simply left unchecked. non-'Primary Use' equates to 'Off-Label Use' in the US, but elsewhere they are separate but analogous concepts.

DEVICE TYPE: has a list which is specified in the US by the regulatory guidance as ‘any term you like that is not too generic’ (I love regulatory guidance). Elsewhere the list is defined as a few broad names that today is a subset of the US list, but might not stay that way.

Many, many others.
posted by blue_wardrobe to Computers & Internet (5 answers total)
 
It seems obvious to me that you have to store a 'super-concept' that is more precise than all of the input and output concepts.

You've got that backwards. It seems obvious to me you need store these to the least common denominator. Essentially, your database will be a series of claims about existence in a set.

Example: if UK recognizes seven shades of blue and the US only one, your database will have to discard the extra information in the UK, and store just "blue". Hence, a database drug color of blue isn't a specific shade of blue, but a set of blues.

It can be tempting to just add all the possible values for a field, but you're going to be biasing the database -- a ton of "blue" pills and a few potentially outlier "cerulean" from a different dataset.
posted by pwnguin at 4:49 PM on September 1, 2009


Best answer: Also, Wikipedia on the subject they call Data Integration. And indeed, they mention some complicated set theory I don't feel like discovering today.
posted by pwnguin at 5:07 PM on September 1, 2009


Response by poster: @pwnguin: I don't understand why I have it backwards. If my tablet is "cerulean" according to the UK, and "blue" according to the US, I need to be able to submit a report to the US authorities using just "blue", and a report to the UK authorities using "cerulean". I am not allowed to approximate. So (at least for submissions of official documents) I think I *do* need a superset of all possible values, and perhaps a mapping that says "For US, RGBxyz should be output as 'Blue'", "For UK, RGBxyz should be output as 'Cerulean'".

If I'm wrong, please enlighten me.
posted by blue_wardrobe at 7:27 PM on September 1, 2009


Perhaps I'm missing something here; I assumed the purpose was to generate some sort of analysis of the data in aggregate. A question: if an input document from the US lists a pill as "blue", how do you describe it on reports for the UK?
posted by pwnguin at 8:13 PM on September 1, 2009


Response by poster: I omitted to mention that I also have access to the company's own precise specifications. So, I can usually ascertain the color as stated in Pantone, RGB, etc. But I have to use euphemisms and approximations when sending reports to each agency.

But you are right, I will want to do aggregate analysis too, and that's important.

Your link to data integration certainly points the way. Thanks for that. I'm now wading through LAV and GAV concepts.
posted by blue_wardrobe at 9:51 PM on September 1, 2009


« Older 'kodomo no shiro' (national children's castle) -...   |   Videocamera Recommendations? Newer »
This thread is closed to new comments.