Sliding hierarchy? Hypercube? How do I structure this data?
September 30, 2014 3:04 AM   Subscribe

I have some data, it happens to be medically-oriented but that's probably not too important. How can I structure it? It appears to be intrinsically hierarchical, but the hierarchy can be organised in various ways. Is there a design pattern for this problem?

The same information can be organised in many different and equally valid hierarchies.

This example happens to be about mechanisms of pain. I'll probably be listing various other similar datasets, so I need a decent model.

(Oh, if someone has done this already, feel free to point it out. I shall be conducting a literature search at some point, but I haven't spotted anything that collects this all in one place yet.)

This is just an illustrative draft, so any errors aren't important, they'll be ironed out later. Here's a simple illustration of the problem, wherein I slice the same data in two different ways:

Link to image of two hierarchies.

This data is very sparse, and will become increasingly complex as I add further sub-categories into the mix.

Is a hierarchy the correct way of modelling this data? I'm beginning to suspect not. However, under certain circumstances a hierarchy becomes a sensible way of visualising the data. For example: "Show me all the sensitisation disorders of the central nervous system." Or "show me all peripheral nervous system components involved in allodynia."

A hierarchy seems like a way of projecting this information into two dimensional form, whereas actually the info should be modelled in more than two dimensions. Does that seem feasible?

I'm open to any structuring system at all. Spreadsheets, entity relationships, pan-dimensional directory listings, anything. However it should preferably be manipulable with standard free tools, so if it requires an obscure expensive solution, that's probably out of my reach.

As mentioned above, I'd like to query this data e.g. "Show me all X where Y structured as a hierarchy". It seems that I'll be drawing an ERD at some point. Ho hum.

Any thoughts?
posted by ajp to Grab Bag (16 answers total) 6 users marked this as a favorite
 
You need two axes of differentiation. What you have are different data types. I would use color or symbols to differentiate the different types of disorders.
posted by sonic meat machine at 4:17 AM on September 30, 2014


Try playing with the data in Tableau. Download the free trial.
posted by oceanjesse at 4:40 AM on September 30, 2014


Take a look at some ontologies like the Human Phenotype Ontology or Gene Ontology for inspiration on how similar data gets organized. (Actually, if you're lucky, your data may already be represented in an ontology that already exists).
posted by penguinicity at 5:03 AM on September 30, 2014


Response by poster: @sonic meat machine: There will be more than two axes eventually. Adding just one more would make the hierarchies far larger than would easily fit here. Imagine adding "pathologies" into the illustrations above. This could produce:

Sensitisation -> PNS -> fibre -> allodynia -> inflammation

Sensitisation -> PNS -> fibre -> allodynia -> ionising radiation

...or...

Inflammation -> PNS -> fibre -> allodynia

Inflammation -> PNS -> nerve root -> radiculopathy

...or...

PNS -> ionising radiation -> fibre -> allodynia

PNS -> inflammation -> nerve root -> radiculopathy

...and so on. Each representational hierarchy is valid, but none is the definitive model.

@oceanjesse Thanks for the recommendation. There are many, many data modelling tools, and I'm familiar with a bunch of them. The first step is the most important: knowing how to organise this data.

I'm edging towards some sort of data-warehouse type model. Hmm.
posted by ajp at 5:08 AM on September 30, 2014


Response by poster: @penguinicity That's very interesting. The Phenomizer appears to list a complete ontology of "abnormalities" such as this ontology of allodynia. However it's strictly structured, and cannot apparently be "re-sliced" as in my examples above. Very close though, thanks.
posted by ajp at 5:17 AM on September 30, 2014


Best answer: I'm not sure if this will work for you, but in the past with data like that I used a Directed Cyclic Graph. For visualization I used something like this radial graph which is a visualization of a Directed Acyclic Graph (DAG). The difference for the DCG is that any sub-node could also be the parent node of other nodes, and so looping can occur.

I think a search/filter coupled with an algorithm for traversing the graph could yield interesting visualizations.

I was able to find this this graphviz online tool to illustrate the concept. You can paste in the following representation of your data.

digraph g{
s -> p -> f -> a -> i
s -> p ->f -> a -> ir
i -> p -> f -> a
i -> p -> n -> r
p -> ir -> f -> a
a
p -> i -> n -> r

s [label="Sensitisation"];
p [label="PNS"];
ir [label="ionizing radiation"];
i [label="inflammation"];
f [label="fibre"];
a [label="allodynia"]
n [label="nerve root"]
r [label="radiculopathy]
}

And you can begin to see some interesting patterns.
For large sets of data, a filter could be constructed, for example:

Searching for Inflammation and allodynia - meaning that only paths starting at inflammation or allodynia are created, yields:

digraph g{
a -> i
a -> ir
i -> p -> f -> a
i -> p -> n -> r
i -> n -> r

s [label="Sensitisation"];
p [label="PNS"];
ir [label="ionizing radiation"];
i [label="inflammation"];
f [label="fibre"];
a [label="allodynia"]
n [label="nerve root"]
r [label="radiculopathy]
}

Pasting this into that online tool shows that there is no way to get to sensitisation from either of the inflammation or allodynia nodes.

Hopefully this can serve as a launch point to get more precisely what you need.
posted by forforf at 6:13 AM on September 30, 2014 [4 favorites]


Minor corrections due to the editing window closing on me (none change the results):

The first set of paths should be:
digraph g{
s -> p -> f -> a -> i
s -> p ->f -> a -> ir
i -> p -> f -> a
i -> p -> n -> r
p -> ir -> f -> a
p -> i -> n -> r

and the second set should be
digraph g{
a -> i
a -> ir
i -> p -> f -> a
i -> p -> n -> r
a
i -> n -> r
posted by forforf at 6:20 AM on September 30, 2014 [1 favorite]


Ok, after seeing that the hierarchy will grow larger, I will take a different tactic. I would say that you have the beginnings of a relational database... and from a database, you can build graphs (similar to what forforf is saying). Excuse my lack of terminology here, but here is an attempt:

NervousSystemDivisions (Central, Peripheral, ...)
NervousSystemComponents (Spine, Cerebrum, ...)
DisorderCategory (Sensitization...)
Pathologies (Inflammation, Ionizing Radiation...)

Then you express your relationships as, well, relationship tables:

DivisionComponents: NervousSystemDivisions × NervousSystemComponents
ComponentDisorders: NervousSystemComponents × DisorderCategory
DisorderPathologies: DisorderCategory × Pathologies

You can then use SQL to build ad-hoc queries across the data and present it however you'd like.
posted by sonic meat machine at 6:25 AM on September 30, 2014


If what you mean by modeling is just "Show me all the sensitisation disorders of the central nervous system," then a relational database seems the obvious way to go.

If you have actual data (ie this person took 3.98 seconds to complete some task and they or their exciting lesions have the following characteristics...) that you need to model statistically, then you have alternatives. You could leave the data in a sql database, but you'd (probably) need to dump it to a flat-file to analyze since most statistical software wants to eat flat files. Or you could just create the file as a flat file with columns for nervous system division, nervous system component, disorder category, etc. If analyzing with these in mind were appropriate or required, you'd probably be looking at a multilevel or cross-classified model, both of which are well-described online for most popular statistical software.
posted by ROU_Xenophobe at 7:03 AM on September 30, 2014


Response by poster: Lots of useful stuff here, but I think @forforf is closest with directed cyclic graphs. Thanks for that, this does indeed look like a graphing problem.

I'm going to look into whether nodes can be "typed" (e.g. by tissue, or pathological mechanism, or whatever) and if edges can be directional and labelled (e.g. "A causes B", "B is caused by A", "C is made of D", "D is affected by B" etc).
posted by ajp at 8:19 AM on September 30, 2014


What you describe with typed nodes and edges is very similar to what RDF is intended to encode.

RDF is a W3C standard for describing graphs with typed nodes and edges. There are a number of tools that allow queries to be run against a set of RDF statements. The main query language is SPARQL. There is a more advanced language called OWL that can be used for something resembling a query.

SPARQL 1.1 defines property paths which would allow simpler expression of join queries across SQL relationship tables. This allows transitive queries such as, get me all ?D where ?D is affected by B OR ?D is affected by something transitively affected by B. No doubt there is some way to do this in SQL DBs, SPARQL just makes is less verbose and more standardized.

Generally you'd take your data, encode it as RDF, such as the turtle RDF language, then put it in a tool like Fuseki, and then run SPARQL queries against that. You could also configure Fuseki to process OWL as it doesn't do that by default.

I hesitate to mention it since the tools and specs are less main stream than SQL. RDF can be a bit tricky to work with and much is obscurely documented. Feel free to mail me questions.
posted by bdc34 at 9:12 AM on September 30, 2014 [4 favorites]


The concern I'd have with DCGs is that graphviz is a visualization tool, not a querying or database tool. You can model DCGs in a RDBMS or in a graph database as bdc34 describes, but whichever of those you choose, you end up with a data store that you can manipulate rather than a document which is relatively static. (That is to say, graphviz is a nice tool, but its output is essentially the same as something like Visio or OmniGraffle.)
posted by sonic meat machine at 9:41 AM on September 30, 2014 [1 favorite]


Sparse data plus unpredictable relationships between entities probably disqualify relational databases if the amount of data is non-trivial.

Honestly, you've got a data modeling problem. I agree with bdc34 - RDF/SPARQL is probably the best way to go, here. There are open source RDF/SPARQL tools out there, but you're going to need to fiddle.

Tools that handle this well are not cheap, and definitely not free. Actually - that's true, in general - software or otherwise.
posted by NoRelationToLea at 10:06 AM on September 30, 2014


Instead of thinking of them as 'axes', think of them as filters.

My claim is that tags (or labels) get you most of the way there, and that you model the smarts / heirarchy at the search layer. I would treat the labels as a 'bag of (special) words', and use something like SOLR (which I like to prototype with Lunr) as the search mechanism.

I don't understand what the heirarchy of this is supposed to be:

Sensitisation -> PNS -> fibre -> allodynia -> ionising radiation

To me, that sounds like a CAUSE of a SYMPTOM, involving several AREAS. Modeling that with a forced 'direction' seems weird. The direction seems like an artefact of the search / display rather than inherent.


I would write it as:

class:sensitization part:pns part:fibre disease:allodynia cause:ratiation

And offer easy (special) searches on classes, parts, diseases, and causes.



If you want to go RDBS, what are the entities here? I see:

DISEASE many:many CAUSES
DISEASE many:many PARTS
DISEASE many:many ATTRIBUTES (whatever 'sensitization') is.


(RDF may be the Right answer. I personally find it very hard to actually work with.)
posted by gregglind at 12:01 PM on September 30, 2014


Response by poster: @gregglind The issue is exactly what you've noticed, that the hierarchy changes depending on how the data is sliced. The best model will be able to represent any (valid) direction.

In the example you chose:

Sensitisation -> PNS -> fibre -> allodynia -> ionising radiation

...this represents:

Sensitisation [is a pathology of system] PNS. [If it involves the component] fibre [it may result in] allodynia. [One cause of this may be] ionising radiation.

That's a bit arse-about-face but it's still valid. I don't want to "force" any one direction, I want to be able to represent all valid directions.

I think this suggests that starting from a hierarchical structure is the wrong approach. While many hierarchies could be "projected" from the underlying dataset, that dataset has to be "multidimensional".

I've started poking around with RDF. Thus far, it does indeed seem painful!
posted by ajp at 7:56 AM on October 1, 2014


Not to threadsit, but 'tags with namespaces' might be a way to go then :)
posted by gregglind at 2:39 PM on October 1, 2014


« Older College student in 2014: Which periodical should I...   |   Fastest way to finish Bachelor's degree while... Newer »
This thread is closed to new comments.