XML - Why won't you play nice?
September 21, 2010 1:36 PM Subscribe
XML_Torture: Postgre database with massive data stored as XML. I need to export the XML, along with some other data, to PDF. I need to do this for around 15,000 records. And by next Thursday.
For reasons I can't begin to explain, I have a Postgre database from an Electronic Medical Records (EMR) system containing 15,000 summary records stored as XML that need to be transformed into PDF for import into a new EMR. In addition to these summary records, there is other data (non-XML) that needs to be collated into separate PDF files. Below is a basic sketch of what the output needs to look like:
LastName (stored separately as text)
FirstName (stored separately as text)
DOB (stored separately as INT8)
AcctNumber (stored separately as text)
Summary Data (stored as an XML file)
- Active Problem
- Inactive Problem
- Allergies
- Immunizations
- Additional fields
The problem is that I can't seem to find a tool or script that will allow me to take one of these files and transform it into either a new table with individual fields (from the XML document) that I could import into Crystal Reports or straight to PDF. I've tried to use Altova to strip out the data from the XML and insert it into a new database table but kept getting a cryptic error that can't be resolved. The majority of the data stored in XML is in unformatted text blocks that make it impossible to import into Excel. Below is an example for reference (most fields will contain quite a bit more data):
CurrentProblem (fieldname/field description):
- Kidney Failure 1931
- No brainz
dga aah
Too many Waffle House trips
I've looked at some of the past AskMe questions about XML transformation and have tried some of the solutions with no success. If it helps I'm using Windows and can purchase software if necessary. What would be ideal for me would be to extract the data from XML and insert it into a new table so I could use it for a Crystal Report and for further analysis. I'm just at my wits end trying to figure out how to extract it in a usable format.
For reasons I can't begin to explain, I have a Postgre database from an Electronic Medical Records (EMR) system containing 15,000 summary records stored as XML that need to be transformed into PDF for import into a new EMR. In addition to these summary records, there is other data (non-XML) that needs to be collated into separate PDF files. Below is a basic sketch of what the output needs to look like:
LastName (stored separately as text)
FirstName (stored separately as text)
DOB (stored separately as INT8)
AcctNumber (stored separately as text)
Summary Data (stored as an XML file)
- Active Problem
- Inactive Problem
- Allergies
- Immunizations
- Additional fields
The problem is that I can't seem to find a tool or script that will allow me to take one of these files and transform it into either a new table with individual fields (from the XML document) that I could import into Crystal Reports or straight to PDF. I've tried to use Altova to strip out the data from the XML and insert it into a new database table but kept getting a cryptic error that can't be resolved. The majority of the data stored in XML is in unformatted text blocks that make it impossible to import into Excel. Below is an example for reference (most fields will contain quite a bit more data):
CurrentProblem (fieldname/field description):
- Kidney Failure 1931
- No brainz
dga aah
Too many Waffle House trips
I've looked at some of the past AskMe questions about XML transformation and have tried some of the solutions with no success. If it helps I'm using Windows and can purchase software if necessary. What would be ideal for me would be to extract the data from XML and insert it into a new table so I could use it for a Crystal Report and for further analysis. I'm just at my wits end trying to figure out how to extract it in a usable format.
The task I am hearing here is "I need to transfer some data from one EMR to another EMR". Is the PDF step the only way the other db can import it? Can it read CSV? This is the sort of question where keeping the main task in mind, and knowing who the players are (what is EMR1, and EMR2) might make a huge difference. PDF as a middle layer sounds... unusual... to me.
For transforming the xml into fields, lots of scripting languages can help. Sorry to be vague, but the question is vague!
posted by gregglind at 2:17 PM on September 21, 2010
For transforming the xml into fields, lots of scripting languages can help. Sorry to be vague, but the question is vague!
posted by gregglind at 2:17 PM on September 21, 2010
XSL-FO is designed for custom transformation of XML to formatted print including PDF.
Agreed with gregglind that PDF being a required middle ground would be odd. If I was spec-ing my desires for moving data from one platform to another PDF as a middle man would be near my last choice.
posted by Babblesort at 2:23 PM on September 21, 2010
Agreed with gregglind that PDF being a required middle ground would be odd. If I was spec-ing my desires for moving data from one platform to another PDF as a middle man would be near my last choice.
posted by Babblesort at 2:23 PM on September 21, 2010
This is definitely the kind of thing where you probably have to write a special purpose script, unless of course there is something existing that is specifically designed for this. The problem isn't just formatting XML, it's that you have a mix of XML and table columns. One approach would be something in perl or whatever that would read each row out of the DB and add the Name/DOB/Account stuff to the XML schema and write it all out in a combined XML blob. Then you could use whatever XML tools you want to format that, XSLT/CSS, etc. Or you could do the XML parsing in the script so that you output everything as plain text or HTML or something and then use whatever program you want to create PDF from that. All of this is pretty straightforward to someone that's worked with XML and scripting, so I don't think it would take too long.
posted by Rhomboid at 2:27 PM on September 21, 2010
posted by Rhomboid at 2:27 PM on September 21, 2010
gregglind: "For transforming the xml into fields, lots of scripting languages can help. "
You can do this in postgres itself with the xpath query built-in.
posted by beerbajay at 2:35 PM on September 21, 2010
You can do this in postgres itself with the xpath query built-in.
posted by beerbajay at 2:35 PM on September 21, 2010
Is the XML data in a column of type
posted by nicwolff at 2:41 PM on September 21, 2010
xml
? Recent versions of Postgres offer the xpath()
function with which you can select text out of an XML value.posted by nicwolff at 2:41 PM on September 21, 2010
Not taking anything away from the other suggestions offered above, but based on your description of the tasks at hand I am fairly certain that you could also do this in FileMaker 11 Advanced. (Disclaimer: I am a FileMaker developer and tend to see it as a Swiss Army knife for these sorts of situations; YMMV.) I am not suggesting that FMP11 is the only way to do this, or the best way, or the quickest way (though it may be). But based on the details you provided, I am fairly certain one could do this in FMP11 without too much struggling.
FMP11 will import XML directly, though you will likely need to define an XSL style sheet for field mapping on import. Once the data is in FMP, you can use FMP's native scripting environment to loop through the records and save the output as a series of PDF files (using the "Save Records as PDF" script step.) Output file names can be set dynamically in the scripting environment using variables.
Again, not suggesting this is the only/best/quickest way to do this, but after reading your question I would suggest that this is a user-friendly option. FileMaker Pro 11 is available as a free 30 day trial from FileMaker.com, though I'm not sure if FileMaker Pro Advanced is available in a trial version. You would probably do better with the Advanced version, which has a script debugger, more robust import capabilities, and some other developer-friendly features. FMP 11 Pro vs Advanced features are compared here. AFAIK, the only limitation of the trial version is its 30 day time bomb, so you could download it and poke at it for an hour or two and see if it does what you need it to.
posted by mosk at 2:56 PM on September 21, 2010
FMP11 will import XML directly, though you will likely need to define an XSL style sheet for field mapping on import. Once the data is in FMP, you can use FMP's native scripting environment to loop through the records and save the output as a series of PDF files (using the "Save Records as PDF" script step.) Output file names can be set dynamically in the scripting environment using variables.
Again, not suggesting this is the only/best/quickest way to do this, but after reading your question I would suggest that this is a user-friendly option. FileMaker Pro 11 is available as a free 30 day trial from FileMaker.com, though I'm not sure if FileMaker Pro Advanced is available in a trial version. You would probably do better with the Advanced version, which has a script debugger, more robust import capabilities, and some other developer-friendly features. FMP 11 Pro vs Advanced features are compared here. AFAIK, the only limitation of the trial version is its 30 day time bomb, so you could download it and poke at it for an hour or two and see if it does what you need it to.
posted by mosk at 2:56 PM on September 21, 2010
Echoing statements above, it makes no sense to use PDF as an intermediary format. PDF is a terminal target format, for printing or displaying documents which must look the same on every viewer's computer or printer.
That said, answering the question as stated:
Personally, I would do this in Python using SQLAlchemy (or plain psycopg2) to pull data from the Postgres database, Beautiful Soup to parse the XML (probably), and ReportLab to output PDF reports. Alternately, you could use the csv module to output a .csv file.
posted by sonic meat machine at 3:08 PM on September 21, 2010 [1 favorite]
That said, answering the question as stated:
Personally, I would do this in Python using SQLAlchemy (or plain psycopg2) to pull data from the Postgres database, Beautiful Soup to parse the XML (probably), and ReportLab to output PDF reports. Alternately, you could use the csv module to output a .csv file.
posted by sonic meat machine at 3:08 PM on September 21, 2010 [1 favorite]
Response by poster: Is the PDF step the only way the other db can import it?
Unfortunately, yes. The problem with this data is that none of it is structured, so we couldn't map any of the relevant data to the second EMR even if we wanted to. We had thought at first that we might be able to just import the problem list data without ICD-9 codes but were told by the EMR2 vendor that we have to have certain data for them to map it. Since in most EMRs the problem lists (i.e. active diseases/conditions) are structured so that there's an ICD-9 code, that provider that added it, the onset date, and the problem status with each problem; in this EMR there's just an "Active Problems" section that's a giant unformatted, freetext block. I've actually migrated clinical data from two other EMRs without incident so I was a little surprised when I found out this particular one is a glorified word processor.
Is the XML data in a column of type xml?
No, unfortunately, it's stored as
If I was spec-ing my desires for moving data from one platform to another PDF as a middle man would be near my last choice.
In this case most of the patients already have a chart in EMR2 with the clinical data, our providers just want to have a copy of the "Clinical Summary" from EMR1 for reference.
you can use a program called Prince.
Thanks for this, it looks like it might be what I was looking for.
posted by playertobenamedlater at 3:20 PM on September 21, 2010
Unfortunately, yes. The problem with this data is that none of it is structured, so we couldn't map any of the relevant data to the second EMR even if we wanted to. We had thought at first that we might be able to just import the problem list data without ICD-9 codes but were told by the EMR2 vendor that we have to have certain data for them to map it. Since in most EMRs the problem lists (i.e. active diseases/conditions) are structured so that there's an ICD-9 code, that provider that added it, the onset date, and the problem status with each problem; in this EMR there's just an "Active Problems" section that's a giant unformatted, freetext block. I've actually migrated clinical data from two other EMRs without incident so I was a little surprised when I found out this particular one is a glorified word processor.
Is the XML data in a column of type xml?
No, unfortunately, it's stored as
text
and not xml
.If I was spec-ing my desires for moving data from one platform to another PDF as a middle man would be near my last choice.
In this case most of the patients already have a chart in EMR2 with the clinical data, our providers just want to have a copy of the "Clinical Summary" from EMR1 for reference.
you can use a program called Prince.
Thanks for this, it looks like it might be what I was looking for.
posted by playertobenamedlater at 3:20 PM on September 21, 2010
If you don't want to pay for PrinceXML (and it isn't cheap), there's an excellent open-source project called wkhtmltopdf that uses the WebKit engine to render HTML/CSS input to PDF.
posted by nicwolff at 3:40 PM on September 21, 2010
posted by nicwolff at 3:40 PM on September 21, 2010
What programming languages do you know? This task could be as simple as a Perl oneliner:
This parses some XML like:
and produces output like:
Doing this with the database values is just a minor modification; iterate over the rows, extract the data you need from columns, parse the XML, extract the data from the XML, print what you need (or create a PDF, if that what you really think you need).
Anyway... Simple Matter Of Programming.
posted by jrockway at 4:42 PM on September 21, 2010
perl -MXML::Simple -E '%a=%{XMLin("-")}; say join ",", @{$_}{qw/active inactive allergies .../} for @{$a{whatever}}'
This parses some XML like:
<root>
<whatever>
<active>active 1</active>
<inactive>inactive 1</inactive>
<allergies>kittens</allergies>
<...>...</...>
</whatever>
<whatever>
<active>active 2</active>
<inactive>inactive 2</inactive>
<allergies>dogs</allergies>
<...>...</...>
</whatever>
</root>
and produces output like:
active 1,inactive 1,kittens,...
active 2,inactive 2,kittens,...
Doing this with the database values is just a minor modification; iterate over the rows, extract the data you need from columns, parse the XML, extract the data from the XML, print what you need (or create a PDF, if that what you really think you need).
Anyway... Simple Matter Of Programming.
posted by jrockway at 4:42 PM on September 21, 2010
This thread is closed to new comments.
Prince can be invoked from the command line, so you could write a shell script to invoke it consecutively on each of your XML files (after exporting them from the database, I guess).
So, that's one option: XML + CSS --> PDF.
posted by Paquda at 2:09 PM on September 21, 2010