How to query XML like SQL?
February 15, 2006 5:10 PM   Subscribe

I'm looking for a software tool that will ingest a large XML file, and allow me to write sql-like queries against it for both summary reporting, as well as detailed reports.

I've been tasked with trying to extract some statistical information about data in some XML's we've been generating over the last year and a half. The suggestion from dev is to construct some horrific grep that pulls pre/post lines, and iterating thru that's output with more series of greps.

I've looked into XPath and XQuery, and they both seem to be starts along the road I want, but what I'd really like is something that ingests my XML, and lets me write SQL against it. Ever heard of any such beast out there?

I have two tasks I need to accomplish, one is to get a count of a certain kind of element, where it has a specific value. The second is to get the list of IDs for those elements.
posted by nomisxid to Computers & Internet (27 answers total)
 
SQLXML probably does most of what you want. Not free though.
posted by unSane at 5:15 PM on February 15, 2006


Response by poster: bitrot, what do you think a relational database is, except a way to organize data files?
posted by nomisxid at 5:26 PM on February 15, 2006


Response by poster: unSane, that looks like something useful at first glance, but deeper reading makes it seem that all they are doing is storing the XML in blob fields, and then doing xquery/xslt translations on that. If I knew how to write xslt/xquery well enough to do that, I wouldn't need a DB in the first place. My whole goal is to be able to use SQL, which I already know.
posted by nomisxid at 5:35 PM on February 15, 2006


"we've been generating"

Mj dlwssO ut jylp ypd has bww; dw;wpagu;d gwh XMI, dwg gwh paf rag ghag gwh XMI's dw;wpagwr tpym. Ghag'ii bw wasuwp gy kpycwss.

My guess: if your org has been generating the XML, get the raw data that the XML's generated from. That'll be easier to process.
posted by orthogonality at 5:37 PM on February 15, 2006


Since XML does not generally map to SQL.

Is this really true? I understand any tree or graph structure can be represented by a matrix, and a matrix is a tabular form, which is pretty much what we're talking about manipulating when we're talking about SQL.

In any case, though, I agree with this:

If your XML file happens to be organized in such a way that it easily maps to relational constructs, then you could easily build a SQL importer for your particular XML file.

And suspect this suggestion is going to be the easiest.
posted by weston at 5:51 PM on February 15, 2006


Response by poster: ortho, the goal is to give a report on what WAS sent out, not report on what data SHOULD have been sent out. Yes, I could back-into the results from the source Db, but if management found out, they'd be peeved since it's not what they asked for...


bitrot, forgive my xml-ignorance, but since XML is heirarchical, I don't see how you could create an XML that couldn't be directly translatable to a relational model. Perhaps not efficient, but impossible? I don't buy that. A relational DB doesn't require relations, it just supports them. I can load a dozen completely unrelated data sets into a single DB.
posted by nomisxid at 5:52 PM on February 15, 2006


Relational databases aren't in the slightest bit hierarchical.

Apart from trivially simple scenarios, I'm having trouble thinking of any kind of XML file that maps meaningfully onto a relational model. Unless each XML file is a simple list of values that corresponds directly to a row in a database (and therefore isn't hierarchical), it can't be done.
posted by cillit bang at 6:17 PM on February 15, 2006


More helpfully, I think you should forget about SQL. What you want to do is trivially simple to do with XPath.
posted by cillit bang at 6:26 PM on February 15, 2006


You want XSL. Not even vaguely SQL-like, but it's the right tool for the job.
posted by ook at 6:46 PM on February 15, 2006


I have two tasks I need to accomplish, one is to get a count of a certain kind of element, where it has a specific value. The second is to get the list of IDs for those elements.

Leaving aside the fascinating philosophical discussion we're having -- your XML data may map very well onto SQL, but we don't know, we haven't seen it -- that sounds like a pretty simple job you could do in Perl. You could do it the Proper Way by using XML parsing modules, or you could just do it by brute force pattern matching.

I know you want to do it in SQL, and you could probably import your data into an SQL database and then do your processing in that database, but, by the time you've written some code to put it into your SQL database then, what the hell, you could have written some code to get your answers anyway.

How about you link to, or post, a sample of the data?
posted by AmbroseChapel at 7:12 PM on February 15, 2006


Response by poster: cillit - I said "XML is heirarchical", not relational DBs. That being said, your statements make no sense. I can't think of ANY realworld xml that doesn't map meaningfully to the relational model. The standard example for XML is the card catalog, also a classic relational DB example. I'm curious what XMLs you've seen that don't contain a collection of structured data.


bitrot - You wouldn't end up with one table per element, you'd end up with one table with as many rows as you have occurences of that element-type. Only a new element-type would require a new table.

< data>
< element 1 name="row1">
< column1>value1 < /c1>
< column2>value2 < /c2>
< /e>
< element 2 name="row2">
< c1>value3< /c1>
< c2>value4< /c2>
< /e>
< element 2 name="row1">
< c1> value5 < /c1>
< /e>
< /data>

This directly corresponds to 2 tables. The first has two rows, each with two columns. the second table has one row, with only one column.

ook - XSLTs sound spiffy, but seems a high learning curve for a one-off project, hence the desire to reuse the SQL knowledge I already have.

Thanx all for the input, even the derails were interesting. Looks like MS has a new 'feature' for querying arbitrary datasources, especially XML, via SQL.

http://msdn.microsoft.com/netframework/future/linq/
posted by nomisxid at 7:24 PM on February 15, 2006


Response by poster: argh. that xml snippet got all garbled. it should read

element type="1" name="row1"
element type="1" name="row2"
element type="2" name="row1"

where appropriate =p
posted by nomisxid at 7:27 PM on February 15, 2006


Best answer: Didn't read the original poster's subcomment later: My whole goal is to be able to use SQL, which I already know.

You're trying to drive a screw with a hammer. Get yourself a screwdriver and learn how to use it, you'll save yourself a lot of time.

two tasks... one is to get a count of a certain kind of element, where it has a specific value

[xsl:value-of select="count(//foo[@bar='baz'])" /]
will count all [foo] nodes whose bar attribute is equal to baz. Adjust as necessary for your data.

The second is to get the list of IDs for those elements.
[xsl:for-each select="//foo[@bar='baz']"][xsl:value-of select="@id"], [/xsl:for-each]
posted by ook at 7:29 PM on February 15, 2006


I can't think of ANY realworld xml that doesn't map meaningfully to the relational model.

Wow. That's just... wow. How about every XHTML file on earth?
posted by ook at 7:36 PM on February 15, 2006


Response by poster: ook - is there a way to test CONTAINS or BEGINS-WITH style clauses? IE, counting foo nodes where value begins with 'baz'?
posted by nomisxid at 7:37 PM on February 15, 2006


Best answer: I have to agree that you want XPath, and if you want to do queries that XPath won't support, you want XQuery.

Berkeley DB XML supports XQuery and XPath. I'm not sure how well it works; the only XQuery database I've used is a huge commercial product.
posted by nev at 7:39 PM on February 15, 2006


Best answer: nomisxid: XPath has a contains() and a starts-with(). See the XPath Language Recommendation.
posted by nev at 7:42 PM on February 15, 2006


The XPath equivalents are starts-with() and contains().

The fact that you need those, however, implies that your XML schema is badly in need of a redesign. Which, given that your dev group's suggestion was to use grep to solve this problem, is perhaps not too surprising :)
posted by ook at 7:44 PM on February 15, 2006


Response by poster: ook, isn't xhtml the PRESENTATION of data, not storage of data? It is related to XML, but they are not the same thing, nor used for the same purposes.

That being said, I don't buy it, even for xhtml. Show me your xhtml that isn't a structured document. The whole point of a markup language is to present your data in a non-random order so that it has meaning. Every example I see of xhtml is just reading data from a datasource and presenting it on a webpage.
posted by nomisxid at 7:48 PM on February 15, 2006


Response by poster: thanx nev, I was looking at this stupid page,
http://www.w3schools.com/xpath/xpath_operators.asp

which made it seem like the handful of operators they listed was the definitive list.
posted by nomisxid at 7:51 PM on February 15, 2006


XML is a hierarchy of nodes with types of Text Node, Element Node, and therefore...

Table 1: Documents
[DocumentId | DocumentName | FileName]
1 | My xml document | index.xml]

Table 2: Element Nodes
[DocumentId | NodeId | ParentNodeId | Namespace | ElementName | SiblingNodeOrder]
1 | 1 | null | Element | http://w3.org/... | root | 0
1 | 2 | null | Element | null | paragraph | 0
1 | 3 | null | Element | null | paragraph | 1

Table 3: Attributes
[NodeId | Namespace | Name | Value]
3 | null | style | text-decoration: blink

Table 4: Text Nodes
[NodeId | Value | SiblingNodeOrder]
2 | This is stupid
3 | THE ONLY PEOPLE WHO WOULD TRY TO DO THIS ARE IDIOTS WHO HAVE BEEN RESTING ON THEIR LAURELS AND POSTING ON SLASHDOT FOR THE LAST FIVE YEARS WITHOUT LEARNING WHAT THE XML IS BUT WHY REASON WITH THOSE THICK HEADS THEY PROBABLY THINK VIRTUAL MACHINES ARE OVERKILL AND UNICODE IS A FAD | 0
posted by holloway at 8:07 PM on February 15, 2006


Heh -- this is kind of amusing... I've worked with a couple of different shops while they made the transition from everything-is-a-table kind of thinking to finally "getting" xml, and it's kind of interesting how difficult it is for people on each side of the divide to communicate with each other. The fact that you think of XHTML as a fundamentally different thing from XML is a common fallacy: they're both just data. What they're used for is irrelevant. (Anyway, XHTML was just an example. Very few real-world XML structures are easy to translate into a relational database; if they could be done that way, they would have been in the first place. Try the IMS QTI as a different example if you prefer.)

Show me your xhtml that isn't a structured document.

This word "structured", I do not think it means what you think it means.

The whole point of XML is that it can contain arbitrary structures which are not necessarily easily mapped to a table-like structure. (Yes, in theory it's possible to map anything to anything else. In theory it's possible to hammer in a screw with a hammer, too. But you wouldn't want to.)

To translate XML into SQL-terms, think of the XML tree as encapsulating all the one-to-one and one-to-many relationships you would normally find in a complex relational database, in a single file. Instead of using a foreign key to associate data across tables, you place related nodes inside of other nodes. Yeah, in theory you could normalize any given XML tree into a collection of relational tables that would meaningfully capture all the structure, but it'd be a huge waste of time, because that structure is already inherent in the XML to begin with, in a much simpler format.

That [data][element]...[/element][/data] example you listed above is classic "I don't really understand XML yet" XML... it captures the data, but loses the structure. The structure is the important part.
posted by ook at 8:26 PM on February 15, 2006


ook: good stuff man.

nomisxid: To answer reassure you, ook is giving great advice so listen to him. Xpath, XQuery, and software that understands XML structures are what you should be using. Learn those tools. Trust ook, don't trust yourself.

Don't use string-based parsers such as grep because they can't even reliably answer whether an xhtml page has the word "select" in it... it'll get confused over <select> vs a text node with the word. That may be a slightly contrived example but you'll soon get the idea that trimming the length of text in hyperlinks is very difficult and unreliable with string-based parsers.

And find the person who told you to use grep and violently kick them in the balls. Wait for them in the parking lot, wear a mask, and have an escape route planned. They must not be allowed to breed.
posted by holloway at 8:38 PM on February 15, 2006


I'm sure there are a lot of XML documents which could very easily be translated into an SQL database.

Take this example from W3Schools introduction to XSLT:

http://www.w3schools.com/xsl/cdcatalog.xml

Excerpt:
<catalog>
<cd>
<title>Empire Burlesque</title>
<artist>Bob Dylan</artist>
<country>USA</country>
<company>Columbia</company>
<price>10.90</price>
<year>1985</year>
</cd>
<cd>
<title>Hide your heart</title>
<artist>Bonnie Tyler</artist>
<country>UK</country>
<company>CBS Records</company>
<price>9.90</price>
<year>1988</year>
</cd>
[etc]
<catalog>


It's relationships are so simple that of course you could transfer it into a database in five minutes.

You'd have a "titles" table, a "titles_artists" linking table, a "titles_companies" linking table, a couple more and you're done. Do a join to get all artists on Columbia, or all albums by Bob Dylan, or all albums by Bob Dylan on Columbia and so on.

But that's a deceptively simple example. A CD has a title, an artist and a couple of other things. It's nested one level deep and that's it.

Because of the arbritrary levels of complexity which are possible in XML, you have no way of knowing if any given XML file could be easily translated. What about a file with nesting seventeen levels deep, with attributes at each level? Do you end up with 172 tables to describe all the relationships? Maybe it's 1717 but whatever, it's too many.

You'd either have to use human judgement to decide which relationships were important and discard the rest or end up creating thousands of tables to describe the relationships between different bits of data. XSLT and XPath are obviously much better ways to get at the data matching any given relationship.
posted by AmbroseChapel at 9:27 PM on February 15, 2006


isn't xhtml the PRESENTATION of data, not storage of data? It is related to XML, but they are not the same thing

They're not "related", they're the same thing, or rather, XHTML is XML.

XML is extensible.

You can write your own languages in it.

One of the languages which has been written in XML is XHTML, and yes, it looks a lot like HTML, but that's just to make life easier for everyone.
posted by AmbroseChapel at 9:33 PM on February 15, 2006


It's not SQL, but there are easier ways to do this kind of ad hoc reporting on the contents of XML files.

For example, if you have access to a ruby interpreter (>1.8), there's a pure ruby XML parser included with the standard distribution - REXML.

You can then get what you need with only a few lines of ruby code.

The tutorial for REXML is pretty clear.

This approach is still using XPATH to find the nodes you want, but then you can use general ruby to do the calculations you need.

For example, you could read the appropriate elements into a hash if they match your value, then just take the length off the hash.

Without your XML structure, I can't give you the code for this, but it's almost certainly fewer than 5 lines of code.
posted by Caviar at 9:33 PM on February 15, 2006


Best answer: There's probably enough info here to get the job done, but I thought I'd mention that many ETL (extract transform load) tools can move from xml to a relational database. I found a list open source ETL tools, but I haven't tried them myself. They might quickly get it into the DB where you can use SQL instead of learning something new, if time is short.
posted by RobotAdam at 11:03 PM on February 15, 2006


« Older Government Grants -- A Scam?   |   iPhoto sort by resolution? Newer »
This thread is closed to new comments.