How to query XML like SQL?
February 15, 2006 5:10 PM   Subscribe

I'm looking for a software tool that will ingest a large XML file, and allow me to write sql-like queries against it for both summary reporting, as well as detailed reports.

I've been tasked with trying to extract some statistical information about data in some XML's we've been generating over the last year and a half. The suggestion from dev is to construct some horrific grep that pulls pre/post lines, and iterating thru that's output with more series of greps.

I've looked into XPath and XQuery, and they both seem to be starts along the road I want, but what I'd really like is something that ingests my XML, and lets me write SQL against it. Ever heard of any such beast out there?

I have two tasks I need to accomplish, one is to get a count of a certain kind of element, where it has a specific value. The second is to get the list of IDs for those elements.
posted by nomisxid to Computers & Internet (31 answers total)
 
SQLXML probably does most of what you want. Not free though.
posted by unSane at 5:15 PM on February 15, 2006


XML is a standard way to organize data files, it isn't a relational database. You could implement SQL with an XML data store, but the inverse is not always true.
posted by b1tr0t at 5:16 PM on February 15, 2006


bitrot, what do you think a relational database is, except a way to organize data files?
posted by nomisxid at 5:26 PM on February 15, 2006


unSane, that looks like something useful at first glance, but deeper reading makes it seem that all they are doing is storing the XML in blob fields, and then doing xquery/xslt translations on that. If I knew how to write xslt/xquery well enough to do that, I wouldn't need a DB in the first place. My whole goal is to be able to use SQL, which I already know.
posted by nomisxid at 5:35 PM on February 15, 2006


"we've been generating"

Mj dlwssO ut jylp ypd has bww; dw;wpagu;d gwh XMI, dwg gwh paf rag ghag gwh XMI's dw;wpagwr tpym. Ghag'ii bw wasuwp gy kpycwss.

My guess: if your org has been generating the XML, get the raw data that the XML's generated from. That'll be easier to process.
posted by orthogonality at 5:37 PM on February 15, 2006


A relational database is for organizing data into tables and rows, with relationships between columns in various tables. XML is more general. It provides a tagging structure for data fields, but does not require as much structure as a relational database does. You can use an XML schema, but you don't have to.

If your XML file happens to be organized in such a way that it easily maps to relational constructs, then you could easily build a SQL importer for your particular XML file. Since XML does not generally map to SQL, you won't find a tool that lets you make SQL queries on arbitrary XML files.

An extreme analogy: Ferraris are made of steel and aluminum, but not every pile of metal is a fast sports car. Some piles of metal happen to be Ferraris (or even Ford Focuses), but you cannot assume that all piles of metal are cars. If you have a pile of metal, you might be able to get an expert to make it into a car, but not always (mercury, though metal, isn't good for making cars). Similarly, some XML files might map to the relational model, but not all do. If a given file does happen to map to the relational model, you can get someone to build an importer for you. If the XML does not map to the relational model then you won't be able to query it with SQL.
posted by b1tr0t at 5:41 PM on February 15, 2006


If your data does map to the relational model well, then setup MySQL, define the schema, and write some perl (or whatever your favorite language is) to extract the XML data and insert it into your SQL database. Then you can run SQL queries on it.
posted by b1tr0t at 5:45 PM on February 15, 2006


Since XML does not generally map to SQL.

Is this really true? I understand any tree or graph structure can be represented by a matrix, and a matrix is a tabular form, which is pretty much what we're talking about manipulating when we're talking about SQL.

In any case, though, I agree with this:

If your XML file happens to be organized in such a way that it easily maps to relational constructs, then you could easily build a SQL importer for your particular XML file.

And suspect this suggestion is going to be the easiest.
posted by weston at 5:51 PM on February 15, 2006


ortho, the goal is to give a report on what WAS sent out, not report on what data SHOULD have been sent out. Yes, I could back-into the results from the source Db, but if management found out, they'd be peeved since it's not what they asked for...


bitrot, forgive my xml-ignorance, but since XML is heirarchical, I don't see how you could create an XML that couldn't be directly translatable to a relational model. Perhaps not efficient, but impossible? I don't buy that. A relational DB doesn't require relations, it just supports them. I can load a dozen completely unrelated data sets into a single DB.
posted by nomisxid at 5:52 PM on February 15, 2006


Relational databases aren't in the slightest bit hierarchical.

Apart from trivially simple scenarios, I'm having trouble thinking of any kind of XML file that maps meaningfully onto a relational model. Unless each XML file is a simple list of values that corresponds directly to a row in a database (and therefore isn't hierarchical), it can't be done.
posted by cillit bang at 6:17 PM on February 15, 2006


More helpfully, I think you should forget about SQL. What you want to do is trivially simple to do with XPath.
posted by cillit bang at 6:26 PM on February 15, 2006


bitrot, forgive my xml-ignorance, but since XML is heirarchical, I don't see how you could create an XML that couldn't be directly translatable to a relational model. Perhaps not efficient, but impossible? I don't buy that. A relational DB doesn't require relations, it just supports them. I can load a dozen completely unrelated data sets into a single DB.

Technically, you are correct. You could trivially translate from any XML file to SQL as follows:

For each element in the XML file, create a new table with a single column of the type of the data element. The name of the table is the "path" of tags from the root to the element, along with any uniquely identifying tag elements. Insert the data from the element into the (only) row. Continue.

At the end, you have a valid SQL database, but you can't make any useful queries on it.
posted by b1tr0t at 6:35 PM on February 15, 2006


You want XSL. Not even vaguely SQL-like, but it's the right tool for the job.
posted by ook at 6:46 PM on February 15, 2006


I have two tasks I need to accomplish, one is to get a count of a certain kind of element, where it has a specific value. The second is to get the list of IDs for those elements.

Leaving aside the fascinating philosophical discussion we're having -- your XML data may map very well onto SQL, but we don't know, we haven't seen it -- that sounds like a pretty simple job you could do in Perl. You could do it the Proper Way by using XML parsing modules, or you could just do it by brute force pattern matching.

I know you want to do it in SQL, and you could probably import your data into an SQL database and then do your processing in that database, but, by the time you've written some code to put it into your SQL database then, what the hell, you could have written some code to get your answers anyway.

How about you link to, or post, a sample of the data?
posted by AmbroseChapel at 7:12 PM on February 15, 2006


cillit - I said "XML is heirarchical", not relational DBs. That being said, your statements make no sense. I can't think of ANY realworld xml that doesn't map meaningfully to the relational model. The standard example for XML is the card catalog, also a classic relational DB example. I'm curious what XMLs you've seen that don't contain a collection of structured data.


bitrot - You wouldn't end up with one table per element, you'd end up with one table with as many rows as you have occurences of that element-type. Only a new element-type would require a new table.

< data>
< element 1 name="row1">
< column1>value1 < /c1>
< column2>value2 < /c2>
< /e>
< element 2 name="row2">
< c1>value3< /c1>
< c2>value4< /c2>
< /e>
< element 2 name="row1">
< c1> value5 < /c1>
< /e>
< /data>

This directly corresponds to 2 tables. The first has two rows, each with two columns. the second table has one row, with only one column.

ook - XSLTs sound spiffy, but seems a high learning curve for a one-off project, hence the desire to reuse the SQL knowledge I already have.

Thanx all for the input, even the derails were interesting. Looks like MS has a new 'feature' for querying arbitrary datasources, especially XML, via SQL.

http://msdn.microsoft.com/netframework/future/linq/
posted by nomisxid at 7:24 PM on February 15, 2006


argh. that xml snippet got all garbled. it should read

element type="1" name="row1"
element type="1" name="row2"
element type="2" name="row1"

where appropriate =p
posted by nomisxid at 7:27 PM on February 15, 2006


Didn't read the original poster's subcomment later: My whole goal is to be able to use SQL, which I already know.

You're trying to drive a screw with a hammer. Get yourself a screwdriver and learn how to use it, you'll save yourself a lot of time.

two tasks... one is to get a count of a certain kind of element, where it has a specific value

[xsl:value-of select="count(//foo[@bar='baz'])" /]
will count all [foo] nodes whose bar attribute is equal to baz. Adjust as necessary for your data.

The second is to get the list of IDs for those elements.
[xsl:for-each select="//foo[@bar='baz']"][xsl:value-of select="@id"], [/xsl:for-each]
posted by ook at 7:29 PM on February 15, 2006


I can't think of ANY realworld xml that doesn't map meaningfully to the relational model.

Wow. That's just... wow. How about every XHTML file on earth?
posted by ook at 7:36 PM on February 15, 2006


ook - is there a way to test CONTAINS or BEGINS-WITH style clauses? IE, counting foo nodes where value begins with 'baz'?
posted by nomisxid at 7:37 PM on February 15, 2006


I have to agree that you want XPath, and if you want to do queries that XPath won't support, you want XQuery.

Berkeley DB XML supports XQuery and XPath. I'm not sure how well it works; the only XQuery database I've used is a huge commercial product.
posted by nev at 7:39 PM on February 15, 2006


nomisxid: XPath has a contains() and a starts-with(). See the XPath Language Recommendation.
posted by nev at 7:42 PM on February 15, 2006


The XPath equivalents are starts-with() and contains().

The fact that you need those, however, implies that your XML schema is badly in need of a redesign. Which, given that your dev group's suggestion was to use grep to solve this problem, is perhaps not too surprising :)
posted by ook at 7:44 PM on February 15, 2006


ook, isn't xhtml the PRESENTATION of data, not storage of data? It is related to XML, but they are not the same thing, nor used for the same purposes.

That being said, I don't buy it, even for xhtml. Show me your xhtml that isn't a structured document. The whole point of a markup language is to present your data in a non-random order so that it has meaning. Every example I see of xhtml is just reading data from a datasource and presenting it on a webpage.
posted by nomisxid at 7:48 PM on February 15, 2006


thanx nev, I was looking at this stupid page,
http://www.w3schools.com/xpath/xpath_operators.asp

which made it seem like the handful of operators they listed was the definitive list.
posted by nomisxid at 7:51 PM on February 15, 2006


XML is a hierarchy of nodes with types of Text Node, Element Node, and therefore...

Table 1: Documents
[DocumentId | DocumentName | FileName]
1 | My xml document | index.xml]

Table 2: Element Nodes
[DocumentId | NodeId | ParentNodeId | Namespace | ElementName | SiblingNodeOrder]
1 | 1 | null | Element | http://w3.org/... | root | 0
1 | 2 | null | Element | null | paragraph | 0
1 | 3 | null | Element | null | paragraph | 1

Table 3: Attributes
[NodeId | Namespace | Name | Value]
3 | null | style | text-decoration: blink

Table 4: Text Nodes
[NodeId | Value | SiblingNodeOrder]
2 | This is stupid
3 | THE ONLY PEOPLE WHO WOULD TRY TO DO THIS ARE IDIOTS WHO HAVE BEEN RESTING ON THEIR LAURELS AND POSTING ON SLASHDOT FOR THE LAST FIVE YEARS WITHOUT LEARNING WHAT THE XML IS BUT WHY REASON WITH THOSE THICK HEADS THEY PROBABLY THINK VIRTUAL MACHINES ARE OVERKILL AND UNICODE IS A FAD | 0
posted by holloway at 8:07 PM on February 15, 2006


Heh -- this is kind of amusing... I've worked with a couple of different shops while they made the transition from everything-is-a-table kind of thinking to finally "getting" xml, and it's kind of interesting how difficult it is for people on each side of the divide to communicate with each other. The fact that you think of XHTML as a fundamentally different thing from XML is a common fallacy: they're both just data. What they're used for is irrelevant. (Anyway, XHTML was just an example. Very few real-world XML structures are easy to translate into a relational database; if they could be done that way, they would have been in the first place. Try the IMS QTI as a different example if you prefer.)

Show me your xhtml that isn't a structured document.

This word "structured", I do not think it means what you think it means.

The whole point of XML is that it can contain arbitrary structures which are not necessarily easily mapped to a table-like structure. (Yes, in theory it's possible to map anything to anything else. In theory it's possible to hammer in a screw with a hammer, too. But you wouldn't want to.)

To translate XML into SQL-terms, think of the XML tree as encapsulating all the one-to-one and one-to-many relationships you would normally find in a complex relational database, in a single file. Instead of using a foreign key to associate data across tables, you place related nodes inside of other nodes. Yeah, in theory you could normalize any given XML tree into a collection of relational tables that would meaningfully capture all the structure, but it'd be a huge waste of time, because that structure is already inherent in the XML to begin with, in a much simpler format.

That [data][element]...[/element][/data] example you listed above is classic "I don't really understand XML yet" XML... it captures the data, but loses the structure. The structure is the important part.
posted by ook at 8:26 PM on February 15, 2006


ook: good stuff man.

nomisxid: To answer reassure you, ook is giving great advice so listen to him. Xpath, XQuery, and software that understands XML structures are what you should be using. Learn those tools. Trust ook, don't trust yourself.

Don't use string-based parsers such as grep because they can't even reliably answer whether an xhtml page has the word "select" in it... it'll get confused over <select> vs a text node with the word. That may be a slightly contrived example but you'll soon get the idea that trimming the length of text in hyperlinks is very difficult and unreliable with string-based parsers.

And find the person who told you to use grep and violently kick them in the balls. Wait for them in the parking lot, wear a mask, and have an escape route planned. They must not be allowed to breed.
posted by holloway at 8:38 PM on February 15, 2006


I'm sure there are a lot of XML documents which could very easily be translated into an SQL database.

Take this example from W3Schools introduction to XSLT:

http://www.w3schools.com/xsl/cdcatalog.xml

Excerpt:
<catalog>
<cd>
<title>Empire Burlesque</title>
<artist>Bob Dylan</artist>
<country>USA</country>
<company>Columbia</company>
<price>10.90</price>
<year>1985</year>
</cd>
<cd>
<title>Hide your heart</title>
<artist>Bonnie Tyler</artist>
<country>UK</country>
<company>CBS Records</company>
<price>9.90</price>
<year>1988</year>
</cd>
[etc]
<catalog>


It's relationships are so simple that of course you could transfer it into a database in five minutes.

You'd have a "titles" table, a "titles_artists" linking table, a "titles_companies" linking table, a couple more and you're done. Do a join to get all artists on Columbia, or all albums by Bob Dylan, or all albums by Bob Dylan on Columbia and so on.

But that's a deceptively simple example. A CD has a title, an artist and a couple of other things. It's nested one level deep and that's it.

Because of the arbritrary levels of complexity which are possible in XML, you have no way of knowing if any given XML file could be easily translated. What about a file with nesting seventeen levels deep, with attributes at each level? Do you end up with 172 tables to describe all the relationships? Maybe it's 1717 but whatever, it's too many.

You'd either have to use human judgement to decide which relationships were important and discard the rest or end up creating thousands of tables to describe the relationships between different bits of data. XSLT and XPath are obviously much better ways to get at the data matching any given relationship.
posted by AmbroseChapel at 9:27 PM on February 15, 2006


isn't xhtml the PRESENTATION of data, not storage of data? It is related to XML, but they are not the same thing

They're not "related", they're the same thing, or rather, XHTML is XML.

XML is extensible.

You can write your own languages in it.

One of the languages which has been written in XML is XHTML, and yes, it looks a lot like HTML, but that's just to make life easier for everyone.
posted by AmbroseChapel at 9:33 PM on February 15, 2006


It's not SQL, but there are easier ways to do this kind of ad hoc reporting on the contents of XML files.

For example, if you have access to a ruby interpreter (>1.8), there's a pure ruby XML parser included with the standard distribution - REXML.

You can then get what you need with only a few lines of ruby code.

The tutorial for REXML is pretty clear.

This approach is still using XPATH to find the nodes you want, but then you can use general ruby to do the calculations you need.

For example, you could read the appropriate elements into a hash if they match your value, then just take the length off the hash.

Without your XML structure, I can't give you the code for this, but it's almost certainly fewer than 5 lines of code.
posted by Caviar at 9:33 PM on February 15, 2006


There's probably enough info here to get the job done, but I thought I'd mention that many ETL (extract transform load) tools can move from xml to a relational database. I found a list open source ETL tools, but I haven't tried them myself. They might quickly get it into the DB where you can use SQL instead of learning something new, if time is short.
posted by RobotAdam at 11:03 PM on February 15, 2006


« Older Government Grants -- A Scam?   |   iPhoto sort by resolution? Newer »
This thread is closed to new comments.