How to split a large XML file into more manageable pieces?
January 30, 2015 3:33 PM   Subscribe

I have a large XML file with 30+ million elements. For reasons, I'd like to break it down into more manageable chunks, say like 1 million elements per file or 30 files or something like that. Are there any utilities that can do this?
posted by disaster77 to Computers & Internet (10 answers total) 4 users marked this as a favorite
 
Do you mean one million transactions per file? I'm not sure what you mean by elements, but if you want n number of rows a file, you can use the UNIX utility split to do it.

Some examples
posted by viramamunivar at 3:53 PM on January 30, 2015


it would help to know more about the structure of the document. If those 30million elements are wrapped by a collection of parent elements then you have to have a parser that knows xml so that tags are closed and the like.

If it just so happens that you have one closed element per line then something super simple like split can totally do this. If not, well, you're scripting an xml parser.
posted by mce at 3:58 PM on January 30, 2015


This is probably 2-20 lines of Perl for someone who knows what they're doing.
posted by Doofus Magoo at 4:18 PM on January 30, 2015 [1 favorite]


I came to echo Doofus. Perl has a nice interface to libxml which can make it pretty easy, if you have someone with a grasp of Perl. Nodes are addressed with a simple nodepath, and you'd just need to pick the level at which to make the splits.
posted by anadem at 4:31 PM on January 30, 2015


It will be a lot more complicated depending on whether you can fit the whole file into memory or not.
posted by dilaudid at 5:42 PM on January 30, 2015 [1 favorite]


Depending on the way it's formatted and how regular the data is you might not need a real parser or to read it all into memory. For instance, a file like this will frequently have a prologue and epilogue of a known number of lines with all of the records in between taking a fixed number of lines.
posted by Horselover Fat at 6:43 PM on January 30, 2015


You don't want to pull that whole mess into memory, so you want to use a stream-oriented parser like Expat, which will parse XML as it's read and call functions that have been registered for various events, e.g. when an element is closed. Here's someone's Gist of a Python script that uses xml.parsers.expat to do exactly what you're after.
posted by nicwolff at 7:09 PM on January 30, 2015


Actually, that Gist is not that well written. I can do better than that: save this Gist as XML_breaker.py, then run e.g.
python XML_breaker.py books.xml book 1000
to split books.xml into files books1.xml, books2.xml, &c, each containing at most 1000 <book> elements. This should work on any size file.
posted by nicwolff at 9:29 PM on January 30, 2015 [1 favorite]


To follow up on nicwolff's comment, you specifically want your XML parser to be a SAX parser, which processes XML as a stream of elements come in. The other main type of parser follows the DOM model, where the entire XML document is loaded into memory before being processed.

You don't want DOM parsing for a dataset that has 30+ million elements. For one, you'll be waiting for the entire record to be loaded into memory before any work gets done, and, secondly, this only works at all if you have sufficient memory to store it all.

A SAX parser can be set up to trigger an event handler or function when a particular element is found and processed. In addition to requiring less resources, events can be handled in parallel (unless you need to preserve ordering). You might have a pool of threads that write out elements until some number are handled.

If you have some experience with C and you will need to do this work often, you might learn a bit about libxml, which is generally the fastest and lowest memory SAX parser out there. There's a bit of a learning curve, but if you will need to do this frequently, you'll want to learn this approach.
posted by a lungful of dragon at 12:23 AM on January 31, 2015 [2 favorites]


I would call what you want "xml split", and if you Google for that phrase you will find a bunch of utilities that do it. This StackExchange answer has a good overview of a few of them.
posted by Nelson at 3:41 AM on January 31, 2015


« Older Hobbies for the one-handed   |   What SF book from the 60s or 70s had spiked... Newer »
This thread is closed to new comments.