Resources for creating a good, extensible file format
September 4, 2009 10:35 PM
My team and I are writing a computer application for which aspects of the file format are very likely to change in future releases. The format is in XML. Does anyone know any resources on the web that lays down general principles about how to make a file format that is as extensible as possible for the future (and minimizes the possibility of cross-version breakage)?
IBM.com has some good documents on "Principles of XML Design" you can Google.
posted by lsemel at 12:19 AM on September 5, 2009
posted by lsemel at 12:19 AM on September 5, 2009
I've done what Netzapper suggests time and time again. Some things I do anytime I'm designing a file or protocol format:
1) Include a version number at the start. For example, apple's XML plist format has the format version as an attribute on the root element. And modularize your parser so that you can easily handle loading multiple versions.
2) You're using XML so create a namespace for your elements. The namespace should include some version too... W3C likes dates. This will greatly help if you want to later add standard elements from other namespaces (OOo perhaps).
3) Don't store unsigned integers. Some languages, like Java, only have signed types. And some, like PHP, handle signed overflow poorly. If you need to store an unsigned 32bit value, write the spec with a signed 64bit value.
4) Use a good date format. It's tempting on Linux to use the unix epoch, but the problem is that different systems have different epochs. I prefer RFC 1123 styles dates ("Sun, 06 Nov 1994 08:49:37 GMT"). Since they're widely used in HTTP, most systems have a library that can correctly parse and generate them.
5) Anytime you store a hash, pass along some type of identifier for what hash type it is. For example, for MD5 you might store the string "<hash type="md5">e19c1283c925b3206685ff522acfe3e6</hash>". Recent years has seen some popular hashes fall, and having some extra agility in this respect helps.
6) Plan for unicode support if you haven't already! Actually, XML helps a lot in this area because it should be easy to write documents in different character sets. The problem is with your parser. But it pays now to know how you're going to handle this problem. For Linux I'd choose UTF-8. I think on OS X and Windows UTF-16 is much more popular. For the love of God, don't choose ASCII and encode all "special characters" as entities.
7) This should be an obvious one, but many people fail it: put all your document structure in the XML. I've seen examples where people move from a CSV to XML (because it sounds more modern) and then end up with stuff like this:
posted by sbutler at 2:04 AM on September 5, 2009
1) Include a version number at the start. For example, apple's XML plist format has the format version as an attribute on the root element. And modularize your parser so that you can easily handle loading multiple versions.
2) You're using XML so create a namespace for your elements. The namespace should include some version too... W3C likes dates. This will greatly help if you want to later add standard elements from other namespaces (OOo perhaps).
3) Don't store unsigned integers. Some languages, like Java, only have signed types. And some, like PHP, handle signed overflow poorly. If you need to store an unsigned 32bit value, write the spec with a signed 64bit value.
4) Use a good date format. It's tempting on Linux to use the unix epoch, but the problem is that different systems have different epochs. I prefer RFC 1123 styles dates ("Sun, 06 Nov 1994 08:49:37 GMT"). Since they're widely used in HTTP, most systems have a library that can correctly parse and generate them.
5) Anytime you store a hash, pass along some type of identifier for what hash type it is. For example, for MD5 you might store the string "<hash type="md5">e19c1283c925b3206685ff522acfe3e6</hash>". Recent years has seen some popular hashes fall, and having some extra agility in this respect helps.
6) Plan for unicode support if you haven't already! Actually, XML helps a lot in this area because it should be easy to write documents in different character sets. The problem is with your parser. But it pays now to know how you're going to handle this problem. For Linux I'd choose UTF-8. I think on OS X and Windows UTF-16 is much more popular. For the love of God, don't choose ASCII and encode all "special characters" as entities.
7) This should be an obvious one, but many people fail it: put all your document structure in the XML. I've seen examples where people move from a CSV to XML (because it sounds more modern) and then end up with stuff like this:
[root] [row]value1a,value2a,value3a[/row] [row]value1b,value2b,value3b[/row] [/root]That's all I can think of for now. But it should future proof your format to a good extent.
posted by sbutler at 2:04 AM on September 5, 2009
Ohh... and (8) resist the temptation to roll your own XML reader or writer. There are many, many good XML parsers out there. And for writing, choose something that makes you use the DOM model for generating a document, serializing at the end (assuming your document isn't too large to be represented in memory).
A good DOM representation takes more code than simple print statements, but it also makes sure your document will end up well formed with all the mandatory parts escaped and nicely packaged. Trust me, it's worth the extra work.
I've been doing a lot of work recently with libxml2 via Perl, and it's been surprisingly pleasant.
posted by sbutler at 2:09 AM on September 5, 2009
A good DOM representation takes more code than simple print statements, but it also makes sure your document will end up well formed with all the mandatory parts escaped and nicely packaged. Trust me, it's worth the extra work.
I've been doing a lot of work recently with libxml2 via Perl, and it's been surprisingly pleasant.
posted by sbutler at 2:09 AM on September 5, 2009
What sbutler said about using a prebuilt parser. I've spent a fair bit of time swearing at a tool I'm forced to use whose data importer uses a hand-rolled XML not-really-a-parser, and it's absolutely shitful - it does stuff like quietly failing if the sub-elements of a STUDENT element occur in an unexpected order. Grrrrrr.
posted by flabdablet at 8:03 AM on September 5, 2009
posted by flabdablet at 8:03 AM on September 5, 2009
Nthing the advice to use library XML reading/writing functionality, that's the real reason why XML is preferable to most other formats these days, you don't have to waste time rolling your own parsing logic.
Also, make sure that any app reading the file does not make any more assumptions about the file content than they have to. For example, if they just need to read one element out of the file, they should use something like XPath (xml.Read("someNode/otherNode")) rather than, say, reading all of the nodes serially and then getting the 11th one.
posted by burnmp3s at 8:36 AM on September 5, 2009
Also, make sure that any app reading the file does not make any more assumptions about the file content than they have to. For example, if they just need to read one element out of the file, they should use something like XPath (xml.Read("someNode/otherNode")) rather than, say, reading all of the nodes serially and then getting the 11th one.
posted by burnmp3s at 8:36 AM on September 5, 2009
XML doesn't really solve any problems. It's just syntax. However, it gives you a very useful tool: tag names. Structure your data well with tag names that make semantic sense and are well defined. Don't ever change the meaning of a tag. Ie, say you have <widgetColor>#8080ff</widgetColor> as a tag in your document, with color being an RGB triple. If you change your app later to support an alpha channel, so that colours are now 4-tuples, go ahead and rename the tag to something like widgetColorWithAlpha. That will make format changes more explicit.
Automated tests are a great way to make sure your new code works right with old file formats. Invest the time in the beginning to build a good test harness + suite for reading files. Never ever delete old test files, make sure your code always interprets the old ones the right way.
posted by Nelson at 8:58 AM on September 5, 2009
Automated tests are a great way to make sure your new code works right with old file formats. Invest the time in the beginning to build a good test harness + suite for reading files. Never ever delete old test files, make sure your code always interprets the old ones the right way.
posted by Nelson at 8:58 AM on September 5, 2009
This thread is closed to new comments.
1) Include a filespec version number in the file itself. Don't try to do this based on magic numbers or format analysis or whatever. Just add a <formatversion>1.0</formatversion> element. Or, add a version attribute to the document element.
2) Write a conversion tool or module.
So, you have a v1.0 file spec. You write a module to read/write it, and you name it FileServicesV1. Then, for your v2.0 release, you do not change the FileServicesV1 module, but rather add a FileServicesV2 module. Then, when you encounter a v1.0 file, you pull it in with FileServicesV1 into your internal representation. When it comes time to save it, output it with your FileServicesV2 module.
You can also make this an external tool, or a Save As... with a "some formatting will not be saved" kind of message, if you want to give your users the option.
posted by Netzapper at 10:57 PM on September 4, 2009