What's the first step to extracting data from an unknown binary format?
October 3, 2006 4:20 PM
I have a string of binary data that's well under 1k in length (so fairly short) from a proprietary source that has no documented information on its particular format. I have a general idea of what it might contain (like a month, day, year and unique id, maybe more) but no idea where to start if I wanted to discern the format and other contents of this data. I've tried googling around but I just can't come up with the right description of my task to find anything useful. Where do I start?
There's a command in UNIX, and thus on Mac OS X, called "strings":
$ strings -filename-
posted by vacapinta at 4:36 PM on October 3, 2006
$ strings -filename-
posted by vacapinta at 4:36 PM on October 3, 2006
My approach depends on knowing a lot about what the device does. If you know what should be in the data, it's easier to find it.
You'll need a hex editor and either the actual device that made it or access to the device that made the record.
A good technique is to make more than one data record. After changing ONE thing in the data, look for differences. Repeat this process, and before long, you'll associate changes with regions. Eventually, it MIGHT yield its secrets. It's a slow process and there are no guarantees.
A good hex editor is essential. I use Win Hex 10.6 by Stefan Fleishmann. Great software.
One other little detail... don't give up. When you are out of ideas, let it sit for a few days and go back to it. Plan on a long project.
posted by FauxScot at 4:45 PM on October 3, 2006
You'll need a hex editor and either the actual device that made it or access to the device that made the record.
A good technique is to make more than one data record. After changing ONE thing in the data, look for differences. Repeat this process, and before long, you'll associate changes with regions. Eventually, it MIGHT yield its secrets. It's a slow process and there are no guarantees.
A good hex editor is essential. I use Win Hex 10.6 by Stefan Fleishmann. Great software.
One other little detail... don't give up. When you are out of ideas, let it sit for a few days and go back to it. Plan on a long project.
posted by FauxScot at 4:45 PM on October 3, 2006
I hope you are aware that often things like 'month' and 'day' are often not explicitly stored, but instead a 32 bit number, called Unix time is stored. This represents the number of seconds since Jan 1, 1970.
You will have to scan each group of 4 bytes and see if you can find sensible dates. Also you may have to do some byte swapping if there may be endian differences.
posted by MonkeySaltedNuts at 5:16 PM on October 3, 2006
You will have to scan each group of 4 bytes and see if you can find sensible dates. Also you may have to do some byte swapping if there may be endian differences.
posted by MonkeySaltedNuts at 5:16 PM on October 3, 2006
It would wastly help you if you had more than one such string. Then you'd have an easier time looking at differences. Especially if it has a CRC number in it (if it does, you can't change it bits and see what the change does... if it doesn't have a checksum, you can change things and note where the differences pop up).
posted by lundman at 5:43 PM on October 3, 2006
Good suggestions from people so far. Here's some stuff I look for too:
1) Sometimes in the first couple bytes you'll get a version number for the data format. If you see something that's of a low count, unrelated, and static then this could be what it is.
2) Many flexable formats store a field length first. So if I wanted to insert a 4 byte integer value of 10, a dump might look like: 0x00 0x04 0x00 0x00 0x00 0x0A. Keep an eye out for these because they greatly help decoding if you have them (ie: even if you don't know what a field means, at least you know how long it is).
3) Strings with 0x00's between each character are UTF-16 encoded. Also, unless there's a field length, you'll probably find a trailing 0x00 (called null terminated strings).
Good luck!
posted by sbutler at 6:11 PM on October 3, 2006
1) Sometimes in the first couple bytes you'll get a version number for the data format. If you see something that's of a low count, unrelated, and static then this could be what it is.
2) Many flexable formats store a field length first. So if I wanted to insert a 4 byte integer value of 10, a dump might look like: 0x00 0x04 0x00 0x00 0x00 0x0A. Keep an eye out for these because they greatly help decoding if you have them (ie: even if you don't know what a field means, at least you know how long it is).
3) Strings with 0x00's between each character are UTF-16 encoded. Also, unless there's a field length, you'll probably find a trailing 0x00 (called null terminated strings).
Good luck!
posted by sbutler at 6:11 PM on October 3, 2006
Sounds like it probably wouldn't be applicable to this particular project, but checking for magic numbers is always a good idea.
posted by BaxterG4 at 6:39 PM on October 3, 2006
posted by BaxterG4 at 6:39 PM on October 3, 2006
I'm thinking along the same lines as sbutler. A lot of data is encoded starting with a byte or word describing how many bytes the next data element is. That might be followed by another byte or word giving the size of the following data, and on to the end.
If your data is organized that way it should be easy to decompose it in discrete elements. Even better if there is some text included. If you view it in a hex editor, you should be able to pick out meaningful chunks and hopefully get a better feel for its internal organization
posted by hwestiii at 7:06 PM on October 3, 2006
If your data is organized that way it should be easy to decompose it in discrete elements. Even better if there is some text included. If you view it in a hex editor, you should be able to pick out meaningful chunks and hopefully get a better feel for its internal organization
posted by hwestiii at 7:06 PM on October 3, 2006
If you haven't already, try the *nix file command. It recognizes lots of common modern and legacy formats. (which may not help you here, but it's worth a shot, right?)
posted by chrisamiller at 7:21 PM on October 3, 2006
posted by chrisamiller at 7:21 PM on October 3, 2006
I tell ya what. Back in the olden days before massive international outsourcing and high-level script style coding, part of my work involved reverse engineering data and code from binary files. And on the very rare occasion, it still does.
Consider this alternate solution if you remain stuck after trying the hex editor and strings type tools, or you can't figure out an effective way crack your particular nut. If your data isn't proprietary, feel free to send the file to me along with a bit of context, and I'll give it an hour or so of a professional lookover (inasmuch as any vestige of my talent remains). Completely free of debt or obligation. If not for the simple challenge, then just to prove I've not yet sunk into full-blown senility. No guarantees of satisfactory results, of course.
posted by mdevore at 8:13 PM on October 3, 2006
Consider this alternate solution if you remain stuck after trying the hex editor and strings type tools, or you can't figure out an effective way crack your particular nut. If your data isn't proprietary, feel free to send the file to me along with a bit of context, and I'll give it an hour or so of a professional lookover (inasmuch as any vestige of my talent remains). Completely free of debt or obligation. If not for the simple challenge, then just to prove I've not yet sunk into full-blown senility. No guarantees of satisfactory results, of course.
posted by mdevore at 8:13 PM on October 3, 2006
By proprietary, I, of course, mean unable to be redistributed to anyone under any circumstances. Not the proprietary of your original message. Apparently senility is already encroaching.
posted by mdevore at 8:17 PM on October 3, 2006
posted by mdevore at 8:17 PM on October 3, 2006
Yeah, post the file somewhere and let everyone have a crack at it.
posted by iconjack at 9:50 PM on October 3, 2006
posted by iconjack at 9:50 PM on October 3, 2006
On windows, the utility 'fc' (file compare) is built in. It will do a binary compare with the /b switch. I'm sure there are better tools out there, but there may not be many better free ones. Certainly good enough for initial work.
You will need a hex editor.
posted by RikiTikiTavi at 10:48 PM on October 3, 2006
You will need a hex editor.
posted by RikiTikiTavi at 10:48 PM on October 3, 2006
Thanks guys, these are some great suggestions for starting locations. I'll report back if I sucessfully find anything.
posted by authenticgeek at 3:06 PM on October 4, 2006
posted by authenticgeek at 3:06 PM on October 4, 2006
I had to do this a number of years ago to write a converter for an unknown binary format.
At least the last time I did this, it was incredibly useful to realize that binary formats may be encoded as chained block lists where each block is 256 or 512 bytes and the last byte is a pointer to the next block.
Also very useful to remember is that the output you're looking at may have the byte orders reversed, depending on the encoding.
od is your friend.
posted by Caviar at 6:22 PM on October 5, 2006
At least the last time I did this, it was incredibly useful to realize that binary formats may be encoded as chained block lists where each block is 256 or 512 bytes and the last byte is a pointer to the next block.
Also very useful to remember is that the output you're looking at may have the byte orders reversed, depending on the encoding.
od is your friend.
posted by Caviar at 6:22 PM on October 5, 2006
This thread is closed to new comments.
posted by Paris Hilton at 4:21 PM on October 3, 2006