Mixing Text and Data
November 8, 2015 6:57 AM
Are there any file / log formats that mix data and text in a readable, flexible way? Something like Markdown + binary blobs? Or JSON with interspersed strings (maybe as comments)? Do Jupyter notebooks have a more static equivalent? Question purposefully vague because I'm not finding much and so exploring wider possibilities.
I want to document an experimental, computational process (think genetic algorithms). It would be nice if my logs / output could also be parsed and data extracted for re-use later. Some of the structures are quite large, so compressed binary blobs might be useful. But code might change, so I guess I need to worry about versions and am wary of opaque serialisation. The more I think about it, the more complicated it gets. So looking for existing solutions / examples.
I want to document an experimental, computational process (think genetic algorithms). It would be nice if my logs / output could also be parsed and data extracted for re-use later. Some of the structures are quite large, so compressed binary blobs might be useful. But code might change, so I guess I need to worry about versions and am wary of opaque serialisation. The more I think about it, the more complicated it gets. So looking for existing solutions / examples.
certainly relevant - i am looking for any idea, no matter how odd.
(i did think of xml - i think a few years ago i would even have used it without question - but it's not so readable and, at least for standard tools, well-formedness means that it's not great for an open-ended log where you're not sure what the end is, or if you're there yet).
posted by andrewcooke at 7:55 AM on November 8, 2015
(i did think of xml - i think a few years ago i would even have used it without question - but it's not so readable and, at least for standard tools, well-formedness means that it's not great for an open-ended log where you're not sure what the end is, or if you're there yet).
posted by andrewcooke at 7:55 AM on November 8, 2015
Base64 encodes binary data to ASCII text. So you could compress your data with gzip, then base64-encode it, and then stick it in your json, perhaps alongside a version number or other metadata.
posted by rustcrumb at 7:56 AM on November 8, 2015
posted by rustcrumb at 7:56 AM on November 8, 2015
I think you're confounding a few things.
One is how to include binary data in a text format like XML or JSON or whatever. There's lots of solutions for that, none awesome. For JSON or XML you typically end up with base64 encoding if you insist on embedding the binary data in the text file. Often it's better to put the binary data in a separate file and link to it / refer to it in the text.
The second is the usability of having a 3MB blob stuffed in the middle of 10k of important text. Obviously you don't want to render out a hex dump of the blob! The usual solution for that is to somehow elide the binary blob, just indicate it's there. Again, externally linking the binary data usually results in a more usable text file.
The final thing you want to do is make some reproducible collection of code + data. That's admirable! Again, I think the best solution here is some external storage of the binary data with links. On top of that you want some versioning to tie it all together. A git repo is a good way to do this, or else something ad-hoc involving version numbers or checksums on the binary data. git is not great at versioning binary data, but for < 100MB it's fine. Beyond that you should consider something like LFS or Dat.
Long story short, I think Jupyter notebook users do solve this problem all the time. This gallery of notebooks should give you some inspiration. Here's an example I did. Note cell 2: my dataset there is referenced externally, from a URL. I didn't bother trying to version that data. To make this truly reproducible I should make some guarantee that data URL is permanently archived. Or better yet, include it in a git repo along with the Notebook.
It's worth noting that a Notebook viewer like this isn't running live code. That web page is effectively static HTML. (And note the images; they are binary blobs included in JSON in the Notebook format.) There's an implication with a Notebook that someone else could download the code and run it and get the same results, but the viewer format isn't actually running the code. And reproducibility does require archived data. Along with archived versions of third party libraries, for that matter.
posted by Nelson at 8:12 AM on November 8, 2015
One is how to include binary data in a text format like XML or JSON or whatever. There's lots of solutions for that, none awesome. For JSON or XML you typically end up with base64 encoding if you insist on embedding the binary data in the text file. Often it's better to put the binary data in a separate file and link to it / refer to it in the text.
The second is the usability of having a 3MB blob stuffed in the middle of 10k of important text. Obviously you don't want to render out a hex dump of the blob! The usual solution for that is to somehow elide the binary blob, just indicate it's there. Again, externally linking the binary data usually results in a more usable text file.
The final thing you want to do is make some reproducible collection of code + data. That's admirable! Again, I think the best solution here is some external storage of the binary data with links. On top of that you want some versioning to tie it all together. A git repo is a good way to do this, or else something ad-hoc involving version numbers or checksums on the binary data. git is not great at versioning binary data, but for < 100MB it's fine. Beyond that you should consider something like LFS or Dat.
Long story short, I think Jupyter notebook users do solve this problem all the time. This gallery of notebooks should give you some inspiration. Here's an example I did. Note cell 2: my dataset there is referenced externally, from a URL. I didn't bother trying to version that data. To make this truly reproducible I should make some guarantee that data URL is permanently archived. Or better yet, include it in a git repo along with the Notebook.
It's worth noting that a Notebook viewer like this isn't running live code. That web page is effectively static HTML. (And note the images; they are binary blobs included in JSON in the Notebook format.) There's an implication with a Notebook that someone else could download the code and run it and get the same results, but the viewer format isn't actually running the code. And reproducibility does require archived data. Along with archived versions of third party libraries, for that matter.
posted by Nelson at 8:12 AM on November 8, 2015
You might look into formats that packages together multiple files. That way the set is easy to pass around and keep separate from others, but the individual data is more easily accessible. There appear to be about thirty different file formats based on ZIP archives. An .epub document is just a zip of HTML, image, data, etc. files. OpenOffice documents are similar, using XML as a base. On OS X a "package" is really a folder (of arbitrary content) that is presented as a single file. There are probably others, but those are the two that jump out at me. (They way we've always done something similar is to use folders and a good naming convention. It's not ideal, but it beats needing special tools to read or extract data from them.)
posted by Ookseer at 10:59 AM on November 8, 2015
posted by Ookseer at 10:59 AM on November 8, 2015
Maybe you can just store your log as HTML, with link tags for the binary blobs.
It's easy to generate and easy to browse on any device.
posted by mbrock at 1:36 PM on November 8, 2015
It's easy to generate and easy to browse on any device.
posted by mbrock at 1:36 PM on November 8, 2015
You might consider referencing your binaries using a binary repository, like artifactory, instead of shoving the content in the json.
posted by rockindata at 1:58 PM on November 8, 2015
posted by rockindata at 1:58 PM on November 8, 2015
thanks! i'll think those over some.
posted by andrewcooke at 5:12 PM on November 8, 2015
posted by andrewcooke at 5:12 PM on November 8, 2015
what i eventually ended up doing, and which worked ok, was separating the text and data, and using hashes in the text.
so the data were saved one log file / database / table, indexed by their hash. and then referred to in the human readable logs by their hash. where the has was, iirc, the first 8 bytes of an sha1 hash, printed as hex.
that, plus some simple functions / regexps let me parse the logs, retrieve appropriate data, and repeat interesting parts of the experiment.
posted by andrewcooke at 5:06 AM on December 10, 2015
so the data were saved one log file / database / table, indexed by their hash. and then referred to in the human readable logs by their hash. where the has was, iirc, the first 8 bytes of an sha1 hash, printed as hex.
that, plus some simple functions / regexps let me parse the logs, retrieve appropriate data, and repeat interesting parts of the experiment.
posted by andrewcooke at 5:06 AM on December 10, 2015
This thread is closed to new comments.
Would an XML-based solution work? Our organization created an XML-based format dedicated to our results, where the data being returned is defined by the DTD, so you can cover changes in the format with updated / documented DTDs which addresses the opaqueness I think.
Examples of this usage can be found here.
posted by SquidLips at 7:45 AM on November 8, 2015