Archiving lots of data files?
November 17, 2011 4:38 PM   Subscribe

Professional help on archiving a large number of data files with metadata.

I'll be accumulating about 5000 raw data files, each of which will have a good chunk of metadata associated with it. Unfortunately binding the metadata directly to the data file is not feasible, so its going to have to exist independently.

Right now the plan is to put a read only version of the data file in a directory accompanied by a copy of the metadata file under document
control. A central source will collate all of the metadata files to make a searchable index.

I know I'm not the first to face this particular problem. Has anyone else here done it? Is my current plan going to deliver a world of hurt later?

Thanks.
posted by Tell Me No Lies to Technology (12 answers total) 3 users marked this as a favorite
 
Response by poster: Forgot to add: if anyone knows of an existing database solution that takes on this problem, please do suggest it.
posted by Tell Me No Lies at 4:47 PM on November 17, 2011


Too bad binding it is not feasible (why is that, incidentally?). Because I'd suggest merging them to an hdf (or even xml) based file format. Otherwise, I dunno, relational database?
posted by zomg at 4:48 PM on November 17, 2011


Response by poster: Too bad binding it is not feasible (why is that, incidentally?).

A lot of programs will be reading the data files directly and upgrading them all to recognize a new file format will be problematic.
posted by Tell Me No Lies at 4:52 PM on November 17, 2011


HDF5
posted by Blazecock Pileon at 5:03 PM on November 17, 2011


This might be really naive, but what about storing the metadata in the document properties information, and using spotlight on a mac, or something similar on another OS to search the folder for stuff in the properties info field when necessary?
posted by lollusc at 5:03 PM on November 17, 2011


But, basically, zomg's suggestion of a relational database would work. Keep the data on the filesystem, and store the metadata in the database. Each metadata record points to the location on the file system where the original file is kept.
posted by Blazecock Pileon at 5:06 PM on November 17, 2011


Best answer: Yes, it is going to deliver a world of hurt. I actually work on a software platform that does exactly what you want (and much more), unfortunately it is aimed the enterprise market and starts in the six digits. The solution we build is based on a relational database, combined with a few million lines of code and some very heavy duty hardware arrays. While I am not going to try and sell you on that, if this is something that is actually a project of importance and you want it to be around for a while, I would seek a professional solution. I have worked on private cloud and archiving software for years, and would not attempt to try and do this myself for a serious project. The reason our product exists, and does very well, is that companies inevitably end up figuring this out themselves.. often after investing significant effort trying to do it themselves. When you are thinking about archiving, you have to think about long term solutions, and there are a lot non-trivial problems in this space that have already been solved.

One possible approach to the file metadata problem is to create a mirror filesystem. So, instead of writing your file to @ /some_directory/filename, you write it to /data/some_directory/filename, and then have the metadata under /metadata/some_directory/filename. This helps separation of metadata from actual data. Honestly though, I can tell you right now that this is not going to end well. At the very least seek help on StackOverflow and explore the open source solutions for this. If you try to build a naive solution from scratch, you are going to regret it.
posted by sophist at 5:44 PM on November 17, 2011


Just some things to think about...

Data Integrity: How will you insure that the data you have stored has not become corrupt? Archiving typically implies storage for the long haul. This means disks die and filesystems can become corrupted. If you are using a database, you can hash the data and store that value in the table. Then you periodically check to make sure the data is still intact. What if it isn't though? Then you need to keep another copy for backup. Where do you keep the copy, on separate drives in the same machine? Or are you going to store them at another site? How do you keep track of where they are? How do you restore from the backup? How do you keep track of the copies in the database? What if your database becomes corrupted? You might not think these concerns are realistic. Will you (or your successor) still feel that way in 10 years? In 20?

Data Access & Scale: How will the files be accessed? Again, archives are intended to be around for a long time, and protocols and standards can shift. Your organization's infrastructure might change as well. Are you building everything around a network mounted filesystem? Does it support NFS, or CIFS, or both? Which versions does it support? Do you need to deal with authentication? How many mounts can your machine support? I would recommend serving the files over HTTP if possible, but that adds another layer of complexity such as the web server. You say you only have 5000 files. Will there be more files in the future? Will your database design grow to support this? How will it perform if many people are accessing the files, searching, or writing at the same time?

Read Only: You say the files will be stored read only. Does that mean they can never be deleted? What about by the root/admin? Are there compliance issues involved?
posted by sophist at 6:33 PM on November 17, 2011 [1 favorite]


Stick it all in a relational DB. Yes, the files too, in a binary column – if your DB can't handle this, switch to PostgreSQL. This gives you indexable, searchable metadata, and perfect data integrity.
posted by nicwolff at 7:52 PM on November 17, 2011


Can you elaborate a little bit on what the data portion is? What kind of files are working with. Given that info there may be some more (or less) options. Bonus points if you can describe some of the key metadata fields. Extra extra bonus if you can explain how the metadata will be utilized to search and organize the data.
posted by dgran at 7:30 AM on November 18, 2011


Response by poster: The files are pcaps (captures of intercepted IPv4 and IPv6 traffic) generally collected at the client end of network protocol sessions. Each pcap is ideally intended to contain solely the packets for an individual application.

pcaps can exist in three states (raw, cooked, and pristine) based on the amount of post processing that has been done to whittle them down to their ideal state.

Some of the meta data is:

Permanent data file location
Permanent data file cksum
OS
Setup Notes
tags

Evaluation Rules version
Total bytes
Total flows
Total unknown bytes
Total unknown flows

App1 client app
App1 client version
App1 total bytes
App1 total flow

App2 ...


The data will be used by different people in different ways. QA will be looking for pcaps of applications they want to test. Development and Marketing will be tracking the percentage of flows identified. Research will be tracking the OS, Application, Versions tuple to see what areas we're weak in.
posted by Tell Me No Lies at 11:28 AM on November 18, 2011


I would second the suggestion by nicwolff. Store each pcap record in a database in one column, using the other columns for additional metadata. You will likely need to write a script to parse the file and load each record, but it will give you most versatility.
posted by dgran at 11:37 AM on November 18, 2011


« Older How can I control floodlights via a foot-switch...   |   How to hack the Myers-Briggs? Newer »
This thread is closed to new comments.