What storage solution am I looking for?
April 9, 2014 8:04 AM   Subscribe

This is for my lab. Setup: two computers, both Mac Pros. Each runs the same psychology experiment software. (Python-based, FWIW.) Our subjects have multiple sessions, and may use one computer or the other on subsequent days. We have multiple experiments running at once. We use Unison to keep their data directories synchronized (it's better than running two rsync cron jobs) but it's starting to get temperamental. I know there must be a better solution, and I feel like it's hardware based.

The basics:
  • Keeping people on one computer or the other is a non-starter. (They're tied to EM-shielded soundproof booths, and scheduling would be a nightmare.)
  • Participants come in once a day at most.
  • They have their own data directories; hierarchy is experiment/participant/session.
My ideal solution would be an eSATA RAID drive that the computers share, so it's mounted on both like a normal internal drive. I see issues, which may not be founded to someone who knows this technology better:
  • Does this type of drive even have multiple eSATA ports? Most are NAS, and I'm worried about speed because...
  • Millisecond-level timing is key in these experiments, so increasing delay in disk access would be very bad. Therefore...
  • I don't want to use USB interfaces and...
  • I don't want to go through the building's network. The switches are 100 Mbps. Can I set up two LANs in OSX (10.6 unfortunately...); one just for this shared storage and one for actual networking?
  • These computers still need access to the building's LAN; they sync to our cluster overnight.*
  • We may be adding a third computer within the lifespan of whatever solution we get now.
  • We'll never be running the same participant on both computers at once, but I'm still leery about conflicts or trying to write to the same file.
* I know that I could rsync each testing computer to the cluster overnight, and then sync back down to each one. This seems like even more of a pain. The version on the cluster always needs to be the canonical copy. Also, it's in a different part of the building, so LAN is an issue again.

Something better that I'm not thinking of? Price isn't really a factor, but it's taxpayer money so I need to keep it reasonable. Can I solve this for less than $1k? (Ok for HDDs to be extra.)
posted by supercres to Computers & Internet (9 answers total) 5 users marked this as a favorite
Best answer: If you need latency lower than 500 ms to provide access to a common file share, then a SAN (not NAS, but SAN) is what you need, though it will be pricy (fairly beyond $1k).

Consider solving your problem in a different way. Instead of synchronizing files on one drive, organize your test data collection procedure on both systems to keep, say, timestamps, test subject IDs, machine IDs, other identifiers — in addition to other results.

These identifiers can be pushed to a database on a third system, so that test subjects can use either machine and you can still collate test results for a given subject and test, regardless of which of the two (or both) machines have been used.

Additionally, if necessary, you can have your Python scripts act as database clients, pulling previous test results from the database, if the subject is continuing a test in progress.

Providing shared access to one volume and without using a slow NAS is going to be expensive (SAN). Consider rethinking how you are managing data to remove this dependency.
posted by Blazecock Pileon at 10:58 AM on April 9, 2014 [1 favorite]

Seconding a database. If I were building this, I'd collect the data on disk in files during the experiment, then after the subject finishes a session, write the data to a table in a database, so the time it takes to store the results doesn't matter. I'd probably save the local data files in case I needed to reconstruct to DB, and disk space is cheap..

I'm currently working on a Python-based system that extracts data from Google Analytics and stores it locally for reporting purposes. Some of the archive files are sqlite database files [library built into Python, file-based, no server required]. I can imagine easily extending this to copy the resulting sqlite DB files to another machine so both machines had access to the historical archive and a guy could pick up a session on either machine.
posted by chazlarson at 11:23 AM on April 9, 2014

Response by poster: Thanks. A software solution has definitely been on our minds, but there's entropy of conflicts that creeps in. It needs to be very reliable and very easy to fix (I probably should have mentioned that).

I'm actually running some experiments to see how much latency would be a problem. The software would still run locally, it would just write data to a mounted network drive. And not even a whole lot of it; mostly mono wav files and text log files. The timestamps should already be in memory. I just worry that the experiment library relies for some arcane reason on low-latency read/write, which is entirely possible. (We also don't want to mess with the software too much, since these are ongoing multi-year experiments.)

SAN was the term I was missing, BP. Thanks!
posted by supercres at 12:37 PM on April 9, 2014

Response by poster: To give you an idea of how easy it needs to be to fix: there are only a couple people in the lab (me included) who know how to run the rsync to copy over directories if Unison fails.
posted by supercres at 12:38 PM on April 9, 2014

Response by poster: Possibly unimportant past comment, this time about why after-the-fact syncs aren't great: conflicts are basically the worst thing that could happen. Say someone runs session 1 on computer 1, then we try to run their session 2 on computer 2. For whatever reason (cron issue, Unison issue/conflict, network issue, hardware issue) the sync doesn't work. Then they run session 1 again.

It's very hard to sort out after the fact what to do about that, both scientifically and organizationally. And they run 20+ sessions. So failures need to be extremely rare and LOUD.

We often don't find out until someone says, halfway through, "I think I ran this session already..." At that point all we can do is rename one data folder as a backup, rsync over the other one, and start them again. And if we don't rsync -a, more conflicts.
posted by supercres at 12:51 PM on April 9, 2014

Best answer: Hard disks have terrible latency (many orders of magnitude more than most things in computing, including the network), but software is built to hide this. The disk latency shouldn't affect the experiment. If the Python code doesn't contain things like fsync calls (it probably doesn't), disk latency is irrelevant. To address the rest of the question, I recommend a NAS. You certainly don't want or need a SAN; these are very expensive beasts and the difference is irrelevant to your usage.
posted by Tobu at 1:11 PM on April 9, 2014 [1 favorite]

We often don't find out until someone says, halfway through, "I think I ran this session already..."

You probably should think hard about a database and clients that interact with the database. Timestamps and other keys will help you avoid this scenario. Consider looking into FileMaker Pro or Access to get something a bit simpler to set up and manage.
posted by Blazecock Pileon at 1:13 PM on April 9, 2014

Response by poster: It's all back-end for our experiment libraries. There's a "state" that gets loaded when they start, which contains their place in the experiment (i.e., session number), among other things. But if the state doesn't sync... (This is the software, btw. Methods paper here (PDF).)

It's fairly robust, and it just started getting conflict-y. I'm just trying to end the yearly Spring cleaning it seems to need, considering our use pattern.

Think NAS is going to be the thing to do. Just need to run some diagnostics first.

Thanks all!
posted by supercres at 1:51 PM on April 9, 2014

Best answer: Ah, this isn't software you wrote. There you go.

On this:
> why after-the-fact syncs aren't great

The thing is that your problem state really only comes up [in the "sync-it-after-they're-done" model] if someone saves state in one room, then runs over to the other room and starts again.

I think if the cron/unison/etc. situation is getting glitchy for some unknown reason, adding a NAS into the mix is just going to make things worse, not better.

All the rsync and cron stuff can be wrapped in error handling to tell anybody what went wrong, and retry if appropriate. No one needs to understand how to run rsync. No one should be running rsync. There should be a "do_all_the_things.sh" script that anyone can double-click that spits out a "Hey, everything's awesome!" or "Whoops! Couldn't find the backup location!" that no one can miss.
posted by chazlarson at 5:51 PM on April 9, 2014

« Older Is motive a Western concept?   |   Help me figure out how to trust again (both other... Newer »
This thread is closed to new comments.