Shipping large amounts of data to a lot of customers?
September 30, 2005 3:08 PM

I need to get 40-60 gigs of data to about 200 customers a month. Currently I am using DVDs and am finding it a total hassle. I am considering harddrives but I think they will be more or less an equivalent hassle. Is there some other way? Downloads seem a little unfeasible due to the size but I'd be interesting in hearing anything.
posted by xmutex to Computers & Internet (19 answers total)
I don't understand if you have to 40-60 gigs total (about 200/300 mega per client) or 40-60 gigs for each client. Anyway, if your budget allows for hard drives, I think the hassle will be negligible compared to burning this quantity of DVDs - besides the fact hard drives can be returned and reused, depending on your budget the whole process can be almost completely automated.
posted by nkyad at 3:33 PM on September 30, 2005


Sorry, I don't understand if you have to move...
posted by nkyad at 3:33 PM on September 30, 2005


Rev disks would work. You would have to get your customers to install rev drives though.
posted by ryanissuper at 3:35 PM on September 30, 2005


Clarification: 40-60 gigs per client per month.
posted by xmutex at 3:38 PM on September 30, 2005


First of all, I have the same question as nkyad.

Second, is this 40-60 gigs/month of unique data? Or is it a dataset 40 to 60 gigs in size, which has a smaller set of monthly diffs that could be extracted, shipped, and applied via a script at your customer sites? Big difference in the feasiblity of a network based solution, so you should investigate this throughly, if you don't know.

Third, are you willing to employ third parties for the data duplication task? If so, can you and your customers accept a different media format than DVD? If so, tape duplication may be your easiest and most cost effective strategy. You cut a tape, FedEx it, and 24 hours later, your customers have copies, and you have your original back along with an invoice.

Of course, you can use a duplication service with your DVD media, but if I were your customer, and you were sending me 60 gigs a month on DVD, we'd already be talking about alternatives....
posted by paulsc at 3:42 PM on September 30, 2005


I take it tapes are out as an option?

It seems like there should exist some way to do what you're looking to do fairly easily with external USB 2.0 HDs and avoid a duplicating machine. Plug in a long set of them and just one or two machines work through replicating an image/structure on all of them. Even if the program doesn't exist it would be trivial to write.
posted by rudyfink at 3:45 PM on September 30, 2005


I was just about to mention USB hard drives. You can get one for probably under $200. Have them send them back as you send a new one out (so two HD for probably a little < $400 per customer) and you're set. you'll probably save money from the time, hassle and wear and tear of a dvd-r.br>
Downloads wouldn't be totally unfeasible. I can get at least 400KB/s on my home connection, any decent business should haven't something equivalent. Get a couple large connections and a few servers, you should be able to pull it off. It should probably take 2 days to download but unless you're overnighting that's about the same amount of time.
posted by geoff. at 3:54 PM on September 30, 2005


The standard way is tape -- there are plenty of tape drives that can handle that load eaisly, and it has become the standard way to send massive amounts of experimental data around the world. Seti@Home ships a DLT every day, with about 40GB worth of data.

Tape in cases is also nicely transportable.

The downside -- high capacity tape drives aren't cheap, and you need to have one on each end -- but when your moving 8-12 Terabytes a month around, tape starts looking like the right answer. The cheapest that you'd be able to work with is DAT72 (which should be DDS5, such is marketing) which holds 72GB, and spools about 20GB an hour.

If that's not enough, I'd skip DLT in favor of LTO, though, which is faster and hold more -- the lower end LTO-200 holds 200GB, and writes out about 50GB an hour. If time to write is an issue, I'd look at AIT (the base tape holds 100GB, runs at 80GB/hr) rather than the higher end LTO drives -- not that they aren't awesome (the top end tapes hold 800GB, the top end drives write them at 500GB/hr!) but the cost may well kill you.

I suspect a real issue is time to create media, but I'm thinking that DAT72 is good enough, though LTO or AIT give you room if your datasets are growing, and who's arent? DATs are the cheapest tapes to buy, and the smallest physically, thus, they're easier to deal with, and the hardware is also the cheapest.

With tape libraries, you can automate the tape loading, and just have a machine generate tapes -- more capital cost, to be sure, but letting the machine juggle 15 tapes means you spend less time swapping them.

(Aside: if I EAT TAPES is visiting, you might have problems.)

Finally, there's two interfaces worth mentioning for these tape drives: SCSI and Fibre Channel. If you already have the latter, use it if you at all can (and I suspect you do, if you're generating that sort of data in a month.) Don't think that USB can handle this sort of data.

The other trick to tapes is making sure the tape never stops while writing -- that really slows down the data rate, as the tape repositions. You want at least U-160 on a PCI-X slot, better is U-320 or FC, better still is that on a PCI-E 8X slot, a bunch of ram in the machine running the tapes, and you might want to think about doing disk to disk to tape, putting some very fast SCSI or FC drives in the server running the tapes.

I only mention this because if 40-60GB * 200 clients is your problem now, I don't see it getting easier in the future.

Other than tape, the only other way would be dedicated point to point links -- a T-1 could handle the load, if you can stream it, if you need to batch it, an OC-3 would easily cope. If you look at the cost, though, tape will be way cheaper.

The big problem with either answer is the hardware at the remote end. DVD readers are cheap, DAT72 isn't (and LTO 900 and OC-3s are even worse.)
posted by eriko at 4:09 PM on September 30, 2005


The unique dataset idea is a great one. We did that with one of our clients. We actually do an SQL dump of their database, diff it with subversion, and commit it into our repository every few minutes. That way we not only have a low-bandwidth backup that's almost up to the minute at a remote location, but we can roll our *backups* back to any point.

If it's a non-unique dataset, just script a yank of the diffs out of your server and patch and import your local file.
posted by SpecialK at 4:09 PM on September 30, 2005


Oog, missed a point that paulsc didn't. My statements are based on 40-60GB *unique* data per client per month. If it's 40-60GBs of data to 200 clients, one tape drive, one tape duplicating service, one fedex account.

You'll still want a nice fast SCSI card on a PCI-X or -E bus, though.
posted by eriko at 4:13 PM on September 30, 2005


If you do choose to go with tape, btw, I remember when we had to do a similar chore for a client. We bought a set of tape jukes, had 3 drives in each and a 24 tape capacity (I think). We'd leave it on overnight to do the duplication and it'd yoink a tape out of it's slot, put it in the drive, wait till it was done, pull it out of the drive, put it back in it's slot, move to the next empty tape, put it in the drive, etc. etc. etc. Then in the morning we'd shut the juke down (not supposed to do this, but it was the only way we could load and unload 24 at a time) and pull all the tapes out and refill it. We could do about 48 tapes a day per juke that way.

Another way we did that kind of chore, but with a larger dataset (we only had Dat72 tapes), was to use raid 1 mirroring. It was so bad, but we just maintained a server with the dataset on a partition in a raid1 array, and then just jack a hotswap drive in and mirror it. When it was done, we shipped the drive off to the customer in the caddy ... they had the same server as we did, so they just downed the server, yoinked their data partition drive out and put the new one in there.
posted by SpecialK at 4:15 PM on September 30, 2005


Assuming you are sending each client the same 40-60 GB, why not host a torrent and let the clients help each other with the distribution?
posted by mullingitover at 4:17 PM on September 30, 2005


You might want to look at rsync over ssh. That way you're only shipping the bits that differ, and you're on a protocol that can resume failed transfers without issue.

If you must use DVDs, at a minimum you should buy a robotic changer to burn them all for you. You can pick one up for under $2k, which should pay for itself in short order.
posted by I Love Tacos at 4:36 PM on September 30, 2005


If your data won't compress much, like jpgs, the dat72 is really only 36GB. They use the 2:1 average compression estimate to make it appear larger.

I get about 23GB on a 20/40 dat tape that backs up mostly jpgs, instead of the 40GB it is rated.
posted by jjj606 at 5:04 PM on September 30, 2005


Does xmutex have too much data to use .torrent files efficiently for distribution? I'm curious because BitTorrent was my first thought as an answer.
posted by lambchop1 at 5:27 PM on September 30, 2005


This sounds like the perfect environment for bittorrent. Especially if all of the clients are getting the same data sets - release all of the data for the clients on the same day, and let them handle distribution for you.
posted by zerokey at 5:31 PM on September 30, 2005


FedEx beats bittorrent for really massive files, especially with only a few recipients With enough customers and their willingness to participate with each other it might work, especially since your files aren't really that big. Many scientists are in the habit of shipping huge hard drives back and forth to each other. As long as you can count on getting them back it's hard to beat a hard drive for data capacity. [and what b1tr0t said]
posted by caddis at 6:41 PM on September 30, 2005


With enough customers etc. bittorrent might work I meant.
posted by caddis at 6:42 PM on September 30, 2005


The problem with Bittorrent is then everyone knows all the other customers, which can potentially cause huge problems (especially since you end up with competitors serving to one another!).

Very different, though, if we're talking about something like doctors/hospitals/different departments at the same company.

Incidentally, 60GB drives are cheap, and you could use USB2 enclosures and perhaps some extra USB2 hubs on your PCI bus to flood them with data (hard drives fill at about 40MB a sec off straight UDMA; never tried with USB2). What we did, though, when we had to image a lab was use a tool called UDPCast to flood every drive simultaneously. UDPCast actually gets 95MBit on a 100Mbit LAN, *reliably*. Nothing else comes close.

Tapes are great, but there's really something to be said for $38 drives (see Froogle) and little startup cost.
posted by effugas at 7:57 AM on October 1, 2005


« Older Be my groom!   |   Have you ever initiated sex while you were asleep?... Newer »
This thread is closed to new comments.