daddy, i want more io
March 1, 2011 10:12 AM Subscribe
How to maximize IO for a fairly large dataset to a single machine?
We have some seriously IO bound processes, and a huge imbalance between cores and IO capacity. Currently we are using NFS via netapp, to approx 10 machines with each machine having 24 fast cores. Each core can realistically process about 100MB/sec of incoming IO, and currently our netapp is driving an aggregate of about 800MB/sec, or ~80MB/sec/node, meaning we are using about 5% of the CPU available.
The size of the dataset is around 40TB total.
What are some good ways to maximize IO and minimize total processing time? Im leaning towards direct connect FC/pci on a single machine and trying to saturate a machine instead of NFS, but outside of big SAN investments I dont see 40TB fitting on pci/ssd drives.
The RAMSAN / TMS systems are nifty, but at $30/gig they are outrageously priced.
We have some seriously IO bound processes, and a huge imbalance between cores and IO capacity. Currently we are using NFS via netapp, to approx 10 machines with each machine having 24 fast cores. Each core can realistically process about 100MB/sec of incoming IO, and currently our netapp is driving an aggregate of about 800MB/sec, or ~80MB/sec/node, meaning we are using about 5% of the CPU available.
The size of the dataset is around 40TB total.
What are some good ways to maximize IO and minimize total processing time? Im leaning towards direct connect FC/pci on a single machine and trying to saturate a machine instead of NFS, but outside of big SAN investments I dont see 40TB fitting on pci/ssd drives.
The RAMSAN / TMS systems are nifty, but at $30/gig they are outrageously priced.
I'm a little fuzzy on where exactly the bottleneck is here.
Is it the total bandwidth on your storage array? If so you'll need a killer SAN or something.
Is it the network? I expect not as switched gigabit ethernet is pretty cheap.
Is it NFS? It's not exactly the bestest and newest distributed filesystem.
Assuming it's the array bandwidth I'd suggest maybe something like HDFS which is the hadoop distributed filesystem. But that's only relevant if you have processor & CPU distributed evenly through your cluster. Depending on your topology it may not quite work. But I'm no expert on this stuff.
posted by GuyZero at 10:54 AM on March 1, 2011
Is it the total bandwidth on your storage array? If so you'll need a killer SAN or something.
Is it the network? I expect not as switched gigabit ethernet is pretty cheap.
Is it NFS? It's not exactly the bestest and newest distributed filesystem.
Assuming it's the array bandwidth I'd suggest maybe something like HDFS which is the hadoop distributed filesystem. But that's only relevant if you have processor & CPU distributed evenly through your cluster. Depending on your topology it may not quite work. But I'm no expert on this stuff.
posted by GuyZero at 10:54 AM on March 1, 2011
reading between the lines, ibfrog, I believe OP has done that -- they are parallel processing a massive data set. The problem they are running into is IO distribution of the data set to the various processors.
My thoughts are: what are the data transfer limits with current technology ?
On a single machine, SATA hard drives can't push more than 1.5 gbits/sec, where as SSD drives can push 500 MB/s (SATA wiki).
That wiki link has a table towards the bottom that shows a bus comparison as well.
Distributing the data over an array of disks is necessary since each disk will have a limited throughput. Then it's a matter of how many disks to saturate an bus interface, and then there's how fast can the "receiver" of the data actually process it ?
Sure, each core can handle 100mb/sec, but if it's on a filer, and you're running NFS over gig-ethernet, where does the bottleneck really happen ? On the filer (pulling data off the disk) ? on the network/switch ? On the ethernet card of the receiving machine, etc -- you'll always have to look at the "slowest" link in the chain, and that will be the bottleneck. I think you'll be stuck with having lots of idle CPU, since IO has/is/"probably" will always be the bottleneck in computing..
(Then there's a different discussion about how often you'd run this 40 gig job, which might drive answers in one direction or another)
posted by k5.user at 10:57 AM on March 1, 2011
My thoughts are: what are the data transfer limits with current technology ?
On a single machine, SATA hard drives can't push more than 1.5 gbits/sec, where as SSD drives can push 500 MB/s (SATA wiki).
That wiki link has a table towards the bottom that shows a bus comparison as well.
Distributing the data over an array of disks is necessary since each disk will have a limited throughput. Then it's a matter of how many disks to saturate an bus interface, and then there's how fast can the "receiver" of the data actually process it ?
Sure, each core can handle 100mb/sec, but if it's on a filer, and you're running NFS over gig-ethernet, where does the bottleneck really happen ? On the filer (pulling data off the disk) ? on the network/switch ? On the ethernet card of the receiving machine, etc -- you'll always have to look at the "slowest" link in the chain, and that will be the bottleneck. I think you'll be stuck with having lots of idle CPU, since IO has/is/"probably" will always be the bottleneck in computing..
(Then there's a different discussion about how often you'd run this 40 gig job, which might drive answers in one direction or another)
posted by k5.user at 10:57 AM on March 1, 2011
Are these Intel-type PCI/DDR/etc. type machines? FC is probably a good direction to look in, but you really have to look at capacity and contention on the NFS/Netapp side.
However, this is fairly easy to estimate on paper. Every link in your I/O chain should have pretty well-defined capacities, so you can pretty much boil it down to a multivariate calculation and even further down to something like dollars per terabyte-hour of processing. PCI has a capacity, SATA has another limit, RAM another, Ethernet/FC, etc. all the way down the line from the processing CPU back to the filer.
posted by rhizome at 12:29 PM on March 1, 2011
However, this is fairly easy to estimate on paper. Every link in your I/O chain should have pretty well-defined capacities, so you can pretty much boil it down to a multivariate calculation and even further down to something like dollars per terabyte-hour of processing. PCI has a capacity, SATA has another limit, RAM another, Ethernet/FC, etc. all the way down the line from the processing CPU back to the filer.
posted by rhizome at 12:29 PM on March 1, 2011
The trick with mapreduce / hadoop is to properly distribute the tasks to where the data lives. Some day I'll find a way to study this and get paid for it, but for now you'll only get my speculation and simple math.
Independent numbered thoughts:
1. First step is making sure you're getting what you paid for from that Netapp. 800MB/sec sounds fast, but assuming ~100MB/sec drives that's only like 8 active disks. Benchmark the simplest I/O app you can think of (bonnie++?), under the simplest (fastest) condition you can so you get an upper limit of where you are.
2. Slowly increase complexity of your setup and try to identify the major dropoff in I/O.
3. It'll take a lot of disks to hold 40TB. SSD is still cost prohibitive, I'd think.
4. This is the downside to those mega core machines; lots of CPU and RAM on hand, disk From The Cloud. Which is handy for vendors, because now it's not their fault when you can't supply the I/O to fuel it. Under the throughput assumption in item 1, you need at least 1 drive per core. More if you want failure tolerance (one usually does).
5. If you go the build it yourself way, make sure you perf tune the filesystem. Lots of slow defaults like atime, and journal ordering. Depending on your needs, turning of writeback might be a huge saver. See libeatmydata.
posted by pwnguin at 12:37 PM on March 1, 2011
Independent numbered thoughts:
1. First step is making sure you're getting what you paid for from that Netapp. 800MB/sec sounds fast, but assuming ~100MB/sec drives that's only like 8 active disks. Benchmark the simplest I/O app you can think of (bonnie++?), under the simplest (fastest) condition you can so you get an upper limit of where you are.
2. Slowly increase complexity of your setup and try to identify the major dropoff in I/O.
3. It'll take a lot of disks to hold 40TB. SSD is still cost prohibitive, I'd think.
4. This is the downside to those mega core machines; lots of CPU and RAM on hand, disk From The Cloud. Which is handy for vendors, because now it's not their fault when you can't supply the I/O to fuel it. Under the throughput assumption in item 1, you need at least 1 drive per core. More if you want failure tolerance (one usually does).
5. If you go the build it yourself way, make sure you perf tune the filesystem. Lots of slow defaults like atime, and journal ordering. Depending on your needs, turning of writeback might be a huge saver. See libeatmydata.
posted by pwnguin at 12:37 PM on March 1, 2011
Oh, and a friend points out that your network is probably the limiting factor with the current setup. Gigabit ethernet translates to about 800MB/sec. I don't know enough about these things to say whether you can add more cards to the netapp and reorganize the network to improve throughput, but it seems worth looking into.
posted by pwnguin at 1:47 PM on March 1, 2011
posted by pwnguin at 1:47 PM on March 1, 2011
What is the data access pattern, mostly random, or mostly sequential? Does each processing node need access to all the data, or can they chew on a well defined subset? Is the data compressed? How well does it compress? How many spindles in the NetApp, sounds like you need 100s?
Not that all of that is going to matter though. It sounds like your network can keep roughly 8 cores fed. If you continue using centralized storage, it sounds like you need to have at least 3 bonded GigE ports per node. Further, it sounds like you want at least 2 10GigE ports on the NetApp, and suitable switch capacity to feed at least 30 GigE ports.
If your data compresses reasonably well, and it makes sense to compress it, you might cut your I/O needs in half or so.
posted by Good Brain at 2:35 PM on March 1, 2011
Not that all of that is going to matter though. It sounds like your network can keep roughly 8 cores fed. If you continue using centralized storage, it sounds like you need to have at least 3 bonded GigE ports per node. Further, it sounds like you want at least 2 10GigE ports on the NetApp, and suitable switch capacity to feed at least 30 GigE ports.
If your data compresses reasonably well, and it makes sense to compress it, you might cut your I/O needs in half or so.
posted by Good Brain at 2:35 PM on March 1, 2011
Scratch that, my math slipped a decimal on the network stuff.
posted by Good Brain at 2:42 PM on March 1, 2011
posted by Good Brain at 2:42 PM on March 1, 2011
This thread is closed to new comments.
It would probably require a large code change but it takes the stance that processing is quicker to move than data. It would distribute your dataset and move the processing from node to node based on what data is required.
See also Map Reduce (Wikipedia)
posted by ibfrog at 10:36 AM on March 1, 2011