What is a good introduction to parallel programming solutions?
April 9, 2009 8:02 AM Subscribe
What is a good way/book/research paper/method to learn how to take existing algorithms and solutions, reimagined as for use with a parallel processor? CUDA experts especially.
One of my primary roles at my current company is quantitative analysis. We're going to be investing in a "test box" with a couple high end NVidia cards for the purpose of trying out CUDA or OpenCL. There are several limitations for CUDA programming, such as no recursion, limited memory access, etc. I'm not necessarily looking for CUDA specific advice, more general parallel programming advice.
I'm a pretty good quantitative programmer. I handle higher level math pretty well and haven't had trouble applying mathematical concepts in code (with the help of my Numerical Recipes book of course). However, parallel solutions are foreign to me and I can't seem to wrap my head around the nuances of re imagining existing solutions in such a way. This is especially true if you have some dependent data between "threads"... I'm not sure where the speedup comes from in that case.
I'm specifically looking for a general survey of common algorithms rewritten in a parallel fashion. I doubt there's anything CUDA specific, so that's not a huge issue.
PS I've already ordered The Art of Multiprocessor Programming and Parallel Scientific Computing in C++ and MPI, although MPI is not the goal I figure I can abstract something good from the book (thumbed through some pages on Google Books). These are the books my boss and I decided to buy first after hours of googling and reading Amazon reviews.
PPS - If anyone simply wants to relate a CUDA story here, feel free.
One of my primary roles at my current company is quantitative analysis. We're going to be investing in a "test box" with a couple high end NVidia cards for the purpose of trying out CUDA or OpenCL. There are several limitations for CUDA programming, such as no recursion, limited memory access, etc. I'm not necessarily looking for CUDA specific advice, more general parallel programming advice.
I'm a pretty good quantitative programmer. I handle higher level math pretty well and haven't had trouble applying mathematical concepts in code (with the help of my Numerical Recipes book of course). However, parallel solutions are foreign to me and I can't seem to wrap my head around the nuances of re imagining existing solutions in such a way. This is especially true if you have some dependent data between "threads"... I'm not sure where the speedup comes from in that case.
I'm specifically looking for a general survey of common algorithms rewritten in a parallel fashion. I doubt there's anything CUDA specific, so that's not a huge issue.
PS I've already ordered The Art of Multiprocessor Programming and Parallel Scientific Computing in C++ and MPI, although MPI is not the goal I figure I can abstract something good from the book (thumbed through some pages on Google Books). These are the books my boss and I decided to buy first after hours of googling and reading Amazon reviews.
PPS - If anyone simply wants to relate a CUDA story here, feel free.
Well, you've probably already looked at this, but the CUDA sdk has several white papers on how they parallelized different algorithms. They were helpful to me when I started to use CUDA.
Some algorithms are going to be pretty difficult to adapt to CUDA for the same reason that it makes some of them really fast: the heavily SIMD architecture. One thing to watch out for is that some of the general parallel programming advice is not going to be valid. The same things you do to get an algorithm to run on a cluster is not the same as programming in CUDA, because a fat node is not comparable to a stream processor with a limited instruction set.
It may be useful to do only part of your algorithm on the GPU. For example, say you have an image processing algorithm that converts images to Lab color space, quantizes them and clusters them. The color space conversion would be trivial in CUDA and you would get a good speedup. The quantization might be a little harder to implement, and the clustering would probably be a lot harder and maybe not worth it in terms of speed. (although there is a CUDA clustering project on Google Code, so what do I know?)
posted by demiurge at 8:41 AM on April 9, 2009
Some algorithms are going to be pretty difficult to adapt to CUDA for the same reason that it makes some of them really fast: the heavily SIMD architecture. One thing to watch out for is that some of the general parallel programming advice is not going to be valid. The same things you do to get an algorithm to run on a cluster is not the same as programming in CUDA, because a fat node is not comparable to a stream processor with a limited instruction set.
It may be useful to do only part of your algorithm on the GPU. For example, say you have an image processing algorithm that converts images to Lab color space, quantizes them and clusters them. The color space conversion would be trivial in CUDA and you would get a good speedup. The quantization might be a little harder to implement, and the clustering would probably be a lot harder and maybe not worth it in terms of speed. (although there is a CUDA clustering project on Google Code, so what do I know?)
posted by demiurge at 8:41 AM on April 9, 2009
Response by poster: Good answers so far, I do appreciate the input as I'm rather clueless about this whole thing. I also found this: CUDA textbook
posted by teabag at 8:49 AM on April 9, 2009
posted by teabag at 8:49 AM on April 9, 2009
"This is especially true if you have some dependent data between "threads"... I'm not sure where the speedup comes from in that case."
There is no straightforward algorithm for converting a serial algorithm to a parallel one. Each problem has a "degree" to which it can be parallelized, and your job as an implementer is to allow as much of that as possible. In pathological cases, there is no speedup.
When one thread's computes on inputs from another thread, you're in a bit of a pickle. You can set up a pipeline and each thread computes something different. If you have a lot of data to chew through, this can be pretty effective, but it depends on how high the cost to transfer the inputs and outputs are.
I don't know much about CUDA, but if it's anything like the older graphics pipelines, you basically have a similar concept. In times past, shaders were set up to do the same calculation (calculating normal vectors, etc) on lots and lots of data (vertices).
Where this sort of model fails is when the first input of the second stage relies on the last input of the first stage. No processing on the second stage can be done until the first stage completes, and you have zero speedup. This is the aforementioned pathological case.
posted by pwnguin at 1:19 PM on April 9, 2009
There is no straightforward algorithm for converting a serial algorithm to a parallel one. Each problem has a "degree" to which it can be parallelized, and your job as an implementer is to allow as much of that as possible. In pathological cases, there is no speedup.
When one thread's computes on inputs from another thread, you're in a bit of a pickle. You can set up a pipeline and each thread computes something different. If you have a lot of data to chew through, this can be pretty effective, but it depends on how high the cost to transfer the inputs and outputs are.
I don't know much about CUDA, but if it's anything like the older graphics pipelines, you basically have a similar concept. In times past, shaders were set up to do the same calculation (calculating normal vectors, etc) on lots and lots of data (vertices).
Where this sort of model fails is when the first input of the second stage relies on the last input of the first stage. No processing on the second stage can be done until the first stage completes, and you have zero speedup. This is the aforementioned pathological case.
posted by pwnguin at 1:19 PM on April 9, 2009
I remember, I remember.... way back in the early 90s we were poised for parallel programming to take over the world. It never happened in the way we envisaged, of people crafting brilliant code. (Yes, I know much modern machinery uses massive parallelism, but not individually hand-crafted.)
Part of the story is that writing parallel code turned out to be much harder than was anticipated. Don't feel bad about finding it hard to wrap your head round it, and do use all the tools available to guard against human error.
posted by Idcoytco at 2:43 PM on April 9, 2009
Part of the story is that writing parallel code turned out to be much harder than was anticipated. Don't feel bad about finding it hard to wrap your head round it, and do use all the tools available to guard against human error.
posted by Idcoytco at 2:43 PM on April 9, 2009
I stumbled across this the other day:
GPGPU stands for General-Purpose computation on Graphics Processing Units. Graphics Processing Units (GPUs) are high-performance many-core processors that can be used to accelerate a wide range of applications.
posted by bdc34 at 3:25 PM on April 9, 2009
GPGPU stands for General-Purpose computation on Graphics Processing Units. Graphics Processing Units (GPUs) are high-performance many-core processors that can be used to accelerate a wide range of applications.
posted by bdc34 at 3:25 PM on April 9, 2009
Also, demiurge: by clustering, are you talking about an algorithm like k means clustering? Because that's an algorithm that begs to be done in parallel.
posted by bdc34 at 3:32 PM on April 9, 2009
posted by bdc34 at 3:32 PM on April 9, 2009
I'm specifically looking for a general survey of common algorithms rewritten in a parallel fashion.
Try The Sourcebook of Parallel Computing and Introduction to Parallel Computing. You might find Patterns for Parallel Programming to be useful, also, but the first two books are definitely more of what you're looking for.
For CUDA programming in particular, those books may or may not be helpful. CUDA is only loosely related to standard MPI / OpenMP / pthreads / ... programming.
NCSA/GLCPC's CUDA workshop from last summer (1 week course) and UIUC's CUDA class (1 semester course) are great resources. David Kirk (NVIDIA Chief Scientist) was a co-instructor for both of these and helped develop the material. Both give excellent introductions to the basics of programming with CUDA.
Once you understand the basics, the example code in the SDK is very helpful. For day-to-day references, the Programming Guide, Reference Manual, and CUDA forums are a great help.
ryanr:
Realize that CUDA programming is much different than parallel computing in general. Learning about MPI isn't going to help you much, if at all. GPGPU programming is its own little world of weird, IMO.
This is very true. CUDA is a strange thing.
Mefi Mail me for stories, I work in a research group that does (and has published) a substantial amount of CUDA work.
posted by doowod at 8:01 PM on April 9, 2009 [3 favorites]
Try The Sourcebook of Parallel Computing and Introduction to Parallel Computing. You might find Patterns for Parallel Programming to be useful, also, but the first two books are definitely more of what you're looking for.
For CUDA programming in particular, those books may or may not be helpful. CUDA is only loosely related to standard MPI / OpenMP / pthreads / ... programming.
NCSA/GLCPC's CUDA workshop from last summer (1 week course) and UIUC's CUDA class (1 semester course) are great resources. David Kirk (NVIDIA Chief Scientist) was a co-instructor for both of these and helped develop the material. Both give excellent introductions to the basics of programming with CUDA.
Once you understand the basics, the example code in the SDK is very helpful. For day-to-day references, the Programming Guide, Reference Manual, and CUDA forums are a great help.
ryanr:
Realize that CUDA programming is much different than parallel computing in general. Learning about MPI isn't going to help you much, if at all. GPGPU programming is its own little world of weird, IMO.
This is very true. CUDA is a strange thing.
Mefi Mail me for stories, I work in a research group that does (and has published) a substantial amount of CUDA work.
posted by doowod at 8:01 PM on April 9, 2009 [3 favorites]
I was thinking of hierarchical clustering, but as I saw from the link I posted, there's some code available that does it already.
posted by demiurge at 9:12 AM on April 10, 2009
posted by demiurge at 9:12 AM on April 10, 2009
Hi,
I've worked quite a bit with CUDA over the last year and a half. It's nice, I like it. I think my advice is to start with the matrix multiplication example in the CUDA Programming Guide, read, disassemble, and reimplement it until you've understood it well enough to re-implement it.
Then start looking at implementations of things similar to what you're doing.
Also, the "doctors' method" of learning works exceedingly well: See one, do one, teach one.
I've been looking at a GLSL / Cg / CUDA implementation of some code similar to what I'm working on. The GLSL / Cg stuff is completely unintelligible to me, and seems awkward to work with. CUDA is hard to wrap your head around, but as a tool / working environment, it's quite acceptable.
posted by krilli at 6:42 AM on May 5, 2009 [1 favorite]
I've worked quite a bit with CUDA over the last year and a half. It's nice, I like it. I think my advice is to start with the matrix multiplication example in the CUDA Programming Guide, read, disassemble, and reimplement it until you've understood it well enough to re-implement it.
Then start looking at implementations of things similar to what you're doing.
Also, the "doctors' method" of learning works exceedingly well: See one, do one, teach one.
I've been looking at a GLSL / Cg / CUDA implementation of some code similar to what I'm working on. The GLSL / Cg stuff is completely unintelligible to me, and seems awkward to work with. CUDA is hard to wrap your head around, but as a tool / working environment, it's quite acceptable.
posted by krilli at 6:42 AM on May 5, 2009 [1 favorite]
Some algorithms are going to be pretty difficult to adapt to CUDA for the same reason that it makes some of them really fast: the heavily SIMD architecture. One thing to watch out for is that some of the general parallel programming advice is not going to be valid. The same things you do to get an algorithm to run on a cluster is not the same as programming in CUDA, because a fat node is not comparable to a stream processor with a limited instruction set.Actually, in my experience, each of the individual multiprocessors in CUDA is fat enough to be considered a separate node. And the instruction set isn't that limited, really.
Some things you can speedup like a 100 times, if all ~16 thread units on all ~8 multiprocessors are kept busy. If you only manage to keep one thread alive on each multiprocessor, you still have eight cores chugging along. Even though they're a bit "challenged" compared to a full, modern x86 core.
So don't worry about getting 100% pipeline utilisation in 100% of your pipeline. CUDA is sufficiently flexible, compared to the old GPGPU stuff, that your code can be quite lumpy and still be respectably fast.
In short, it's my experience that LESS stuff will have to be done on the main CPU than you might think. It's probably worth it to have all the stuff running on the GPU.
posted by krilli at 7:12 AM on May 5, 2009
This thread is closed to new comments.
Nvidia's documentation and examples are pretty good. They cover relevant examples like FFTs and Black-Scholes option pricing.
posted by ryanrs at 8:32 AM on April 9, 2009