Best language for parallel code
September 20, 2006 5:25 AM Subscribe
I have an embarrassingly parallel problem for which I currently have a working algorithm prototyped in a relatively high level language (IDL). It's time I properly parallelised it however, and also rewrote it with something I can optimise a bit more easily. What languages and resources are available?
Key things I'm looking for are:
a) freeness, cost in particular but also in the open sense
b) good array mangling - something that will easily let me mangle matrices is a bonus
c) portability - I may run this on any number of unix-like OS's and architectures
d) minimal memory management - I don't want to have to go through a load of debug cycles fixing memory leaks and overflow errors, even if it does mean it's going to be a bit less optimal
e) documentation - ideally in the form of a good book I can just order off Amazon or wherever.
I'm aware I'm unlikely to get all those things in one package. Fortran and OpenMP look like the best bet so far, but I suspect I'd need a book on it and I'm not sure how gcc is doing on the OpenMP front. I could write a CLI and work some cleverness to call multiple processes of it but it's somewhat less tidy to do things that way. Have I missed anything in my Googling on the matter?
(By the way I should add that I'm less bothered about an ability to run across a network - I'm not likely to actually need to automate spreading it across multiple boxes. I'm really after a language where I can simply write a loop and mark it out as being something for which each iteration is independent, and where I can multiply two large arrays and have the processors share the work magically).
Key things I'm looking for are:
a) freeness, cost in particular but also in the open sense
b) good array mangling - something that will easily let me mangle matrices is a bonus
c) portability - I may run this on any number of unix-like OS's and architectures
d) minimal memory management - I don't want to have to go through a load of debug cycles fixing memory leaks and overflow errors, even if it does mean it's going to be a bit less optimal
e) documentation - ideally in the form of a good book I can just order off Amazon or wherever.
I'm aware I'm unlikely to get all those things in one package. Fortran and OpenMP look like the best bet so far, but I suspect I'd need a book on it and I'm not sure how gcc is doing on the OpenMP front. I could write a CLI and work some cleverness to call multiple processes of it but it's somewhat less tidy to do things that way. Have I missed anything in my Googling on the matter?
(By the way I should add that I'm less bothered about an ability to run across a network - I'm not likely to actually need to automate spreading it across multiple boxes. I'm really after a language where I can simply write a loop and mark it out as being something for which each iteration is independent, and where I can multiply two large arrays and have the processors share the work magically).
Best answer: Well, it all depends on what hardware you want to run it on - a cluster of small (1-2 cpu) machines, big SMP or NUMA box (16-32 cpu) or something wilder like a CM-5. Each architecture has a different approach to program construction.
Since you want to run it on unix machines, I'll assume a cluster. Personally I'd recommend C++ or Java, your choice depending on which you are more comfortable with. Certainly Java will be safer unless you are really down with C++. Threading is dangerous unless you understand it; the vast majority of people who think they understand it really don't.
Java with its monitors on every object does make it much easier to write a correct multi-threaded program than C++ does. Those who claim Java is slow aren't really paying attention - a modern JVM like HotSpot does JIT compilation and dynamic optimisation, often getting to the same speed as C and occasionally better since the garbage collector can cluster objects by reachability (therefore access patterns... sometimes) and improve CPU cache hit rates. This is true mostly for complex data structures, not so much uniform stuff like big arrays.
I'm not sure what Java has for linear algebra but there would have to be some nice classes out there somewhere. C++ has STL and MTL which both rock, not to mention BLAS and linpack and probably a bunch more.
Building a distributed implementation can also be difficult; if you even think about message passing stuff, there's a lot you probably want to know about wave algorithms, elections, distributed termination detection, etc, etc. Certainly if you take that route, look into MPI. OpenMP is distributed shared memory and therefore likely much easier to program (depending on your problem) but if you aren't careful you can incur terrible costs by thrashing pages back and forth between machines.
Since you say it's embarrassingly parallel.... can you statically split the workload into small chunks and then quickly/efficiently recombine (if necessary) the results from lots of separate runs? That'd be the lowest-risk and likely highest-performance option by far if it truly is very parallel.
The biggest technical risks (bug opportunities) in a program like this are the places where your units of concurrency, i.e. threads on an SMP box or sites in a distributed program, interact. If you can get rid of all interactions and I do mean all, the code will be much much more reliable. Less sexy, yes, but more reliable. It's much harder to assure yourself that a concurrent or distributed program is correct than a simple single-threaded thing that operates in isolation on a small part of your problem space.
Your desire to have the program auto-parallelised is not a dream easily realised. There are some auto-parallelising fortrans out there but they're definitely not portable and fortran is pretty horrible to write. Likewise C* for the Thinking Machines series - equally unportable. You're most likely going to have to suck it up and explicitly parallelise. Sorry.
Perhaps you can express it using transactions? There are distributed object databases out there supporting transactions; with those you can run mutators (the computation) on a bunch of sites and have them operate consistently over a central wad of data. It's a lot harder to write incorrect code with ACID transactions but you can pay a big performance cost on it depending on how smart the object database is. That could turn disgusting though.
If you could give us more details of the application and the systems you want to run it on, that'd probably help. Or drop me a line (email findable via like in profile) - I'm a grad student working in distributed systems. I may not have a chance to look into this thread for a while.
posted by polyglot at 7:55 AM on September 20, 2006
Since you want to run it on unix machines, I'll assume a cluster. Personally I'd recommend C++ or Java, your choice depending on which you are more comfortable with. Certainly Java will be safer unless you are really down with C++. Threading is dangerous unless you understand it; the vast majority of people who think they understand it really don't.
Java with its monitors on every object does make it much easier to write a correct multi-threaded program than C++ does. Those who claim Java is slow aren't really paying attention - a modern JVM like HotSpot does JIT compilation and dynamic optimisation, often getting to the same speed as C and occasionally better since the garbage collector can cluster objects by reachability (therefore access patterns... sometimes) and improve CPU cache hit rates. This is true mostly for complex data structures, not so much uniform stuff like big arrays.
I'm not sure what Java has for linear algebra but there would have to be some nice classes out there somewhere. C++ has STL and MTL which both rock, not to mention BLAS and linpack and probably a bunch more.
Building a distributed implementation can also be difficult; if you even think about message passing stuff, there's a lot you probably want to know about wave algorithms, elections, distributed termination detection, etc, etc. Certainly if you take that route, look into MPI. OpenMP is distributed shared memory and therefore likely much easier to program (depending on your problem) but if you aren't careful you can incur terrible costs by thrashing pages back and forth between machines.
Since you say it's embarrassingly parallel.... can you statically split the workload into small chunks and then quickly/efficiently recombine (if necessary) the results from lots of separate runs? That'd be the lowest-risk and likely highest-performance option by far if it truly is very parallel.
The biggest technical risks (bug opportunities) in a program like this are the places where your units of concurrency, i.e. threads on an SMP box or sites in a distributed program, interact. If you can get rid of all interactions and I do mean all, the code will be much much more reliable. Less sexy, yes, but more reliable. It's much harder to assure yourself that a concurrent or distributed program is correct than a simple single-threaded thing that operates in isolation on a small part of your problem space.
Your desire to have the program auto-parallelised is not a dream easily realised. There are some auto-parallelising fortrans out there but they're definitely not portable and fortran is pretty horrible to write. Likewise C* for the Thinking Machines series - equally unportable. You're most likely going to have to suck it up and explicitly parallelise. Sorry.
Perhaps you can express it using transactions? There are distributed object databases out there supporting transactions; with those you can run mutators (the computation) on a bunch of sites and have them operate consistently over a central wad of data. It's a lot harder to write incorrect code with ACID transactions but you can pay a big performance cost on it depending on how smart the object database is. That could turn disgusting though.
If you could give us more details of the application and the systems you want to run it on, that'd probably help. Or drop me a line (email findable via like in profile) - I'm a grad student working in distributed systems. I may not have a chance to look into this thread for a while.
posted by polyglot at 7:55 AM on September 20, 2006
goddamnit I swear I previewed... link in profile.
posted by polyglot at 7:56 AM on September 20, 2006
posted by polyglot at 7:56 AM on September 20, 2006
Erlang was created more or less soley to deal with concurrency. It's a functional language that the Lispaholics over at Reddit are going gaga for, and I know it matches a and c on your list.
Cavaet: I've never actually used it, I'm only aware of it. Don't take this as a recommendation to go with Erlang as your final decision, but to add it to the list of languages to look into.
posted by revgeorge at 7:59 AM on September 20, 2006
Cavaet: I've never actually used it, I'm only aware of it. Don't take this as a recommendation to go with Erlang as your final decision, but to add it to the list of languages to look into.
posted by revgeorge at 7:59 AM on September 20, 2006
Response by poster: It'd be basically SMP boxes from 2-8 cpus. Possibly boxes with 16 or more cpus. If I needed to use more boxes it's unlikely to be ever more than half a dozen on one-off runs for which the workload can be split. Also, it may end up being released in some form or another to other users, and they'll probably just want to compile it and run it on one box without having to set up any extra infrastructure for it.
When you say:
can you statically split the workload into small chunks and then quickly/efficiently recombine (if necessary) the results from lots of separate runs? That'd be the lowest-risk and likely highest-performance option by far if it truly is very parallel.
that's exactly the deal. There'd be very little message passing or anything like that on the inside. It's being passed a list of a few million things and being asked to do the same thing independently to each.
I suspect the best thing to do is write a program to crunch, and a script to do the management of a bunch of these programs and collate the data, but it'll probably end up with me writing something less usable than something using more general methods than something someone's already written to solve such problems in general. Mainly cos I write bad code :-)
Also there are some sizeable lookup tables so shared memory would be an advantage for that reason.
Portability's an issue because the resources I have available are a few Intel Linux boxes with 1 to 4 cpus, and an SGI Altix or two, and might end up being run on Macs or other systems in the future (Windows is much more unlikely though).
posted by edd at 8:10 AM on September 20, 2006
When you say:
can you statically split the workload into small chunks and then quickly/efficiently recombine (if necessary) the results from lots of separate runs? That'd be the lowest-risk and likely highest-performance option by far if it truly is very parallel.
that's exactly the deal. There'd be very little message passing or anything like that on the inside. It's being passed a list of a few million things and being asked to do the same thing independently to each.
I suspect the best thing to do is write a program to crunch, and a script to do the management of a bunch of these programs and collate the data, but it'll probably end up with me writing something less usable than something using more general methods than something someone's already written to solve such problems in general. Mainly cos I write bad code :-)
Also there are some sizeable lookup tables so shared memory would be an advantage for that reason.
Portability's an issue because the resources I have available are a few Intel Linux boxes with 1 to 4 cpus, and an SGI Altix or two, and might end up being run on Macs or other systems in the future (Windows is much more unlikely though).
posted by edd at 8:10 AM on September 20, 2006
Perl has only had stable threading since 5.8.0, but it seems to match your criteria fairly well.
a. Check and check.
b. Check, but arrays of arrays and hashes of hashes and combinations thereof have to be expressed with references, but it isn't such a big deal once you get a handle on the dereferencing syntax. See some O'Rly on perl matrix ops here: http://www.unix.org.ua/orelly/perl/advprog/ch02_02.htm
c. Check.
d. Perl uses reference counts to free memory so it does leak if you have circular references, but they're (afaik) the only time you have to do anything to patch leaks until you get into C operations on perl internals and you have to manage your own reference counts.
e. Documentation available online is very good, also with great resources like perlmonks and a generally very good community to answer specific problems. O'Rly Perl Best Practices (or the camel book) and the standard docs are a probably the only book you need if you know other languages.
Also, you can use the Inline::*modules to make your life easier... You can drop in to python, eg, if it seems to translate parts of your algorithm more plainly, and when it comes time to tighten up an refactor you can move the most expensive operations to optimized C. Other advantages include a strong standard profiler and testing suite, and a large barn full of wheels you don't have to reinvent in CPAN.
posted by moift at 8:38 AM on September 20, 2006
a. Check and check.
b. Check, but arrays of arrays and hashes of hashes and combinations thereof have to be expressed with references, but it isn't such a big deal once you get a handle on the dereferencing syntax. See some O'Rly on perl matrix ops here: http://www.unix.org.ua/orelly/perl/advprog/ch02_02.htm
c. Check.
d. Perl uses reference counts to free memory so it does leak if you have circular references, but they're (afaik) the only time you have to do anything to patch leaks until you get into C operations on perl internals and you have to manage your own reference counts.
e. Documentation available online is very good, also with great resources like perlmonks and a generally very good community to answer specific problems. O'Rly Perl Best Practices (or the camel book) and the standard docs are a probably the only book you need if you know other languages.
Also, you can use the Inline::*modules to make your life easier... You can drop in to python, eg, if it seems to translate parts of your algorithm more plainly, and when it comes time to tighten up an refactor you can move the most expensive operations to optimized C. Other advantages include a strong standard profiler and testing suite, and a large barn full of wheels you don't have to reinvent in CPAN.
posted by moift at 8:38 AM on September 20, 2006
Response by poster: Perl's good, but as you hint at I'd probably have to drop down to C for the main computation loops, given my experience of the speed of perl arrays - all that pointer following can take its toll I think - and then I suspect I'd end up moving most of the array operations out of perl and I lose any advantage in source clarity and so on in the process.
Perl's great for some tasks, and I use it an awful lot myself, but I don't think it's the tool for the job here, except maybe as the script that handles daughter processes at quite a removed level.
Current thinking is therefore a minimal C or Fortran (I know it's ugly, but I already know the language, and in my field other users will understand the source) program, maybe with some shared memory thrown in between threads/processes if it's done easily enough, and a high level scripting language to split data and recombine output.
posted by edd at 9:14 AM on September 20, 2006
Perl's great for some tasks, and I use it an awful lot myself, but I don't think it's the tool for the job here, except maybe as the script that handles daughter processes at quite a removed level.
Current thinking is therefore a minimal C or Fortran (I know it's ugly, but I already know the language, and in my field other users will understand the source) program, maybe with some shared memory thrown in between threads/processes if it's done easily enough, and a high level scripting language to split data and recombine output.
posted by edd at 9:14 AM on September 20, 2006
I'm really after a language where I can simply write a loop and mark it out as being something for which each iteration is independent, and where I can multiply two large arrays and have the processors share the work magically
What I remember from college (and this was 6 years ago), there was version of Fortran called Parallel Fortran that did this. Basically you took a working Fortran program and you inserted compilier directives into the source code using Fortran comments and voila..you had parallelism. I don't think anyone has written a pre-processor like system for Java. In which case, I would seriously consider a hard core threading library like java.util.concurrent. There is non-Sun provided version for older versions of Java. They do have a cool ThreadPoolExecutor that will take chunks of work and execute it in a pool of threads. You'd have to do the work to queue up the chunks of work in the executor and then wait for the pool to be done, but all the messy threading details are hidden under the covers.
posted by mmascolino at 11:41 AM on September 20, 2006
What I remember from college (and this was 6 years ago), there was version of Fortran called Parallel Fortran that did this. Basically you took a working Fortran program and you inserted compilier directives into the source code using Fortran comments and voila..you had parallelism. I don't think anyone has written a pre-processor like system for Java. In which case, I would seriously consider a hard core threading library like java.util.concurrent. There is non-Sun provided version for older versions of Java. They do have a cool ThreadPoolExecutor that will take chunks of work and execute it in a pool of threads. You'd have to do the work to queue up the chunks of work in the executor and then wait for the pool to be done, but all the messy threading details are hidden under the covers.
posted by mmascolino at 11:41 AM on September 20, 2006
Best answer: There'd be very little message passing or anything like that on the inside. It's being passed a list of a few million things and being asked to do the same thing independently to each.
In that case, my approach would be to write the smallest, simplest program that is given a parameter identifying the part of the problem space and produces its answer. You can then make a wrapper that instantiates n copies of it, each in a thread which takes a larger parameter space and splits it across the n threads. If necessary, you can write a shell/perl script that uses [rs]sh to run a bunch of copies of this program on a list of machines.
Go with whatever language you're most comfortable with. Since you claim to write bad code, I'd say definitely go with Java because it's fails in much less mysterious ways when you do something wrong; you can't randomly overwrite other parts of your program because you went 1 off the end of an array.
You probably want to look up the producer-consumer problem, it's what you'll want to use to division up the work for your worker threads and perhaps collate the results. There are known-good solutions involving a circular buffer and semaphores; if you like I can email you some code for that part and it should be the only place where threads interact.
Make absolutely sure your shared lookup tables truly are lookup only and never modified :)
posted by polyglot at 3:34 PM on September 20, 2006
In that case, my approach would be to write the smallest, simplest program that is given a parameter identifying the part of the problem space and produces its answer. You can then make a wrapper that instantiates n copies of it, each in a thread which takes a larger parameter space and splits it across the n threads. If necessary, you can write a shell/perl script that uses [rs]sh to run a bunch of copies of this program on a list of machines.
Go with whatever language you're most comfortable with. Since you claim to write bad code, I'd say definitely go with Java because it's fails in much less mysterious ways when you do something wrong; you can't randomly overwrite other parts of your program because you went 1 off the end of an array.
You probably want to look up the producer-consumer problem, it's what you'll want to use to division up the work for your worker threads and perhaps collate the results. There are known-good solutions involving a circular buffer and semaphores; if you like I can email you some code for that part and it should be the only place where threads interact.
Make absolutely sure your shared lookup tables truly are lookup only and never modified :)
posted by polyglot at 3:34 PM on September 20, 2006
Response by poster: By way of an update, I rewrote in C using OpenMP. It turned out a coworker uses OpenMP which helped a fair bit.
End result is a program that's scriptable in the way polyglot suggested so that it can be split across multiple machines but is also parallel at a lower level for SMP systems with a compiler that supports it.
posted by edd at 12:26 AM on December 13, 2006
End result is a program that's scriptable in the way polyglot suggested so that it can be split across multiple machines but is also parallel at a lower level for SMP systems with a compiler that supports it.
posted by edd at 12:26 AM on December 13, 2006
This thread is closed to new comments.
posted by feloniousmonk at 7:22 AM on September 20, 2006