While this is going on, I'm growing older and older.
June 4, 2010 12:47 PM   Subscribe

Is it possible to distribute image conversion jobs across a cluster of machines? Either using Imagemagick/mogrify or some other method?

At my place of work we often have a need to convert large piles of images from one format to another. Usually this is TIFFs to JPEG2000, which we do right now using Imagemagick's mogrify command on a Linux machine. However, because we have so many TIFFs and we're just using one machine to do the mogrify it's taking foooooor-ever. Like weeks. Is there a (reasonably) easy way to parallelize this sort of task? Either using mogrify or something else?

I've had people mention Hadoop and one person mention Parallel Python, but I've been unable to find specific, followable examples of image processing. Do you know anything that would help?
posted by the dief to Computers & Internet (13 answers total) 3 users marked this as a favorite
 
Do you need to do this continuously or in batches? If you do it in batches then it's dead simple:

1. Give each machine a number.
2. Generate a list of images to be converted.
3. Divide the list into a number of sections equal to the number of machines.
4. Each machine runs the conversion script you're already using on its section of the list.

So if you have 3 machines you'd number them 1, 2, and 3, then machine 1 would take the first third of the list, etc.
posted by jedicus at 12:54 PM on June 4, 2010


Response by poster: Damn, see, I didn't even think about an NFS solution. That would probably be way easiest, huh?
posted by the dief at 12:56 PM on June 4, 2010


Nthing what everyone else has said, this task can be easily parallelized by just splitting the large number of files into subsets and using the existing method to convert them on each machine. If you want to get really fancy you can write a script for each machine to keep track of which images have already been converted and dynamically select unconverted ones to work on.
posted by burnmp3s at 12:57 PM on June 4, 2010


NFS is not designed for this use case. Clusters where people expect a lot of I/O to the same disk from many different computers simultaneously use high-performance distributed systems like Lustre. If you have hundreds of machines trying to read and write lots from one NFS mount simultaneously, your file server will probably crash.

There's nothing wrong with a simple solution, but make sure you scale up gradually so you have an idea of how it will affect the performance of your whole system.
posted by grouse at 1:09 PM on June 4, 2010


Best answer: I see you already got a solution, but maybe this would be useful for someone else googling for an answer. If you do this all the time, you might want to look into the Sun Grid Engine. We use it to run multiple jobs in parallel across an entire cluster. It's smart enough to do the load balancing itself across the network. I think the version we are using is this open source version. I can't tell you how easy it is to install, but it is fairly easy to use.
posted by bluefly at 1:15 PM on June 4, 2010


Presumably you'd reduce the number of transactions on the network server by having the server create a tarball of TIFFs for each converting machine to pull down and have the converting machines put back a tarball of JPEGs.
posted by jedicus at 1:16 PM on June 4, 2010


Rocks Clusters is a turnkey SGE/Cluster installation. It's super simple to get running. Install Rocks on a pile of old PCs, turn them on, and bingo, instant cluster! If this is not something you want to commit resources for, you can install SGE separately.
posted by Geckwoistmeinauto at 1:23 PM on June 4, 2010 [1 favorite]


Best answer: My first thought was that this is the sort of usecase bashreduce has in mind, but it seems designed to process one big file. Rather than a set of files.

NFS may work, if the workload is mainly CPU bound. You need to consider whether you need more disk speed or CPU in the current setup. Systems like Lustre are intended to function at supercomputer scales, where a single beefy fileserver can't keep up, and systems like mapreduce use such filesystems to distribute workloads to where the data is. Basically, the goal is to marry CPU and I/O so when you scale up you get more of everything you need.

Also, you might look at benchmarking imagemagick vs graphicmagick. They have a lot of performance related features that might help you scale up even with just the existing system, and if you need to purchase hardware for this cluster, will get more done per computer.
posted by pwnguin at 1:38 PM on June 4, 2010


I've done the segmented-batching-over-nfs-share thing. Works fine.

Next time, I want to do it with simple webservices. The server machine runs a very simple web server, that has two functions: getUnprocessedFile(), which returns a file path on its NFS share. Then the function markFileAsProcessed(file), which marks a file as processed. Files have a timeout between getUnprocessed and markFileAsProcessed where they will be returned to the unprocessed list.

The worker machines then use a super simple webservice client to get unprocessed files, process them, and then return them somehow. Either right away, saving to the NFS share, or as a batch later.

This can be done with surprisingly simple Ruby scripting. A colleague has done something like this at work.
posted by krilli at 2:58 PM on June 4, 2010


A solution that will scale better than NFS for you might be GlusterFS. You could distribute your data across many machines and all of them would be able to see the entire dataset. Memail me if you want some help with setting it up.

(Disclosure: I'm one of the developers of GlusterFS).

Is self-linking ok in answers?
posted by Idle Curiosity at 4:40 PM on June 4, 2010 [2 favorites]


For this work, I would write a script to fire off x number of shell script jobs to a Sun Grid Engine or other computational cluster that manages some number of nodes. Each node would work on a conversion job, put the end result on the NFS share, then SGE feeds it another job, etc.

SGE has a lot of options. You can set up a job that waits for all the others to finish, for example, which does collation or whatever you might want to do at the end. You also have control over the task list, if you want to suspend the queue or delete jobs, for example.
posted by Blazecock Pileon at 5:48 PM on June 4, 2010


Oh, and kick all of them off from a script calling a bunch of SSH sessions up at once.

SSH is probably a bad idea unless each session can work on a batch of files in serial. If the asker has 2000 conversion tasks and two nodes, firing off a 1000 SSH remote shell commands at once to each node will not work. Instead, fire off two jobs, each doing 1000 conversions in serial.
posted by Blazecock Pileon at 5:52 PM on June 4, 2010


nthing Sun Grid Engine. It might be overkill if you just want to do this one single task, though. It might be easier to hack together some shell scripts + NFS for sharing the files between machines.
posted by kenliu at 8:50 PM on June 4, 2010


« Older Who was Minnie Rae?   |   Crossing the US/Canada border with pets Newer »
This thread is closed to new comments.