Advice/tools/theory for cluster scheduling?
October 12, 2017 5:59 PM   Subscribe

We use a computing cluster (running SLURM). Recently the cluster has been hitting capacity, leaving people unable to run even small jobs. I am looking for guidance about how to set fair usage limits, both practically and in theory (optimal allocation of resources given the randomness in needs, etc.) I am not the administrator, just trying to learn about the topic.
posted by molla to Computers & Internet (2 answers total)
I'm neither a SLURM person (first I've heard of it) nor HPC, but I do wrangle computers in return for rent money. The things I would ask as a follow-up to your question are for more details on the nature of the contention: is the cluster always running at 100%, full out, round the clock, so nobody can get a toe in because high priority jobs always have dibs? This is a different issue than if, say, it's slammed at peak times like the start of the month or midterms with too many jobs trying to run serially (one after another) so even tiny jobs wait weeks.

It looks like SLURM has a couple of options for optimizing the scheduling of jobs, from backfilling (tucking low priority/short jobs into the windows between higher priority jobs) or gang scheduling (which looks like a leapfrogging-suspension sort of deal where it kicks off job A on resources X, lets it run a while, then suspends A and kicks off B on resources X, lets that run a while, then suspends B and resumes A, etc etc until the first job completes -- so no job is completely blocked out even on a contentious cluster, it has a chance to make at least some slow progress rather than no progress). So there may be things that can be done even without setting quotas on resource usage (like User U can only use 25% of the cluster at any given time even if it's sitting idle otherwise), but you'll have to find the administrators to talk about what they have already set up before you can figure out a strategy.

Probably the best thing to do is start out with looking at the statistics for current usage. % jobs that get immediate service, % jobs with wait times 1-x mins, % jobs with wait times x-expiry minutes, % jobs that expire without being serviced; top reasons for delay or expiry; average % resource utilization by jobs of categories Needed Immediately, Needed Soon, Needed Whenever; peak and trough job submission times and how that relates to when the job is actually scheduled, relative usage of different resources (cpu time, memory, etc). If you can back up "the cluster is busy" with "20% of our jobs never run to completion because the usage patterns show people bulk load up their jobs Tuesday mornings, but the jobs expire from the run queue and we become 50% idle on Saturday through the rest of the weekend" or "20% of jobs that are not completed need 10G of memory or less, because they are blocked by long-running jobs that need 64G+" or something similar, you'll be much more likely to be able to find ways to optimize. If you can tie value to those statistics (like, say, 5 failed jobs this quarter limited our ability to generate data to support grant proposals worth $XX) it will make it easier to make a case to spend money to add capacity. If you really do have to set resource limits, those stats will tell you what people are actually using now and therefore which changes may have an effect. And you once you start grabbing those stats, set up monitoring for them so you can watch for patterns and make sure your changes are having the desired outcome. Open XDMoD looks like a decent starting point.

Since you're not the administrator, you may be limited here to actions like stopping by the admins and asking where their dashboards are so you can get an idea of how to schedule things more optimally for yourself, and launching from there into a discussion of how to to improve your job success rate.

Incidentally, the order of things I'd try: first configuration and software optimizations, then beg for money to expand, then impose limits. If nothing else, people haaaate quotas, so for customer service purposes, it's good practice to say "we tracked these datapoints and tried X, Y, and Z to fix the contention, before we had to set up these limits, but we continue to monitor usage patterns to see if we can optimize further."
posted by sldownard at 3:45 AM on October 13

It sounds like you don't want a technical answer. That's good.

This sounds more a process and people problem than a technical problem (think a resource that everyone wants to use, some are being left out, everyone thinks their need is most important).

If the answer does not include "buy more resources", it becomes "how to share this, fairly", and that's the process/people problem.

Should everyone get equal opportunity for the resource ? What's considered a fair allocation of time ? Do we allow line jumping ? (if so, under what conditions ?) Should whoever gets it next be in some priority order ? (if so, how is priority determined ?) Are a small number of people/groups/jobs taking the majority of the time ?

These are questions a PM type person would pull together in a meeting, and you get process out of that. People running system would stick to the process. Complaints go to the PM -- ie someone says "hey, can you slip my job to the start of the queue?" you say img grumpy cat no.
posted by k5.user at 8:22 AM on October 13

« Older Energy efficient bulb cracked inside fixture?   |   i have to stop eating cold tablets Newer »

You are not logged in, either login or create an account to post comments