What to look for in a workhorse PC
August 28, 2018 11:41 AM   Subscribe

Our department is going to get a new desktop PC that we'll use to run simulation studies, to do some machine learning (latent Dirichlet allocation and random forests, primarily), and to solve some optimization problems. What should specs should we consider?

I work in a small department of four psychometricians and one data analyst at a medical licensure/certification organization. With some frequency we undertake tasks that are computationally demanding, especially for our work-issued laptops. For instance, when we build exams, our laptops may run for some time solving that optimization problem. My research focuses on topic modeling, and fitting those models + cross validation can take hours, even when taking advantage of parallel processing. Finally, we run semi-frequent simulation studies that are especially resource intensive. We program a lot (R & SAS), but we are not programmers, and none of us are especially hardware savvy.

For the time being, cloud computing services are not an option. We have, however, gotten approval to get a dedicated desktop PC. Our IT department will aquire the PC, but we will have considerable input on the specifications.

What should we consider as we put our request together for IT?
posted by kortez to Computers & Internet (9 answers total) 4 users marked this as a favorite
Hi! IT guy at a university here. Couple questions: does your university have a high-performance computing cluster? HPC they call it sometimes. This is a service that some large R1 places have that is intended to support exactly these kinds of needs. If not, you can also look at Open Science Grid, that might also help you get there.

But to answer your question, there are some easy basics. First, it's well known that SAS writes to disk ALL THE TIME compared to other stat programs, so you're going to want a computer with fast (and ideally redundant) storage. Any SSD drives in a RAID array should solve that problem for you pretty well.

Second, RAM. Can't have too much RAM for these tasks. I'd shoot for 128 or 256 GB if you can get it, in the highest speed available.

Processor: yes, the quicker, the better. The number of cores only really makes a difference if your analysis is composed of problems that are parallelizable. I'm not strong enough in this area to tell you which is which, but any competent stats person should be able to let you know if your analysis requires good parallel processing capabilities. If it does, more cores is better. I mean, you're probably going to get a lot of cores anyway, but just FYI. Couple other things here: of course faster processors are better, but also if you are going for raw speed, the amount of on-chip cache memory makes a significant difference.

Based on all of this, what you are looking at is what has traditionally been called a workstation. Saying "workstation" will help your IT folks find options for you that will meet your needs; workstation-class computers often come with the ability to have a LOT of RAM (I think 32/64 GB is upper end for a regular PC these days) and can also get you access to processors with more cores and larger on-chip caches.

One more note about parallel processing: if that is a big focus, look into getting a graphics card that can do parallel processing for you. They even sell graphics cards that don't have a graphics output, just because GPUs are so good at matrix math at speed. You do this stuff using CUDA or other special-purpose tools.

Okay, nerd signing off here. Have fun!
posted by rachelpapers at 12:03 PM on August 28, 2018 [6 favorites]

How big are your LDA models? Can you tell if they're hitting swap or are they small enough to fit in RAM? (This assumes they're trained all-at-once, not online.) For me (using MALLET), having enough RAM is key; I often have to specify running java with 32+ GB.

Large SSD for OS, apps, and swap space, large HD array (maybe RAIDed) for data storage. I'm unsure about the state of GPU ML in R, but getting a nice video card in there will help with deep learning.

If you're not optimizing for parallelization or using libraries that do it well, a faster fewer-core processor will serve you a lot better.

posted by supercres at 12:04 PM on August 28, 2018

Totally tangential thought: there's some initial startup headaches to Amazon AWS, but it's often more cost effective than owning the metal (blog on AWS+R, AWS+SAS).
posted by supercres at 12:08 PM on August 28, 2018 [1 favorite]

There's not really enough information in the question to answer this well. How big are your datasets? Will you store them permanently on this system or do they live in a institutional storage area? How fast is your interconnect to your data? Will you be doing ML with lots of images? What expertise is available to set up and administer your system? Will this system live under a desk in a department or can it live in a datacenter? What is your budget?

I agree with others about looking into institutional HPC services first. There might even be a free tier you can use.

If you are committed to buying your own hardware, I'd look into a small cluster (say 2 or more) of 1U rack-mounted systems, each in the ballpark of with dual CPUs (so 32 cores total), 128GB RAM, 4x8TB disks in RAID0+1. This will cost ~$4K per 1U system.

(I buy and operate computer clusters for astronomical data processing, machine learning, etc.)
posted by ldenneau at 12:32 PM on August 28, 2018 [2 favorites]

Not an R expert, but I think Linux can use multiple cores with R, and Windows can't. (there might be intricacies to this that I don't understand) So whether you invest in more or fewer cores may be tied to which OS you're planning on using.
posted by condour75 at 12:33 PM on August 28, 2018

One thing you want to make sure of: Of all the gobs of RAM others recommend, make sure it is ECC RAM - that means error correcting. For most consumer level stuff it's not necessary, but for big science type applications it's a must. This will further constrain the possibilities because most processors do not support ECC memory.
posted by dbx at 2:10 PM on August 28, 2018

In 2018 you'd be nuts not to get a good NVIDIA GPU. (Has to be NVIDIA brand, AMD GPUs aren't yet widely supported for general purpose computation.) it's seriously magical - if you can get an algorithm to work on it I have seen things go from 50-1000 times faster.

Unfortunately I don't think LDA or random forests are all that well suited to the GPU. (One ought to be able to do LDA using variational inference, probably, but I haven't seen how?) There should be R libraries that will let you easily speed up matrix operations, at the very least. And there are high quality R bindings to Tensorflow if you want to get into deep learning which everybody does right now.

My half-informed understanding is that AMD Ryzen CPUs are what you want right now, even more so if you can take advantage of parallelism. The only case where Intel beats them is single-threaded processes bound by CPU time of floating point SIMD operations (AMD doesn't yet support AVX512).

I am reliably told that most people should NOT buy the Threadripper 2990WX, because it's not as fast as you'd hope unless you have a program that can take advantage of its weird design. Other AMD models are good, though.
posted by vogon_poet at 3:14 PM on August 28, 2018 [1 favorite]

(It occurs to me that it might not be "nuts" not to buy the GPU, but I'd strongly consider it. It is worth the trouble, in my experience, if there's a GPU version of what I need to do.)
posted by vogon_poet at 3:19 PM on August 28, 2018

Thank you all for the feedback so far. This is incredibly helpful.

Does your university have a high-performance computing cluster? Unfortunately, an HPC solution won't work at this point. We're a smallish nonprofit, not a university, so we don't have those resources on site. And for the time being, a solution like AWS or Azure is not an option. Perhaps the best way to think about this is that we have the opportunity to get a pretty darn good upgrade over our Dell Inspirons. The workstation suggestions seems spot on. Thanks for all the detail; I feel like this provides a good starting point for our discussion with IT.

How big are your LDA models? Can you tell if they're hitting swap or are they small enough to fit in RAM? So far, they're small enough to fit in RAM. For LDA/topic modeling, I'm working exclusively in R on a PC, which means (as I understand things) if the models were too big for the available RAM, it'd crash.

ldenneau, your questions are helpful. The data sources (text documents) I'm using for topic modeling aren't all that large, nor are the byproduct data sets, such as document-term matrices. The datasets I'm using for the random forest models are considerably larger, at 6-7 GB, but far smaller than the datasets you're working with, I suspect. Datasets will live on this machine. Our IT department will set up and administer our system, and it'll most likely live under a desk in our office space. I don't know what the budget will be, but I'd guess under 4K.

In 2018 you'd be nuts not to get a good NVIDIA GPU. Parallelization is important to us. I use the foreach and doParallel packages in R to distribute repeated code across cores (k-fold cross-validation is well suited to parallelization), and I know some of my colleagues are interested in the same. I'd thought the trick with GPUs was getting the data on and off them (not being a hardware person, I think it seemed like magic), but the feedback here has led me to the gpuR package. We'll look into this more.
posted by kortez at 7:59 PM on August 28, 2018

« Older How to do a colour switcheroo on a .png without...   |   What should be in our going into labor bag? Newer »
This thread is closed to new comments.