I can bake a loaf of bread, but I don't know where to buy flour
January 22, 2010 10:48 AM   Subscribe

I am a graduate student who works with a lot of very computationally sophisiticated people. I'm doing okay with the content of my work (which is focused on computational modeling only to a certain degree) and feel comfortable acquiring new high-level skills. But there are a lot of basic skills, vocabulary, and background knowledge that I don't have, and it's slowing me down! What do I need to know, and how do I learn it efficiently?

I didn't major in computer science, but took just enough CS classes to get me off the ground for some computationally basic out-of-class research using Python. At this point I feel comfortable using (for instance) Matlab, Python, a little bit of Scheme, and various domain-specific tools, although I have no doubt my coding practices would make real computer scientists cringe - I don't have a good sense of how to make this kind of work efficient or particularly well-designed. I work on a windows machine, had a Cygwin Linux system in the past, and am currently setting up a vm of Ubuntu. An example of the kind of issue I'm talking about: when I took my very first computer programming class, we spent the first day learning about for loops and so on, but what I didn't know was where I ought to write my code (i.e. not in a word document) and how to get my code to execute. I've got that one covered now :), but this is the type of 'background knowledge' I'm talking about. Actually, come to think of it, I write my code in either Emacs or the in-language text editors, and mostly end up running things via single commands in a linux shell. Is this what the kids are doing these days?

Part of the problem is that I'm not sure what-all I need to learn. I've taught myself a lot of what I know, outside of a community of other computery people, so I have the sense there are a lot of practices and shortcuts that I'm simply not aware of. Often I'm aware of a gap in my knowledge only because I hear other students using terms that are only loosely meaningful to me. Some of these issues are prerequisites for getting a certain type of project started (which I've in the past muddled through very slowly, depending on line-by-line instructions or wasting a lot of time), and some are simply ways of making my work more efficient. So I guess my question is, as a person who does computational modelling, or other types of programming, what are the nuts-and-bolts things that you need to know to do your work well and quickly? What is the best way to go about acquiring this 'background knowledge' that I missed by not being immersed in an undergrad computer science program?

Examples of concepts/things I am very fuzzy on and want to know how to do:
- People in the lab run their programs 'on the cluster'. From step zero, how would I get a program I've written/run on my own computer to 'run on the cluster'
- Better/more efficient use of Linux commands, particularly linking between different programs. I have a feeling I'm not automating things nearly as well as I could be.
- Remote access to servers and FTP in general is a big black box to me. I have done it in the past but have no idea how it works.
- IP addresses and access to the internet - what does it mean/why is it the case that my vmware will allow connection to the internet via an ethernet cable but not a wireless network?
- (relatedly) I have a piece of software on my lenovo called 'Access Connections' while at the very least is the GUI that manages connecting to a network. How would I go about finding out if this software is interfering with the 'native' networking processes on my machine?

I know that the obvious answer is to ask my fellow students about all this, but I feel very stupid doing so (and often have no idea what question to ask), and don't want to be a bother about 'non-research questions' when I also work with them and need their help on actual research questions. There is no technical assistant/lab manager in the lab.

Anonymous because these are things I feel very silly not knowing after 4 years of working with/around computers.
posted by anonymous to Computers & Internet (6 answers total) 6 users marked this as a favorite
Some of these things I agree could be better learned on the side, but:

From step zero, how would I get a program I've written/run on my own computer to 'run on the cluster'

I doubt this is something most undergrads come in knowing how to do -- and I think most cluster setups are pretty idiosyncratic to the department anyways, so knowledge won't exactly transfer. So feel no shame in asking other grad students about this. In fact I wouldn't be surprised if there is some documentation in your department on a wiki or something.
posted by advil at 10:59 AM on January 22, 2010 [1 favorite]

The kids these days are using IDEs like Eclipse and considering their Emacs-using command-line-running elders to be outdated old farts. I'm an outdated old fart, myself, and have yet to see what they're doing that I'd want to and can't in Emacs.

"on the cluster" could mean a bunch of different things.

Everyone is always in a perpetual state of feeling they could be better at using the UNIX command-line tools and setting up pipelines. You should know at least one text munging scripting language -- sed & awk or Perl or Ruby; you should know find and xargs and grep; you should know man and look up the manpages of the commands you use. Even simple things like cp and ls have interesting command-line parameters. Consider getting O'Reilly's UNIX Power Tools.

FTP is an old insecure (password is passed over the net in plaintext) protocol. Forget it and move an to sftp, file transfer over SSH. ssh is well worth knowing -- you can read up on it at the openssh website. For your Linux VM, check out sshfs (mount remote filesystems over ssh such that they appear to be local and you can edit them with your locally running Emacs instance) and yafc (a file transfer client that supports sftp that's much nicer than the sftp client.) Things you'll want to familiarize yourself with: public key login (and ssh-add and ssh-agent) and ssh tunneling.

Access to the Internet is a very big subject. I'm guessing your VMWare could be configured to access the Internet over the host box's wireless, but maybe it needs to be told what network interface to use (on Linux boxes, this tend to be called things like eth0 and wlan0.) But maybe iit can't -- VMs are funky in the details.

But, y'know what... asking people questions is really the best way. Judiciously dole out questions to your colleagues; follow up with questions on web forums and IRC channels.
posted by Zed at 11:15 AM on January 22, 2010

I agree with advil. What you seem to be describing is so far in the realm of applied computing, just reading up on it isn't going to help. Your colleagues computational exercises are probably more ad hoc than you expect.

So learn this stuff in an applied way. You might have data that are so huge it would be worthwhile to crunch it on a cluster, but you won't know 'til you try crunching it on your laptop. Write your code. If you think your code is bad or simply more brute force than it needs to be, then you've got a specific question to ask a colleague. If they think running on the cluster is going to be the best way, that will involve transferring data to a remote machine, so you'll get up to speed on remote access. And adapting the code to run in parallel. You'll be learning and getting work done all at once.

Also, this blog has lots of keen examples of what you can do on the linux command line

posted by bendybendy at 11:41 AM on January 22, 2010

Like Zed said "on the cluster" could mean different things to different people / at different places, but in my department, we have a powerful computing cluster made up of a bunch of badass computers all wired together that lets us run resource intensive analyses. It's kind of a pain to deal with, because we don't have the permissions to modify much (if any) of the software that is stored on the cluster, but you can use the processing power of the cluster to run locally stored programs too. Someone in your lab/dept will probably be happy to help you (especially if it increases the overall output of your group).
posted by solipsophistocracy at 12:10 PM on January 22, 2010

anonymous: "People in the lab run their programs 'on the cluster'. From step zero, how would I get a program I've written/run on my own computer to 'run on the cluster'"

This is a 9 or 10 on the difficulty scale, depending on whether the cluster runs the same OS as your own computer. At this time there is no simple algorithm to convert a single threaded program to a multithreaded one. Typically there are undergraduate and graduate level courses on such things. Loading it onto the cluster is easy if you just want server sized RAM and disk speed, just ask your peers for the documentation or the cluster admin. If there's terms you don't understand hit up Wikipedia and the book I reference below (Tanenbaum).

However in the experiences of my friends who administrate the cluster, NONE OF YOUR COHORTS KNOW HOW EITHER. It is the curse of computing, that all the people who really need massive amounts of compute power to do research have no idea how to harness it.

If you wish to convert your programs I suggest locating a course on cluster computing and have a chat with the instructor about whether or not you'd be successful and any remedial steps you could take.

The other items are far simpler in nature.

- UNIX pipes hook text processing programs together, but it might not cut it for computational science if your data isn't text.
- Remote access is something you do daily via the WWW. In fact, you can use a remote access program called telnet to connect to a website, if you're clever. We don't use telnet for remote access anymore for the same reasons we don't use FTP. Fortunately, SSH is used to replace BOTH FTP and telnet. So just learn SSH.
- The best reference for networking all around is Tanenbaum's "Computer Networks". Pretty much everything is covered to some degree, with lots of references if you need more depth. It covers network protocols like FTP and telnet/SSH, and countless more.
- I have no clue what you're attempting to do with VMware, but as far as I know the virtual machines in VMware don't emulate WiFi. It's not really needed since it's all fake hardware anyways.
posted by pwnguin at 4:32 PM on January 22, 2010

"People in the lab run their programs 'on the cluster'. From step zero, how would I get a program I've written/run on my own computer to 'run on the cluster'"

Running on the cluster means that the program is running simultaneously on multiple processors or computers. In the lab I used to be in, we used MPI to do this, which is fairly common, as far as I know. With MPI, there is no memory shared between instances of your program. Instead, you write code that communicates from one processor to the others. The code itself is actually strikingly simple, it's when you start realizing that everything is happening at the same time (concurrency) that it becomes complicated. If you wanted (or needed) to dive into some MPI, Using MPI is a fantastic and easy-to-understand book on the subject. It starts off with a delightfully simple example, so you might consider at least checking it out at the library.

As far as how to start the program on your cluster, that probably varies from cluster to cluster. At ours, you logged into a shell (SSH) account, and then typed mpirun -n 500 ./program_name, which says "run program_name on 500 processors".

I know that the obvious answer is to ask my fellow students about all this, but I feel very stupid doing so.

In my experience, everyone in grad school basically feels like they don't know what they're doing. This means that when someone asks you a question that you know the answer to, it's like the best thing ever, because you get to feel like you know something. Furthermore, the people who ask the "stupid" questions early on will be able to answer more of them later, thus contributing to their feelings of awesomeness. I highly suggest you just ask.
posted by !Jim at 7:38 PM on January 23, 2010

« Older Where have I lived?   |   "Do they have diets for kids?" Newer »
This thread is closed to new comments.