How can I improve my problem solving skills?
July 9, 2013 10:06 AM   Subscribe

I've almost finished my first year working as a system administrator. When I compare myself to my much more experienced colleagues, I think the biggest difference between us is how much better they are at problem solving and debugging. How can I get better at this aspect of my job?

I know that to a certain extent, this is to do with how much better they know our company's systems and their 6 - 10 years greater experience, and so the gap will naturally close over time, but I'm looking for concrete steps I can take to help speed things along. I like to do a good job, to keep learning new things and improving, and this seems like the next thing I need to develop.
posted by jonrob to Work & Money (18 answers total) 17 users marked this as a favorite
I'm a computer programmer/analyist. My job is basically problem solving. We recently had a co-op student working here who was postively terrible at problem solving and debugging and I had to spend basically the entire 2 months trying to teach him how to do this.

My suggestions are:
- before you work on solving the problem, make sure you fully UNDERSTAND the problem. Understand exactly what it SHOULD do, and understand exactly what it IS doing and WHY it is wrong. Seriously, this is the biggest thing.
- follow the system through its entire flow. If there is a problem with printing reciepts, follow through the whole purchasing process from start to finish. Don't just look at the receipt, look at the whole process. This often will give you a hint at where the problem lies.
- check the really obvious/simple/small things first. Don't always assume it is some crazy convoluted bug that is causing your problem.
- look at your problem backwards, if you know what I mean. Start with the end result you're looking for and work backwards to the start of the process.
- When in doubt, look farther in to the process. If there are a bunch of other things that had to happen before you get to your problem, look at all those previous things as well. For example, if the client's name is always being printed out last name/firstname instead of first name/last name, take it alllll the way back and look at how their name was put in to the system in the first place.
- know when to ask for help. Don't waste your time spinning your wheels on the same problem for weeks. If you've hit a total standstill you're allowed to ask for help, even if that help is just a quick discussion about the issue and some brainstorming.
posted by PuppetMcSockerson at 10:50 AM on July 9, 2013 [1 favorite]

I work in (technical) manufacturing and have led my fair share of root-cause analysis. The biggest thing by far is to: ask "obvious" intelligent questions, follow an intelligent line of reasoning, and do some intelligent investigating. Many people are very lazy and just want the answer spelled out for them, and it drives me nuts.

But common sense ain't so common, so learn about 8D problem solving methods (six sigma, FMEA etc). What is the problem? Define what isn't working. Then think. What could cause this problem to show up? That's your hypothesis. How can we test and investigate whether this is indeed the problem? That's your experiment. Gather the data. Then analyze: given my hypothesis and my data, is my hypothesis valid? Simple simple stuff but it's amazing how scattered so many professionals can be. They get ahead of themselves I think, or they're afraid to start.

Just keep a bright & open mind, and try to connect the dots and don't be afraid to ask questions in order to connect those dots.

And document your work so people can see how awesomely you connected those dots!
posted by St. Peepsburg at 10:55 AM on July 9, 2013 [2 favorites]

Change one thing at a time. For any change, figure out beforehand what you are doing, and what you think it's going to do. Then, look at what it *did* do and see what that tells you about what's actually going on.

Describe the problem carefully before you dive in and start changing things. Not "hey, the internet is broken", but "Okay, I can ping my default gateway by IP but I can't reach anything by name", or "I have link light and tcpdump shows some traffic but I can't reach anything by name and IP."
posted by rmd1023 at 11:14 AM on July 9, 2013 [2 favorites]

You don't have the experience some of your colleagues do, and that makes them invaluable resources that you should use wisely. They will be more willing to spend time educating you if you make yourself helpful to them and save them some troubleshooting time. You might not know the systems well enough to locate the problem yourself every time, but you can get good at characterizing the problem, narrowing down the general part of the system you think it's coming from, documenting all that work, then asking the the right colleague (the one who knows that particular area the best) for assistance once the preliminary work is already done.

And never ever trust other people's descriptions of the problem, especially if they're not fellow admins. Even techy people can be surprisingly unsophisticated when it comes to stuff that's not in their exact field of interest.
posted by contraption at 11:17 AM on July 9, 2013 [1 favorite]

My overarching strategy is to slowly reduce big problems into smaller, simpler problems.

To do this, it helps to start isolating variables. Think about each step of the broken process or each component of the broken system, and figure out what you can remove without affecting the problem. Once you've removed everything that's working, you should have a much better idea of where the true problem lies.

Admittedly, a lot of this is developed through experience and intuition, but I find it extremely helpful to apply scientific principles to everyday problem solving. Form a hypothesis, determine a way to test your hypothesis, isolate and don't confound your variables, and don't fall into the correlation/causation trap. Also be aware that your "problem" may simply be actually be a desired behavior of your system, or a symptom of a completely tangential issue.

It's helpful to check for obvious solutions first, and to recall prior experiences when you're solving problems. However, it's incredibly important not to get hung up on either of these two things. Past experience is great, but *so* many people in IT will simply assume that New Problem = Old Problem without taking any basic steps to confirm that this is indeed the case. Tons of IT problems exhibit very similar symptoms, and it's dangerous to trust your gut feelings when initially encountering a problem. Take the time to confirm that you're actually experiencing the problem that you're trying to fix.

Similarly, "obvious" potential causes of a problem are always a good place to start your investigation, but should not be implicitly assumed to be the root cause. Yesterday, one of the routers in my office started acting up, and generated a ton of weird connectivity issues for us. While this was being fixed, another one of our systems went down. While troubleshooting the issue, our engineer immediately (and correctly) mentioned our network issues to the vendor's support technician. Instead of verifying that the network issues were indeed the cause (or even related to) the failure of this system, they ran in circles for an hour, being perplexed about why the system wasn't logging any network issues, without actually pausing to consider that there were no network issues, and that the system wasn't even connected to the broken router. Instead, the problem had an extremely obvious solution that we would have checked for and corrected in 5 minutes on any other day.
posted by schmod at 11:22 AM on July 9, 2013

St. Peepsburg has a great point. Knowing to ask the right questions is so, so, so key. (Corollary: Don't expect your users to have accurately reported the problem at hand)

During the aforementioned router snafu, we initially got a lot of reports saying "The internet is down."

In reality, we couldn't reach Google (and anybody else close to their backbone). Fortunately, we knew to ask the right questions that quickly isolated our problem as a router issue:
"Oh, you can't get to Google? Could you humor me and try [intranet website]? Oh, that works? How about Bing?"
posted by schmod at 11:28 AM on July 9, 2013

Don't make unwarranted assumptions. Ensure that you can justify each and every assumption you do make. Especially, as @contraption notes, do not take 3rd party analysis at face value. Much time is wasted looking in the wrong places because people simply "know" it cannot be due to X, because X doesn't do that. Eventually, out of desperation they decide to double check X and, lo and behold ...

Document everything you do fully, when desperate look for holes in what you've done so far.
posted by epo at 11:43 AM on July 9, 2013

If I have a thorny problem, I grab a whiteboard. Two columns. The first is "what do I know". In this column goes _verified_ facts about the issue. If you haven't run a command to test and verify it, it does not go in this column. This column is not for what you think. So you can write "cannot access site". You cannot write " is down" unless you've verified it's actually the service you want on that box and not a network issue.

Column 2 is "what don't I know". This would be "do I have network connectivity to", "what network devices are between me and", etc.

Design tests for column 2. Based on the results, you should be able to add something to column 1 and possibly some things to column 2. Repeat until you have an answer, or nothing is in column 2.

If you have nothing in column 2, one of your assumptions is wrong. Recheck that you've verified what you think you've verified in column 1.

This approach also has the benefit of breaking things down into manageable chunks. If you're testing an item in column 2 of "is kerberos negotiating properly", you can research how to test that, then test and document. You'll pick up a whole lot of troubleshooting wizardry this way as you learn the ins and outs of protocols.

Really, I'm reiterating a lot of what's already said. This formalized approach just keeps me honest on making sure I know what I think I know when dealing with tough problems. My usual approach is to switch to the formal mode after 2-3 hours of relying on my body of knowledge in troubleshooting and checking the more common causes for an issue.

The formal approach also works _really_ well. I've been able to troubleshoot issues as far ranging as layer 2 network problems and software bugs in .NET applications. The former is near to being in my wheelhouse, the latter notsomuch, but it's a testament to the process.
posted by bfranklin at 11:50 AM on July 9, 2013 [1 favorite]

If there's one thing I've learned in debugging technical issues it's this: READ WHAT'S ON THE SCREEN. Learn where the services and apps you're using dump their logs. Make sure you know how to make them more verbose when you need to. And then READ THEM. I can't count the number of times people have complained to me that something's not working when the system is sitting right there telling them exactly what the problem is. (Or, to be fair, the number of times I've done the same thing.)

You should also at the very least read the Wikipedia article about George PĆ³lya's How To Solve It.
posted by asterix at 12:37 PM on July 9, 2013 [2 favorites]

Great advice above about carefully defining the problem in a detailed and specific way (make sure you can replicate a bug before you ever start trying to fix it!), and only changing one thing at a time when testing your guesses about what the cause might be.

When I was a new programmer and got stuck on a bug, sometimes my boss/mentor would have me tell him about the issue. He would rarely even say anything, it was just the act of explaining it out loud to someone else that would make me realize what I had been missing. Before long I realized I could just do this by myself, without involving another person. Typing it rather than just talking my head was helpful when I was teaching myself to think this way. Key phrases to watch for in your explanation are "I assume Y" (better check your assumption), and "I know it can't be X" (Do you really know that? If you've checked everything else already besides X, maybe it is X causing the problem).

For problem solving that isn't debugging, sometimes my brain gets caught brainstorming in a too-narrow vision of what the solution needs to be. If I can try to pare down my idea of the solution to the actual necessities, I get a lot further. As a dumb example, the other day I wanted to clean hair out of the shower drain, which I've always done with an ancient surgical hemostat (think long, skinny scissors but with grippy bits where the blades would be). Except I couldn't find my hemostat. My brain spent 20 minutes trying to figure out what other gripping things were around the house (Kitchen tongs? Too big. First aid tweezers? Not long enough. Could I whittle grooves into some chopsticks and attach a hinge... Too ridiculous.) until I finally realized that I didn't need a grippy thing. I just needed a long, pulling thing, like a coat hanger with a small loop curled on the end. If you try to imagine a solution that mimics existing solutions, you might be limiting yourself too much. Do your best to figure out what the real, non-negotiable requirements are, and aim for those.
posted by vytae at 1:09 PM on July 9, 2013

Another example of real-world problem solving: make sure you state the problem clearly.

One day I was baking a bread pudding, and I know my friend does not like raisins, so I only put the raisins in one end of the raw pudding, then brought it over to the friend's house to bake it. I wanted to mark which end was non-raisin, so I asked my friend if she had a toothpick that I could stick in the top. She spent several minutes looking unsuccessfully for a toothpick before she asked me: why do you need a toothpick? I need to mark which end of the pan has no raisins. Oh - how about we use a drop of food coloring on top? Problem solved.

I was so focused on my chosen solution that I didn't take the time to fully describe the problem. Always ask why the user wants what they asked for.
posted by CathyG at 2:10 PM on July 9, 2013 [1 favorite]

Good answers above, I have just one tidbit to add - if you do fix a problem, but you dont know why, you still have work to do. Do not assume things are OK just because the problem went away. It's quite possible the "fix" broke something. The likelihood of this happening goes up the more obvious or easier the "fix" seemed to be.
posted by forforf at 3:41 PM on July 9, 2013 [3 favorites]

Oh yeah, forforf is right on about that. I think one of the marks of a seasoned troubleshooter is that they get discouraged rather than happy when the problem they were working on magically evaporates and can't be made to recur. That doesn't mean you fixed it, that just means that not only are the conditions that caused it still lurking in there somewhere, but you spent all that time and didn't even manage to get it fully characterized before you lost it, since clearly there is some variable or moving part your model wasn't even accounting for.
posted by contraption at 3:50 PM on July 9, 2013

In addition to all the great answers above about how to improve your problem solving approach, I think that introspection is key to improving here.

When you squash a bug, ask yourself how you could've changed your process to find and fix it more quickly. Were you led down a dead end by an unfounded assumption? Did you focus too closely on the problem without backing up to consider what else in the system could have caused it? When collaborating with your more seasoned colleagues, try to figure out what questions they ask that you wouldn't have.

(And, my personal version of rubber duck debugging is to actually start writing a question for stackoverflow/superuser/serverfault/etc. In the process of trying to cover all my bases and provide enough information for somebody else to help me, I often solve it myself.)
posted by Metasyntactic at 4:52 PM on July 9, 2013

I've nothing to add except that the phenomenon @vytae is describing is also known as rubber duck debugging:
posted by askmehow at 4:54 PM on July 9, 2013 [1 favorite]

I've been doing the sysadmin thing for about eight years now. Before that I worked as a programmer, mostly in embedded systems. Debugging was always my favourite part of that, and sysadmin is joyous for me because there is an endless supply of bugs to identify and workarounds to apply.

Everybody here is offering useful advice. The one that resonates most strongly with me is rmd1023's: change one thing at a time and make sure what happens is what you thought would happen. If not, you don't yet understand how the pieces relate to each other, so revert that change - and, critically, double-check that the original behaviour is restored - before changing anything else.

It's very, very easy to fall into the superstition trap when you're debugging, so make sure you can reliably trigger whatever misbehaviour you're tracking down. If you can't, at least work out how frequently you expect to see it, so that when you think you've fixed it you know how much testing you need to do to prove that.

One thing that some people have barely touched on that I think bears more emphasis: it's at least as important to learn your users as to learn your systems. Learn to value users who come to you with vague and downright misleading problem descriptions, because those people are the canaries in your coalmine: their technical cluelessness will lead them to find bizarre ways to interact with your systems that you would never have dreamed of yourself, which means that they are the ones who will dig up all your weird-ass corner cases.

Example: I recently pushed out a campus-wide update from Firefox 10 ESR to Firefox 17 ESR. Lots of annoying little things had changed, as usual; as usual, I had done comprehensive testing, found appropriate tweaks to my deployment script, and was quite confident that nobody would even notice the update other than by the disappearance of irritating web site warnings about outdated browsers.

Very few people did. But three days later I got a mail from my favourite canary, explaining that nobody in the school has been able to access the Internet for three days now regardless of whether in his classroom or the computer lab and please fix it! now! because I am spending my whole time writing problem reports to you instead of attending to this line of kids needing my help and the only one we can use is the laptop in the back of my room.

After a quick check that web browsing was working just fine for me, I knew from past experience that I'd need to watch what he was trying to do, so I sat down with him in the classroom and had him demonstrate the issue. Task at hand: find some images to paste into a presentation. His method: Open Firefox. Type "google images" into the search box in the top right corner and click the Search icon. The result looked like this.

Turns out that the only way this man ever uses the Web starts with searching for a site by name and clicking the top Google result, so if the search box fails, that's him completely cut off. And of course he's taught all his students the same horrid method, so that's them completely cut off too. Also, a scary looking message about proxy servers means "only the technician can fix this, so don't you even try", so that's everybody in the school cut off as far as he knows.

In fact there was a real issue there, which everybody else had just been working around by not using the search box and/or picking Yahoo or Bing instead of Google. But were it not for my canary, I would probably have taken much longer to notice and fix it.
posted by flabdablet at 9:03 PM on July 9, 2013

Be systematic.

Make a list of things you need to do in order to e.g. eliminate a possible cause for the problem, and then tick them off as you go through them.

Eliminating possible causes is often more useful than zoning in on what you think the problem is and spending all your efforts trying to confirm that.

Learn about software testing techniques.
posted by rjs at 11:01 PM on July 9, 2013

Get involved in more problems - even when a more senior admin is taking point, get in there and take part, instead of just observing. Poke around in logs and ask the more experienced troubleshooters to explain how they identified an issue, then go and see it for yourself. If they'll let you (which they should), pair with seniors during crises. Go on the on call roster if you're not already; working on critical issues after hours really hones your skills (but make sure you identify an escalation point for things you can't handle alone!). Practice makes perfect.

Also, if you don't already, have your whole team work to build a compendium of actual issues you've seen and how they were investigated, mitigated and solved. Make sure it's searchable... :)

(Senior sys admin / devops person here. I started out just like you and am now one of the best two or three problem solvers/troubleshooters in my org.)
posted by snap, crackle and pop at 9:08 PM on July 11, 2013

« Older Help me acquire footage of the unhappy elderly   |   Mindlessly amusing myself with curiosity Newer »
This thread is closed to new comments.