How do I learn to manage a large unfamiliar computer system?
September 13, 2014 2:20 AM Subscribe

I've been tasked with taking over the backup system for a fairly large tech company. The thing is I've never managed anything in my life. It's not that I'm clueless about the technology, but I'm dealing with a mess of barely coherent documentation left by the previous manager and a couple of guys who have a vague idea of how the system works. I've never done any sort of project management before, I've always been working under someone, but now this project is going to be all me. So what do I do here? Where do I start? What tools should I be using? Is there a book or a mooc that tells you how to do this? If I wanted to take a class, what class would i take?
posted by anonymous to Work & Money (8 answers total) 5 users marked this as a favorite

As a sysadmin and someone who tends to take over the backup systems everywhere I've worked, I would start by throughly investigating and documenting what's there and making sure that everything that needs to be backed up, is, and that it's being kept for the correct amount of time, and that the existing backups are restorable. And then I would make a plan to extend it to include anything that needs to be backed up and isn't, or potentially investigate replacing it if the current system isn't working properly or doing anything useful. What I would replace it with depends entirely on what platforms you're using and the volume of data and the retention requirements.

It's not clear what your goal is here - my experience is usually that it's the case that someone set something up a while ago and then no one paid much attention to it and half of it rotted away and there's a bunch more stuff created since that hasn't been added.

I don't really understand why you would take a project management approach (unless the guys who know something about it need to do the actual work and you'll be organising and managing them?) and it's not clear to me why you would need classes or books. The pure IT approach would just be to figure out what it does and how it does it, then fix it.
posted by corvine at 5:52 AM on September 13, 2014 [3 favorites]

You start by documenting how the system works, then by building and documenting procedures to check it is working as designed, then by designing interval audits to make sure it's continuing to work as designed and that procedures are being followed. When you find a gap in any of those things, you address that.

Or what corvine said, really.
posted by DarlingBri at 5:55 AM on September 13, 2014

You remember those flowcharts, with squares and diamonds and circles and ins and outs? Create one of those for every process in the system, and add as much detail as you are able to provide. If the previous documentation is incoherent, then you're trying to make the coherent documentation. Document definitions of what everything means. Once you have this bible, then you can start worrying about what to do with it, asking questions like "what will break?" and "what needs regular maintenance?" and "department x wants the system to also do this, where do I start?"

Project management is about starting with a goal and ending with a thing, and managing every point in between; your documentation process could be considered a project, but the backup system itself it not a project, it's the thing that somebody else's project produced.

So, your issue is documentation, not management: management comes in when you start needing the system to do something, and you need the documentation done before you can reach that point.

If there are courses on the software itself, particularly training directly from the developer, that's probably most important; if configuration of the software or other customization is in a particular programming language, get a basic course or online tutorial for that language.
posted by AzraelBrown at 6:24 AM on September 13, 2014 [1 favorite]

It's hard to answer your question without having additional information. I'm uncertain about the nature of the project. What's the goal? Is it simply to document the existing system, or to replace it with another backup system, or to expand the scope of the backups, or to test the existing backups to make sure you can restore from them, or...? How many people are going to be reporting to you in this project? Are you committed to a particular project-management philosophy (such as lean, or agile)? Do you have a rough idea of the milestones?
posted by alex1965 at 7:41 AM on September 13, 2014

I've been in the position of inheriting responsibility for a backup system and needing to make it work. For bonus points, it turned out the ones I inherited weren't actually functioning correctly.

Your first step is to understand what's there and how it works.

If your system is built on something like Veritas, then this might not be so bad. Look for documentation of the software in question online, gently poke around in the settings of your admin panel without changing anything, and you should be able to figure it out pretty fast.

If it's a homebrew system, it's going to be a little more trouble. ...A lot more trouble. In order to work out how the system was working and how it was supposed to work, I spent a lot of time scouring the relevant systems to see what processes were running, find every log and console I could get my hands on and read it for clues about what was running and what was failing, analyzing all of the chron jobs to see what triggered and when, looking at all of the users defined on the system to see what they were for and if they had the right permissions. (My problem eventually turned out to be that the backup system was trying to copy to a server across the local network, but the server in question had been moved to another network entirely.)

Your second step will be to document how everything works so nobody else has to go through that again.

Your third step will be identifying next-actions, and this is where the project management part comes into play. Is the system fit for service? Enough space? Enough safety? All the right things are being stored, and there's an offsite plan in the unhappy event of an earthquake or lightning strike? If everything is fine for now, then managing the system will entail making sure the backup runs when it should, and doing regular tests to make sure that backed-up data is actually usable. I've seen backups appear to succeed when they didn't really. Testing is key.

If everything is not fine for now, or you can see a future where the demands on the system will outgrow its functionality, then identify what's wrong and figure out how to solve it. If your backup is taking 22 hours to complete and it runs every 24 hours, perhaps this is a sign that the system should be rethought completely. ...Ask me how I know...
posted by Andrhia at 8:04 AM on September 13, 2014

Backup system can be a good example of sunk-cost-fallacy thinking. They can be expensive and cryptic, and people get very invested in the idea that they A) need to get their money's worth out of it (almost, but not quite, hoping for a catastrophe from which it could rescue them, so it can justify its existence), and B) they've been assured that the backup system was going to prevent them from a terrible happenstance, and don't like to be informed that it broke down at some point, because that would mean that they've been deluding themselves about their security from the threat of downtime. It's not easy to recognize that you've invested emotionally in a backup system, but letting people know that the disaster-prevention they thought was there is no longer around can really give that rug-yanked-out-from-under feeling.

It's an insurance policy, and people hate paying for insurance and never using it, and also can't stand not having insurance.

IT investigation is the way to go. Get original documentation for the software, and do what you can (as described by others above) to work out what your company's procedure is.

If it's not working at all, you should be prepared to demonstrate that. In that event, they'll want you to fix it, Fix It, FIX IT!-- maybe that's possible, maybe not. If it's not possible, you'll need evidence of that too, and an assessment of your actual current backup needs (which probably don't resemble the backup needs when the system went in) and some realistic-as-you-can idea of how it should scale in the future. If the current system can't scale up, might as well replace it now, or at least start budgeting immediately for the replacement.
posted by Sunburnt at 9:51 AM on September 13, 2014 [2 favorites]

Also, when you figure out the new system (by which I mean machines plus practices, which will inevitably be new in some respects if not all of them) you may have to declare war on "this isn't how we used to do it." When you put the word out about new procedures, remind people that the old procedure is superseded because it didn't work, it was broken, it was summoning eldritch gods, whatever you have to do to get people know at the new procedure is the only procedure. You might even have to explain what was wrong, and how you fixed it.
posted by Sunburnt at 10:03 AM on September 13, 2014

I've been planning, building and managing backup systems for about 20 years. Here are some of my thoughts on where to get started:

You need to work out what you're working with first. I tend to see the infrastructure stack with applications at the top, servers and virtualisation under that, then storage and backup software under that, then storage and backup hardware at the bottom, with network running right up the side. You should start at the bottom of this list.

1. Applications - What applications are they backing up. Are they using application or database agents for the backup software, like the agent for Oracle RMAN that lets RMAN talk directly to the backup system?

2. What hosts (operating systems ) are they backing up? Are they using backup agents on the hosts? Are the hosts doing anything "special" like triggering snapshots or replication? How is the backup data transferred to the storage hardware?

3. Virtualisation. Do they use ESX/vShpere or Hyper-V? How are the VMs backed up?

4. Backup hosts - How many? What type of hardware? How many network and or SAN interfaces do they have, and probably most importantly, what backup software package are they using? Assuming it's a commercially supported product, yes, it's definitely worth getting your employer to send you on (at least one) 5-day training course. And start looking at the manuals. You won't ever read them from cover-to-cover (there are many, and they are huge), but start getting an overview of how the software works at a high level, at least.

5. Storage infrastructure. Do they backup to tape? Disk? Both? What kind of hardware? Read up on this hardware. Again, you don't have to read the whole thing, but get an overview. (You don't usually need to read up on tape until something goes wrong, which is rare on reasonably modern hardware).

Keep in mind, I've worked with commercial products, not home-grown scripted environments. But If that's the case here, the following broad elements would still need to be investigated where relevant.

Welcome and good luck!
posted by Diag at 3:56 AM on September 14, 2014

« Older Independent financial advice in London | What else can I do to prevent getting pink eye... Newer »

This thread is closed to new comments.

Ask MetaFilter

How do I learn to manage a large unfamiliar computer system?
September 13, 2014 2:20 AM Subscribe

Tags

Share

How do I learn to manage a large unfamiliar computer system? September 13, 2014 2:20 AM Subscribe

Tags

Share

How do I learn to manage a large unfamiliar computer system?
September 13, 2014 2:20 AM Subscribe