Your victories and woes as a unix/linux sysadmin
July 21, 2008 4:36 PM
What were your greatest victories and punishing woes as a unix/linux system administrator?
I've been learning both unix and linux recently from a desire to use Solaris on a home server, and then the whole system administration thing hit me. So, I wanted to throw this one out there for the real world sysadmins out there: For those who are or were Unix or Linux system administrators, whether for a small company or large enterprise, what were some of the unique challenges or obstacles you encountered while working on the job? Not limited to just technical challenges, but it can be anything.
I've been learning both unix and linux recently from a desire to use Solaris on a home server, and then the whole system administration thing hit me. So, I wanted to throw this one out there for the real world sysadmins out there: For those who are or were Unix or Linux system administrators, whether for a small company or large enterprise, what were some of the unique challenges or obstacles you encountered while working on the job? Not limited to just technical challenges, but it can be anything.
I haven't been a system admin for a company/enterprise, but I did admin work on several servers for a decently long time. The big challenge is always security vs convenience. Users will ask you to do (or if they can do) infinitely many things, many of which will go against whatever security policy you decide you want to follow. If you don't do it the first time someone asks, they'll probably get mad and make you come up with a solution for them. If you do do something, and then someone else later on asks you to do the same thing/let them have the same privilege and you refuse, they'll get mad because you're being uneven. The tighter your policy is, the more times people will bug you, and the more hassles you'll have from 'below'. The looser your policy is, the more blame you take from 'above' for security risks exposed because of it.
On a technical level, dealing with software/library versions can suck if you're using something outside of your OS's package system.
Greatest victory: having a day where absolutely nothing has gone wrong, no one has complained to you, something complicated you wrote/set up works flawlessly (like a big new backup script), and someone may even complement you on how easy you made something that they always hated to do by hand.
What you should never ever ever forget: keep backups, and never put anything untested onto the production server.
posted by devilsbrigade at 4:52 PM on July 21, 2008
On a technical level, dealing with software/library versions can suck if you're using something outside of your OS's package system.
Greatest victory: having a day where absolutely nothing has gone wrong, no one has complained to you, something complicated you wrote/set up works flawlessly (like a big new backup script), and someone may even complement you on how easy you made something that they always hated to do by hand.
What you should never ever ever forget: keep backups, and never put anything untested onto the production server.
posted by devilsbrigade at 4:52 PM on July 21, 2008
Oh, another thing: have backup plans. If there's a big security crisis with your main linux server (some new big remote hole, for example), its absolutely thrilling to be able to, in 5 minutes, shut it down and bring up a minimal BSD server as a replacement for critical services (email, dns, maybe static web, etc). Then you have all the time in the world to deal with the problem instead of freaking out, pissing people off, and getting yelled at for the next week about a problem that was absolutely out of your control.
posted by devilsbrigade at 4:56 PM on July 21, 2008
posted by devilsbrigade at 4:56 PM on July 21, 2008
If you're gonna do this for a living, learn to write a handoff email and learn to take copious notes on what you've done. Also, *trust your instincts.* You might not think you have any, but you will, and when they speak up, you may well have to act on them in ways other people won't appreciate.
I was a professional systems administrator for a year or so until medical concerns ended my career. (I still work in a technical systems capacity these days, just in an industry that defines it differently.) My worst day ever on the old job involved coming in to "Upgrade machine X using SunSolve patches A through Z; Other Guy's already done most of it."
Other Guy left a note that said "I get patch now and later others can install, I don't have time."
I discover that he's downloaded patch A into some obscure location, hasn't even thought about patches B through Z, and patch D is a kernel patch. I get them all and put them in the right spot, log into machine X...
and find a giant banner warning me never, ever to upgrade machine X's kernel unless I have a senior operations guy standing over my shoulder. No one has said anything about this to me, ever. I have never seen this banner. I page an ops guy-- "have you seen this? I'm supposed to apply patch D tonight at 3am and this banner seems to tell me not to do it."
"Well, just do it."
"The banner says one of you guys needs to be in here if I'm going to do this. Do you know why?"
"Nope! Not coming in tonight, either. Probably you should just do it."
I sigh. I install everything I can up to that kernel patch and I look at it. No one in the building or on the phone can tell me what the hell this thing is, or why I should or shouldn't install it.
I call the account manager and cancel the downtime. "Sorry, but this machine appears to have other technical issues that preclude me doing this upgrade at this time. I'll have an ops engineer look at it at 9am local. I don't want to risk extended downtime for your customer."
Account manager is cool with it. I finish my shift and write a really long handoff explaining that only the ops guys should touch machine X, it has all patches except D and patches that depend D installed, I'd appreciate better handoffs next time.
I come in the next evening to *absolute ridicule* from everyone else on the second shift-- including Other Guy of the ridiculous handoff-- about how I'm a pussy who won't do a kernel patch. (I was one of two women in my deparment, and my coworkers were fascinating people that way.) I get called in and raked over the coals by my manager. The senior ops guy decides to come in and knock this patch out in five minutes, since I'm so squeamish about getting my hands dirty. He also stops to express his disappointment with me, since I should've just pushed the patch and said the hell with it.
Seventy-two hours of machine X downtime (and associated labor that we couldn't bill the client for, because it was our fault), someone finally remembers that machine X (and seven of its pals) were installed with the *wrong version of Solaris* for the hardware in question. (Each version of Solaris, back in the day, only worked on some subsets of Sun hardware, and different updates provided compatibility with other machine types.) Patching the kernel without correcting the base install would hose the machine.
No one ever implied that I was a pussy again, at least, but it was certainly a really obnoxious week. You need to know when to wave off, and when to stop and investigate before plowing ahead with the hard technical stuff. You need to leave good notes and solid documentation, with an eye to "will someone understand this later." You need to be able to tell your coworkers exactly what you did and why you did it. Your soft skills are just as important as knowing what slice 2 is and why you shouldn't mess with it.
posted by fairytale of los angeles at 5:30 PM on July 21, 2008
I was a professional systems administrator for a year or so until medical concerns ended my career. (I still work in a technical systems capacity these days, just in an industry that defines it differently.) My worst day ever on the old job involved coming in to "Upgrade machine X using SunSolve patches A through Z; Other Guy's already done most of it."
Other Guy left a note that said "I get patch now and later others can install, I don't have time."
I discover that he's downloaded patch A into some obscure location, hasn't even thought about patches B through Z, and patch D is a kernel patch. I get them all and put them in the right spot, log into machine X...
and find a giant banner warning me never, ever to upgrade machine X's kernel unless I have a senior operations guy standing over my shoulder. No one has said anything about this to me, ever. I have never seen this banner. I page an ops guy-- "have you seen this? I'm supposed to apply patch D tonight at 3am and this banner seems to tell me not to do it."
"Well, just do it."
"The banner says one of you guys needs to be in here if I'm going to do this. Do you know why?"
"Nope! Not coming in tonight, either. Probably you should just do it."
I sigh. I install everything I can up to that kernel patch and I look at it. No one in the building or on the phone can tell me what the hell this thing is, or why I should or shouldn't install it.
I call the account manager and cancel the downtime. "Sorry, but this machine appears to have other technical issues that preclude me doing this upgrade at this time. I'll have an ops engineer look at it at 9am local. I don't want to risk extended downtime for your customer."
Account manager is cool with it. I finish my shift and write a really long handoff explaining that only the ops guys should touch machine X, it has all patches except D and patches that depend D installed, I'd appreciate better handoffs next time.
I come in the next evening to *absolute ridicule* from everyone else on the second shift-- including Other Guy of the ridiculous handoff-- about how I'm a pussy who won't do a kernel patch. (I was one of two women in my deparment, and my coworkers were fascinating people that way.) I get called in and raked over the coals by my manager. The senior ops guy decides to come in and knock this patch out in five minutes, since I'm so squeamish about getting my hands dirty. He also stops to express his disappointment with me, since I should've just pushed the patch and said the hell with it.
Seventy-two hours of machine X downtime (and associated labor that we couldn't bill the client for, because it was our fault), someone finally remembers that machine X (and seven of its pals) were installed with the *wrong version of Solaris* for the hardware in question. (Each version of Solaris, back in the day, only worked on some subsets of Sun hardware, and different updates provided compatibility with other machine types.) Patching the kernel without correcting the base install would hose the machine.
No one ever implied that I was a pussy again, at least, but it was certainly a really obnoxious week. You need to know when to wave off, and when to stop and investigate before plowing ahead with the hard technical stuff. You need to leave good notes and solid documentation, with an eye to "will someone understand this later." You need to be able to tell your coworkers exactly what you did and why you did it. Your soft skills are just as important as knowing what slice 2 is and why you shouldn't mess with it.
posted by fairytale of los angeles at 5:30 PM on July 21, 2008
I don't do it professionally right now, but I've got a couple Linux servers out there I maintain.
Best day: a machine got compromised. It was actually great fun, because I absolutely love that type of stuff. They weren't able to do anything to the data, just run an IRC bot. I figured out where the bot was connecting to, killed it, copied the script to my home directory (so I could peruse how it worked without anyone being able to touch it), and, about 5 minutes after noticing it, had figured out how they got in (an out-of-date web script), taken that offline, and restored everything to normal. There's a certain thrill to stopping attacks as they're unfolding.
Worst? Pretty often, really, but some highlights (shadows?):
- I went to apply some updates back in the days of RPM, before package management. I had some trouble finding some particular boot script I needed, but I eventually found a copy. It didn't occur to me at the time that using boot scripts for a different Linux distro is a very bad idea if you expect the machine to boot.
- The million times a day things stop working in stupid ways. I update Apache and it overwrites my configuration file with the default, and all my websites stop working. That type of thing.
- Pretty recently: I found an old hard drive of mine with lots of personal data, and decided I'd copy it over to my new system. Some errors popped up, so I ran fsck.ext3 on it. I passed it the "Do whatever you think is best" type arguments, but it still didn't work. Oh, that's because I use ReiserFS, and just hosed the drive by letting fsck "fix" the "corrupt" superblocks. Add in some more thoughtlessness trying to fix it, and it was soon impossible to recover the filesystem.
Two things that will get you a long way: monitor everything. I can spot things out of the ordinary by glancing at top on my server, because I view it all the time. And make backups. All the time.
Hope this helps, somehow?
posted by fogster at 6:36 PM on July 21, 2008
Best day: a machine got compromised. It was actually great fun, because I absolutely love that type of stuff. They weren't able to do anything to the data, just run an IRC bot. I figured out where the bot was connecting to, killed it, copied the script to my home directory (so I could peruse how it worked without anyone being able to touch it), and, about 5 minutes after noticing it, had figured out how they got in (an out-of-date web script), taken that offline, and restored everything to normal. There's a certain thrill to stopping attacks as they're unfolding.
Worst? Pretty often, really, but some highlights (shadows?):
- I went to apply some updates back in the days of RPM, before package management. I had some trouble finding some particular boot script I needed, but I eventually found a copy. It didn't occur to me at the time that using boot scripts for a different Linux distro is a very bad idea if you expect the machine to boot.
- The million times a day things stop working in stupid ways. I update Apache and it overwrites my configuration file with the default, and all my websites stop working. That type of thing.
- Pretty recently: I found an old hard drive of mine with lots of personal data, and decided I'd copy it over to my new system. Some errors popped up, so I ran fsck.ext3 on it. I passed it the "Do whatever you think is best" type arguments, but it still didn't work. Oh, that's because I use ReiserFS, and just hosed the drive by letting fsck "fix" the "corrupt" superblocks. Add in some more thoughtlessness trying to fix it, and it was soon impossible to recover the filesystem.
Two things that will get you a long way: monitor everything. I can spot things out of the ordinary by glancing at top on my server, because I view it all the time. And make backups. All the time.
Hope this helps, somehow?
posted by fogster at 6:36 PM on July 21, 2008
Recommendation: The Lone Sysadmin. (I'm biased because I know him, tee hee.)
posted by Madamina at 7:45 PM on July 21, 2008
posted by Madamina at 7:45 PM on July 21, 2008
+1 everything above (working Linux/BSD sysadmin here)
My experience, in short:
The good: writing clever Perl scripts as a workaround under pressure to fix things
The bad: debugging said scripts a week later
The ugly: seeing those scripts become production services
No, I'm not proud of myself. That taught me to document properly and not to consider one-liners for anything else than quick one-offs.
Also, fairytale of los angeles' story is unfortunately a fairly common one, I've encountered and unfortunately have dealt with a couple of fairly similar ones. You end up with "legacy systems" that were quickly put together to solve a problem and then people forget, move on, and eventually you have to fix them. Know your linker and how to keep concurrent library versions (even though this is hideous, but you sometimes can't escape this), look up LD_LIBRARY_PATH etc.
But, uh, back to work...
posted by phax at 1:45 AM on July 22, 2008
My experience, in short:
The good: writing clever Perl scripts as a workaround under pressure to fix things
The bad: debugging said scripts a week later
The ugly: seeing those scripts become production services
No, I'm not proud of myself. That taught me to document properly and not to consider one-liners for anything else than quick one-offs.
Also, fairytale of los angeles' story is unfortunately a fairly common one, I've encountered and unfortunately have dealt with a couple of fairly similar ones. You end up with "legacy systems" that were quickly put together to solve a problem and then people forget, move on, and eventually you have to fix them. Know your linker and how to keep concurrent library versions (even though this is hideous, but you sometimes can't escape this), look up LD_LIBRARY_PATH etc.
But, uh, back to work...
posted by phax at 1:45 AM on July 22, 2008
The best thing I did was hire and train (more or less from scratch, though he's a pretty good Win admin) my replacement.
posted by chengjih at 7:43 AM on July 22, 2008
posted by chengjih at 7:43 AM on July 22, 2008
Oh, for people having bad days, I present The Daily WTF. One thing to keep in mind is that, unlike physicians and whatnot, at the end of your day as a sysadmin, no one dies.
posted by chengjih at 7:46 AM on July 22, 2008
posted by chengjih at 7:46 AM on July 22, 2008
There's some bitterness here, but the worst thing about being a Sysadmin (Windows admin, not Linux) is walking out of a movie/concert/play/party or rolling out of bed to fix something that's broken. Repeatedly. And oh yeah, not getting paid for any of it.
posted by cnc at 1:21 PM on July 22, 2008
posted by cnc at 1:21 PM on July 22, 2008
The best *and* the worst: knowing that if and when you do your job perfectly, most of the people you're supporting aren't going to have any idea the magnitude of the feat you're pulling off. Corollary: knowing that the pretty much the only time you're going to get major kudos or recognition is when you're cleaning up a mess.
posted by genehack at 6:14 PM on July 22, 2008
posted by genehack at 6:14 PM on July 22, 2008
This thread is closed to new comments.
posted by rhizome at 4:44 PM on July 21, 2008