Are DB/IT groups really infallible?
August 21, 2005 10:59 AM   Subscribe

I work at a small software company. I'm a developer, so my cohorts and I are responsible for our own software bugs and it's pretty obvious when we screw something up. My policy is to fess up, fix it, and move on. But another group in the company, responsible for all database administration and all networking/internet issues, has never made a mistake. Is that really possible?

When something happens to a database, the DBIT group has either found an Oracle bug, a corrupted index, a corrupted config file (?), etc. When something fails to get backed up and the backup is needed, the client didn't set up the backup correctly or, if they themselves did the backup, they find a bug in the backup software that prevented the backup. Our ISP is blamed for all internet problems. And you won't believe how many routers and disk arrays go bad and have to be replaced. Needless to say, any bugs they find in our own software are duly reported as yet another burden they have to endure.

Does this match everyone else's experience with such DB/IT groups? We developers are just amazed at how everyone else in the company buys these stories and we collectively wonder if DB/IT could possibly be telling the truth?
posted by anonymous to Technology (13 answers total)
 
Nah, they're lying through their teeth.

Is there anything you can do about it? Nope. Get on with things.
posted by SpecialK at 11:31 AM on August 21, 2005


Odds? Lying about 85 percent of all that stuff. I'm a DB/IT person myself, and I've pulled some of that nonsense in the past. You'd need a sting operation to really nail them.
posted by zerolives at 11:36 AM on August 21, 2005


Obviously you don't feel that your IT staff is taking responsibility for their actions. What do you want to do about it?

If you want commiseration, adding a GMail address to anonymous Ask MeFi questions is starting to be in vogue.

If you want your organization to take a closer look at IT, then you will need to engage in office politics to get your argument in front of people who can make a change. I can't tell you how to go about it, but I will explain what my argument would be in your situation.

I would start by documenting each place where I felt that IT had passed the buck. I would then look for any reoccurring problems such as internet connectivity, routers and disk arrays. Then I would raise the question of why procedures weren't changed to detect problems before they resulted in service outages.

The most obvious shortcoming in your description is that IT claims a user error or software bug prevented backups being restored. After the first problem restoring data, why weren't the backups being regularly tested? Why didn't the IT staff randomly choose a backup every backup period and perform a test restore to verify the data? That procedure is the only way to determine if backups are actually working. Without it you have a bunch of media that are worth than useless: it's a false sense of security.

Similar arguments could be made about any reoccurring problems with changes such as multiple-ISPs, periodic disk array testing (and I believe RAID-5 should allow for a single disk to fail without losing data), etc. The question isn't whether IT had failed to prevent problems in the first place, it's whether they learned from the problems and took steps to prevent them from happening again.

If you have any sysadmin friends outside your organization, you may want to ask them how specific problems could have been prevented.
posted by revgeorge at 11:46 AM on August 21, 2005


I have set up large, high-volume Oracle installations, and while Oracle has bugs (lots of them), I have never seen a config file "get corrupted". I mean come on, it's just sitting there on disk to be loaded when the DB starts up. If that file gets "corrupted" it was because somebody was fucking with it.

As for corrupted indexes, it's possible (but I've never seen it); even so, it is the DBA's responsibility to be aware of bugs that can cause data to be lost and work around them.

When I was acting sysadmin/DBA for various small firms I worked for, I made sure everything worked. If something I was responsible for became broken, my fault or not, I took it personally.

Your DB/IT dept should be asked why their tech is so fragile. Why aren't they preflighting and verifying the backup? Why don't they know about bugs causing the "corrupted indexes"? How are config files getting "corrupted"? Why don't they switch to a better ISP? Why aren't they making things better? If any of this is too hard, what are they there for?

Your DB/IT group sounds expert at CYA, that's for sure.
posted by ldenneau at 11:53 AM on August 21, 2005


They must be Gods.
posted by caddis at 12:02 PM on August 21, 2005


The DB/IT group's job is to keep systems up 100% of the time. If they're not doing that, they're doing something wrong. It sounds like they're doing something wrong quite frequently. Whether its their direct fault or not, they're not performing their job correctly.

w.r.t. Oracle, it can be complex enough that an inexperienced/unknowledgable person can seriously screw up an installation. If they keep getting problems, maybe they should find someone certified?

Next time they try to pass the buck, ask them to explain how they will prevent this problem in the future, and see if you can hold them to that.
posted by devilsbrigade at 12:24 PM on August 21, 2005


Sysadmins are expected to be perfect.

Imagine how hard it would be to write code with no bugs, first time.

That's what sysadmins are expected to do on a regular basis. With root permissions. On a live system. Under time pressure. With an angry, impatient, obnoxious user (who blames you for the problem you're solving) watching over your shoulder, as often as not.

And whenever the hardware fails, you get the blame.

When the ISP goes down, it's your fault.

User forgot to backup his data? Your fault.

When the OS crashes or the DB corrupts a table (which does inevitably happen from time to time, particularly in a software house with developers playing around with pre-alpha-level code) all the users blame you. Often, they all telephone you and email you simultaneously, all 50 of them, helpful as anything, to tell you about it.

And in all these cases, you have to find a way to cover your ass fast before it gets fired - at the same time as fixing the problem.

Buy them a beer sometime, and be grateful.

(I left that line of work some years ago. Far too stressful).
posted by cleardawn at 1:08 PM on August 21, 2005


I used to work as a sysadmin, and left the job to become an actuary after a lot of soul-searching.

Systems administration (and administration of any kind, whether in IT, accounting, etc) is a tough job, morale wise, compared to product development. Sysadmins keep things running, i.e. maintain status quo. You notice us most when things go wrong, but barely at all when we provide you with five nines of uptime. We don't have big achievements or milestones to call our own, especially when implementation projects often get handled by a different group as the specialization / outsourcing trend continues. This has a tremendous effect on our prestige and our morale.

In the long run, being the bearer of bad news is really draining, especially if the reaction to the subsequent announcement of a solution is usually "What took you so long?" rather than "Good job!". Inevitably, the result is that we stop apologizing for mistakes in order not to go insane from the guilt that goes with an apology, and we explain things away to defer the blame. While this may frustrate you, it is an essential coping mechanism akin to paramedics having to numb themselves to seeing death.

So take a chill pill, and realize that in some ways you have it better than they do. Thank god I'm not in that line of work anymore.
posted by randomstriker at 1:25 PM on August 21, 2005


Wow, cleardawn, talk about synchronicity. Come 'ere, gimme a hug!
posted by randomstriker at 1:26 PM on August 21, 2005


A diligent and conscientious IT admin type person can manage to make (visible) mistakes pretty rarely. A diligent and conscientious coder is going to produce bugs quite regularly, unless you're willing to go to extremes of effort usually found in the aerospace or telephone industries, and very very few companies are willing to expend that much effort on their software. So it's possible that your DB/IT guys are just really good.

OTOH, from your description, it does kinda sound like they're using an excuse-of-the-day calendar.

From what you describe, it doesn't sound like their actual performance is that bad. It's just that when they do screw up they don't own up to it. Is this an accurate impression? Their job may, as devilsbrigade says, be to maintain 100% uptime, but that's not really possible with reasonable amounts of funding. Someone's job is to balance the losses of downtime against the expense of each 0.01% improvement and find the best medium. Which probably isn't 99.99% unless you're in a very unusual company.

I think I like revgeorge's response best of the ones I've seen here. I think it's also important to continue to lead by example, but since you're in a different group that's probably not enough. I assume the DB/IT folks are acting this way because they're afraid people will think they're incompetent. A direct challenge is likely to exacerbate that; unless you actually want to get them fired and replaced, IMHO it's not the best approach. Perhaps you could try to insinuate the idea that kickass DB guys never make mistakes, but even more kickass DB guys are always ready when the config file gets corrupted or the sunspots bruise the tape heads or whatever. If they take that bait and start being able to make reports like "pixies ate our router, but our contingency plan worked, so that's why you didn't notice anything wrong", they might feel they have some more cred, and you can get them to start admitting their actual mistakes. And when the inevitable problems do happen, commiserate with them and publicly make the presumption that they were doing the right thing at the time (unless it's obvious they weren't). Eventually they'll get to the point where they say "Yesterday's outage was because I was on crack and erased the wrong disk, but it should be working now, and I've [done something to make that less likely next time]. Sorry about that." and everyone can just move on.

IMHO, although encouraging the emotional growth of your coworkers (or bosses) may not be your job, it will make your job and your life better, so it's worth doing.

(Disclaimer: I'm absolutely abysmal at office politics. I expect that everything I say in this post is a recipe for disaster. But when it works it's nice.)
posted by hattifattener at 1:46 PM on August 21, 2005


What hattifattener said.

You may also be seeing the consequences of mismanagement, perhaps by a manager who isn't even with the company any more.

A few years back I worked in an organisation where most of the IT staff lied black and blue about problems with project delivery, until it was impossible to conceal the truth from top management any more. The reason was that the previous CIO had a fearsome temper, complete with shouting and eyes bugging out like Marty Feldman. This guy left and was replaced by a man with an altogether kinder, gentler style, who was genuinely receptive to the truth. He'd been in the job two years, but people were still reflexively concealing all bad news. Maybe by now things are better there.

Anyway, your colleagues may be under the thumb of someone with a misguided approach to managing performance.
posted by i_am_joe's_spleen at 5:19 PM on August 21, 2005


Random & Cleardawn, I feel your pain. But I don't agree with your posts.

IT is a thankless job. I currently run a company that supports windows, linux and mac desktop systems and linux servers of various types.

I have 100% uptime on several systems after six months, including the redundant MySQL database that does a couple million transactions during the 8 hour business day. 100%. It's because I'm careful to tweak the settings only during downtime and nights, I'm careful to monitor and performance-clock the systems and set Nagios to alert me if it's running outside of specifications, and I'm exceptionally careful to test everything on development systems. If something does go down, we have a "post-crash" meeting in my office and with the client to discuss what went wrong, why it went wrong, and either why it won't go wrong again or what we should do to mitigate the risk.

IT is developed well enough these days as a career choice that there really is no excuse for having systems go down constantly. And there's no way that NONE of it is their fault, especially with repeated things.

So let's walk through what's happening with Anonymous's IT department. Keep in mind that I *am* a system administrator, and a business owner, and I care a lot about uptime.

* Oracle Bug - So they're not reading their mailing lists and looking for bugs that they could run into before they actually run into them? They don't know what bugs are common? Bullshit.
* Corrupted Index - Yeah, that's pretty common. If it's common in your environment, though, they need to tweak some configurations or write a script that checks the indexes for corruption and re-indexes automatically.
* Corrupted Config File - Oh, bull - fucking - shit. Config files are loaded once on startup, and are text files. Unless they're using some sort of GUI to configure Oracle, it's pretty hard for the configuration file to screw itself up. If they're getting corrupted, that means one of two things: the RAM in the server is bad and is munching everything ... or 'corrupted' means "I was editing it and changed a setting and didn't test it before it went live and it screwed everything up."
* Backups - Client didn't set correctly - Huh? They're allowing a client to be responsible for mission critical backups? Someone needs their head checked.
* Backups - Bug in Backup Software - And you're using that backup software because? Why not use a different piece of software? Incompetent fool.
* Internet Problems - And you don't have service from two different ISPs because? At the one client I have where 'net connectivity is mission-critical, we have a T-1 dragged in from one direction, a T-3 dragged in from another direction (separate runs into the building) and then yet another phone line run in separate from everyting else that's got a backup-to-the-backup DSL line in through it. Problem solved.
* Router goes bad - Uh, they're kinda, you know, hard circuits? The only time I've had a router go bad is when a client was running a 40 or 50 seat office with a 1.5mb/s T-1 off of a Linksys cable modem router. That fucker melted after a few months of nonstop use. I know it can happen, but if it happens more than once, that means that they're fucking with settings during the day and borking something.
* Disk array goes bad - OK, but where are they buying their disks, K-mart? The only time a disk array should go 'bad' is when they get turned off and turned back on again... or when no one's monitoring the disk arrays and more than one disk falls off the bottom. Even then, the problem could be solved by adding more layers of redundancy.

So again, I have to call 'incompetent bullshit' on your IT/DB staff. People who don't have their heads stuck [up their asses,in the sand] find problems and fix them, no matter how tough their job is. It's more than possible to have an IT department where things breaking is the exception as opposed to the norm, but it takes the willingness to WORK hard at it and to tweak everything until you're satisfied and to have multiple layers of redundancy so that when something breaks you catch it before users notice.

But how to fix it? It sounds like whoever the manager is over in that group is concerned more with his image than the quality of his work, and that's a tough problem to go after. He's gonna have to have some pretty bad egg on his face before he gets booted, and I doubt Anonymous is the one to fix it.
posted by SpecialK at 9:53 PM on August 21, 2005


your colleagues may be under the thumb of someone with a misguided approach to managing performance

If you decide to try and do something about the situation, pay attention to this. The first thing you need to do is find out exactly how their performance is measured, and by whom. Is there a blame-driven manager somewhere higher up in the hierarchy? Then people will be concerned with passing blame. The only definitive solution is to get IT to be managed based on objectives.

You may or may not be able to figure out a way to change that. Your first step would be to talk to your manager. Your manager may not be aware that this is an ongoing problem, or maybe he underestimates the effect it's having on his team. Start by asking questions, and if he seems receptive, offer him concrete examples of how the situation is creating problems for you.

If you can't change the situation, start looking around for other job opportunities. Blame-driven management, if left unckecked, will spread.
posted by fuzz at 8:48 AM on August 22, 2005


« Older Momento.   |   email problems Newer »
This thread is closed to new comments.