Constant corruption - and I'm not even talking about the government!
January 25, 2008 1:42 PM   Subscribe

I have a few mysql tables that keep repeatedly getting corrupted - help!

I'm running a website that uses OSCommerce, though that in particular may not be relevant to my problem. The sessions and whos_online tables both seem to keep getting corrupted. I run myisamchk -r on them (even after stopping mysql server altogether in order to do so, just in case) and they're fine for a very short while (measured in minutes, not even hours) and then both keep getting corrupted again.

I've rebooted the server (Red Hat linux), I've tried dropping the tables and recreating their structure, and I'm now at my wit's end.

I figured a reboot would run fsck to see if it was a disk issue. I must plead ignorance here -- I'm an amateur when it comes to being a sysadmin, and I built this box so long ago I don't remember if it ran fsck on boot automatically, or if some other job checks first if it should run fsck, or what... Anway, I looked a boot.log and see no record of fsck being run.

Anyone know what might be causing this, or what else I can look into? My understanding is that I can't run fsck on a mounted file system so I'm not sure what to do.
posted by twiggy to Computers & Internet (12 answers total)
There's usually a file you can create on the root filesystem that will force the system to run fsck on its next bootup; from this post and a few others, it looks like it may be /forcefsck on Red Hat. Try touching /forcefsck, then rebooting.

Also, it may be possible to switch MySQL from ISAM to InnoDB tables; here's an article on that. I have to admit I don't know a huge amount about the two table types, but I understand that InnoDB is newer and more reliable. Of course, make very sure you have your filesystems in good order before you try anything drastic like that, lest you trash the whole database. In fact, I would probably download a whole mysqldump right now, just in case.
posted by pocams at 1:52 PM on January 25, 2008

check /var/log/messages for nasty disk failure messages; these tables are probably writing to a bad sector on disk. Look for nasty seek errors. Definitely run a backup ASAP and get it off the machine in case the HD dies or is dying.

If you see this kind of error, try running badblocks - (fsck checks the file system for logical integrity, but doesn't scan the disk's surface). This may destroy your machine, so get a good backup going first. You'll also need to be in rescue mode.

Depending on the cost / size / age of the HD, it may be quicker & cheaper to swap out the box for something newer and studlier.
posted by jenkinsEar at 2:13 PM on January 25, 2008

Response by poster: Believe it or not - no nasty filesystem messages to be seen in /var/log/messages .... I'm putting off trying this forcefsck thing and see how that goes but I'm mildly terrified as it's a 1u rackmount in a datacenter so I don't have access to it if it explodes...

Why would badblocks destroy my machine, exactly?
posted by twiggy at 2:17 PM on January 25, 2008

InnoDB just makes MySQL adhere more to the ACID rules, adding foreign key constraints and that kinda thing. So in on a high level, it's mre reliable, because you've got more control of the data that's entered. (e.g. you need to have a tree before you can add the leaves).

File corruption is something else though, MyISAM shouldn't corrupt a table, because all you're doing is sending SQL queries it's way, you might corrupt the data, e.g. turn all Z's to X's, but not actually the table.
Hard to point to the cause without knowing the nature of the corruption, but if it is disk based, then move the datafiles to another storage device to see if the problem goes away.

I suppose the datafile could be corrupt, and the mysql tools to fix that aren't perfect. Try creating a new datafile, export the current data (so you've got the SQL queries to rebuild it) and run them in the new space.

I'm not sure how OSCommerce manages the who's online table, but it's possible it's doing something quirky-- make sure you're up to date on patches/versions and you might just get lucky.
posted by Static Vagabond at 2:26 PM on January 25, 2008

Response by poster: Okay, my /mysql directory is its own filesystem, so I just did a umount /mysql and ran fsck on it. Here's the output:

[/]# fsck /mysql
fsck 1.32 (09-Nov-2002)
e2fsck 1.32 (09-Nov-2002)
/mysql: clean, 3552/2097152 files, 1070723/4186940 blocks

That's not a whole pile of information, but it seems as if it's saying the filesystem is OK. Thing is, that result was virtually instantaneous. I've read the manpage and I don't see anything saying I _must_ provide any options, but i'm starting to wonder if in my panicked state I'm not reading thoroughly enough.
posted by twiggy at 2:35 PM on January 25, 2008

You mentioned it was at a datacentre, so access is probably a pain-- you could create a little ram disk if you don't have a second drive to hold the tables while you test.
posted by Static Vagabond at 2:51 PM on January 25, 2008

What version of MySQL? What does it say is corrupt? Just the indexes, or data, or what? Are you using fulltext indexing? What happens if you ALTER TABLE foo ENGINE=INNODB?

fsck will take hardly any time on small filesystems; it tends to scale with the number of files, so running quickly is to be expected. For bad blocks, SMART is probably the first place to go looking; smartctl -a /dev/sd[whatever] to see if it's reallocated or has pending sectors (ones waiting to be reallocated), and smartctl -t long /dev/sd[whatever] to make it run a surface scan of the media.
posted by Freaky at 5:52 PM on January 25, 2008

Response by poster: Freaky: It's MySQL 4.1.22... I haven't switched the engine to InnoDB yet as I'm a bit hesitant to do so... the site has been running for a couple of years just fine so I don't think it's related to the engine...

I'll read about SMART, though, and see if that might help, thanks!

Oh yeah, as far as what it says is corrupt - I should've saved output. I've managed to get it to "only" complain now that the table is "usable but should be fixed"...

If memory serves, though, it said the table was considered "crashed".. I recall a reference to referencing records in the wrong key file or something... (oy, I really should've saved the output!)
posted by twiggy at 6:30 PM on January 25, 2008

Could also be bad RAM. Normally, I'd recomend running memtest86, but that's a little hard to do in a datacenter. Most people will suggest, as a temporary fix, running something like this:
cd /usr/src/linuxfor i in $(seq 1 10); do make clean; make bzImage; make modules; done
(Assuming you have a valid .config file already). This just builds the kernel 10 times in a row... which is a simple stress test to do.
posted by sbutler at 8:32 PM on January 25, 2008

In my experience, bad RAM will bring the whole system down. I suppose it could manifest in subtler ways, but I find it odd that it would repeatedly corrupt the same two tables, especially after several MySQL restarts and a system reboot.

If you're going to do what sbutler suggests anyway, throw a "nice" in front of the second command, to run it at lower priority so that it doesn't slow MySQL down too much.

I'd second what Static Vagabond says.
posted by fogster at 10:17 AM on January 26, 2008

4.1.22's a bit old, but should be ok. I think the last non-MERGE-releated MyISAM corruption issue was fixed in 4.1.21, but given how recent that was I wouldn't rule out another MySQL bug. You might try taking a snapshot of the tables and saving a binlog until they get corrupted; then roll back and re-run the binlog and see if it happens again.

If it's a half-decent server, you should have ECC memory; assuming it's enabled, it shouldn't (with very high probability) allow broken memory to hit the system, though I suppose that depends on what the OS does with the MCE's (Machine Check Exception). Linux I think will either kill the process who owns the memory or panic the system in the event of uncorrected errors.

mcelog will tell you about any MCE's the system has logged (note it consumes the data that's buffered in the kernel, so if you want to save it redirect the output to a file).

There are also online memtest applications which may be of use, though obviously they can't be as thorough as an offline memtest.
posted by Freaky at 8:30 PM on January 26, 2008

Response by poster: Update for those who might come across this via Google, etc:

My server has ECC RAM, and as noted above, I don't believe bad memory to be the issue here based on the symptoms, the rebooting, etc.

I've gotten the tables to at least behave stably now, but I still get the whole "Usable but should be fixed" thing. Still, it doesn't give me much clear information on what's "slightly wrong" or why it should be fixed even when I use --verbose on myisamchk.

Unfortunately I'm still trying to figure out what might be wrong here, but I'm considering doing the work to upgrade to MySQL 5 just to see if rebuilding the database makes this go away. If I do that, I'll report back with success/failure.
posted by twiggy at 2:02 PM on January 28, 2008

« Older What's recommended to consume right after a...   |   Chevrolet Tracker Spark Plugs Newer »
This thread is closed to new comments.