Advertise here: Contact FM.


The super err will not be denied.
February 20, 2008 12:52 PM   RSS feed for this thread Subscribe

How do I figure out the source of my VM memory corruption issues, and can I fix them?

I am using the latest version of VMWare's server product, under winXP 64bit. One of my VMs is a Red Hat Enterprise Linux 4 server, with all the relevant development tools installed that are required to build out product from source (gcc et al).

On rare occasions, for one specific error condition-causing test, instead of the usual our-program generated error message and program flow, I'll get the following err displayed to screen:

"*** glibc detected *** double free or corruption (fasttop): 0x09f1e318 ***
Aborted"

Now here's where it gets really strange...this error message has super powers of some sort that allow it to display to screen even when stdout and stderr are being redirected, leading me to think it must be an issue somewhere deeper than our code.

" /my/stuff/here >/dev/null 2>&1
*** glibc detected *** double free or corruption (fasttop): 0x0825b318 ***
Aborted"


My questions are many, but I'll settle for any advice towards narrowing down the root cause of what seems to be memory corruption in the VM.

Any theories on how the VM is getting corrupted, and what I might do to prevent it are also welcome.
posted by nomisxid to computers & internet (13 comments total)
No idea about the memory corruption (try ruling things out by running it on a different virtual and physical machine), but glibc might well be writing error messages to the controlling tty instead of stderr just to make sure they get seen.
posted by fvw at 1:02 PM on February 20, 2008


I don't think this has anything to do with your VMWare setup, but rather with a bug in your code that didn't bother previous versions of glibc. Specifically, my 1-minute Google search suggests that older glibc was lax about double-frees, e.g. free(x); free(x).
posted by qxntpqbbbqxl at 1:21 PM on February 20, 2008


It's difficult to diagnose a problem like this from afar. There are two broad possible causes.

1. You program calls free or delete more than once on the same memory location, or once on an invalid memory location, or once on a once-valid memory location that's been overwritten (i.e. corrupted.)
2. Something external to your program, such as VMWare or glibc, is corrupting memory.

Scenario (2) is possible, but unlikely, because such a problem would likely affect other programs and/or the stability of the OS as a whole.

Some things to check:
If you post a stack trace here, perhaps I or someone else can provide more help.
posted by metric space at 1:38 PM on February 20, 2008


I am not the developer for this code, only it's tester. I have brought this error to the attention of dev, but since it only occurs on my VM (same binary on another system = no error), they have been reluctant to invest serious time investigating. It has never been seen on any other real or VM system, same OS/arch or otherwise, and it runs every day successfully on all the other environments. It always fails at exactly the same place, IF it fails.

Previously, when I encountered this error, my dev environment was hosed. No matter how many times I'd recompile or rerun the test, it would fail every time. After working with dev to a point where they felt confident it was the VM corrupting things, and not our code nor glibc, I' wiped my entire dev directory tree, re sync'd from source control, and things were OK again. VMware released a dot-release of server, and it'd been 2 months since I'd last seen this issue. Then yesterday, I start to see it happen again, but this time it's an occasional thing. Sometimes I can run a dozen copies of the same test, no issues, other times, it's every 3rd run.


This is the code pointed at by the stack trace:

void free_x_dirents(X_DirEnt **s)
{
if(s) {
for (int i = 0; s[i]; i++)
delete s[i];
xfree(s);
}
}


#0 0x0051a7a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
#1 0x0055b7a5 in raise () from /lib/tls/libc.so.6
#2 0x0055d209 in abort () from /lib/tls/libc.so.6
#3 0x0058fa1a in __libc_message () from /lib/tls/libc.so.6
#4 0x005962bf in _int_free () from /lib/tls/libc.so.6
#5 0x0059663a in free () from /lib/tls/libc.so.6
#6 0x0818d344 in ~basic_string (this=0x8e95b58)
at /root/work/tmp/i686-pc-linux-gnu/libstdc++-v3/include/bits/basic_string.h:233
#7 0x0807ff1e in ~x_DirEnt (this=0x8e95b58) at x-client.h:27
#8 0x0806be3c in free_x_dirents (s=0x8e95ab8) at x-client.c:420
posted by nomisxid at 3:19 PM on February 20, 2008


This is probably something that could be solved by sitting down with the system and the code, but it will be very difficult to come to any conclusions otherwise. However, in the interests of science, how about this.

The error is not occurring precisely in the code you posted, but in the destruction of an std string in the X_DirEnt destructor. The error is probably not in the std string destructor, but is manifested there because of corruption elsewhere. Without seeing more of the code it's impossible to tell where.

1. Again, are you certain that you are comparing the exact same binaries (byte-for-byte the same, "sum" reports the same checksum), and all of the exact same dynamic libraries (ldd output is the same, "sum" reports the same checksums for all libraries)? This is important because it is very easy to pull in different libraries, different header files, out-of-date object files, etc. into separate compiles of the same code, even on the same machine, if one is not very careful.

2. valgrind is your friend. Try running the test under it - as simple as "valgrind <test>". It will report some spurious problems with std string. But does it say anything about reads from freed memory in your code?

posted by metric space at 4:28 PM on February 20, 2008


1. Yes. Our program actually verifies the checksums of all included files at runtime, and will not start if any don't match what was generated at compile time. This is a huge pain in my ass most of the time, because I can't mix binaries from different compiles, but in this case....

2. Go figure, I've done 75 runs in a row without the failure now. Running valgrind on a successful run, the only time "free" appears is in the final summary:

==25160== ERROR SUMMARY: 3716 errors from 502 contexts (suppressed: 23 from 2)
==25160== malloc/free: in use at exit: 29,784 bytes in 1,479 blocks.
==25160== malloc/free: 4,576 allocs, 3,097 frees, 295,306 bytes allocated.
==25160== For counts of detected errors, rerun with: -v
==25160== searching for pointers to 1,479 not-freed blocks.
==25160== checked 428,560 bytes.
==25160==
==25160==
==25160== 7,190 bytes in 103 blocks are possibly lost in loss record 3 of 4
==25160== at 0x400479D: malloc (vg_replace_malloc.c:207)
==25160== by 0x81AC6A9: operator new(unsigned) (new_op.cc:68)
==25160== by 0x818AD6A: std::string::_Rep::_S_create(unsigned, unsigned, std::allocator const&) (new_allocator.h:88)
==25160== by 0x81137A9: char* std::string::_S_construct(char const*, char const*, std::allocator const&, std::forward_iterator_tag) (in /home/dsimon/dev/rsit/70/src/ssh/scp)
==25160== by 0x818BC74: std::string::string(char const*, std::allocator const&) (basic_string.h:1449)
==25160== by 0x804B205: main (generic_main.cpp:11)
==25160==
==25160== LEAK SUMMARY:
==25160== definitely lost: 0 bytes in 0 blocks.
==25160== possibly lost: 7,190 bytes in 103 blocks.
==25160== still reachable: 22,594 bytes in 1,376 blocks.
==25160== suppressed: 0 bytes in 0 blocks.
==25160== Reachable blocks (those to which a pointer was found) are not shown.
==25160== To see them, rerun with: --leak-check=full --show-reachable=yes

If I can get it to start failing semi-consitently again, I'll post the valgrind results of that as well.

posted by nomisxid at 4:57 PM on February 20, 2008


Ok. In the meantime, what about those 3716 valgrind errors? Valgrind can report certain innocuous behavior in well-written code (e.g. stl), but I would be suspicious of your code unless and/or until you can investigate and demonstrate the harmlessness of each of those messages.

Is the application multi-threaded?
posted by metric space at 6:02 PM on February 20, 2008


It's notable that valgrind is reporting leaked strings, whilst the backtrace you point to is an error raised whilst attempting to delete a string. Could these be related?

(Does the current issue still lead to VM corruption btw? If so, then I'd be very tempted to blame VMWare...)
posted by pharm at 3:56 AM on February 21, 2008


Yes, glibc does open /dev/tty to print that message:
open("/dev/tty", O_RDWR|O_NONBLOCK|O_NOCTTY) = 3
writev(3, [{"*** glibc detected *** ", 23}, {"double free or corruption (fastt"..., 35}, {": 0x", 4}, {"0804a008", 8}, {" ***\n", 5}], 5) = 75
In the case of the test program I wrote, valgrind shows the error quite clearly:
==13490== Invalid free() / delete / delete[]
==13490== at 0x401CFCF: free (vg_replace_malloc.c:235)
==13490== by 0x80483D4: main (mdf.c:5)
==13490== Address 0x415F028 is 0 bytes inside a block of size 1 free'd
==13490== at 0x401CFCF: free (vg_replace_malloc.c:235)
==13490== by 0x80483C9: main (mdf.c:4)
Here's my program:
1 #include
2 int main(void) {
3     char *m = malloc(1);
4     free(m); free(m);
5     return 0;
6 }
If your program allocates large amounts of memory, then valgrind may have forgotten about the free'd pointer. You can try increasing the --freelist-vol parameter if you aren't getting a message that would explain the heap corruption (which could be due to a double free, a mismatch between malloc/free and new/delete, or a write before or after an allocated block).

I think it's a total red herring that you are running in a VM.
posted by jepler at 6:09 AM on February 21, 2008


(whoops; valgrind reports the invalid free on mdf.c:5 because I had moved the second free to its own line before that run to increase the clarity of the message. Instead, I flubbed it)
posted by jepler at 6:11 AM on February 21, 2008


thanx all for the input so far.

I'll ask dev about their valgrinding plans. They may be putting it off till later in the cycle for reasons of their own.

I ran my debug compile all last night, without getting the double free condition again. Hopefully I can take what I've learned here today, and next time it does happen, I can gather better info to prove to dev it's not the vm's fault.

app is not multi-threaded currently.

the only app that seems to get corrupted in the vm is my build environment...but since it's my vm, i could only get dev to play on it for a limited time.
posted by nomisxid at 7:48 AM on February 21, 2008


further follow up for the curious:

downloaded an iso of memtest86+, and started the vm with it, ran dozens of passes, no errors. rebooted RHEL and did a clean make, and now I get all sorts of errs before I even get as far as before:

*** glibc detected *** corrupted double-linked list: 0x097f8128 ***

*** glibc detected *** malloc(): memory corruption (fast): 0x08d7c8d4 ***
posted by nomisxid at 11:50 AM on February 21, 2008


I have no evidence for this, naturally, but I strongly suspect that there is a bug in your application. In many years of software development, I've seen thousands of bugs in locally developed code, dozens of problems due to documented bugs in widely available third party software, and only a handful of problems due to undocumented bugs in well-used third party code. Of course, guess which of the three options a lot of developers are most likely to blame when their program fails?

Does your application rely on user input, random number generation, time of day, or any other factor(s) that make its separate runs "non-deterministic"? That could explain the intermittent problems you're seeing. Try to run the tests again with valgrind and see what it reports just before the program dies. Good luck!
posted by metric space at 12:56 PM on February 21, 2008


« Older I just started using Tazorac, ...   |   Where can I buy cafeteria-styl... Newer »
This thread is closed to new comments.