How do I interpret RPC extended error information?
October 12, 2007 9:20 AM   Subscribe

COM/RPC debugging: interpreting RPC extended error information.

Anyone here had to debug a COM/RPC problem using RPC extended error information?

In our application, there's a multi-threaded server process which sends notifications to a single-threaded GUI process, using COM. While the GUI process is busy, notifications from the server eventually fail, with the error code 0x80010100 (RPC_E_SYS_CALL_FAILED).

The RPC extended error information:

ProcessID is 4316 [the GUI, i.e. the destination process]
System Time is: 10/11/2007 16:57:53:590
Generating component is 1 [Application]
Status is 2147549440 [0x80010100, RPC_E_SYS_CALL_FAILED]
Detection location is 2
Flags is 0
NumberOfParameters is 3
Long val: 1113943584
Short val: 3
Long val: 4096

A couple questions:

1. I take it this means the error is occurring somewhere in the receiving process, in the application code?! (I was assuming it was in the RPC run-time, on the sending side, so that's useful information.)

2. Any idea what "detection location is 2" indicates, and what the three parameters indicate? The long val looks like an address (0x42656E20), but looking at the GUI process, that's not a valid address--no code is loaded there.

Any general RPC debugging tips would also be appreciated.
posted by russilwvong to Computers & Internet (6 answers total) 1 user marked this as a favorite
 
What do you mean by "while the GUI process is busy"? What is it busy doing? If it's not pumping messages, then the RPC call isn't going to work.
posted by jeffamaphone at 10:16 AM on October 12, 2007


Response by poster: It's retrieving information from the server process, using COM calls, and using the information to create a tree.

While it's making each COM call back to the server process and waiting for the server to respond, it's running the message loop.
posted by russilwvong at 10:25 AM on October 12, 2007


So while your GUI is making a COM call to the server, the server tries to send a notification to the GUI via COM call. Hmm... I'm not sure why you're failing, but that's generally a bad thing.

You could try implementing IMessageFilter on your GUI process to tell the server to "try again later" when you're already inside a COM call to the server.
posted by jeffamaphone at 10:29 AM on October 12, 2007


Response by poster: Thanks for the IMessageFilter suggestion, that should at least provide some visibility into incoming COM calls.

My understanding is that COM explicitly supports re-entrancy. Don Box, Essential COM, p. 230: To help prevent deadlock, all COM apartment types support re-entrancy. When a thread in an apartment makes a call through a proxy to an object outside the caller's apartment, incoming method requests can continue to be serviced while the caller's thread is waiting for the ORPC response from the original call.

Box provides a detailed explanation of how this works for the single-threaded apartment. When the caller's thread enters the channel's SendReceive method to send the ORPC request and receive the ORPC response, the channel ... places [the thread] in an internal windows MSG loop. This is not unlike what happens when a thread creates a modal dialog box.

Our application design makes heavy use of re-entrancy, and it hasn't been a problem up to now.

posted by russilwvong at 10:46 AM on October 12, 2007


Response by poster: We've found the problem! That was definitely the ugliest bug hunt in a while.

Here's a summary of the debugging process we used, for future reference.

1. Reproduced the problem consistently.

2. Used application-level trace statements to narrow down the problem. Found that the COM notifications from the server to the client are eventually failing, error code 0x80010100 (RPC_E_SYS_CALL_FAILED).

3. Found a reference to RPC extended error information. Obtained the information described above.

4. Put the GUI process on a separate machine and used Wireshark to capture the packet traces. What happens is that the server sends a DCERPC request and the GUI sends a DCERPC response; everything looks fine. And then eventually, the GUI starts sending DCERPC fault packets (0x80010100) in response to incoming requests.

Checked the contents of the first request packet which failed against the application trace statements to verify that the application code was not being called by COM. So the error is occurring somewhere in the COM code.

5. Looked through the DCOM source code (we bought a copy of the DCOM source code from the Open Group back in 2000, while debugging a different problem). Found three or four instances of the error code, but couldn't figure out why those instances would occur.

6. Used windbg to set a breakpoint in the GUI process, specifically ole32.dll, as described in this blog post. With windbg attached to the GUI process, the error code would change to 0x800101cc, which doesn't appear anywhere; so that was a dead end.

7. Found something interesting: in the latest version of ole32.dll, there's only one instance of the error code. It's in OXIDEntry::PostCallToSTA.

Googled PostCallToSTA and found this post, describing a very similar problem. The poster had tracked it down further, to PostMessage failing with ERROR_NOT_ENOUGH_QUOTA (1816).

Someone else had said that it was probably caused a message limit. There is a limit of 10,000 posted messages per message queue.

8. After thinking about it a bit, this matches the behavior we're seeing. If the internal COM message loop isn't processing all messages in the queue, only the COM notifications, it's quite possible that the message queue is filling up.

Tested the hypothesis by increasing the message queue size, using the USERPostMessageLimit registry entry. Verified that the error took longer to occur.

Fixing the problem was relatively simple: added a DoEvents method which processes all messages in the message queue, and called the DoEvents method whenever the GUI receives a COM notification. Verified that this fixes the problem.

--
I talked to someone from Microsoft technical support yesterday, to find out if there were other debugging techniques that we could apply for future problems like this one.

Besides RPC extended error information and packet traces, they have "checked builds" available for MSDN subscribers (would this include an instrumented version of ole32.dll?); there's COM tracing; and they have a tool called IDNA, a "time travelling tool" which captures all instructions executed by a process. The IDNA dump can then be interpreted by Microsoft using their private symbol files.
posted by russilwvong at 5:43 PM on October 23, 2007


Response by poster: Fixing the problem was relatively simple: added a DoEvents method which processes all messages in the message queue, and called the DoEvents method whenever the GUI receives a COM notification. Verified that this fixes the problem.

Further update: this change fixed the particular problem, but caused other problems elsewhere in the application (because it changed the order in which events are processed).

Fix #2: change the COM interface so that when the GUI initiates a server operation which takes a long time, it doesn't block: instead, it makes a call to start the operation, which returns immediately. The server then notifies the GUI when the operation completes (or fails).
posted by russilwvong at 9:49 AM on February 7, 2008


« Older What to do about my shoulder dysplasia?   |   Yoga mat is not grippy Newer »
This thread is closed to new comments.