Redisassembleunobfuscate?
November 22, 2008 4:56 PM   Subscribe

Why aren't there assembly -> C or pseudocode tools?

So, for my last lap in college, I'm doing the dreaded binary bomb project, which is actually kind of cool. I'm almost done (5/6), and hopefully should be able to finish soon.

But anyway, looking all this GAS, I'm wondering why there isn't a tool that generates equivalent C or C++ code... it seems absurdly simple in some respects to simply generate one of the many possibilities of C that can yield a given assembly procedure, but I must be missing something?
posted by tmcw to Computers & Internet (10 answers total) 1 user marked this as a favorite
 
There are plenty. Google "decompiler".
posted by Flunkie at 5:14 PM on November 22, 2008 [1 favorite]


It's a very good question. There is a wikipedia page about decompilers which talks about these. I'm surprised I've not give this more thought before, or if I did at one time, I completely forgot about it (I'm an old software engineer). Back when we were first building RISC machines, people must have been interested in porting machine language programs to new architectures, and one technique could have been to decompile it into simple C and then recompile it to the new target machine, but I really don't remember anyone seriously trying to do this.
posted by thomas144 at 5:21 PM on November 22, 2008


Two reasons. First, it is a difficult problem. Second, it's not very useful.
posted by ryanrs at 6:03 PM on November 22, 2008


As far as I know (which isn't very far), the code produced wouldn't be fit for reading by humans. It'd be extremely inelegant.
posted by philomathoholic at 8:51 PM on November 22, 2008


They exist, as noted. And they don't always work, as noted. They may produce equivalent code but that doesn't necessarily mean it will be easier to understand.
posted by chairface at 10:26 PM on November 22, 2008


Best answer: One of the reasons it is not commonplace is that it is technically illegal in most countries to decompile copyrighted binaries. For the most part I think this has driven development underground or into niche areas, such as security consulting.

One of the other problems, as you alluded to in your question, is that there is not a one-to-one correspondence from assembly to C. You are also going to be missing out on a lot of the context of the code, such as variables names and it going to take some serious work to get at the types. This is assuming the code has not been optimized. If you are looking at production code that has been heavily optimized it adds a whole other level.

For this and other reasons, it is not as trivial as you might first think. So, the question is "for what problems is it worth the trouble?". The only answer I am familiar with is the security industry. Decompliation is one method of identifying weak spots in target code, and reverse engineering competitor's projects. There are also companies such as Veracode which have very sophisticated binary analysis routines which they use in evaluating and consulting on the security of their clients' systems. This is just what I know from a friend who worked with this stuff for a while, so someone feel free to correct me if I got it wrong. I also stumbled across this page which goes over a few of the issues in a bit more detail.
posted by sophist at 11:48 PM on November 22, 2008


sophist, can you give a cite for it being illegal? As far as I know, the only prohibitions in the US are if you have a contractual obligation not to decompile (a EULA, etc.), or if someone can bring the DMCA into play (but even then, as I understand it, decompiling isn't illegal per se, but using or disseminating the information you got fro decompiling may be).

Some other big reasons to decompile stuff are for doing maintenance on software for which the source code has been lost (yes, this happens more than you'd like to believe), and for debugging code that you don't have source to (of course no vendor would ever ship a library or an OS with a bug in its libraries; I'm just being hypothetical here).
posted by hattifattener at 1:32 AM on November 23, 2008


Especially with optimizing compilers, there is only a causal correspondence between your high-level code and the bytecode. Obviously its closer for C, but trying to decompile anything more complicated is guessing at best. For higher languages that compile to a virtual machine, the correspondence is even more complicated due to the mismatch between VM and language features (example: the JVM is very class-centric).
posted by mezamashii at 3:32 AM on November 23, 2008


hattifattener: Wikipedia has a section on its legality.

I read recently about a lawsuit between Blizzard, makers of World of Warcraft, and the makers of Glider, a 'bot' that automated play of World of Warcraft. One of the arguments Blizzard made was that copying the game to RAM in order to play it was an unauthorized copy, unless permitted by the software's EULA; and since Glider violated the game's EULA using violated copyright law. And Blizzard won, absurd though that may seem.

If it's a copyright violation just to copy a program into RAM without authorization, I can believe it's also a copyright violation to create a decompiled copy without authorization.
posted by Mike1024 at 3:40 AM on November 23, 2008


As far as I know (which isn't very far), the code produced wouldn't be fit for reading by humans. It'd be extremely inelegant.

Exactly. You'd be turning a series of MOV R1,R2 instructions into comparable C directives, and it would get you practically nothing. There's no elegant way to intuit that the instructions on lines 230-238 are actually the result of running

int tax = .06 * (a + b)

because all the semantic information gets stripped off, and you can't naively determine what C directive was most likely to have generated the block of assembly you're looking at, without brute-forcing your binary against the set of all possible inputs. So, while there are decompilers, the output from one of them is no more readable than the source binary--it's just line after line of extremely simple instructions, with primitive looping and branching instructions that would be equally easy to trace in a debugger for the compiled binary.
posted by Mayor West at 4:55 AM on November 24, 2008


« Older Did I just kill my monitor?   |   Buying a Rottweiler and seeking my revenge Newer »
This thread is closed to new comments.