Follow Slashdot blog updates by subscribing to our blog RSS feed

 



Forgot your password?
typodupeerror
×
Programming IT Technology

Famous Last Words: You can't decompile a C++ program 479

The Great Jack Schitt writes "I've always heard that you couldn't decompile a program written with C++. This article describes how to do it. It's a bit lengthy and it doesn't seem like the author usually writes in English, but it might just work (haven't tried it, but will when I have time)."
This discussion has been archived. No new comments can be posted.

Famous Last Words: You can't decompile a C++ program

Comments Filter:
  • Why (Score:0, Insightful)

    by Bruha ( 412869 ) on Sunday May 25, 2003 @11:16AM (#6034954) Homepage Journal
    Why would you want to do this unless you were stealing source?

    I'd just leave it alone. ;)
  • You can't (Score:5, Insightful)

    by Anonymous Coward on Sunday May 25, 2003 @11:20AM (#6034975)
    Information is lost in compilation. You can never reconstruct the exact original source. You end up with valid C++ that has no more human-understandable information than the equivilent machine code.

    Like turning hamburgers into cows...
  • Why not? (Score:5, Insightful)

    by bazik ( 672335 ) <bazik AT gentoo DOT org> on Sunday May 25, 2003 @11:21AM (#6034981) Homepage Journal
    I've always heard that you couldn't decompile a program written with C++.

    Well, you can decompile every binary programm at least to assembler code, so why shouldnt it possible with C++?

    Maybe he ment "you can't decipher the source of a C++ programm" ;)
  • Re:Why (Score:5, Insightful)

    by Anonymous Coward on Sunday May 25, 2003 @11:24AM (#6035000)
    You need reasons?

    1) Finding backdoors
    2) Testing security
    3) Fixing bugs
    4) Adding features
    5) Discovering copyright violations
    6) Interfacing to non-supported clients

    Pretty much anything and everything you would do if you had the source.
  • Re:Why (Score:5, Insightful)

    by p4ul13 ( 560810 ) on Sunday May 25, 2003 @11:30AM (#6035032) Homepage
    You could be updating a program for your company for which the source is lost.
  • by truth_revealed ( 593493 ) on Sunday May 25, 2003 @11:31AM (#6035037)
    Sure you can decompile an optimized and symbol-stripped C++ program, but you'd never have it the original compact form of the source as you do with the Java class file decompilers due to the heavy use of inline functions and templates used in C++. A C program, sure, but decompiling C++ is not terribly useful.
  • by kingkade ( 584184 ) on Sunday May 25, 2003 @11:45AM (#6035107)
    dear anonymous c,

    please stop breathing and kill any offspring you may have inexplicably fathered for the sake of our gene pool. Thanks.

    Respectfully,
    -- Human Race
  • by pVoid ( 607584 ) on Sunday May 25, 2003 @11:53AM (#6035143)
    Neat tricks are generally either one of these three things:

    A hidden API call - which can be easily found via ASM listings

    A nice little algorithm - which can be found in comp sci books

    An elegant piece of code - which can *not* be decompiled from ASM

    So no, I disagree with you.

  • Re:Why (Score:2, Insightful)

    by Kadagan AU ( 638260 ) <kadagan@ g m a i l . com> on Sunday May 25, 2003 @11:55AM (#6035151) Journal
    I agree, that should have been modded insightful, not funny. We have a ton of in-house apps that we don't have source for anymore, and it would be really nice to be able to update them without having to rewrite the entire thing.
  • Re:You can't (Score:5, Insightful)

    by capnjack41 ( 560306 ) <spam_me@crapola.org> on Sunday May 25, 2003 @11:59AM (#6035174)
    And then on top of that, the compiler optimizes that code, so calculations are no longer the straightforward and intuitive things they used to be, now they're a series of out-of-order, smaller calculations that are harder to recognize. They're efficient as hell but barely reversible.

    I'll RTFA when it comes back to life :).

  • by sheetsda ( 230887 ) <doug@sheets.gmail@com> on Sunday May 25, 2003 @12:01PM (#6035183)
    There seem to be a lot of people in this story saying "shame on you for reverse engineering". It has its uses, how else would viruses, worms, and trojans be analyzed to figure out what they do and how they do it.
  • Re:hmm (Score:3, Insightful)

    by deranged unix nut ( 20524 ) on Sunday May 25, 2003 @12:01PM (#6035185) Homepage
    The problem is that there are quite a few people out there that assume that just because it is in binary form, that it can't be figured out. For example, they will use XOR to "encrypt" data stored inside the program, or assume that their secret algorithm is safe because it is compiled.

    The barrier to entry is definately raised, but it is always possible to figure out what the compiled code is doing given enough time and effort. In fact, I've even heard of people who patch operating system kernel code without the source...
  • Re:Why (Score:2, Insightful)

    by neonstz ( 79215 ) on Sunday May 25, 2003 @12:07PM (#6035206) Homepage

    What you need is a decent source control/backup system, not a decompiler.

  • Re:Why (Score:3, Insightful)

    by IamTheRealMike ( 537420 ) on Sunday May 25, 2003 @12:09PM (#6035218)
    This might be useful for when trying to make apps run in Wine. Occasionally disassembly is the only way to figure out why the app crashes light years away from the nearest API call etc.
  • by Call Me Black Cloud ( 616282 ) on Sunday May 25, 2003 @12:12PM (#6035230)
    ...trying to rebuild a wrecked sand castle just by looking at the grains of sand. You can't. Compilers throw away a lot of information needed by people but not necessary for the machine. Compilers optimize the code to run more efficiently and that's a one-way street. Sorry to burst your bubble but trying to reconstruct original source is like trying to herd cats.

    Thank you, thank you. I'm Mr. Metaphor and I'll be here all week.
  • Re:Why (Score:3, Insightful)

    by isorox ( 205688 ) on Sunday May 25, 2003 @12:19PM (#6035275) Homepage Journal
    shutting the barn door after the horse bolted?
  • Re:Why (Score:5, Insightful)

    by Lumpy ( 12016 ) on Sunday May 25, 2003 @12:28PM (#6035321) Homepage
    Why would you want to do this unless you were stealing source?


    nice try.

    You must be either Bill Gates, Steve Ballmer or someone who works for the BSA.

    How am I to tell if your close source program isn't full of my GPL code that you blatently stole and are trying to rob me blind by STEALING my IP? Being a closed source advocate as you seem to be you are for me trying to detect IP theft and the illegal STEALING of my code by PIRATES right?

    Ok, I'm going overboard to make my point... I have EVERY right to use tools in a good and legal way. Why not outlaw hammers as anyone can perform a very grisly and horrible murder with one... Or better yet only allow licensed contractors to have hammers! as we know that the unlicensed public is only going to do very ewvil things with tools!

    see my point now? A tool is exactly what it looks like.... a tool. it can be used for good and evil. and I dont have any respect for the self righteous like you condemning what I do before I even do it.

    people with attitudes like you are what cause all the pain and suffering in this world...... STOP IT!
  • by Wizard of OS ( 111213 ) * on Sunday May 25, 2003 @12:49PM (#6035422)
    Why do people keep thinking that decompilation is possible? In short: decompiling a computer program is solving the halting problem. Period.

    The long version: In a compiled computer program there is no distinction for either code or data. Every byte in memory can be data, but it can also be executed as valid computer code.

    Now, the catch is that during compilation, data and code are mixed in the resulting binary. For instance take the compilation of a 'case' statement. There are several ways of compiling a case:
    - you can write it as a list of IF's, which is perfectly fine decompilable
    - you can write it as a jump, based on the case expression.
    The fun part about the second possibility is that it's far more efficient, but it poses a problem: when decompiling this you have to know where the bounds of the case lie. What's the furthest jump that can be made? It's a jump based on a calculated value, so you should know which values are possible. But for that, you need to run the program, and more specifically, you must run all possible execution paths.

    This can be rewritten as the instance of the halting problem: can a computer find out for any program whether or not it will halt? It is proven that a computer program cannot be written to do this task. Neither can a computer program decompile any other computer program.
  • by Minna Kirai ( 624281 ) on Sunday May 25, 2003 @01:01PM (#6035505)
    When you think about it, the higher level the language is, the easier it should be to "decompile".

    No, no, no. This is both empirically untrue, (Do you see many ML or even C++ decompilers out there?) and theoretically insensible.

    The higher level a language is, the more changes there will be between the original source code and the assembly. Thus the more source data that will have been discarded by the original compiler, which is data the decompiler cannot reconstruct.

    The reason Java decompilers work relatively well is not because Java's a high-level language (it isn't, really), but because the output program is at such a high level! Instead of working from binary code, a Java decompiler gets a more presentable bytecode, packed with the names of classes and methods. (Also, because optimization of Java programs is supposed to happen after compilation at a JIT stage, the bytecodes won't be as obfuscated as the output of a normal C++ optimizing compiler)

    The closer the original source was to asm, the more the individual coder's style will be reflected in the asm

    When decompiling, the "individual coder's style" is exactly what you're trying to get!

    the more the obvious patterns the compiler uses every time for given constructs will be present.

    Good compilers don't use "obvious patterns". Their transformation functions are very sensitive, so a tiny change in the input source (expanding a loop from 3 times to 4) can cross an optimization threshold and totally change the appearance of the output.
  • by yerricde ( 125198 ) on Sunday May 25, 2003 @01:38PM (#6035713) Homepage Journal

    Now, the catch is that during compilation, data and code are mixed in the resulting binary.

    Not last time I checked. My compiler emits at least four segments in a compiled program: .text (program code), .rodata (initialized data marked as 'const'), .data (initialized data), and .bss (zero-filled data, which is run-length encoded). Segments .text and .rodata are also write-protected.

    Yes, there is a halting problem, but this isn't it. Segments make distinguishing code from data straightforward. I understand that a few programs make platform-specific API calls that write-enable .text would be harder to disassemble (and subsequently decompile), but do most user programs make such calls?

    Besides, even if the halting problem were relevant, the halting problem can be solved in a real computer, which has limited memory and is thus a linear bounded automaton [netaxs.com] rather than a Turing machine.

  • Re:You can't (Score:3, Insightful)

    by ryanr ( 30917 ) * <ryan@thievco.com> on Sunday May 25, 2003 @02:09PM (#6035879) Homepage Journal
    Who said the point of the exercise was to turn the code back into the original C++?
  • by johannesg ( 664142 ) on Sunday May 25, 2003 @02:16PM (#6035912)
    Your understanding of the consequences of the halting problem is incomplete. It is not a proof that it is impossible to determine of any given program whether or not it stops in finite time. It is merely a proof that there exists a class of programs for this determination cannot be made. However, there are also many programs for which it can easily be determined whether or not it stops in finite time, and the same thing is true for decompilation.

    Furthermore, there is nothing saying that it has to do a 100% perfect job. Decompilation is already accepted to be imprecise; using some common sense (intuition if you want) to fill in some gaps is not an invalid method.

    The problems that face decompilation that stem from real-world issues are far, far greater than this (rather theoretical) problem. For example, decompiling any STL-based source to a useful state will be far more difficult than a simple jumptable.

  • by Anonymous Coward on Sunday May 25, 2003 @02:31PM (#6035991)
    You are assigning too much "magic" to the process of compilation.

    Compilation is the process of converting one defined stream of data into another defined stream of data based on a ruleset. The ruleset usually discards useful meta information, like comments, but the core information is retained and the process is, for the most part, predictable--feeding the same source to the same compiler with the same settings will yield identical results as far as this discussion is concerned.

    > It is proven that a computer program cannot be
    > written to do this task

    The simplicity of the ruleset for early versions of VB allowed for the proliferation of VB decompliers in the early 90's. That must tell you something.
  • by p3d0 ( 42270 ) on Sunday May 25, 2003 @02:34PM (#6036003)
    I haven't seen any comments on the linked "book" itself yet. In short: it sucks hard. Go take a look and try not to laugh.
  • Re:Why not? (Score:3, Insightful)

    by GlassHeart ( 579618 ) on Sunday May 25, 2003 @02:44PM (#6036048) Journal
    you can decompile every binary programm at least to assembler code

    No. Assuming we're talking about software disassemblers here, not every program can be reliably disassembled. Disassemblers work by mainly following the execution paths of already disassembled code, so that it knows exactly where a subroutine begins. In many instruction sets, instructions have variable length, and not starting your decoding on the right byte will be a big mistake that cascades on to the next instructions. Now, knowing this, all we have to do is to change the execution path without the disassembler knowing. A function pointer (address loaded at run-time) already presents a serious problem to a disassembler, but simply asking the user to enter the instruction address to jump to will completely defeat the automatic disassembler. There's no way for the disassembler to know what the user will enter, and hence where the program will go to next.

    Humans will still be able to disassemble your program, of course. However, you still won't get the original assembly source back. Assembly languages usually support macros and pseudo-instructions that improve readability, but have no correspondence in assembled form.

  • by Anonymous Coward on Sunday May 25, 2003 @03:32PM (#6036230)
    Ah, the joys of being out of the academic enviornment. You can ignore the people that tell you the things you commonly do are impossible.
  • Re:Why (Score:4, Insightful)

    by i_am_nitrogen ( 524475 ) on Sunday May 25, 2003 @05:32PM (#6036808) Homepage Journal
    One really good reason I haven't seen mentioned yet is writing a Linux driver for a piece of hardware only supported in Windows, such as the DXR3/Hollywood+ [sourceforge.net] or the MyHD/WinTV-HD/etc [sourceforge.net]. For these projects where the hardware manufacturers either can't or won't offer any help, the only way to support the hardware is by disassembling the Windows driver and figuring out the algorithms used by reading the disassembly and/or watching the interactions between the driver and the code. Fortunately for the MyHD driver project, the MyHD software is distributed without any EULA.

    BTW: Nice job getting all those responses with two lines...
  • Re:You can't (Score:5, Insightful)

    by jkorty ( 86242 ) on Sunday May 25, 2003 @06:06PM (#6036983) Homepage
    Information is lost in compilation. You can never reconstruct the exact original source

    So what? Doing reasonable interpolations in context is what brains are for. Example: IIRC, when the Morris Worm appeared in 1989, Gene Spafford examined the binary and reverse-engineered the C code, sprinkling it with meaningful comments and good variable and function names. When the original source became available, his turned out to be cleaner program than the original. That is, he not only recreated the original in every way that counts, he overshot and did better than the original

  • by BagMan2 ( 112243 ) on Monday May 26, 2003 @02:02AM (#6039033)
    It's relatively easy to come up with the set of C statements that would mimick a particular set of asm statements that you wish to decompile, but the end result would be a C program that was not much easier to read to the original asm was. Changing various assembly operations into C operations does get you back the information you really need.

    The symbolic names make up the bulk of the lost information, but often times programmers will organize a sequence of code in a certain way to make it easier to understand. The compiler will often rearrange that code in a manner that makes it easier for the computer to understand. Compilers will do screwy things like increment a variable on the stack, while holding the original in a register for later usage. Where the original C code might have had the variable increment at the end like this:

    while (x 10)
    { // do a bunch of stuff using x
    x++;
    }

    The way the compiler optimizes register usage may cause the assembly to actually increment x just after doing the conditional, then hold the non-incremented value in a register for use down below. The decompiled asm might look like this:

    while (x 10)
    {
    int temp = x;
    x++; // do stuff with temp
    }

    While this may seem like a trivial difference in the C code, it can often distort the intent of the algorithm. When a C programmer sees a construct like the latter, they naturally assume that the temp variable was used because more natural constructs would not. The C programmer then wastes time mulling it over only to discover that it was just dumb.

    I am currently on a project where I am maintaining some pretty poorly written code. I can't tell you how much time I waste looking at a particularly ugly algorithm trying to figure out why they are doing all these screwy things, only to discover they were just idiots.

    My point is, that the compiler and optimizer are going to mangle the logical order of the code in such a manner that it will be far more difficult to read.

    Like I said at the beginning, simple translation of assembly to C is easy, getting back the meaning that gives the endeavor any value at all is much more difficult.
  • Re:You can't (Score:2, Insightful)

    by kevquinn ( 563706 ) on Monday May 26, 2003 @04:34PM (#6041773) Homepage
    It is perfectly possible to reverse-engineer a meaningful source from a given binary. It's certainly not easy, and of course you won't end up with the same variable names etc (unless the author kindly left in heaps of debug symbols etc), but that hardly matters. The point is that it is possible. Even templates are possible to decompile, given enough incentive; after all it's just fancy pattern matching.

    With regards the original article - well, that was a bunch of obvious guff really; what you'd expect from high-school geeks of the type I was, some number of years ago. Of note, is that it claimed to decompile C++, when actually it talked only of rather trivial C constructs, something that is a well understood practice already.

    Some relatively recent classic decompilation work was done by Cristina Cifuentes [uq.edu.au] who put together a C decompiler that worked to a significant degree for common DOS-based compilers of the time. Effectively the job of "decompilation" can be thought of as "compilation" - instead of compiling C into ASM, you think of compiling ASM into C. Not as daft as it sounds, honest. You can download "dcc" from the above site to investigate further.

    Boomerang [sourceforge.net] is a sourceforge project attempting to create a decompiler. Worth a look, as well.

    It's worth noting, that there are a number of ways to "cheat". For example, it's often trivial to discover what compiler was used to generate a given object code, and there are usually masses of common library-type code that gives you a leg up. Add to that, the fact that a piece of code was generated by a compiler, and the problem of discovering what a given piece of object code does is drastically simplified - compilers add huge amounts of structure and predictability to the generated object code that can be absent in free-form handwritten assembler (and few people do that anymore!), and much can be made of this.

    On the code/data issue mentioned by others in this thread - although separating code/data in general from mixed binaries can be considered hard, in reality it's often quite feasible and even simple. After all, the CPU manages to work it out. Again, the fact that there are so many short-cuts you can take really helps.

    Of course, a quick cruise around the cracking community will turn up all sorts of ways and means to shortcut this sort of problem...

    Here are the results of a quick googling [google.com]:

I have hardly ever known a mathematician who was capable of reasoning. -- Plato

Working...