Follow Slashdot stories on Twitter

 



Forgot your password?
typodupeerror
×
Programming IT Technology

Famous Last Words: You can't decompile a C++ program 479

The Great Jack Schitt writes "I've always heard that you couldn't decompile a program written with C++. This article describes how to do it. It's a bit lengthy and it doesn't seem like the author usually writes in English, but it might just work (haven't tried it, but will when I have time)."
This discussion has been archived. No new comments can be posted.

Famous Last Words: You can't decompile a C++ program

Comments Filter:
  • You can't (Score:5, Insightful)

    by Anonymous Coward on Sunday May 25, 2003 @11:20AM (#6034975)
    Information is lost in compilation. You can never reconstruct the exact original source. You end up with valid C++ that has no more human-understandable information than the equivilent machine code.

    Like turning hamburgers into cows...
    • by Morologous ( 201459 ) * on Sunday May 25, 2003 @11:24AM (#6035002)

      Like turning hamburgers into cows...

      I'm going to use that line.
    • Re:You can't (Score:2, Informative)

      by jezzgoodwin ( 675518 )
      He's quite right.

      Take a sum within a program, for example (a+b)=1000 ... now there are infinite possible combinations of what a and b can be ... but without the correct variable names, or the commenting that went along with the code (assuming there was some) ... the decompiled output is going to be pretty much useless / extremely difficult to understand
      • Re:You can't (Score:5, Insightful)

        by capnjack41 ( 560306 ) <spam_me@crapola.org> on Sunday May 25, 2003 @11:59AM (#6035174)
        And then on top of that, the compiler optimizes that code, so calculations are no longer the straightforward and intuitive things they used to be, now they're a series of out-of-order, smaller calculations that are harder to recognize. They're efficient as hell but barely reversible.

        I'll RTFA when it comes back to life :).

      • Considering some of the code I've had to support, I could probably deal with it.

        As opposed to code by authors from the school of copy & paste, who don't include comments, and are generally confused as to what they are trying to do, I'll take the decompiled code that actually works but needs commenting.
    • by NewbieProgrammerMan ( 558327 ) on Sunday May 25, 2003 @11:38AM (#6035076)
      Heh. You're assuming that you're attempting to decompile something that had human-understandable source to start with. :)
    • Re:You can't (Score:5, Informative)

      by antis0c ( 133550 ) on Sunday May 25, 2003 @11:46AM (#6035116)
      What's to say you need something as readable as the original? I worked at InterAct Accessories/GameShark for a few years before they went under as essentially a 'reverse engineer'. Without getting yet another CND from them in the mail due to a post on Slashdot (I don't even think they could send one now they're out of business?), all I can say is sometimes when hacking a game it benefits an engineer to decompile the application and be able to set breakpoints and watch execution flow while the game is running on for example a PlayStation 2. Sure it's going to be a lot of nearly unreadable C++ mixed with Assembly, but if you can watch the execution flow as you do something, it can be useful.

      Of course a lot of naive people think decompiling would allow you to take an application and start writing patches for it, in that case you are right, it's going to be pretty useless. However it's not entirely useless for all situations. I'm sure the WINE guys might get some use out of it.
    • Not to mention that no two C++ compilers can even agree on how to compile C++ code.

      The article is slashdotted, so I couldn't read it, but I would think that C++ would be extreemly difficult to decompile because of the use of inlined functions, and what would you do with templates? Also, I don't see how a class could be recreated from binary. I would be more likely to believe that a C++ binary could be decompiled into (ugly) C code, but not necessarily C++ code.
    • Re:You can't (Score:3, Insightful)

      by ryanr ( 30917 ) *
      Who said the point of the exercise was to turn the code back into the original C++?
    • Re:You can't (Score:5, Insightful)

      by jkorty ( 86242 ) on Sunday May 25, 2003 @06:06PM (#6036983) Homepage
      Information is lost in compilation. You can never reconstruct the exact original source

      So what? Doing reasonable interpolations in context is what brains are for. Example: IIRC, when the Morris Worm appeared in 1989, Gene Spafford examined the binary and reverse-engineered the C code, sprinkling it with meaningful comments and good variable and function names. When the original source became available, his turned out to be cleaner program than the original. That is, he not only recreated the original in every way that counts, he overshot and did better than the original

  • Oop (Score:5, Funny)

    by Suffering Bastard ( 194752 ) * on Sunday May 25, 2003 @11:21AM (#6034977)

    it doesn't seem like the author usually writes in English

    Surely he now understands the English infinitive "to be Slashdotted".

  • Why not? (Score:5, Insightful)

    by bazik ( 672335 ) <bazik&gentoo,org> on Sunday May 25, 2003 @11:21AM (#6034981) Homepage Journal
    I've always heard that you couldn't decompile a program written with C++.

    Well, you can decompile every binary programm at least to assembler code, so why shouldnt it possible with C++?

    Maybe he ment "you can't decipher the source of a C++ programm" ;)
    • Re:Why not? (Score:3, Insightful)

      by GlassHeart ( 579618 )
      you can decompile every binary programm at least to assembler code

      No. Assuming we're talking about software disassemblers here, not every program can be reliably disassembled. Disassemblers work by mainly following the execution paths of already disassembled code, so that it knows exactly where a subroutine begins. In many instruction sets, instructions have variable length, and not starting your decoding on the right byte will be a big mistake that cascades on to the next instructions. Now, knowing thi

    • Why not indeed. (Score:3, Informative)

      by fishexe ( 168879 )
      Well, you can decompile every binary programm at least to assembler code, so why shouldnt it possible with C++?

      There's a huge difference between disassembling and decompiling. With assembly, you generally have a 1 to 1 correspondence between machine language instructions and assembly instructions. That is, one specific instruction you feed to the assembler becomes one specific assembled instruction. Sometimes it's more complicated than this, but only slightly.

      Now look at c, where one line of code cou
    • Re:Why not? (Score:3, Informative)

      This is such a grossly misinformed statement, I don't even know where to begin. Assembler and machine language ("binary") are semantically identical. You can go back and forth from assembler to machine code all day and still have the same thing. All you lose when going from human/compiler generated (vs disassebled machine code) is labels and comments.

      With C++ or any high-level language, there zillions of ways a compiler might interpret the code - just as long as the machine code effectively does was the
  • hmm (Score:5, Informative)

    by Graspee_Leemoor ( 302316 ) on Sunday May 25, 2003 @11:21AM (#6034984) Homepage Journal
    A c/c++ decompiler that totally worked would be the Holy Grail of crackers. Unfortunately it is actually impossible to get everything back because lots of info is lost on compilation.

    Nevertheless there are tools out there that attempt to decompile programs; I think of them more as ways of making assembly more readable.

    Note, a lot of them wouldn't work on hand-written assembly, because they rely on knowledge of how certain compilers compile various things- e.g. there was a Delphi decompile available.

    graspee

    • Re:hmm (Score:3, Insightful)

      The problem is that there are quite a few people out there that assume that just because it is in binary form, that it can't be figured out. For example, they will use XOR to "encrypt" data stored inside the program, or assume that their secret algorithm is safe because it is compiled.

      The barrier to entry is definately raised, but it is always possible to figure out what the compiled code is doing given enough time and effort. In fact, I've even heard of people who patch operating system kernel code with
    • Re:hmm (Score:5, Interesting)

      by jackb_guppy ( 204733 ) on Sunday May 25, 2003 @12:55PM (#6035461)
      I wrote reverse compilers on IBM midrange equipment. where there are not stacks and self modifing code is VERY commom place. It is easy to do:

      Create a program that preforms / understands the opcodes for the processor and addressing. And it follows both sides of a branch.

      Now "run" the program, that maps out the all opcode and data areas.

      Once done. Look at that Assemmebler equivatlent, map out commom subroutines and function calls. Data Storage become very clear. Lastly, commom storage with show external and internal common structures - so naming of fields and visualable.

      It is striaght forward, can be time comsuming - and very helpful is understnad hinden or loss information.
    • A friend of mine work(ed) with a company in Kingston, ON that was spun off from Queens University. Their sole purpose and business model is to take whatever binaries and source a company has available, run it through their cluster of analysis systems, and produce a "clean" update of the system. As per usual, there is about 10-15% of the produced code that needs some hand inspection and tweaking to complete the task.

      Their "big" business was the Y2K work, as their software isn't limited to just reverse-e

  • Slashdot has DDoS'ed the damn thing into oblivion.

    On the other hand, did anyone get to mirror it?
  • by Anonymous Coward on Sunday May 25, 2003 @11:24AM (#6034997)
    but it'll look like this

    class a
    {
    public:
    void b(int c);
    void d(int e);
    private:
    int g;
    int h;
    };

    int main()
    {
    a f;
    f.b(23);

    int x; x=0; x++;
    if(x > 3) goto j;
    f.d(x); x++
    if(x > 3) goto j;
    f.d(x); x++;
    if(x > 3) goto j;
    f.d(x);
    j: f.b(42);

    return 0;
    }

  • Yeah, but they should know how to decompile the slasdot effect first... another one down. Anybody with a Mirror or Google Cache link ?
  • by truth_revealed ( 593493 ) on Sunday May 25, 2003 @11:31AM (#6035037)
    Sure you can decompile an optimized and symbol-stripped C++ program, but you'd never have it the original compact form of the source as you do with the Java class file decompilers due to the heavy use of inline functions and templates used in C++. A C program, sure, but decompiling C++ is not terribly useful.
  • by 1nv4d3r ( 642775 ) on Sunday May 25, 2003 @11:41AM (#6035091)
    Hell, I'd be happy if the people working for me could consistently compile their c/c++. I need a new job...


  • When you think about it, the higher level the language is, the easier it should be to "decompile". The closer the original source was to asm, the more the individual coder's style will be reflected in the asm - the higher level it is, the more the obvious patterns the compiler uses every time for given constructs will be present. Reverse engineering a program written in asm to human readale source is a nightmare, but if you knew for instance that the source was C++ and it was compiled by gcc 3.2 (easy eno
    • all modern compilers are optimizing compilers, and they reorganize code completely to suit themselves in the most efficient manner. The compiler will reorganize modules and rewrite lines of code in order to make better use of registers, processor features/limitations that
      You cannot really see a programmer's style as a result. When you decompile, you'll get it returned as whatever the compiler shifted the code around as.


      • Exactly, you backed up my point while trying not to :)

        The fact that you cannot see the programmer's style, only the compiler's style, is what makes decompiling source much easier. It's easier to learn the thinking patterns of the compiler by observing its output in various cases than it is to write software that can guess random human patterns.
    • Dude.....C is practically ASM. C is *not* a "high level" language according to most theoretical professors. (c.f. Yale's Stan Eisenstat, Arvind Krishnamurthy, Zhong Shao, Richard Yang, Columbia's Belhauser, Havard's Smith...) C _is_ readable asm code......
      have you ever taken a compilers course?

      having written a compiler for a toy language (tiger) [google for princeton professor appel's "tiger" language and his collaboration with z. shao, who implemented the heap-activation in SML-NJ....] i can assure y

      • Where did I say C was a high level language? I used C++ as a reference because like it or not, it is high enough level to have it's own structure. I didn't use C because I well understand that C is basically portable assembler.

        Beyond the current OOP language like C++ and Java, the only things higher level are the toy languages for braindead programmers (think VB, Delphi, FoxPro, etc) - and the various real attempts at 4GL, which never seem to work right for general cases, but can be useful in application
    • When you think about it, the higher level the language is, the easier it should be to "decompile".

      No, no, no. This is both empirically untrue, (Do you see many ML or even C++ decompilers out there?) and theoretically insensible.

      The higher level a language is, the more changes there will be between the original source code and the assembly. Thus the more source data that will have been discarded by the original compiler, which is data the decompiler cannot reconstruct.

      The reason Java decompilers work
  • Spectulation Code (Score:5, Informative)

    by Davak ( 526912 ) on Sunday May 25, 2003 @11:44AM (#6035104) Homepage
    Considering the entire post is evidently based on speculation...

    Here is some code [planet-source-code.com] that supposedly decomplies... not that I've tried it.

    Quote from the FAQ [cs.uu.nl]:


    [35.4] How can I decompile an executable program back into C++ source code?

    You gotta be kidding, right?

    Here are a few of the many reasons this is not even remotely feasible:
    * What makes you think the program was written in C++ to begin with?
    * Even if you are sure it was originally written (at least partially) in C++,
    which one of the gazillion C++ compilers produced it?
    * Even if you know the compiler, which particular version of the compiler was
    used?
    * Even if you know the compiler's manufacturer and version number, what
    compile-time options were used?
    * Even if you know the compiler's manufacturer and version number and
    compile-time options, what third party libraries were linked-in, and what
    was their version?
    * Even if you know all that stuff, most executables have had their debugging
    information stripped out, so the resulting decompiled code will be totally
    unreadable.
    * Even if you know everything about the compiler, manufacturer, version
    number, compile-time options, third party libraries, and debugging
    information, the cost of writing a decompiler that works with even one
    particular compiler and has even a modest success rate at generating code
    would be significant -- on the par with writing the compiler itself from
    scratch.

    But the biggest question is not how you can decompile someone's code, but why
    do you want to do this? If you're trying to reverse-engineer someone else's
    code, shame on you; go find honest work. If you're trying to recover from
    losing your own source, the best suggestion I have is to make better backups
    next time.

    I would have posted AC but that have me blocked out for some reason...


    Davak

    • thanks for nothing. (Score:4, Interesting)

      by twitter ( 104583 ) on Sunday May 25, 2003 @12:22PM (#6035294) Homepage Journal
      If you're trying to reverse-engineer someone else's code, shame on you; go find honest work.

      Shame on you Davak, you should go find honest code. There's nothing wrong with trying to understand how things work. Some people are stuck with legacy equipment or code they can't replace easily and this is their only option for improvement or even fixing it. Those people would be better off if free code were available. Sometimes the only way to make that free code is to understand the original code. There's nothing wrong with reverse engineering software, ever. Republishing someone else's binary is not legal, but it's not immoral. If the code were honest to begin with, the reverse engineer part would not be required. These days, it's cheaper to throw out the dis-honest code and hardware and buy some hardware that's well understood. If you make hardware or software, I hope you understand the implications for your product - I'm not buying it.

      • If you're trying to reverse-engineer someone else's code, shame on you; go find honest work.

        Shame on you Davak, you should go find honest code ...

        If you read carefully, you'll note that the "honest work" sentence is NOT Davak's. It is still indented as part of the blockquote, and therefore is the final section of the passage he was quoting from that C++ FAQ. The last sentence that is actually Davak's is his comment about wishing to post as an anonymous coward, presumably to avoid situations like this o

  • by SharpFang ( 651121 ) on Sunday May 25, 2003 @11:46AM (#6035117) Homepage Journal
    Well, it isn't. Sure, if you're so lazy uou want to have source rebuilt from binaries with one click, complete with comments, makefile and documentation, that's of no use. But imagine the program does some very clever trick. Something you ooh about, "How the hell does he do that? It's impossible?". You want to include that trick in your code. You need it. So - you have three options: 1) Try to design it from scratch. Helluva work, you don't know where to start. 2) Look into the binary. If you're ASM guru, you MAY succeed. But ASM from high-level languages is hell to read. 3) Decompile the puppy, look for that piece through what looks like piles of junk, but is way more readable than ASM and find it. Then just rewrite it in pretty fashion, changing variable names and functions to your needs and include in your own software. It's "the best of the worst", last resort at finding a solution to a small problem. Not a way to edit the source and add a single feature to the original program, like remove print protection from Acrobat Reader. The decompiled program most probably won't be possible to compile. You won't make a cow from hamburgers. But with some luck you may find out the cow was a bull and got killed by a truck.
    • by pVoid ( 607584 ) on Sunday May 25, 2003 @11:53AM (#6035143)
      Neat tricks are generally either one of these three things:

      A hidden API call - which can be easily found via ASM listings

      A nice little algorithm - which can be found in comp sci books

      An elegant piece of code - which can *not* be decompiled from ASM

      So no, I disagree with you.

    • Quoteth the original poster

      Then just rewrite it in pretty fashion, changing variable names and functions to your needs and include in your own software. It's "the best of the worst", last resort at finding a solution to a small problem.

      And exposes you to possible trade secret and copyright infringement claims.

      Really, if you know somebody else can take input "a", do "something magical" with it, and get output "b", are you really willing to admit that they are smarter then you?

    • 4) e-mail the person who wrote it and ask how the trick works.



      -m

    • If you are a competent well-educated programmer, why would you need to take somebody else's code when you could simply design and implement your own solution in as much time as a reverse-engineering effort would probably take?

      Code or pseudocode is available free for many thousands of tough algorithmic problems which have been studied and published in the literature (e.g. Knuth et al) which is to be found in most good university libraries and/or the Internet.

      • Let me give you an example. A while ago I was trying to figure out how the Intel C++ compiler called global constructors during program initialization. This was before the standard x86 C++ ABI, so this wasn't documented anywhere. The only way to figure out how things were working was to wade through ASM listings and hex dumps. Also, there are lots of places where reverse engineering is absolutely crucial. For example you can sometimes figure out the protocol to a device by analyzing the binary code of the d
  • by Anonymous Coward

    I've done some reverse-engineering on programs written in C/C++ (Intel x86). After a while you learn how to recognize different things like virtual function calls, while/for-loops, switch and stuff like that. However, it's a totally different thing to decompile to C++. It may be possible to decompile compiled code to C, but don't expect that it will look much like the original source, especially if the code was optimized by the compiler :)

  • Templates (Score:5, Informative)

    by ucblockhead ( 63650 ) on Sunday May 25, 2003 @11:58AM (#6035167) Homepage Journal
    He won't be able to regenerate any templates. If a program makes heavy use of templates, the "C++" he "decompiles" to is going to be hideously ugly.

    [insert joke about it being hideously ugly with templates here.]

    {I did not read the article itself because it is, of course, slashdotted)

    • You could probably guess. Worst case, the decompiled version is refactored to be cleaner than the original, which doesn't sound all that bad to me. :)
  • Java Decompiler? (Score:3, Interesting)

    by mindstrm ( 20013 ) on Sunday May 25, 2003 @12:00PM (#6035181)
    Anyone recommend a java decompiler known to work on the most recent versions of java, properly?

    Something that will literally give me code I can re-compile immediately?
    • You want JAD [geocities.com]

      And for everyone that whines about "Oh, the decompiled code doesn't have pretty names...!" Who cares? You can puzzle through. Say some method in your app server throwing a NullPointerException... "well, where in the method could that be happening... decompile, put some debug here, and here... ah, that's weird, it's needs this obscure session variable, how did that go missing?" Now isn't that better than screaming "GODDAMN IT WHY DOES THIS CRAP KEEP BREAKING!!" and distressing your co-worke
      • JAD is a godsend. I wrote a very complex optimization method that was extremely effective in a couple of circumtstances. A couple of years later, those circumstances turn up again only in a different language. I can't find the source code anywhere, just the class file that had my great method in it. So, JAD comes to the rescue; it gave me a bunch code that used d1,d2,d3,... as my variables, but I already had a basic understanding of what each variable's role was, so it wasn't a problem for me to reverse-eng
    • Use JAD [tripod.com]. It's the best one for Java. If you want a decent GUI front end, get DJ Java Decompiler [fortunecity.com].
  • by sheetsda ( 230887 ) <<doug.sheets> <at> <gmail.com>> on Sunday May 25, 2003 @12:01PM (#6035183)
    There seem to be a lot of people in this story saying "shame on you for reverse engineering". It has its uses, how else would viruses, worms, and trojans be analyzed to figure out what they do and how they do it.
  • by crovira ( 10242 ) on Sunday May 25, 2003 @12:06PM (#6035202) Homepage
    not the source's lies.

    Losing source code and var names (name spaced globals aka statics and scoped locals) allows the cracker (these are rarely hacking tools, they're mostly cracking tools,) to focus on what the machine actually was told to do instead of smothering it with shades of meaning which interfere with understanding the code.

    C++ or Java or Smalltalk, or almost any highly structured language using machine code libraries or virtual machines result in structured blocks of code and heap and stack allocation.

    A good decompiler can take the machine code, peel away the name spaces and code calls, extract the patterns in the code and the hacker/cracker can read the patterns instead of wasting time on the code.

    Forensic analysis work is extremely useful at telling you what happened when something dies but it is no good at telling you how something worked. For that you need code traces.

    Map those code traces onto the structure the decompiler reveals and you understand the program better than the authors/coders.
    • by Fnkmaster ( 89084 ) on Sunday May 25, 2003 @01:12PM (#6035577)
      Neo: Do you always look at it in binary?


      Cypher: Well you have to. The compilers work for the construct program. But there's way too much information to decode the Matrix. You get used to it. I...I don't even see the code. All I see is an array, function pointer, integer. Hey, you uh... want a drink?


      Neo: Sure.


      Cypher: You know, I know what you're thinking, because right now I'm thinking the same thing. Actually, I've been thinking it ever since I got here. Why, oh why didn't I sell my VA Linux stock?... Good shit, huh? Cowboy Neal makes it. It's good for two things, degreasing Perl code and killing brain cells.

  • by pchown ( 90777 ) on Sunday May 25, 2003 @12:07PM (#6035205)
    You might decompile one file and find a comment like this at the top:

    * This program is free software; you can redistribute it and/or
    * modify it under the terms of the GNU General Public License
    * as published by the Free Software Foundation; either version
    * 2 of the License, or (at your option) any later version.
    ;-)
  • misleading... (Score:4, Informative)

    by bismarck2 ( 675710 ) on Sunday May 25, 2003 @12:12PM (#6035229)
    Even with complete original source code, understanding a non-trivial C++ application is very difficult. Source derived from an optimized executable is going to be a LOT rougher. No real function names, module names, variable names, or comments. Use of standard libraries (STL, MFC, Boost) is likely highly obscured as well. A tool like this would probably produce source that looks more like a C/machine language hybrid rather than normal C++. The primary use of something like this is if you are looking for a very specific piece of logic such as a password check or an encryption operation or protocol details. When were these famous last words anyway?
  • by Call Me Black Cloud ( 616282 ) on Sunday May 25, 2003 @12:12PM (#6035230)
    ...trying to rebuild a wrecked sand castle just by looking at the grains of sand. You can't. Compilers throw away a lot of information needed by people but not necessary for the machine. Compilers optimize the code to run more efficiently and that's a one-way street. Sorry to burst your bubble but trying to reconstruct original source is like trying to herd cats.

    Thank you, thank you. I'm Mr. Metaphor and I'll be here all week.
    • Thank you, thank you. I'm Mr. Metaphor and I'll be here all week.

      Calling yourself Mr. Metaphor is like using metaphor instead of analogy, which, in your case, is as incorrect as a cow marking its territory with cow pies and instituting an elaborate cow-tipping territory defense program.

  • In europe it is legal to use reverse engineering for compatibility reasons enabling your software to work with others people software (mainly Microsoft)

    If you make the reverse engineering in europe you could develop compatible software and then export it to US. So it may be great news for us. In fact it is becoming really complicated to develope software for/at US. Patents, legislation, compatibility. It seems that more lawers than programmers are needed to write something more complicated than HelloWorld.
  • by Wizard of OS ( 111213 ) * on Sunday May 25, 2003 @12:49PM (#6035422)
    Why do people keep thinking that decompilation is possible? In short: decompiling a computer program is solving the halting problem. Period.

    The long version: In a compiled computer program there is no distinction for either code or data. Every byte in memory can be data, but it can also be executed as valid computer code.

    Now, the catch is that during compilation, data and code are mixed in the resulting binary. For instance take the compilation of a 'case' statement. There are several ways of compiling a case:
    - you can write it as a list of IF's, which is perfectly fine decompilable
    - you can write it as a jump, based on the case expression.
    The fun part about the second possibility is that it's far more efficient, but it poses a problem: when decompiling this you have to know where the bounds of the case lie. What's the furthest jump that can be made? It's a jump based on a calculated value, so you should know which values are possible. But for that, you need to run the program, and more specifically, you must run all possible execution paths.

    This can be rewritten as the instance of the halting problem: can a computer find out for any program whether or not it will halt? It is proven that a computer program cannot be written to do this task. Neither can a computer program decompile any other computer program.
    • Your understanding of the consequences of the halting problem is incomplete. It is not a proof that it is impossible to determine of any given program whether or not it stops in finite time. It is merely a proof that there exists a class of programs for this determination cannot be made. However, there are also many programs for which it can easily be determined whether or not it stops in finite time, and the same thing is true for decompilation.

      Furthermore, there is nothing saying that it has to do a 100

  • by TapeLeg ( 671494 ) on Sunday May 25, 2003 @12:54PM (#6035458)
    You can decompile any program. A compiled program is just your high-level program translated into machine language. There is no sort of magical encryption or similar transformation that it undergoes once you compile it.

    All you need to do is read in the bytes of any binary program, interpret the bytes as their machine language equivalents for whatever platform you are using, and then convert your MOV statements to assignment operators, JMP statemets to higher level loop structures, etc..

    Of course, you won't retain the names of identifiers, which are referred to only by memory locations in a compiled program; and some control structures might be rearranged due to compiler optimization and the lack of machine language equivalents, but the meat and potatoes of it is all right there.

    It's by no means easy to accomplish, especially with higher and higher level programming languages, but impossible? humbug! =)
  • by Animats ( 122034 ) on Sunday May 25, 2003 @01:14PM (#6035590) Homepage
    Decompilers are rare, but possible. The first good one, decades ago, decompiled IBM 1401 assembler programs into COBOL. There's a commercial business, The Source Recovery Company [source-recovery.com], still doing that for legacy mainframe programs.

    C decompilers exist; here's one. [backerstreet.com] There are others. Most aren't very good. It's a hard problem.

    Without debugging information, decompilation tends to result in code with arbitrary variable and function names, of course. But you get names when a DLL or .so is entered, so at least you get the program's major interfaces. Minimal C++ decompilation could be done by adding vtable recognition to a C decompiler.

    A more difficult problem is recognition of idioms. Things like "for" statements tend to decompile as lower level constructs. That's OK as a first step. You need some internal representation Initial decompilation might represent all transfers of control with "goto"; higher level recognition then deals with that.

    The key to doing a good job is "optimization", finding more concise source code that will generate the object code. The key to this problem is defining an internal representation that can represent any valid machine-language program, and which can be modified as higher level information about the program is discovered. The first step is usually to start at the starting address and build a code tree by following calls, like a good debugger does. Then you start to improve on the code tree, doing things like this:

    • Recognition of function calls. Each function call should be decompiled, and all calls to the same function checked to insure they have the same calling sequence. Then a prototype can be generated and placed in a header file.
    • Recognition of fixed-format structures. Figuring out how big a structure is can be tough, but at least fixed-format ones should be fully recognized. All references to the structure should be checked for type consistency, and a structure definition generated.
    • Recognition of "for", "while", and "switch".
    • Once constructors and destructors have been found, the structure of derived objects can be figured out. Now class definitions can be generated.
    • Once class member functions have been identified, the most restrictive protection ("private", "public", "protected") that will work should be attached. Similarly, "const" can be inserted for all arguments not seen to be modified.

    Decompilation won't always succeed. But you should find all the places where the code is doing something the compiler doesn't understand, and get code back for everything else.

    It's a big job, and somebody ought to do it. Among other things, it would be a valuable tool for finding compiler bugs.

  • by Minna Kirai ( 624281 ) on Sunday May 25, 2003 @02:23PM (#6035951)
    The article [cxd3.com] (link provided for those who don't read URLs) is wrong, even in the first section.

    The title of the first "chapter" is "Why is c++ Decompiling possible?". But immediately he lists "what is totally loss when you compile a program and what stays there".

    In the Lost column he puts templates and classes. The remains list has things like function calls and local variables.

    Well, guess what? Those things are are "lost" are everything that distinguishes C++ from C. If you don't have classes (meaning no inheritance or virtual functions either) and don't have templates either, then you're really just programming in "a better C", not C++.

    So all his approach can hope to "decompile" is C code. Which is something we've seen done in various forms for decades.
  • From the author (Score:5, Interesting)

    by opcodevoid ( 675898 ) on Sunday May 25, 2003 @03:46PM (#6036319)
    I didn't relize my artical was getting any feedback because people are posting it here instead of pscode.

    Anyway i seen alot of people saying decompiling is impossible or at least not practical, well that is not true. Decompiling c++ is very practical because of high level keywords(if,while,for) ,local variables, and parameters. All of these generate certain instruction similer on every platform and just about every proccesser.

    I also extending the artical to contain 92 pages in total which will cover OOP, and crt, and a whole bunch of other stuff

  • by mark-t ( 151149 ) <markt AT nerdflat DOT com> on Sunday May 25, 2003 @05:41PM (#6036855) Journal
    I posted the following remark about 20 minutes ago on pscode, and when I just checked back there I found that the remark had been surreptitiously removed (I still had a backup of what I had written in my cache):

    Nice try, but no. All this article ultimately describes is how to write high level language code that does the same thing as particular groups of assembly instructions, which is meaningless to a high level language programmer because knowing all the individual steps of a process are nowhere important as understanding what the process actually *IS*. This is something that no automated decompilation process can uncover because the responsibility for that understanding falls on the programmer, not the computer. Since code that only replicates functionality, but does not convey meaning to the programmer is not maintainable, the entire process of decompilation would be wasted. One would probably be better off spending their time figuring out how to do it themselves (with, perhaps, some help from standard reverse engineering, if needed).

    Not only does the author completely fail to realize that the technique he is describing doesn't remotely qualify as decompilation, and is is nothing but normal reverse engineering, but he figures that the appropriate response to negative criticism is to remove evidence of it rather than attempt to intelligently respond. I noticed that my vote of 1 of 5 was still intact on his voting page, though.

    I was originally surprised when I first read the article that someone would think it had merit enough to write about, but having some insight into the mindset of the author that I did not have before (offered by his rapid censorship of my remarks), my surprise has waned completely.

Top Ten Things Overheard At The ANSI C Draft Committee Meetings: (5) All right, who's the wiseguy who stuck this trigraph stuff in here?

Working...