Slashdot is powered by your submissions, so send in your scoop

 



Forgot your password?
typodupeerror
×
Programming The Military

Why DARPA Hopes To 'Distill' Old Binaries Into Readable Code (theregister.com) 54

Researchers at Georgia Tech have developed a prototype pipeline for the Defense Advanced Research Projects Agency (DARPA) that can "distill" binary executables into human-intelligible code so that it can be updated and deployed in "weeks, days, or hours, in some cases." The work is part of a five-year, $10 million project with the agency. The Register reports: After running an executable through the university's "distillation" process, software engineers should be able to examine the generated HAR, figure out what the code does, and make changes to add new features, patch bugs, or improve security, and turn the HAR back into executable code, says GT associate professor and project participant Brendan Saltaformaggio. This would be useful for, say, updating complex software that was written by a contractor or internal team, the source code is no longer or never was to hand and neither are its creators, and stuff needs to be fixed up. Reverse engineering the binary and patching in an update by hand can be a little hairy, hence DARPA's desire for something a bit more solid and automatic. The idea is to use this pipeline to freshen up legacy or outdated software that may have taken years and millions of dollars to develop some time ago.

Saltaformaggio told El Reg his team has the entire process working from start to finish, and with some level of stability, too. "DARPA sets challenges they like to use to test the capabilities of a project," he told us over the phone. "So far we've handled every challenge problem DARPA's thrown at us, so I'd say it's working pretty well." Saltaformaggio said his team's pipeline disassembles binaries into a graph structure with pseudo-code, and presented in a way that developers can navigate, and replace or add parts in C and C++. Sorry, Java devs and Pythonistas: Saltaformaggio tells us that there's no reason the system couldn't work with other programming languages, "but we're focused on C and C++. Other folks would need to build out support for that." Along with being able to deconstruct, edit, and reconstruct binaries, the team said its processing pipeline is also able to comb through HARs and remove extraneous routines. The team has also, we're told, baked in verification steps to ensure changes made to code within hardware ranging from jets and drones to plain-old desktop computers work exactly as expected with no side effects.

This discussion has been archived. No new comments can be posted.

Why DARPA Hopes To 'Distill' Old Binaries Into Readable Code

Comments Filter:
  • by wierd_w ( 1375923 ) on Saturday August 19, 2023 @05:09AM (#63779652)

    Since compilers strip out "unnecessary" data, like comments, or the names of variables (because they mean nothing to a computer), recovery of all that missing, and often essential metadata has long been a niggly, prickly, and pernicious obstacle with disassembling binary or object code back into assembler, (and in some cases, back into "something that resembles C")

    Short of having some kind of AI that "knows" about commonly used interfaces/libraries, and which can identify the compiled code's disassembly and pair it up, there is no easy way to revert it back to something genuinely human-readable.

    Even then, there's situations where the code never really was "human readable", such as hand-assembled performance-focused code, where attempting this kind of operation on it will seriously degrade its value-- or software that real-time modifies itself in memory (like SecuRom)

    I wish DARPA all the luck in the world, but this is something that people have been wanting to do for aaaaages.

    • by wierd_w ( 1375923 ) on Saturday August 19, 2023 @05:11AM (#63779654)

      (Additionally, this toolkit would make a lot of closed source software vendors shit solid gold bricks, and reach impulsively for their lawyers and cease-and-desist orders, like Catholics reaching for a crucifix when they catch even the faintest hint of something 'satanic')

      • For lost source code, most copy right laws have exceptions which explicitly allow decompiling and fixing and porting and recompiling.

    • You'll never get the comments back, nor will you have meaningful variable names... but an advanced disassembler would be able to map out the execution paths and provide order and consistency and even break the code into objects.

      Honestly, I still wouldn't call that 'human readable' as I have a bit of trouble reviewing my own code after a few years and it's no longer fresh in my mind. And I comment my code.

      Code that is organized purely based on how it executes seems like a great thing, but realistically you

    • Hold on a moment (Score:5, Informative)

      by Gravis Zero ( 934156 ) on Saturday August 19, 2023 @08:13AM (#63779814)

      From TFA:

      We know what you're thinking: Uncle Sam is reinventing decompilation. It certainly sounds like it. There are lots of decompilation and reverse-engineering tools out there for turning executable machine-level code into corresponding source code in human-readable high-level language like C or C++. That decompiled source, however, tends to be messy and hard to follow, and is typically used for figuring out how a program works and whether any bugs are exploitable.

      From what we can tell, this DARPA program seeks a highly robust, automated method of converting executable files into a high-level format developers can not only read – a highly abstract representation, or HAR, in this case – but also edit to remove flaws and add functionality, and reassemble it all back into a program that will work as expected. That's a bit of a manual, error-prone chore even for highly skilled types using today's reverse-engineering tools, which isn't what you want near code going into things like aircraft.

      DARPA instead seems to want a decompilation-and-recompilation system that is reliable, easy enough to use, and incorporates stuff you'd expect from a military research nerve center, such as formal verification of a program's modifications.

    • by mpercy ( 1085347 ) on Saturday August 19, 2023 @09:24AM (#63779928)

      No, they're trying to vastly improve on de-compiling tools. Tools that can disassemble object/executable code (which is pretty trivial, e.g., objdump does it) but then analyze the assembly to back out C or C++ code that might have originally created the code.

      Think "ftoc" but replace FORTRAN with "exectuable/object code".

      I've done some decompiling of (small) executables, using non-AI tools that often did little more than treat C as an assembler language. It did usually get things like "this is a subroutine" right and created the parts like "int a113s_r4r5r6(int r4, int r5, intr6) { .... return r6; }".

      So we got C code that we could recompile, and while not exact byte-for-byte in the output, the resulting recompiled code was "correct". We could theoretically edit the resulting C code, but because obviously all the labels, variable, names, etc, are stripped out, the decompiler had to generate *something* so we got stuff like I mentioned above.

      As a subject-matter expert, most of my job was trying to recognize what the code was *really* doing, replacing the decompiler-generated names with my best educated-guess as to what the function/variable is really doing or might be called. The decompiler didn't always (usually) see things the "this is an array access", and had instead emitted code like

      int *v123;
      int v3245234;
      v123 =
      v123 += v3544;
      v3245234 = * v123;

      Which is C-as-assembly, essentially. But recognizing the pattern and making substitutions like

      v345235 == "a"
      v3245234 == "b"
      v3544 == "i"

      we might recast that as

      int b = a[i];

      I'm sure the AI parts here are geared towards doing that sort of thing better and more accurately. Not to mention being able to compare object code against known object code in the wild and find the corresponding source code, e.g., when FOSS software got included, or when libc code was statically linked.

      • I see myself potentially replacing refactoring by using this path. Compile existing code, run AI assisted decompile, then proceed to understand the code. At least this way I would have a consistent starting point when faced with some hastily written chicken scratch which was squeeze by program management. Might even find a bunch of trivial bugs and convenient optimizations this way too.
      • Not to mention being able to compare object code against known object code in the wild and find the corresponding source code, e.g., when FOSS software got included, or when libc code was statically linked.

        back in the days (late 80s, early 90s when there was still software manually written in assembly), that was one of the strengths of "Sourcer" disassembler.
        instead of merely dumping machine code into human readable mnemonic, it actually tried to understand what the code does and give meaningful names and comments.
        it did so by having a lot of knowledge in its database.

        so instead of merely:

        out dx, al

        you got:

        out dx, al ; switch the PC speaker timer output to single trigger

        (that how I learned how to play digital

    • There are plenty of programs that are just compiled from high level languages down to lego like assembly macros and end finally in assembly.

      If the source code is lost, decompiling them into something C like and then compiling them again might already be enough to port them to a different processor / system.

      • by sfcat ( 872532 )

        Clearly you have never tried that. It doesn't work for C or anything that compiles down to assembler. It will work for Java, sort of, but it is unreliable and for a large enough body of code the chance that a bug gets introduced approaches 1 quite quickly. This is the sort of technology that works for some simple cases but doesn't scale well to large applications. Doesn't mean you couldn't make one that works but that is going to be very expensive and definitely won't be done at a university (this is th

        • Clearly you have never tried that. It doesn't work for C or anything that compiles down to assembler.
          Clearly I have tried that.
          Which is clear from my clearly written parent post.

          Perhaps you should look up what a "macro assembler" is and how an old traditional C compiler compiles to Assembler Macros ...

          And how those get 1 : 1 mapped to machine code, without any optimization.

          Sigh ...

          I think that was pretty clear from my previous post.

          But reading comprehension ...

    • Since compilers strip out "unnecessary" data, like comments, or the names of variables (because they mean nothing to a computer), recovery of all that missing, and often essential metadata has long been a niggly, prickly, and pernicious obstacle with disassembling binary or object code back into assembler, (and in some cases, back into "something that resembles C")

      Well, given they already have ghidra courtesy of the NSA [nsa.gov], that part's down pat.

      Short of having some kind of AI that "knows" about commonly used interfaces/libraries, and which can identify the compiled code's disassembly and pair it up, there is no easy way to revert it back to something genuinely human-readable.

      One of the biggest problems with ghidra is the inability to easily define external data type libraries. (These contain type references, function signatures, and data structure names / layouts for a given library.) You can build some custom ones for the currently disassembled binary. But it cannot be exported and imported into another project. If you could, you'd be able to help the disassembler quite a bit in that regard. No AI

    • A reverse compiler is not a new idea - I used one about 40 years ago.
  • Literally, did this for a project in the 90s just to meet our requirements to said acronym.

  • Too bad the summary couldn't be bothered to state what HAR means.
    • From what we can tell, this DARPA program seeks a highly robust, automated method of converting executable files into a high-level format developers can not only read – a highly abstract representation, or HAR, in this case – but also edit to remove flaws and add functionality, and reassemble it all back into a program that will work as expected. That's a bit of a manual, error-prone chore even for highly skilled types using today's reverse-engineering tools, which isn't what you want near code going into things like aircraft.

      If only there was some way to go directly to the article. Hmm...

      • If only the editor had bothered to spell out the acronym in parentheses after its first use in the summary.

    • by mpercy ( 1085347 )

      "From what we can tell, this DARPA program seeks a highly robust, automated method of converting executable files into a high-level format developers can not only read – a highly abstract representation, or HAR, in this case – but also edit to remove flaws and add functionality, and reassemble it all back into a program that will work as expected. That's a bit of a manual, error-prone chore even for highly skilled types using today's reverse-engineering tools, which isn't what you want near code

  • This has been the Holy Grail for 5 decades. When I started programming, Source Code Control was not used, and to be honest all you really had was sccs on UNIX. On the mini I started on, nothing. So as you can image, lots of source was lost. At least once per month someone would as "Is there a way to turn this executable into source ?".

    If I understand what they want, I wish them luck and hope for success. I guess if anyone could o it, it is them. But right now to me, seems they may be chasing windmills

    • I remember being mildly surprised to see some mainframe software from IBM and Unisys being maintained as layers of patch sets upon patch sets, kind of like how SCCS used to keep things. Even today you build Yocto Linux from a somewhat recent set of sources files or repositories then apply a more limited number of patches on top of that. At least with the disassembly/reconstruction you start with the latest version. :-)
  • means "cheese hopper" in Italian.

  • by gweihir ( 88907 ) on Saturday August 19, 2023 @09:39AM (#63779956)

    Analyzing code usually takes longer and more requires skill than writing it in the first place. I guess they are lying about the actual purpose of this.

    • I guess the mantra is more: code gets 100 - 1000 times more often read by a human than written.
      So making understanding code quicker is an immense cost saver.

      I doubt I write code faster than I can analyse it ... I probbaly over think writing code to much. For some reason I have the old instinct of writing it once, for good, and not having to come bck to it for fixes, onyl changes.

      But in relation to peers I'm not really slower, however I sometimes see people hammering an immense amount of code out with modern

      • by gweihir ( 88907 )

        For quality code (obviously a non-standard thing created only by an irrelevant fringe-movement among coders) the numbers are different. That stuff tends to be well-thought out and the code actually follows a reasoning, comes with no or clearly marked surprises and may well be easier to read than write. But most code out there is just slapped together and then more crap thrown in until it seems to be working.

        That said, I do not think code gets read more often that written. My guess would be that except while

    • by erice ( 13380 )

      Analyzing code usually takes longer and more requires skill than writing it in the first place.

      If you have a clear spec, sure. But if the source and developers are missing, I think it is likely that the requirements are also AWOL or perhaps just wrong. In the latter case, the original developers, who had much better information at their disposal, adapted the code so that it worked but never updated the spec.

  • There were C-like disassemblers in the 90's but they were approximations and couldn't pull nonexistent symbol names out of thin air. You can't recreate information that was lost. IDA Pro and such reversing tools also do this. Again, they're smoking crack if they think they can recreate original source with comments or source control revisions.
  • by bugs2squash ( 1132591 ) on Saturday August 19, 2023 @09:54AM (#63780006)

    ensure changes made to code within hardware ranging from jets and drones to plain-old desktop computers work exactly as expected with no side effects

    My hubris detector went off when I read that - if they have a formal spec why not work on compiling the formal spec

    • Because the whole point is to double check that those implementing the spec are doing so as described. Without their assistance. (Willing or not.) With any building contractor that's easy. With assembly code, that's a lot harder.

      Having something that can reverse the implementation into something that's more manageable for inspectors is a good thing. Unless you want a bunch of inspectors, extremely specialized in only one or two processor architectures, demanding the government forbid deployment of any oth
  • by rjzak ( 159949 ) on Saturday August 19, 2023 @11:39AM (#63780270) Homepage

    A company called GrammaTech has some similar tools: https://github.com/GrammaTech/ddisasm. It disassembles binaries into an intermediate format which can be altered, and recompiled back to a new binary.

  • ....intellectual property laws want humanity to die out before allowing something like this.

  • Reverse engineering the binary ...

    For decades, the US government thought destroying the blueprints was a good idea: With nothing to steal, the American technological advantage was ensured. This is ressurecting a dead horse, or a monkey's paw, with all the prescribed horrors appearing as expected. For the current works of art, there is no going back: Like the library of Alexandria, decades of knowledge can never be replaced. At least, the microcomputer revolution meant much technical data about them (eg. the Radio-Shack MS Basic, the App

  • Those binarys are already readable as Assembly Language. That's perfectly readable and clear. I've been doing it for 30 years! In the immortal words of W.C. Filelds: Go away kid, you bother me.
  • Wouldn't it be easier to just get the source? And if you can't, why are you even using the software?!

A committee takes root and grows, it flowers, wilts and dies, scattering the seed from which other committees will bloom. -- Parkinson

Working...