Why DARPA Hopes To 'Distill' Old Binaries Into Readable Code (theregister.com) 54
Researchers at Georgia Tech have developed a prototype pipeline for the Defense Advanced Research Projects Agency (DARPA) that can "distill" binary executables into human-intelligible code so that it can be updated and deployed in "weeks, days, or hours, in some cases." The work is part of a five-year, $10 million project with the agency. The Register reports: After running an executable through the university's "distillation" process, software engineers should be able to examine the generated HAR, figure out what the code does, and make changes to add new features, patch bugs, or improve security, and turn the HAR back into executable code, says GT associate professor and project participant Brendan Saltaformaggio. This would be useful for, say, updating complex software that was written by a contractor or internal team, the source code is no longer or never was to hand and neither are its creators, and stuff needs to be fixed up. Reverse engineering the binary and patching in an update by hand can be a little hairy, hence DARPA's desire for something a bit more solid and automatic. The idea is to use this pipeline to freshen up legacy or outdated software that may have taken years and millions of dollars to develop some time ago.
Saltaformaggio told El Reg his team has the entire process working from start to finish, and with some level of stability, too. "DARPA sets challenges they like to use to test the capabilities of a project," he told us over the phone. "So far we've handled every challenge problem DARPA's thrown at us, so I'd say it's working pretty well." Saltaformaggio said his team's pipeline disassembles binaries into a graph structure with pseudo-code, and presented in a way that developers can navigate, and replace or add parts in C and C++. Sorry, Java devs and Pythonistas: Saltaformaggio tells us that there's no reason the system couldn't work with other programming languages, "but we're focused on C and C++. Other folks would need to build out support for that." Along with being able to deconstruct, edit, and reconstruct binaries, the team said its processing pipeline is also able to comb through HARs and remove extraneous routines. The team has also, we're told, baked in verification steps to ensure changes made to code within hardware ranging from jets and drones to plain-old desktop computers work exactly as expected with no side effects.
Saltaformaggio told El Reg his team has the entire process working from start to finish, and with some level of stability, too. "DARPA sets challenges they like to use to test the capabilities of a project," he told us over the phone. "So far we've handled every challenge problem DARPA's thrown at us, so I'd say it's working pretty well." Saltaformaggio said his team's pipeline disassembles binaries into a graph structure with pseudo-code, and presented in a way that developers can navigate, and replace or add parts in C and C++. Sorry, Java devs and Pythonistas: Saltaformaggio tells us that there's no reason the system couldn't work with other programming languages, "but we're focused on C and C++. Other folks would need to build out support for that." Along with being able to deconstruct, edit, and reconstruct binaries, the team said its processing pipeline is also able to comb through HARs and remove extraneous routines. The team has also, we're told, baked in verification steps to ensure changes made to code within hardware ranging from jets and drones to plain-old desktop computers work exactly as expected with no side effects.
So, DARPA is trying to reinvent disassembly? (Score:5, Insightful)
Since compilers strip out "unnecessary" data, like comments, or the names of variables (because they mean nothing to a computer), recovery of all that missing, and often essential metadata has long been a niggly, prickly, and pernicious obstacle with disassembling binary or object code back into assembler, (and in some cases, back into "something that resembles C")
Short of having some kind of AI that "knows" about commonly used interfaces/libraries, and which can identify the compiled code's disassembly and pair it up, there is no easy way to revert it back to something genuinely human-readable.
Even then, there's situations where the code never really was "human readable", such as hand-assembled performance-focused code, where attempting this kind of operation on it will seriously degrade its value-- or software that real-time modifies itself in memory (like SecuRom)
I wish DARPA all the luck in the world, but this is something that people have been wanting to do for aaaaages.
Re:So, DARPA is trying to reinvent disassembly? (Score:5, Insightful)
(Additionally, this toolkit would make a lot of closed source software vendors shit solid gold bricks, and reach impulsively for their lawyers and cease-and-desist orders, like Catholics reaching for a crucifix when they catch even the faintest hint of something 'satanic')
Re:So, DARPA is trying to reinvent disassembly? (Score:5, Interesting)
The idea with the addendum, is that if a tool that could easily produce human-readable code (and not just raw disassembly, with obtuse and difficult to tease out structure of "what it's doing"), it would make many hardware vendors "Very Upset."
See for instance, nVidia, and their binary blob drivers, or Broadcom with their binary blob radio firmware.
Being able to generate human-readable code using an AI assistive tool (assuming it's worth a shit-- which is a whole other ball of wax), means also being able to easily produce human-readable documentation about a binary blob, and what it's doing.
That means trade secrets and other things that are obfuscated inside such a blob could be revealed and disseminated quickly.
Hence the note about reaching for C&Ds.
Re: (Score:2)
And yet Ghidra freely exists.
Yea, you aren't going to sue the NSA. Your ass will disappear.
Re: (Score:1)
Re: (Score:2)
means also being able to easily produce human-readable documentation about a binary blob, and what it's doing.
I beg to differ - even producing documentation from someone else's well-written original code is often a nightmare. I don't see it getting any easier using AI de-compiled code, which will almost certainly be less readable.
It'll be a lot easier than doing so from disassembled assembly code, but that's not saying much.
And my bet is we'll have intentionally-obfuscating compilers coming out any day now in order to reduce the risk - just like we had back in the day when CPUs were simple linear processors, and c
Re: (Score:3)
means also being able to easily produce human-readable documentation about a binary blob, and what it's doing.
I beg to differ - even producing documentation from someone else's well-written original code is often a nightmare. I don't see it getting any easier using AI de-compiled code, which will almost certainly be less readable.
No doubt. I even have problems going back to old code I wrote, and remind myself to better document it in the future (which I don't do.) AI seems to be the answer to everything, or at least attracting loads of cash. Given some of the challenges and problems AI has had with stuff that should be relatively straightforward, such as looking up case law or even writing a simple article, trying to understand code is likely to be a source of humor for quite some time.
too complicated (Score:2)
Re: So, DARPA is trying to reinvent disassembly? (Score:2)
Re: (Score:2)
After all, the whole point of proprietary software is to sell you something that tells the machine how to do something that you cannot describe yourself. So why would they have a problem forbidding you from ever being allowed to describe anything? Hell, it's a monopoly at that point. You want the machine to do someth
Re: (Score:1)
For lost source code, most copy right laws have exceptions which explicitly allow decompiling and fixing and porting and recompiling.
Re: (Score:2)
You'll never get the comments back, nor will you have meaningful variable names... but an advanced disassembler would be able to map out the execution paths and provide order and consistency and even break the code into objects.
Honestly, I still wouldn't call that 'human readable' as I have a bit of trouble reviewing my own code after a few years and it's no longer fresh in my mind. And I comment my code.
Code that is organized purely based on how it executes seems like a great thing, but realistically you
Hold on a moment (Score:5, Informative)
From TFA:
We know what you're thinking: Uncle Sam is reinventing decompilation. It certainly sounds like it. There are lots of decompilation and reverse-engineering tools out there for turning executable machine-level code into corresponding source code in human-readable high-level language like C or C++. That decompiled source, however, tends to be messy and hard to follow, and is typically used for figuring out how a program works and whether any bugs are exploitable.
From what we can tell, this DARPA program seeks a highly robust, automated method of converting executable files into a high-level format developers can not only read – a highly abstract representation, or HAR, in this case – but also edit to remove flaws and add functionality, and reassemble it all back into a program that will work as expected. That's a bit of a manual, error-prone chore even for highly skilled types using today's reverse-engineering tools, which isn't what you want near code going into things like aircraft.
DARPA instead seems to want a decompilation-and-recompilation system that is reliable, easy enough to use, and incorporates stuff you'd expect from a military research nerve center, such as formal verification of a program's modifications.
Re:So, DARPA is trying to reinvent disassembly? (Score:5, Interesting)
No, they're trying to vastly improve on de-compiling tools. Tools that can disassemble object/executable code (which is pretty trivial, e.g., objdump does it) but then analyze the assembly to back out C or C++ code that might have originally created the code.
Think "ftoc" but replace FORTRAN with "exectuable/object code".
I've done some decompiling of (small) executables, using non-AI tools that often did little more than treat C as an assembler language. It did usually get things like "this is a subroutine" right and created the parts like "int a113s_r4r5r6(int r4, int r5, intr6) { .... return r6; }".
So we got C code that we could recompile, and while not exact byte-for-byte in the output, the resulting recompiled code was "correct". We could theoretically edit the resulting C code, but because obviously all the labels, variable, names, etc, are stripped out, the decompiler had to generate *something* so we got stuff like I mentioned above.
As a subject-matter expert, most of my job was trying to recognize what the code was *really* doing, replacing the decompiler-generated names with my best educated-guess as to what the function/variable is really doing or might be called. The decompiler didn't always (usually) see things the "this is an array access", and had instead emitted code like
int *v123;
int v3245234;
v123 =
v123 += v3544;
v3245234 = * v123;
Which is C-as-assembly, essentially. But recognizing the pattern and making substitutions like
v345235 == "a"
v3245234 == "b"
v3544 == "i"
we might recast that as
int b = a[i];
I'm sure the AI parts here are geared towards doing that sort of thing better and more accurately. Not to mention being able to compare object code against known object code in the wild and find the corresponding source code, e.g., when FOSS software got included, or when libc code was statically linked.
Re: So, DARPA is trying to reinvent disassembly? (Score:2)
Disassembly with some general culture. (Score:2)
Not to mention being able to compare object code against known object code in the wild and find the corresponding source code, e.g., when FOSS software got included, or when libc code was statically linked.
back in the days (late 80s, early 90s when there was still software manually written in assembly), that was one of the strengths of "Sourcer" disassembler.
instead of merely dumping machine code into human readable mnemonic, it actually tried to understand what the code does and give meaningful names and comments.
it did so by having a lot of knowledge in its database.
so instead of merely:
out dx, al
you got:
out dx, al ; switch the PC speaker timer output to single trigger
(that how I learned how to play digital
Re: (Score:1)
There are plenty of programs that are just compiled from high level languages down to lego like assembly macros and end finally in assembly.
If the source code is lost, decompiling them into something C like and then compiling them again might already be enough to port them to a different processor / system.
Re: (Score:2)
Clearly you have never tried that. It doesn't work for C or anything that compiles down to assembler. It will work for Java, sort of, but it is unreliable and for a large enough body of code the chance that a bug gets introduced approaches 1 quite quickly. This is the sort of technology that works for some simple cases but doesn't scale well to large applications. Doesn't mean you couldn't make one that works but that is going to be very expensive and definitely won't be done at a university (this is th
Re: (Score:1)
Clearly you have never tried that. It doesn't work for C or anything that compiles down to assembler.
Clearly I have tried that.
Which is clear from my clearly written parent post.
Perhaps you should look up what a "macro assembler" is and how an old traditional C compiler compiles to Assembler Macros ...
And how those get 1 : 1 mapped to machine code, without any optimization.
Sigh ...
I think that was pretty clear from my previous post.
But reading comprehension ...
Re: (Score:2)
Since compilers strip out "unnecessary" data, like comments, or the names of variables (because they mean nothing to a computer), recovery of all that missing, and often essential metadata has long been a niggly, prickly, and pernicious obstacle with disassembling binary or object code back into assembler, (and in some cases, back into "something that resembles C")
Well, given they already have ghidra courtesy of the NSA [nsa.gov], that part's down pat.
Short of having some kind of AI that "knows" about commonly used interfaces/libraries, and which can identify the compiled code's disassembly and pair it up, there is no easy way to revert it back to something genuinely human-readable.
One of the biggest problems with ghidra is the inability to easily define external data type libraries. (These contain type references, function signatures, and data structure names / layouts for a given library.) You can build some custom ones for the currently disassembled binary. But it cannot be exported and imported into another project. If you could, you'd be able to help the disassembler quite a bit in that regard. No AI
Reverse compiler (Score:2)
Everything old is new again. (Score:1)
Literally, did this for a project in the 90s just to meet our requirements to said acronym.
Re: (Score:2)
So you wrote "a decompilation-and-recompilation system that is reliable, easy enough to use, and incorporates stuff you'd expect from a military research nerve center, such as formal verification of a program's modifications"? No, I didn't think you did.
Re:Everything old is new again. (Score:4, Informative)
Literally, did this for a project in the 90s just to meet our requirements to said acronym.
Yea. The late Don Lancaster wrote some articles about decompiling Apple ][ assembly language [6502disassembly.com] back in the mid 80's.
Alternatively, you're the joke. (Score:3)
Which part is factually inaccurate or is a sign of ignorance? Perhaps the problem is you simply didn't RTFA. Perhaps you are the joke.
Re: (Score:2)
I gave up there.
You and this shitty site are the joke
Re: (Score:2)
Nobody is keeping you here. You are free to never use this site again.
Re: (Score:2)
Which also explains why you don't realise that The Register is shit at journalism.
Re: (Score:2)
I understand your complaint and my response is that you are not required to use Slashdot or The Register. Alternatively, you could become a volunteer editor for either or both sites. The future is up to you but complaining on Slashdot won't change anything. I'm just presenting you with alternatives to crying about it.
Re: (Score:2)
You missed the factual inaccuracies, because you simply couldn't comprehend TFA.
I pointed them out to you.
The Register is a joke site that can't do journalism.
You are the joke.
What the HAR is HAR? (Score:2)
Re: (Score:3)
From what we can tell, this DARPA program seeks a highly robust, automated method of converting executable files into a high-level format developers can not only read – a highly abstract representation, or HAR, in this case – but also edit to remove flaws and add functionality, and reassemble it all back into a program that will work as expected. That's a bit of a manual, error-prone chore even for highly skilled types using today's reverse-engineering tools, which isn't what you want near code going into things like aircraft.
If only there was some way to go directly to the article. Hmm...
Re: (Score:1)
If only the editor had bothered to spell out the acronym in parentheses after its first use in the summary.
Re: (Score:2)
"From what we can tell, this DARPA program seeks a highly robust, automated method of converting executable files into a high-level format developers can not only read – a highly abstract representation, or HAR, in this case – but also edit to remove flaws and add functionality, and reassemble it all back into a program that will work as expected. That's a bit of a manual, error-prone chore even for highly skilled types using today's reverse-engineering tools, which isn't what you want near code
Holy Grail (Score:2)
This has been the Holy Grail for 5 decades. When I started programming, Source Code Control was not used, and to be honest all you really had was sccs on UNIX. On the mini I started on, nothing. So as you can image, lots of source was lost. At least once per month someone would as "Is there a way to turn this executable into source ?".
If I understand what they want, I wish them luck and hope for success. I guess if anyone could o it, it is them. But right now to me, seems they may be chasing windmills
Mainframes had layered patch sets (Score:2)
Saltaformaggio (Score:2)
means "cheese hopper" in Italian.
Re: Saltaformaggio (Score:2)
That is nonsense (Score:3)
Analyzing code usually takes longer and more requires skill than writing it in the first place. I guess they are lying about the actual purpose of this.
Re: (Score:2)
I guess the mantra is more: code gets 100 - 1000 times more often read by a human than written.
So making understanding code quicker is an immense cost saver.
I doubt I write code faster than I can analyse it ... I probbaly over think writing code to much. For some reason I have the old instinct of writing it once, for good, and not having to come bck to it for fixes, onyl changes.
But in relation to peers I'm not really slower, however I sometimes see people hammering an immense amount of code out with modern
Re: (Score:2)
For quality code (obviously a non-standard thing created only by an irrelevant fringe-movement among coders) the numbers are different. That stuff tends to be well-thought out and the code actually follows a reasoning, comes with no or clearly marked surprises and may well be easier to read than write. But most code out there is just slapped together and then more crap thrown in until it seems to be working.
That said, I do not think code gets read more often that written. My guess would be that except while
Re: (Score:2)
Analyzing code usually takes longer and more requires skill than writing it in the first place.
If you have a clear spec, sure. But if the source and developers are missing, I think it is likely that the requirements are also AWOL or perhaps just wrong. In the latter case, the original developers, who had much better information at their disposal, adapted the code so that it worked but never updated the spec.
They're smoking crack (Score:2)
Re: (Score:3)
ensure (Score:3)
ensure changes made to code within hardware ranging from jets and drones to plain-old desktop computers work exactly as expected with no side effects
My hubris detector went off when I read that - if they have a formal spec why not work on compiling the formal spec
Re: (Score:2)
Having something that can reverse the implementation into something that's more manageable for inspectors is a good thing. Unless you want a bunch of inspectors, extremely specialized in only one or two processor architectures, demanding the government forbid deployment of any oth
Similar project ddisasm + gtirb (Score:4, Interesting)
A company called GrammaTech has some similar tools: https://github.com/GrammaTech/ddisasm. It disassembles binaries into an intermediate format which can be altered, and recompiled back to a new binary.
But...but... (Score:2)
....intellectual property laws want humanity to die out before allowing something like this.
Had the answer (Score:2)
Reverse engineering the binary ...
For decades, the US government thought destroying the blueprints was a good idea: With nothing to steal, the American technological advantage was ensured. This is ressurecting a dead horse, or a monkey's paw, with all the prescribed horrors appearing as expected. For the current works of art, there is no going back: Like the library of Alexandria, decades of knowledge can never be replaced. At least, the microcomputer revolution meant much technical data about them (eg. the Radio-Shack MS Basic, the App
Dear DARPA (Score:1)
Just get the source (Score:2)
Wouldn't it be easier to just get the source? And if you can't, why are you even using the software?!