Forgot your password?
typodupeerror
Programming Security

Mystery of Duqu Programming Language Solved 97

Posted by samzenpus
from the solving-a-mystery dept.
wiredmikey writes "Earlier this month, researchers from Kaspersky Lab reached out to the security and programming community in an effort to help solve a mystery related to 'Duqu,' the Trojan often referred to as 'Son of Stuxnet,' which surfaced in October 2010. The mystery rested in a section of code written an unknown programming language and used in the Duqu Framework, a portion of the Payload DLL used by the Trojan to interact with Command & Control (C&C) servers after the malware infected system. Less than two weeks later, Kaspersky Lab experts now say with a high degree of certainty that the Duqu framework was written using a custom object-oriented extension to C, generally called 'OO C' and compiled with Microsoft Visual Studio Compiler 2008 (MSVC 2008) with special options for optimizing code size and inline expansion."
This discussion has been archived. No new comments can be posted.

Mystery of Duqu Programming Language Solved

Comments Filter:
  • I guess allens don't exist.
  • A link to the actual code snippet would've been nice; I'd love to see the structure and logic behind it.
    • Re:Let's See It (Score:5, Informative)

      by Baloroth (2370816) on Monday March 19, 2012 @11:56AM (#39403829)
      • Re: (Score:2, Interesting)

        by Ihmhi (1206036)

        You know, I wonder if the antivirus suites of the future will be able to see stuff like this being written. Like "oh no, he is using emacs/vi and writing a php injection script - perhaps this is something we should look into specifically". I don't think heuristics of this sort would be any more onerous than the deep sort of file scanning that antivirus suites already do.

        As an aside, Kaspersky is fantastic and aside from a small hiccup a year or two ago where they lost some CC data (and handled it pretty wel

        • by s.petry (762400)

          This is already done to a large degree, at least with what matters in binary code. The "Script kiddie" tools are extremely well documented. This goes way back in time to when a tool came out called (I hope I'm remembering the name right) VCL or Virus Creation Labratory. It became pretty easy to determine VCL based code and the tool set pretty much evaporated.

          What editor you use is really unimportant. The compiler is what counts, and the compiler never sees your editor.

        • by ShaunC (203807)

          You know, I wonder if the antivirus suites of the future will be able to see stuff like this being written. Like "oh no, he is using emacs/vi and writing a php injection script - perhaps this is something we should look into specifically"

          I can't imagine that someone with enough technical ability to create the "mystery" Duqu code isn't already doing their development in a sandboxed VM with no AV apps installed. I doubt it's worth the time on the AV companies' part to attempt to detect the act of malware actually being written.

        • by CSMoran (1577071)

          You know, I wonder if the antivirus suites of the future will be able to see stuff like this being written. Like "oh no, he is using emacs/vi and writing a php injection script - perhaps this is something we should look into specifically".

          Real programmers use cat anyway :).

        • by tehcyder (746570)

          (For some of my customers, I tell them to immediately call me on a red light. Makes my job easier lol.)

          Does it play that old "red light spells danger - can't hold out much longer" song to reinforce the point?

          Because that would be like a bucket full of awesome.

  • by ElmoGonzo (627753) on Monday March 19, 2012 @11:55AM (#39403815)
    they may have learn MASM to avoid detection.
  • Source Code? (Score:4, Insightful)

    by deemen (1316945) on Monday March 19, 2012 @11:57AM (#39403843)
    How did they deduce it was an unknown programming language? By looking at the compiled machine code? How could they tell this wasn't just regular C?
    • Re:Source Code? (Score:5, Informative)

      by CaptainJeff (731782) on Monday March 19, 2012 @12:05PM (#39403923)
      Different languages compile down very differently. Indeed, different compilers compile the same source code differently (try comparing GCC output to Visual Studio output and you'll see some obvious differences in how the assembly/machine code is crafted). In this case, there were clear signs of an object-oriented approach (data and functions were located around each other in memory, which is not likely to happen in non-OO languages, etc).
      • by CSMoran (1577071)

        In this case, there were clear signs of an object-oriented approach (data and functions were located around each other in memory, which is not likely to happen in non-OO languages, etc).

        I agree with the gist of your statement, but I don't think the OOP source-level organization of "data close to methods" is reflected in the generated machine code or the intermediate assembly. I'd wager data would be placed in non-executable memory segments, far from where the code ('text') resides. When you print out values of pointers you can often recognize what lives on the stack, what is heap-based data and what is a function pointer just by looking at address ranges.

    • Re:Source Code? (Score:5, Informative)

      by tomhath (637240) on Monday March 19, 2012 @12:06PM (#39403933)
      It seems they recognized a sequence of instructions [securelist.com] that are typical of a class constructor, just not like any class constructor they were familiar with.
      • by robi5 (1261542)

        The GP's question was something else - how did they initially tell it was _not_ regular C (as that obviously lacks the fingerprint of OO techniques of C++). Or if it didn't look like regular C or anything else, why didn't they just assume it was written in assembly, or some other rare machine code generating language like Common Lisp?

        • Re:Source Code? (Score:5, Informative)

          by djdanlib (732853) on Monday March 19, 2012 @01:05PM (#39404675) Homepage

          They did open the lines up for suggestions, and some community members suggested that it looked like OO C. How did they know? They probably had experience using and debugging OO C, if I had to guess. There were also plenty of people who said that it definitely wasn't compiler X or language Y from their own experiences. The article links to this discussion: http://www.securelist.com/en/blog/677/The_mystery_of_Duqu_Framework_solved [securelist.com]

          But about discovering the specifics of the truth? It's probably like you alluded to in your comment - fingerprinting the machine code. It would take a while, but you could come up with fingerprints for a great many various compilers and features. You could do that for Common Lisp, too. (In fact, someone DID suggest for them to look at various LISP dialects.) It has taken long enough that such a scenario - having a good library of fingerprints - is believable. Given a scanner with a dictionary of fingerprints, one could reasonably say that you either have hand-assembled machine code made to mimic another language, or that you have code generated by a very specific language and compiler. If nothing in your library of fingerprints matched, assuming you had a good handle on hand-assembling machine code, you could look and see if it smells like such a beast. It would be tremendously laborious to hand-assemble code to make it look like a specific compiler generated it, and why would you do that in the first place? I fail to see the benefit when you could just use that compiler. If you were trying to throw off the analysts with a false positive match, there would still be a ton of mysterious data that still needs examination.

          Think about DNA analysis. We can look at our DNA and determine some chunks of it came from virus, and that some of it is "junk" that serves no purpose.

          Also think about image analysis like OCR or various captcha-breaking software. You can map images to characters with a program, and detect anomalies and known signatures.

          Then there is heuristic antivirus scanning. It knows enough to find some previously unthought-of malicious code, even if it does sometimes generate false positives.

          So why not apply those techniques to machine code, and see what you get? If multiple methods give you similar results, you would be onto something, I imagine.

          • by EnempE (709151)
            Perhaps they got what I think they were hoping for, which was someone involved with creating the program giving them a tip.

            By putting out a public call like that they created a forest of opinions from all over the internet, perfect for a tree that wanted to help but didn't want to get chopped down as a result.
          • by drinkypoo (153816)

            Think about DNA analysis. We can look at our DNA and determine some chunks of it came from virus, and that some of it is "junk" that serves no purpose.

            Except that it's recently been discovered that more of that "junk" has a purpose than we thought...

    • How did they deduce it was an unknown programming language? By looking at the compiled machine code? How could they tell this wasn't just regular C?

      I suppose that you could possibly tell what compiler was used by the arrangement of the machine code, but I still don't see what the point is. Who cares if it was written in assembly language, C or Atari Basic?

      • Re:Source Code? (Score:5, Insightful)

        by Sarten-X (1102295) on Monday March 19, 2012 @12:15PM (#39404027) Homepage
        Knowing the language and techniques used can speed up analysis of future variants found, because they'll know what patterns to look for first.
      • by tlhIngan (30335)

        I suppose that you could possibly tell what compiler was used by the arrangement of the machine code, but I still don't see what the point is. Who cares if it was written in assembly language, C or Atari Basic?

        Because knowing the compiler and version helps analysis - each compiler tends to emit code for the same statement very differently. By knowing the compiler, its idiosyncracies in the way it emits code is understood and it makes reversing the assembly back to C much easier.

        Analyszing assembly code is d

        • Re:Source Code? (Score:5, Interesting)

          by b4dc0d3r (1268512) on Monday March 19, 2012 @03:12PM (#39406271)

          To tag along - it's hard to tell data from code, and it helps the decompiling app to detect what is code vs. data if it knows which compiler created it.

          It looks like the original blog used IDA Pro, which has library signatures for different compilers. It can identify functions and auto-comment the code, making disassembly easier. Auto-identify stack variables and keep track of them through lots of PUSH and POP and RETURN X statements, it's quite powerful.

          In this case, IDA probably gave a lot of erroneous warnings or disassembled data or refused to disassemble code, requiring lots of manual work. The classes apparently were done inconsistently, making it hard to even write a plug-in to automatically detect them (scripts exist to identify MSVC objects through their RTTI properties, and do a decent job identifying non-RTTI classes, but this would not work with this code).

          http://www.hex-rays.com/products/ida/index.shtml [hex-rays.com]

          When reverse engineering, and your tool basically says "WTF do I do with this?" it's one of those moments where you want to know how the attacker made it.

          Is it hand-rolled? Or a new attack creation kit that script kiddies can cobble something together using?

          And "unknown language" was not a really good way to describe it. "Unrecognized output" would have been better. The assumption is that a language like C would compile to a C-like syntax, C++ would do things differently. But it could have been just C++ with an unknown compiler.

      • Re: (Score:2, Insightful)

        by UnknownSoldier (67820)
        I can tell you have never taught another programmer nor learned the benefits of reverse engineering so you can write better code! e.g. I used to work on a professional C/C++ compiler for consoles. Customers would sometimes ONLY provide assembly code and it was your job to figure out why the compiler was generating invalid code.

        Here is an perfect example -- a friend of mine was taking a CS course and the assembly code the prof provided was absolute shit -- a perfect example of how to NOT write co
      • Re:Source Code? (Score:4, Informative)

        by plover (150551) * on Tuesday March 20, 2012 @01:16AM (#39410565) Homepage Journal

        It's only a clue, not an answer. But it's one data point more than they had before. And they need somewhere to start looking for the author.

        OO C is very interesting. C++ developers are a dime a dozen (OK, it's 2012, we're four for a quarter.) And you can't swing a dead cat around here without hitting a C coder. But OO C developers are a subset of a subset of people. Nobody who sets out to write a virus for the first time says "I should download a four year old compiler for a language I know nothing about and start writing my virus." They don't read in their copy of "Virus Creation Lab for Dummies" book where it says to torrent a copy of Visual Studio 2008, then download some GNU OO C framework for it. This is a tool that a limited set of experts uses for their day jobs. Possibly it's something a laid off software engineer would still have on his home machine. It might be code generated by a custom library that some gaming house wrote for their own internal stuff, and that by pattern matching with commercial software products they might be able to find the company of origin. They can go back and figure out who they fired in the last three years, and who now is driving the Ferrari. Maybe there's an OO C Google Group this guy participates in. Maybe he published a bogus "please help me with my homework" question on stackoverflow, and they can match some source code to some object code.

        Or maybe it doesn't help find the guy today, but tomorrow if they haul a potential perpetrator before a judge, they can provide as corroborating evidence to the jury that the person who wrote this code was very specialized in his knowledge of this esoteric tool, and the defendant worked with this tool every day.

        Whatever that clue might be, it could be useful knowledge to someone hunting down the author. Either way, it certainly has value.

    • Re:Source Code? (Score:5, Insightful)

      by Baloroth (2370816) on Monday March 19, 2012 @12:09PM (#39403973)

      There are certain characteristics to the way C++ behaves (the manner in which you pass parameters, etc). Mainly, through having looked at lots and lots of code samples, they can say what they expect the compiled code to look like. If they know C++ compiled code looks like x, regular C looks like y, and this looked like z, it can't be C. Essentially, the code did things you simply can't do in C++ or C (even Objective C) by itself. The problem is, that method only allows you to compare to known languages. More details here [securelist.com].

      It's basically like identifying an animal by footprint. Once you know a deer leaves a certain kind of footprint, you can identify more deer by examining footprints. But you can't identify an unknown animal that way: if you haven't seen a given footprint before, you won't know what animal it is, only what general characteristics it has (weight, etc.)

  • by JoeCommodore (567479) <larry@portcommodore.com> on Monday March 19, 2012 @12:04PM (#39403913) Homepage

    A well publicized article featuring Microsoft Development products of all things, I think they should use that PR in their Microsoft Visual Studio Ads...

  • If you can disassemble it then who cares whether it was written in OO C , C++ or Logo? I don't see why it mattered so much. Just follow the assembler.

  • by Anonymous Coward

    Objective C but then for the MS platform?

  • Here is an older post about it: http://lambda-the-ultimate.org/node/4476 [lambda-the-ultimate.org]

  • by j33px0r (722130) on Monday March 19, 2012 @12:44PM (#39404405)

    FTFA:

    Why did the authors of Duqu use OO C? While there is no easy explanation why OO C was used instead of C++ for the Duqu Framework, Kaspersky experts say there are two reasonable causes that support its use [More control over the code & Extreme portability]. These two reasons indicate that the code was written by a team of experienced ‘old-school’ developers

    Why OO C? Because it worked, because they new how to use it, because they knew it would throw Kaspersky for a loop, because they thought it was cool. There are many many reasons and they do not all have to be logical.

    Kaspersky experts might want to consider that the programming wheel of life may have turned and that what was once old-school is now new-school. Whose to say that the under-estimated script-kiddies cannot grow up to be formidable adults with a whole new bag of tricks?

    • by plover (150551) *

      Occam's razor. The simplest answer is usually correct. That drives an awful lot of investigations.

      Despite the twists and turns that you see in TV crime dramas, most real world bad guys aren't quite that clever at hiding all of their tracks. Sure, they're going to hide the obvious ones they know they're leaving. They will use hacked proxies to deliver their code. They'll use sophisticated command and control networks to make sure nobody can track them back to the actual box making the inputs. They'll h

  • by Ukab the Great (87152) on Monday March 19, 2012 @01:11PM (#39404737)

    For O'Reilly's "Mastering Duqu"?

  • Why does this matter? If it is a compiled program it is just a bunch of instructions. If the OS lets the instructions to run it doesn't much matter what compiler/language was used other than how efficiently it will do the crap it is told too.

    • by Anonymous Coward

      Because it gives hints about the _people_ that wrote it.

    • by xanthos (73578)

      As a couple of others have stated, it is important in identifying who may be behind the code. "Authors" in certain parts of the world tend to use a certain set of tools for financial fraud, another group uses a different set of tools for industrial espionage, yet others may use either set of tools to mimic these groups while they do plain old espionage for a nation state.

      As a defender, you probably are more worried about one group than the others. A small startup data mining firm is probably more worried

    • by jonwil (467024)

      If you know the libraries that a program is using, it can make it easier to reverse engineer.

      • Ah good point. I suppose too if the skills are rare enough just knowing the programming language might narrow it down enough to get to the few likely cuplrits.

  • The code was written by someone with some very serious Assembler skills.

    ANYTHING that can be written in any higher level language can be written in Assembler and that is an indisputable fact.

    • by b4dc0d3r (1268512)

      It was too consistent to be compiler intrinsics, but not consistent enough to be straight assembly. That's the impression I got from the original blog post.

      No question it would have been possible, but given the rest of the code was compiled in MSVC it made sense that some sort of macro, framework, toolkit, or something was in between the course and the output.

  • Smarter than you think. I remember reading somewhere that US radio controllers in WW-II used a native american language to communicate with each other. No amount of analysis will give you any insight, if the other party is careful to not use any trails. To translate on language into another mechanically requires deep knowledge of both the languages.

    If you rolled your own language with its own grammar, you can be secure in the fact that *even* deep analysis will not yield any clues, not atleast by the c

Simplicity does not precede complexity, but follows it.

Working...