Become a fan of Slashdot on Facebook

 



Forgot your password?
typodupeerror
×
Programming Privacy Science

Anonymous No More: Your Coding Style Can Give You Away 220

itwbennett writes Researchers from Drexel University, the University of Maryland, the University of Goettingen, and Princeton have developed a "code stylometry" that uses natural language processing and machine learning to determine the authors of source code based on coding style. To test how well their code stylometry works, the researchers gathered publicly available data from Google's Code Jam, an annual programming competition that attracts a wide range of programmers, from students to professionals to hobbyists. Looking at data from 250 coders over multiple years, averaging 630 lines of code per author their code stylometry achieved 95% accuracy in identifying the author of anonymous code (PDF). Using a dataset with fewer programmers (30) but more lines of code per person (1,900), the identification accuracy rate reached 97%.
This discussion has been archived. No new comments can be posted.

Anonymous No More: Your Coding Style Can Give You Away

Comments Filter:
  • by msobkow ( 48369 ) on Wednesday January 28, 2015 @03:53PM (#48927173) Homepage Journal

    Can they do it with corporate code where there are naming and style standards in abundance, and code reviews to ensure those guidelines are followed?

    • by Marginal Coward ( 3557951 ) on Wednesday January 28, 2015 @03:57PM (#48927225)

      It seems like using the applicable features of the corporate version control system would be a lot easier - and possibly even better than 95% accurate.

    • That's what "git blame" is for...

      /me ducks and runs like hell...

    • It's not just limited by corporate code. Good luck doing this on pep8 Python.
    • I've always found that even with style guidelines in place, developers will still leave their fingerprints all over it.

      Some devs will be verbose in their comments, some less. Some devs will embrace IoC where others shun it. Some devs will create a single method with all code in it, some will refactor the heck out of it with many methods. Heck, devs can't even agree sometimes on what should be public, protected, and private (and rarely will style guidelines dictate this kind of thing).

    • by jellomizer ( 103300 ) on Wednesday January 28, 2015 @05:09PM (#48927841)

      Perhaps not as well. If people are following the coding standards for the organization then the code for the most part looks far more similar.

      When I am working with a development team, I will tend to adjust my unique style to better match what everyone else is doing. Even if it means doing coding methods that I will normally disagree with.

      If the code tends to use a bunch of Goto's instead of Procedures or classes. I will use those GOTO not for my benefit, but for people who will maintain my code later on, so they won't have to change their mindset and debugging strategies to see what the program is doing to do future corrections.

      I will go full Object Oriented if the group of people that I am working with do their coding full OO.

      My personal style would be more procedural, than OO. Not due to lack of knowledge or not realizing OO advantages and disadvantages. But if I am to code on my own, I code in the way that My Mind handles the requirements, and how I feel would be easier for me to change and fix my code in the future.

      I think this method is best for ID based on personal code, vs group corporate code, where a lot of your particular style is hidden.

      • by rtb61 ( 674572 )

        Just curious, how are larger companies going with algorithm libraries and variable naming rules to ensure maximum re usability of code (variables named by function rather than named by application). Any change, is most of it done from scratch, any fancy algorithm data bases with search functions based upon algorithm descriptors and software engineering. Also things like software language translators or the same algorithms stored in different languages. Any shift away from writing code to more assembling al

    • by AK Marc ( 707885 )
      Even if they build up a database of 100% of written code, how can they identify me if I only copy and paste code from others?
    • Similarly I was thinking this would probably be defeated by a "minifier", obfuscator, or anything along those lines. There are dozens to choose from for most languages and it would be trivial for anyone attempting to remain anonymous to use them on their releases.

      If you want the code to remain usable, there are tools to enforce a standard style instead, in which case just set it up with rules based on a popular project if your language of choice doesn't have a specific style. At that point you're down to

      • by Mr Z ( 6791 )

        Did you read the part in the article where they're actually doing the matching based on the ASTs (abstract syntax trees), and so are able to identify authors even after the code goes through an obfuscator? Relevant quotes:

        Their real innovation, though, was in developing what they call “abstract syntax trees” which are similar to parse tree for sentences, and are derived from language-specific syntax and keywords. These trees capture a syntactic feature set which, the authors wrote, “was c

    • by Gorobei ( 127755 )

      Can they do it with corporate code where there are naming and style standards in abundance, and code reviews to ensure those guidelines are followed?

      I was starting to wonder about that, then realized we at $BIGCORP are already generating ASTs from your input buffer, unifying those trees with a bunch of patterns, and telling your editor to flag questionable constructs. You type "if not foo in x" and 50ms later you get a proposed improved snippet. It's pretty rare to see quirky style in our codebase.

  • by TWX ( 665546 ) on Wednesday January 28, 2015 @03:58PM (#48927233)
    ...based on the quality of that code...
    • by halivar ( 535827 ) <bfelger@gmai l . com> on Wednesday January 28, 2015 @04:26PM (#48927515)

      goto blah;
      ^^ Idiot.

      // If you don't know why this is here, don't fuck with it.
      goto blah;

      ^^ Code guru.

      • by lgw ( 121541 ) on Wednesday January 28, 2015 @04:53PM (#48927721) Journal

        For lack of mod points let me just say: beautiful!

        It's like this in any engineering discipline:
        * The apprentice doesn't do things by the book, for he thinks himself clever
        * The journeyman does everything by the book, for he has learned the world of pain the book prevents
        * The master goes beyond the book, for he understand why every rule is there and no longer needs the rules

        Or put another way - the apprentice thinks he knows everything, the journeyman known how little he knows, the master knows everything in the field, and still knows how little he knows.

      • by c ( 8461 )


        try { ...
              throw BlahException("blah");
        } catch(Exception& blah) { ...
        }
        ^^ Idiot.

      • by ihtoit ( 3393327 )

        if I were the programmer (I'm not, not since primary school when I programmed the TURTLE to draw stuff on large sheets of cartridge paper) I'd be dropping //remarks in everywhere. Back to when I did TURTLE programming, I got berated for wasting time on comments but when it came down to 1000+ lines of code, it was nice to know which draw routines drew what part of the image. My TURTLE St. Paul's Cathedral was 7,700+ lines of code, probably 3/4 of that was comments. If it were stripped of comments it'd probab

      • // exception was found
        // beyond here be dragons, run
        // make your escape now
        goto blah;

        ^^ code master

    • by ranton ( 36917 )

      This doesn't seem so far fetched. I'm not sure the field of natural language processing is that far away from being able to create metrics which would determine the skill of developer by looking at their code. It could then be used by employers during the hiring process and during reviews.

      While that may sound like a nightmare scenario (and it very well could be), a more intelligent software system may even be able to show why it thinks the code is bad, and give an interviewer or reviewer the chance to ask w

  • Using this technique, can they tell us if the NSA did write the Regin Malware [slashdot.org] now?

    • I want to see it run Regin against sections of code in gnu/linux/systemd and see if the same NSA shills wrote any of it.
  • by Anonymous Coward on Wednesday January 28, 2015 @04:00PM (#48927255)

    Can we use this to find Satoshi?

  • No Kidding (Score:5, Insightful)

    by invid ( 163714 ) on Wednesday January 28, 2015 @04:09PM (#48927367)
    I can usually tell who wrote the code in the office by whether or not they put a space after their ifs: if(i == 0) vs if (i == 0); where they put their brackets, whether or not they replace their tabs with spaces, how they deal with bools: if (!var) vs if (var == false) and several other telling signs. There are so many combinations of variations no two programmers in the office (about 12 of us) have the same style.
    • i could do the same. not only that but i could often also tell who had originally trained that person because often part of the trainers style often leaked into their style.

      i work at a university and we hire 100 level CS students. so we generally assumed they knew nothing and trained them from scratch.

    • Yeah, about that... I start twitching whenever my boss types: MyFunction (arg1, arg2) and so on. Who puts a space after the function name before the '('? People who must die, of course.

      OK, calming down now.. 1.. 2.. 3.. 4.. 5..

      No, I'm OK, really.

      I had an old boss who was a code style nazi. He was an asshole. And actually, my current boss is very cool, even if he codes like that.

      • by AK Marc ( 707885 )
        If the whitespace is meaningless, it should be eliminated (carriage returns excepted). However, I can understand people who add in meaningless whitespace, as some times a + b is easier to read than a+b, even if they are interpreted the same.
        • So, you don't indent code? Or if you do, at what point is the indent meaningless (how many spaces/tabs) ... ? No spaces after semicolons? Or before/after braces? Or ...

          Readability should count as meaningful. It helps. And the compiler strips it out anyways, right, so ultimately it doesn't matter, just like comments, except in helping understand the code.

          I may be misunderstanding something completely in what you said... but I don't get why you would say it should be removed. Maybe in javascript for net

          • by AK Marc ( 707885 )
            Indent isn't meaningless. But there's no reason to double-space an indent. It carries a reading meaning, related to nesting of code.

            Code "feels" smaller when it's compact. Also, having a single spacing method uniform across everyone makes for easier cut-and paste sharing. Having one person space things differently than another will result in decreased readability.
    • I once worked on a project that had a handful of developers, where each developer was in charge of one code for one of the software subsystems of the project. We didn't have much of a coding standard there - only about one page - but we ended up with a consensus coding style in the project that everybody could live with. Even so, you could always tell who wrote what by the personality shown around the edges of the coding style of a given module, function, or even over just a few lines.

    • by PRMan ( 959735 )
      And in Visual Studio, I hit Ctrl+K Ctrl+D all the time, which puts my code into "Standard" Microsoft format. If everyone did this, I imagine the analyzer would drop to 50% or lower.
      • by ihtoit ( 3393327 )

        coding to book (sans comments) will kill the process of identifying authors stone dead, I think. If everybody's "Hello World!" was identical, how do you tell the difference?

    • if (false == var) prevents accidentally assigning false to var if you forget to use double equals
  • I suppose all those "// damn U bill gates!" comments gave me away

  • by meerling ( 1487879 ) on Wednesday January 28, 2015 @04:16PM (#48927413)
    When I was a kid in the 80s we figured out we could identify who wrote a particular piece of software by looking at it's code. Those individualistic and identifiable features we used in the argument over programming being an art or a science when we wanted to support the art side.
    • by Virtucon ( 127420 ) on Wednesday January 28, 2015 @04:19PM (#48927449)

      It's all about style. Writing software is very creative and it needs to have the authors fingerprints on it somewhere. If corporations don't like that they can suck the source code into a parser and spit out perfectly mundane crap that loses the intonation and the thoughts the original developer had for it.

  • by Crashmarik ( 635988 ) on Wednesday January 28, 2015 @04:18PM (#48927435)

    1985 Hugo Winner

    Really, the fact that coding style is recognizable was so well known it made it into pop culture 30 years ago.

    Also, on the smaller sample size the program might just be recognizing the parts of the style that come from the corporate standards. It would be interesting to see if it could recognize code from people who all work at the same company.

    • But I can't recall an instance.

      • by AJWM ( 19027 )

        Vinge is considered one of the fathers of cyberpunk because of his "True Names", which did precede Varley's chilling (and Hugo-winning) "Press Enter[]" (1981 vs 1985).

        On the other hand, Varley's much earlier (1976) "Overdrawn at the Memory Bank" was also one of the seminal works of the field.

        Been a while since I've read it, but the warlocks (hackers) in "True Names" would never have let their identity (true name) be determined from their coding styles.

  • I guess we can expect that source code repositories will be scanned and processed. And, for code written by multiple authors, the modified code (from commits) will be scanned and indexed as well.

    But, I bet they will never figure out who writes the malware recently attributed to the three letter agencies. They should, however, be able to figure out which agency writes the stuff if they get a copy of the source code or maybe even from decompiling the binary.

    Additionally, if written from .NET, the CLR code c

    • by Shados ( 741919 )

      Back in the days of .NET 1~2, decompiling via Reflector or whatever other tool got you back pretty good stuff. Today, there's a LOT more sugar, from LINQ to async/await and everything in between. If you go back to the original language, good decompilers sometimes infer what the original sugar was from the output following certain conventions and patterns...but moving that to another language will give you unreadable garbage.

      Reading F# in C# , this>but,worse>

      • by Shados ( 741919 )

        Bah, formatter messed things up. The last line was me joking about the crazy nested generic chains that F# types end up looking like in a language that doesn't support the same syntax sugar.

  • "The key to this system being used is, of course, first obtaining the code stylometries for a wide range of developers. The authors didn't address how, say, a database of programmers’ styles would be compiled. Also, to identify the author of a piece code would require access to the source code, and not just executables, though the authors mention there is some evidence that style is preserved in binaries."
    -> so once you post to github and similar 'they' can link every code you ever write to you,...

  • Comment removed based on user account deletion
  • by TrollstonButterbeans ( 2914995 ) on Wednesday January 28, 2015 @04:25PM (#48927505)
    If your coding is terrible and very newbie like, they can't single you out since your code is similar to the ocean of other terrible coders.

    So if you are a paranoid freak, the best way to ensure your safety and keep the government off your back is to write terrible code.
  • by Krazy Kanuck ( 1612777 ) on Wednesday January 28, 2015 @04:30PM (#48927545)
    Not that many of us actually use comments.... http://xkcd.com/1421/ [xkcd.com]
  • by jgotts ( 2785 ) <jgotts@gmaCOLAil.com minus caffeine> on Wednesday January 28, 2015 @04:31PM (#48927555)

    Most programming isn't writing new code. Most programming is working on someone else's crap you inherited. Invariably, you're going to be using that person's style or else the result will look like garbage.

    There is also the problem that most non-trivial code is worked on by multiple people at the same time.

    Writing some code from scratch as an assignment is a very artificial exercise nowadays, unless you're in a classroom setting. Therefore, you're going to get a signature from a programmer doing atypical work.

  • by MouseTheLuckyDog ( 2752443 ) on Wednesday January 28, 2015 @04:32PM (#48927559)

    95% of 250 coders. That means that out of a million programmers they will misidentify 200000.

    I suspect that there are few enough variances in style to make any coders style unique. For example whether to uses braces on a one line statement after an in if in C.

    With a few programmers it's likely to work, but when the possible source of programmers is the world...

    Not to mention emacs, Visual Studio and such enforcing some indentation standards and programming languages enforcing others.

    • by Rinikusu ( 28164 )

      Okay, I just woke up from a nap, but could you show your math there? Maybe I'm missing something because I come up with.. 50k, not 200k...

    • by Ksevio ( 865461 )
      I find the statistics dubious as well - they also dropped the dataset to nearly 1/10 while roughly doubling the code input and the results were 2% better, so it's possible if we follow the trend it will reach the 20% you seem to quote.
    • by Kjella ( 173770 )

      What complete and utter bullshit.

      95% of 250 coders. That means that out of a million programmers they will misidentify 200000.

      You know it's not a contest to come up with the worst bullshit. If you're left with one person 95% of the time when you have 249 possible wrong answers, it's like being left with 4000 people when you have 999999 wrong answers. If all those are too close to tell apart you'll misidentify >99.9%.

      Imagine for example that you wanted to find people by height and weight, as measured to nearest cm and kilo. It might work decently on a small group, but if you scale it up to a million people there'll be a lot of d

    • It's 50,000.

      Or for the study, the 12 people who code exclusively in assembly.

  • by Maxo-Texas ( 864189 ) on Wednesday January 28, 2015 @04:47PM (#48927675)

    Write a version of pretty-printer that rerenders your code into a different style.

    Have a lexicon of mipelled words for each "personality".

    Another lexicon of variable names.
    a vs inta vs int_a vs x.

    Refactoring and unfactoring for subroutines.

    Run the comments through google translate and back to english.
    ukrainian
    japanese
    chinese

    Synonym and antonym substitution in the comments.

    The mind dances at the possibilities to mess with this algorithm.

    • I can just imagine how unreadable such code would end up being, as any comments would look like they were written by some sort of AI tool.

    • "Hey, you notice some odd grammar, word choice, and spelling variance in this code?"
      "Oh yeah, must be Maxo-Texas. That's his anonymization software."

    • If you did this every time, you'd be identified as the guy who runs his code through Google Translate prior to release.

      Non-normal behavior is the most easy to single-out. In order to avoid detection, you basically have to become noise. And if you're the only one, then even that is a pattern.

      Sure, you could run some things through Google Translate and leave some things alone, but that'd be the equivalent of having two online personas.

  • by toonces33 ( 841696 ) on Wednesday January 28, 2015 @04:58PM (#48927755)

    and then use F2C to convert it to C code before I check in.. Try analyzing that!

  • Of course you could anonymize source code using an obfuscator.

    But maybe the simpler way is to compile Java to bytecode, then decompile it back to Java. I suspect that's as effective as most obfuscators.

  • Perhaps something like Artistic Style might help.

    http://astyle.sourceforge.net/ [sourceforge.net]

  • Someone just needs to write a tool that takes source code and translates it into an obfuscated form that only the CPU can understand. Is anyone working on this type of privacy tool?

  • by Kittenman ( 971447 ) on Wednesday January 28, 2015 @06:25PM (#48928349)
    Wouldn't any programmer worth their salt identify themselves in the comments, or (if not) be logged as the last guy in that code on such-and-such a date, while working on such-and-such a patch number? (E,.g 'kittenman was here, 1/Jan/15, fixing Steve's crap').

    But I hope my code is easily recognizable. I'm proud of it. It may not be the smartest, slickest, quickest there is, but it's mine. And it works.
    • by Shados ( 741919 )

      People still use these stupid 90s style comments with authors and dates and shit? Really?

      Just use the source control system for that.

  • I doubt it. Therefore, this is about as reliable as graphology (handwriting analysis).

    If you take two programmers who code to book standard, how do you tell the difference between them using the same strict problem?

  • You can have/use this idea for free:

    Before a system will build said code, have the build system verify the code not only by the public key/code hash, but as a secondary method - the code fingerprint of the author in question.

    This turns a creepy idea into something worthwhile.

The 11 is for people with the pride of a 10 and the pocketbook of an 8. -- R.B. Greenberg [referring to PDPs?]

Working...