Anonymous No More: Your Coding Style Can Give You Away 220
itwbennett writes Researchers from Drexel University, the University of Maryland, the University of Goettingen, and Princeton have developed a "code stylometry" that uses natural language processing and machine learning to determine the authors of source code based on coding style. To test how well their code stylometry works, the researchers gathered publicly available data from Google's Code Jam, an annual programming competition that attracts a wide range of programmers, from students to professionals to hobbyists. Looking at data from 250 coders over multiple years, averaging 630 lines of code per author their code stylometry achieved 95% accuracy in identifying the author of anonymous code (PDF). Using a dataset with fewer programmers (30) but more lines of code per person (1,900), the identification accuracy rate reached 97%.
Can they do it with corporate code? (Score:5, Interesting)
Can they do it with corporate code where there are naming and style standards in abundance, and code reviews to ensure those guidelines are followed?
Re:Can they do it with corporate code? (Score:5, Funny)
It seems like using the applicable features of the corporate version control system would be a lot easier - and possibly even better than 95% accurate.
Re: Can they do it with corporate code? (Score:2, Funny)
Drats! I was.sure that.everyone else wrote.stuff.like "if(user == 'dumbfuck"){exit 666};
Re: (Score:3)
That's what "git blame" is for...
Re: (Score:2)
Re: (Score:2)
I've always found that even with style guidelines in place, developers will still leave their fingerprints all over it.
Some devs will be verbose in their comments, some less. Some devs will embrace IoC where others shun it. Some devs will create a single method with all code in it, some will refactor the heck out of it with many methods. Heck, devs can't even agree sometimes on what should be public, protected, and private (and rarely will style guidelines dictate this kind of thing).
Re:Can they do it with corporate code? (Score:5, Interesting)
Perhaps not as well. If people are following the coding standards for the organization then the code for the most part looks far more similar.
When I am working with a development team, I will tend to adjust my unique style to better match what everyone else is doing. Even if it means doing coding methods that I will normally disagree with.
If the code tends to use a bunch of Goto's instead of Procedures or classes. I will use those GOTO not for my benefit, but for people who will maintain my code later on, so they won't have to change their mindset and debugging strategies to see what the program is doing to do future corrections.
I will go full Object Oriented if the group of people that I am working with do their coding full OO.
My personal style would be more procedural, than OO. Not due to lack of knowledge or not realizing OO advantages and disadvantages. But if I am to code on my own, I code in the way that My Mind handles the requirements, and how I feel would be easier for me to change and fix my code in the future.
I think this method is best for ID based on personal code, vs group corporate code, where a lot of your particular style is hidden.
Re: (Score:2)
Just curious, how are larger companies going with algorithm libraries and variable naming rules to ensure maximum re usability of code (variables named by function rather than named by application). Any change, is most of it done from scratch, any fancy algorithm data bases with search functions based upon algorithm descriptors and software engineering. Also things like software language translators or the same algorithms stored in different languages. Any shift away from writing code to more assembling al
Re: (Score:2)
Re: (Score:2)
Similarly I was thinking this would probably be defeated by a "minifier", obfuscator, or anything along those lines. There are dozens to choose from for most languages and it would be trivial for anyone attempting to remain anonymous to use them on their releases.
If you want the code to remain usable, there are tools to enforce a standard style instead, in which case just set it up with rules based on a popular project if your language of choice doesn't have a specific style. At that point you're down to
Re: (Score:2)
Did you read the part in the article where they're actually doing the matching based on the ASTs (abstract syntax trees), and so are able to identify authors even after the code goes through an obfuscator? Relevant quotes:
Re: (Score:2)
Can they do it with corporate code where there are naming and style standards in abundance, and code reviews to ensure those guidelines are followed?
I was starting to wonder about that, then realized we at $BIGCORP are already generating ASTs from your input buffer, unifying those trees with a bunch of patterns, and telling your editor to flag questionable constructs. You type "if not foo in x" and 50ms later you get a proposed improved snippet. It's pretty rare to see quirky style in our codebase.
Re: (Score:2)
They are talking about the corporate code as a baseline to compare to the anonymous code.
Re: (Score:2)
If it doesn't, and you need this sort of analysis to determine who wrote a section of code, you're doing something wrong.
With pair programming, you may have two programmers sharing a keyboard, and alternating writing chunks of code.
I can usually look at a section of code, and reliably know which of my coworkers wrote it, even when they follow the style guidelines. Do they use an if-else chain, or a switch statement? Do they use #define's or prefer enums? Bitfields, or masks? Often I can tell who wrote it just by looking at the comments. Some people are neurotic about grammar and using complete sentences. Others prefer mi
Re: (Score:2)
I can tell who wrote it just by looking at the comments
Yeah, my first thought on this was "how accurate would it be if you a) stripped out comments, and b) ran through a code formatter (many code editors auto-formatting to a standard on the fly)"
I think including comments is basically cheating, as they're super distinguishable. You can tell what code I've worked on cause I consistently type "teh", spell words like "colour" with my local spelling, etc. But recognising just the actual code itself, that's more impressive.
Re:Can they do it with corporate code? (Score:4, Insightful)
*raising hands slowly* Is there a problem, Coding Officer?
Re: (Score:2)
"legal" of course meaning adhering to rules written and ratified by a group of power and money grubbing politicians in the pockets of large corporations.
Re: (Score:3)
Re:Can they do it with corporate code? (Score:5, Informative)
You obviously haven't had to work in an environment where code has to be certified. I can tell you from first hand experience that coding in an RTCA DO-178B environment or similar has some pretty strict adherence to some very pedantic and strict coding requirements. You'll find this type of development in avionics systems (both civilian and military) as well as other industries like medical electronics where code safety is literally life-and-death.
Outside of that type of environment, I do agree with you. You'd be lucky if even half of the developers have seen a company coding standard. You'd be hard pressed to find any developers who really adhere to it even when they know the document exists. But in those small niche markets, you'd be surprised at how strictly they adhere to arbitrary coding standards (whether they really impact code quality or safety or not).
Re: (Score:2)
It's not just these type of environments that are strict. Well established companies have the same practices, because the only way to have controlled growth is to adhere to a set of standards. Sure, standards change over time but not quickly. For posterity, controlled does not imply restricted.
Re: (Score:3)
A sonnet has strict rules, too.
But I'd wager that someone could tell one of Shakespeare's from one of yours.
Re: (Score:2)
RC doesn't pay me at all. I haven't worked there for over 15 years now.
Up next, automatic intelligence rating... (Score:5, Funny)
Re:Up next, automatic intelligence rating... (Score:5, Funny)
goto blah;
^^ Idiot.
goto blah;
^^ Code guru.
Re:Up next, automatic intelligence rating... (Score:5, Insightful)
For lack of mod points let me just say: beautiful!
It's like this in any engineering discipline:
* The apprentice doesn't do things by the book, for he thinks himself clever
* The journeyman does everything by the book, for he has learned the world of pain the book prevents
* The master goes beyond the book, for he understand why every rule is there and no longer needs the rules
Or put another way - the apprentice thinks he knows everything, the journeyman known how little he knows, the master knows everything in the field, and still knows how little he knows.
Re: (Score:2)
It's like jazz. You have to know know rules before you can break them.
Re: (Score:2)
And, I accidentally repeated repeated a word.
Re: (Score:3)
The guru knows the novice knows more than the corporate enterprise architect, but won't let on lest the novice get a more-swelled head.
Re: (Score:2)
try {
throw BlahException("blah");
} catch(Exception& blah) {
}
^^ Idiot.
Re: (Score:2)
if I were the programmer (I'm not, not since primary school when I programmed the TURTLE to draw stuff on large sheets of cartridge paper) I'd be dropping //remarks in everywhere. Back to when I did TURTLE programming, I got berated for wasting time on comments but when it came down to 1000+ lines of code, it was nice to know which draw routines drew what part of the image. My TURTLE St. Paul's Cathedral was 7,700+ lines of code, probably 3/4 of that was comments. If it were stripped of comments it'd probab
Re: (Score:2)
// exception was found
// beyond here be dragons, run
// make your escape now
goto blah;
^^ code master
Re: (Score:2)
This doesn't seem so far fetched. I'm not sure the field of natural language processing is that far away from being able to create metrics which would determine the skill of developer by looking at their code. It could then be used by employers during the hiring process and during reviews.
While that may sound like a nightmare scenario (and it very well could be), a more intelligent software system may even be able to show why it thinks the code is bad, and give an interviewer or reviewer the chance to ask w
Let's analyze the cyberspying code. (Score:2)
Using this technique, can they tell us if the NSA did write the Regin Malware [slashdot.org] now?
Re: (Score:2)
What about Bitcoin? (Score:5, Funny)
Can we use this to find Satoshi?
No Kidding (Score:5, Insightful)
Re: (Score:2)
i could do the same. not only that but i could often also tell who had originally trained that person because often part of the trainers style often leaked into their style.
i work at a university and we hire 100 level CS students. so we generally assumed they knew nothing and trained them from scratch.
Re: (Score:2)
OK, calming down now.. 1.. 2.. 3.. 4.. 5..
No, I'm OK, really.
I had an old boss who was a code style nazi. He was an asshole. And actually, my current boss is very cool, even if he codes like that.
Re: (Score:2)
Re: (Score:2)
So, you don't indent code? Or if you do, at what point is the indent meaningless (how many spaces/tabs) ... ? No spaces after semicolons? Or before/after braces? Or ...
Readability should count as meaningful. It helps. And the compiler strips it out anyways, right, so ultimately it doesn't matter, just like comments, except in helping understand the code.
I may be misunderstanding something completely in what you said... but I don't get why you would say it should be removed. Maybe in javascript for net
Re: (Score:2)
Code "feels" smaller when it's compact. Also, having a single spacing method uniform across everyone makes for easier cut-and paste sharing. Having one person space things differently than another will result in decreased readability.
Re: (Score:2)
I once worked on a project that had a handful of developers, where each developer was in charge of one code for one of the software subsystems of the project. We didn't have much of a coding standard there - only about one page - but we ended up with a consensus coding style in the project that everybody could live with. Even so, you could always tell who wrote what by the personality shown around the edges of the coding style of a given module, function, or even over just a few lines.
Re: (Score:2)
Re: (Score:3)
coding to book (sans comments) will kill the process of identifying authors stone dead, I think. If everybody's "Hello World!" was identical, how do you tell the difference?
Re: (Score:2)
Re: (Score:2)
Re: (Score:3)
Style guidelines should be about avoiding pitfalls of the language, using appropriate idioms, and not making life miserable for maintainers, not about where you put spaces and braces.
Re: (Score:2)
Actually, the one I hate is:
if ($variable == false) {
doSomethingInteresting($variable);
}
and one of my co-workers does:
if ($variable == false)
{
doSomethingInteresting($variable);
}
Of course, my code is beautiful and everyone else's is terse and ugly and everyone should write code the same way that I do. Try suggesting that to a group of programmers and see how far it gets you. Generally, it's not worth the argument--you w
Re: (Score:2)
As the thread suggests, one advantage to different coding styles is that you can generally tell who wrote what and, if there seems to be a bug, you can track them down and tell them to fix it in that ugly mess. In our office, we have the rule that if you go around changing code style, you now own that code and are responsible for it. About the only issue we've run into is that people's styles evolve over time. So the guy right out of school may have a certain style that changes as he is exposed to more styles.
git/cvs/svn/mercurial blame can tell you who wrote whatever code. Please tell me you are using some kind of source repository.......
Re: (Score:2)
Use a diff tool that can ignore formatting changes. I'm a fan of Beyond Compare [scootersoftware.com], but there are plenty of others.
That explains it (Score:2)
I suppose all those "// damn U bill gates!" comments gave me away
Welcome to the party (Score:3)
Re:Welcome to the party (Score:5, Insightful)
It's all about style. Writing software is very creative and it needs to have the authors fingerprints on it somewhere. If corporations don't like that they can suck the source code into a parser and spit out perfectly mundane crap that loses the intonation and the thoughts the original developer had for it.
John Varley Press Enter (Score:4, Informative)
1985 Hugo Winner
Really, the fact that coding style is recognizable was so well known it made it into pop culture 30 years ago.
Also, on the smaller sample size the program might just be recognizing the parts of the style that come from the corporate standards. It would be interesting to see if it could recognize code from people who all work at the same company.
Vernor Vinge probably beat him to it (Score:2)
But I can't recall an instance.
Re: (Score:2)
Vinge is considered one of the fathers of cyberpunk because of his "True Names", which did precede Varley's chilling (and Hugo-winning) "Press Enter[]" (1981 vs 1985).
On the other hand, Varley's much earlier (1976) "Overdrawn at the Memory Bank" was also one of the seminal works of the field.
Been a while since I've read it, but the warlocks (hackers) in "True Names" would never have let their identity (true name) be determined from their coding styles.
Source of Future Data (Score:2)
I guess we can expect that source code repositories will be scanned and processed. And, for code written by multiple authors, the modified code (from commits) will be scanned and indexed as well.
But, I bet they will never figure out who writes the malware recently attributed to the three letter agencies. They should, however, be able to figure out which agency writes the stuff if they get a copy of the source code or maybe even from decompiling the binary.
Additionally, if written from .NET, the CLR code c
Re: (Score:2)
Back in the days of .NET 1~2, decompiling via Reflector or whatever other tool got you back pretty good stuff. Today, there's a LOT more sugar, from LINQ to async/await and everything in between. If you go back to the original language, good decompilers sometimes infer what the original sugar was from the output following certain conventions and patterns...but moving that to another language will give you unreadable garbage.
Reading F# in C# , this>but,worse>
Re: (Score:2)
Bah, formatter messed things up. The last line was me joking about the crazy nested generic chains that F# types end up looking like in a language that doesn't support the same syntax sugar.
The key to this system being used is, ...... (Score:2)
"The key to this system being used is, of course, first obtaining the code stylometries for a wide range of developers. The authors didn't address how, say, a database of programmers’ styles would be compiled. Also, to identify the author of a piece code would require access to the source code, and not just executables, though the authors mention there is some evidence that style is preserved in binaries."
-> so once you post to github and similar 'they' can link every code you ever write to you,...
Re: (Score:2)
Re: (Score:2)
are the podcasts/videocasts out for that yet?
Bad Coders Can't Be Identified (Score:4, Interesting)
So if you are a paranoid freak, the best way to ensure your safety and keep the government off your back is to write terrible code.
Re: (Score:2)
Oblig XKCD (Score:3)
Most programming isn't new code (Score:4, Insightful)
Most programming isn't writing new code. Most programming is working on someone else's crap you inherited. Invariably, you're going to be using that person's style or else the result will look like garbage.
There is also the problem that most non-trivial code is worked on by multiple people at the same time.
Writing some code from scratch as an assignment is a very artificial exercise nowadays, unless you're in a classroom setting. Therefore, you're going to get a signature from a programmer doing atypical work.
What complete and utter bullshit. (Score:3)
95% of 250 coders. That means that out of a million programmers they will misidentify 200000.
I suspect that there are few enough variances in style to make any coders style unique. For example whether to uses braces on a one line statement after an in if in C.
With a few programmers it's likely to work, but when the possible source of programmers is the world...
Not to mention emacs, Visual Studio and such enforcing some indentation standards and programming languages enforcing others.
Re: (Score:2)
Okay, I just woke up from a nap, but could you show your math there? Maybe I'm missing something because I come up with.. 50k, not 200k...
Re: (Score:2)
Re: (Score:2)
What complete and utter bullshit.
95% of 250 coders. That means that out of a million programmers they will misidentify 200000.
You know it's not a contest to come up with the worst bullshit. If you're left with one person 95% of the time when you have 249 possible wrong answers, it's like being left with 4000 people when you have 999999 wrong answers. If all those are too close to tell apart you'll misidentify >99.9%.
Imagine for example that you wanted to find people by height and weight, as measured to nearest cm and kilo. It might work decently on a small group, but if you scale it up to a million people there'll be a lot of d
Re: (Score:2)
It's 50,000.
Or for the study, the 12 people who code exclusively in assembly.
So you could use this tool to make your code anon. (Score:4, Interesting)
Write a version of pretty-printer that rerenders your code into a different style.
Have a lexicon of mipelled words for each "personality".
Another lexicon of variable names.
a vs inta vs int_a vs x.
Refactoring and unfactoring for subroutines.
Run the comments through google translate and back to english.
ukrainian
japanese
chinese
Synonym and antonym substitution in the comments.
The mind dances at the possibilities to mess with this algorithm.
Re: (Score:2)
I can just imagine how unreadable such code would end up being, as any comments would look like they were written by some sort of AI tool.
Re: (Score:2)
"Hey, you notice some odd grammar, word choice, and spelling variance in this code?"
"Oh yeah, must be Maxo-Texas. That's his anonymization software."
Re: (Score:2)
If you did this every time, you'd be identified as the guy who runs his code through Google Translate prior to release.
Non-normal behavior is the most easy to single-out. In order to avoid detection, you basically have to become noise. And if you're the only one, then even that is a pattern.
Sure, you could run some things through Google Translate and leave some things alone, but that'd be the equivalent of having two online personas.
Hah. I write everything in Fortran.. (Score:3)
and then use F2C to convert it to C code before I check in.. Try analyzing that!
Re: (Score:2)
That's one way to make your ForTran run slower
Obfuscator? Or just translate A-B-A? (Score:2)
Of course you could anonymize source code using an obfuscator.
But maybe the simpler way is to compile Java to bytecode, then decompile it back to Java. I suspect that's as effective as most obfuscators.
Code beautifier (Score:2)
Perhaps something like Artistic Style might help.
http://astyle.sourceforge.net/ [sourceforge.net]
Easy Solution (Score:2)
Someone just needs to write a tool that takes source code and translates it into an obfuscated form that only the CPU can understand. Is anyone working on this type of privacy tool?
Pointless, but no doubt true (Score:3)
But I hope my code is easily recognizable. I'm proud of it. It may not be the smartest, slickest, quickest there is, but it's mine. And it works.
Re: (Score:2)
People still use these stupid 90s style comments with authors and dates and shit? Really?
Just use the source control system for that.
will they show the method? (Score:2)
I doubt it. Therefore, this is about as reliable as graphology (handwriting analysis).
If you take two programmers who code to book standard, how do you tell the difference between them using the same strict problem?
Here's a great idea... (Score:2)
You can have/use this idea for free:
Before a system will build said code, have the build system verify the code not only by the public key/code hash, but as a secondary method - the code fingerprint of the author in question.
This turns a creepy idea into something worthwhile.
Re: (Score:3)
Sounds like the solution is to use an entirely different language than the bulk of one's work is in, if one wants to anonymously write malicious or otherwise legally complicated code.
Re: (Score:2)
That kind of depends on the stylesheets, pre-compiler style enforcement routines, and the fact that a shit-ton of corporate code is often improved incrementally by multiple authors.
'course, there's still the comments that you could use, but who does that?
Re: (Score:2, Funny)
Re: (Score:2)
Re:Demonstrates the need... (Score:5, Insightful)
This is why people need to follow style guides, so that all source code is styled the same.
There's a damn good chance 95% of coders are not criminals, nor would they care if someone identified their code.
That said, this will become a legal nightmare is when this kind of profiling can be used to frame another coder.
And with the laws wanting to treat any "hacker" as a potential terrorist these days, the consequences of even being accused can be rather severe to deal with.
Re:Demonstrates the need... (Score:5, Insightful)
You want scary? The same can be applied to general text on the Internet, tying posters on different sotes together, including anonymous (not your real name avatar) to a site with your real name.
Which the NSA probably has churning away on its databases. Which probably does little more than add confirmation of said links from watching and recording all traffic to any and all of a billion IP addresses.
And I, for one, welcome our new panopticon overlords who won't abuse it, not one of their thousand agents, because they're supposed to check a got-a-warrant box on a piece of paper before choosing to abuse it.
Re: (Score:2)
This is why people need to follow style guides, so that all source code is styled the same.
Why does all code need to be styled the same?
I can see a need in a safety critical environment like avionics or medical devices that needs strict adherence to rules to ensure that the code has been written correctly and with as few bugs as possible. But what difference does it make outside of that kind of environment? I mean, so what if there's a thousand different coding standards in the Chrome source? What difference does it really make?
harder to read if there is no consistency (Score:2)
Generally speaking each project has a coding style that most code in the project adheres to, for the simple reason that it's easier to maintain when the code all looks more-or-less similar.
If one area uses lowercase with underscores, and the other area uses CamelCase, and one area typedefs the heck out of everything while the other is explicit, then for someone coming in and trying to understand the code it makes it harder than necessary to figure out what's going on.
So if you look at the linux kernel, or g
Re: (Score:2)
Coding standard adoption can provoke holy wars but at the end of the day, you're a team. Though idiosyncratic decisions irk me, such as prefixing instance variables with underscore. Any decent editor will make such a distinction between scope via colours.
Pretty printing tools and style checkers present in any decent editor will enforce coding standards with minimal fuss.
Re: (Score:2)
Re: (Score:2)
Even when following a coding style guide 100%, there is still generally enough leeway to allow for plenty of personal style. There's the words you use to name things, use of whitespace and grouping of statements, basically everything about a piece of source code that's lost if you compile and then decompile a program. Just like the prose from two different authors are distinct from one other, even if they go through the same copy editor to fit a publisher's style guide. And if your corporate style guide req
Re: (Score:3, Funny)
So, what's it like to work for FaceBook?
Re: (Score:2)
Newfags can't triforce
Slashdot supports too few entities to do this right, and forget about UTF8. But you can get sorta close.
*
* *
Unless someone can do better?
Re: (Score:2)
Re: (Score:2)
I once marked CS homework and uncovered cheating for an 'individual' assignment.
A group of students had debug comments in their code - the giveaway? spelling mistakes.
Re: (Score:2)
there's a wiki site (can't remember the name) that takes great joy in posting accusations without attribution or evidence, and when called on them the Admins sit there and claim that the person who posted the slander is now the same person trying to get a retraction based on some sort of magic ring with a seekrit style decoder. Even when called out to post the evidence they claim to hold, they just dive straight in to claiming knowledge they can't possibly have for various reasons not least of which said cl