from the leaving-your-mark dept.
itwbennett writes Researchers from Drexel University, the University of Maryland, the University of Goettingen, and Princeton have developed a "code stylometry" that uses natural language processing and machine learning to determine the authors of source code based on coding style. To test how well their code stylometry works, the researchers gathered publicly available data from Google's Code Jam, an annual programming competition that attracts a wide range of programmers, from students to professionals to hobbyists. Looking at data from 250 coders over multiple years, averaging 630 lines of code per author their code stylometry achieved 95% accuracy in identifying the author of anonymous code (PDF). Using a dataset with fewer programmers (30) but more lines of code per person (1,900), the identification accuracy rate reached 97%.
[Crash programs] fail because they are based on the theory that, with nine
women pregnant, you can get a baby a month.
-- Wernher von Braun