Does GitHub Copilot Improve Code Quality? (github.blog) 75
Microsoft-owned GitHub published a blog post asking "Does GitHub Copilot improve code quality? Here's what the data says."
Its first paragraph includes statistics from past studies — that GitHub Copilot has helped developers code up to 55% faster, leaving 88% of developers feeling more "in the flow" and 85% feeling more confident in their code.
But does it improve code quality? [W]e recruited 202 [Python] developers with at least five years of experience. Half were randomly assigned GitHub Copilot access and the other half were instructed not to use any AI tools... We then evaluated the code with unit tests and with an expert review conducted by developers.
Our findings overall show that code authored with GitHub Copilot has increased functionality and improved readability, is of better quality, and receives higher approval rates... Developers with GitHub Copilot access had a 56% greater likelihood of passing all 10 unit tests in the study, indicating that GitHub Copilot helps developers write more functional code by a wide margin. In blind reviews, code written with GitHub Copilot had significantly fewer code readability errors, allowing developers to write 13.6% more lines of code, on average, without encountering readability problems. Readability improved by 3.62%, reliability by 2.94%, maintainability by 2.47%, and conciseness by 4.16%. All numbers were statistically significant... Developers were 5% more likely to approve code written with GitHub Copilot, meaning that such code is ready to be merged sooner, speeding up the time to fix bugs or deploy new features.
"While GitHub's reports have been positive, a few others haven't," reports Visual Studio magazine: For example, a recent study from Uplevel Data Labs said, "Developers with Copilot access saw a significantly higher bug rate while their issue throughput remained consistent."
And earlier this year a "Coding on Copilot" whitepaper from GitClear said, "We find disconcerting trends for maintainability. Code churn — the percentage of lines that are reverted or updated less than two weeks after being authored — is projected to double in 2024 compared to its 2021, pre-AI baseline. We further find that the percentage of 'added code' and 'copy/pasted code' is increasing in proportion to 'updated,' 'deleted,' and 'moved 'code. In this regard, AI-generated code resembles an itinerant contributor, prone to violate the DRY-ness [don't repeat yourself] of the repos visited."
Its first paragraph includes statistics from past studies — that GitHub Copilot has helped developers code up to 55% faster, leaving 88% of developers feeling more "in the flow" and 85% feeling more confident in their code.
But does it improve code quality? [W]e recruited 202 [Python] developers with at least five years of experience. Half were randomly assigned GitHub Copilot access and the other half were instructed not to use any AI tools... We then evaluated the code with unit tests and with an expert review conducted by developers.
Our findings overall show that code authored with GitHub Copilot has increased functionality and improved readability, is of better quality, and receives higher approval rates... Developers with GitHub Copilot access had a 56% greater likelihood of passing all 10 unit tests in the study, indicating that GitHub Copilot helps developers write more functional code by a wide margin. In blind reviews, code written with GitHub Copilot had significantly fewer code readability errors, allowing developers to write 13.6% more lines of code, on average, without encountering readability problems. Readability improved by 3.62%, reliability by 2.94%, maintainability by 2.47%, and conciseness by 4.16%. All numbers were statistically significant... Developers were 5% more likely to approve code written with GitHub Copilot, meaning that such code is ready to be merged sooner, speeding up the time to fix bugs or deploy new features.
"While GitHub's reports have been positive, a few others haven't," reports Visual Studio magazine: For example, a recent study from Uplevel Data Labs said, "Developers with Copilot access saw a significantly higher bug rate while their issue throughput remained consistent."
And earlier this year a "Coding on Copilot" whitepaper from GitClear said, "We find disconcerting trends for maintainability. Code churn — the percentage of lines that are reverted or updated less than two weeks after being authored — is projected to double in 2024 compared to its 2021, pre-AI baseline. We further find that the percentage of 'added code' and 'copy/pasted code' is increasing in proportion to 'updated,' 'deleted,' and 'moved 'code. In this regard, AI-generated code resembles an itinerant contributor, prone to violate the DRY-ness [don't repeat yourself] of the repos visited."
No. (Score:5, Funny)
/betteridge
It's also... (Score:3)
Anecdotal evidence (Score:3)
Every account of using it I've read online has been negative about code quality.
Re: Anecdotal evidence (Score:3)
Is that different to "security through obscurity"?
Re: Anecdotal evidence (Score:4, Informative)
Of course, the Java solution is to just make every error condition an exception and shut down the whole thing.
Re: (Score:2)
Ironically, one of the downsides of modular coding paradigms is that it can be very difficult to decide, from inside a function or object, where input data originally comes from and where it's going. This is another reason real code is often bloated and messy. We certainly need better computer languages for the 21st centu
Re: Anecdotal evidence (Score:4, Insightful)
Who here remembers Perl's ideas about tainted variables [wikipedia.org]?
I do and I continue to teach it in my software security classes. Data-paths are really critical for software security.
My take on "AI" coding assistants is negative. And my largest criticism is strategic: People using crutches will never learn how to walk without them. So, yes, some not very significant "productivity" gains may be there in the code generation step if you are really bad at it. But in that case, you should use "AI" tools even less because if you lean on them you will never get better. Obviously, the other criticisms like code churn, more bugs, etc. are valid too.
One particular troubling thing I have seen in exercises and exams where students were allowed to use "AI" was that "AI" simply overlooks border conditions and part of the spec that are not quite standard. For example, I had a well-known algorithm (that the students did not know) to be implemented, but I had very explicitly a different order of some steps, which made sense in the given context. About 5% of the students got that right. The others just took the "AI" answer and got it wrong. I have found a similar thing in my own experiments, and this exam question was kind of a trap. Which worked a lot better than I expected.
The problem, of course, is that LLM-type AI has no understanding and no fact-checking ability. It essentially craps out the "solution" that matches the question best statistically. If there is one sentence in there (or in the case of that exam-task, a numbered list of three), that does not fit what it saw in training, it simply ignores (!) hat part. Now, in the software security space, understanding is critical. For example, as soon as you write input validation for non-trivial things, you are going to fail if you use "AI" for that. And incomplete input-validation is what attackers or sometimes other problems (remember Cloudstrike?) are going to walk in by.
Hence one catastrophic scenario I expect is that "AI" will start to or may well have already have started to recommend "insecurity patterns" users that look good and adequate. These will then make it into many products, maybe even cross platform and cross-language. And then the attackers will have massively less effort to attack different products. On top of that, pattern seeding with insecurity patterns may also be or become an attack vector.
The whole thing is a clusterfuck with retarding (literally) developers, decreasing the variability of code, overlooking parts of the spec and generally and subtly (or not subtly) decreasing code quality.
Obviously, the metrics used to "prove" that "AI" increases developer productivity are completely bogus, because "lines of code written per time" is really bullshit. Writing the code is a minor part of project time. The major part is maintaining the code. A number I remember from 35 years back when I studied software engineering was 20% coding, 60% maintenance. Hence if you code 5% faster, but maintenance cost goes 2% up, you have a net loss. I expect we will see a lot of that happening.
AI versus cut and paste from Stackoverflow / docs (Score:2)
Skeptical take:
Read many different vendor's API documentation and there is little or no error handling in example code.
Most Stackoverflow and blog entries are the same, no error handling.
And this is not an effective way to handle errors
try
{
(big block of code with multiple operations, multiple business logic items)
return some not null value
}
catch (exception e)
{
return null
}
Re: (Score:2)
Re: Anecdotal evidence (Score:2)
Re: (Score:2)
Re: (Score:1)
Of course, the Java solution is to just make every error condition an exception and shut down the whole thing.
...
For every exception, there usually is a catch
"Shut downs" only happen in C++, if an exception is popping out of a function, which did not declare that exception int the "throws" clause.
You are mixing up Java with C#
In Java, the compiler makes sure you handle all "checked exceptions", some idiots do a empty catch block, though.
C# has no checked exceptions ... only unchecked. So worst case they fl
Re: (Score:2)
My rule of thumb is: if it's readable and clear to a random stranger, then it's not security hardened.
Wouldn't this defeat the entire premise of security when it comes to open-source code? If the code is available for anyone to read and no one can understand it it's not possible for others to scrutinize it, and if they can read it then by your rule it must not be very secure.
Re: (Score:3)
Readability is what you want when reviewing code from an interview candidate offering a solution to a coding problem, or when reviewing code from a new hire to see if he fucked up on his first month on the job.
Real and mature code is messy because the real world is full of special cases and redesigns and imp
Re: (Score:2)
To paraphrase Einstein, code should be as simple as possible, but no simpler than that.
And that is just it. Unless the code is only doing really trivial things, "as simple as possible" is not going to be very simple.
Re: (Score:1)
You have strange ideas about code.
If you are in one of my teams: you quickly learn to write "unmessy" code, or get assigned the most boring work you can imagine.
Real code is not messy. It is as readable as any code. If you do not learn how to write unmessy code, you can completely forget to get promoted inside of the organization.
Re: (Score:2)
Re: (Score:3)
That would be rally bad. Because Stackoverflow usually includes a discussion of alternatives and advantages and drawbacks. This a) serves so that a competent (!) coder can understand the problem better and make an adequate selection of a solution and b) does contribute to developer education and experience. Yes, it takes more time, but that time is well-spent.
Re: (Score:2)
Re: (Score:1)
It works when the LSP is too slow or does not load properly, but there is no way I will trust copilot to fill out a code block for me.
The irony, I suppose that is the correct word - might be coincidence, its suggestions that it uses are my own code so I guess I should feel flattered in those instances.
Phillip Morris says cigarettes don't cause cancer (Score:5, Insightful)
No conflict of interest at Github/Micro$oft either. ;-)
Re: (Score:2)
I'm sure Microsoft is being 1000% ethical and if there was any evidence that AI actually makes code worse they would definitely let Github publish stuff about that despite Microsoft investing $100 billion or more in AI and AI related stuff.
Re: (Score:2)
Indeed. Obviously Microsoft would fall on their sword to protect us all and make the world a better place! Right? Right?
Man, I really hope I am retired when all this AI crap has to be ripped out everywhere...
Re: (Score:2)
Nope, never. Their marketing has never lied before, so why would it lie now. After all, everyone knows if you train AI on average crap on the internet that you get diamonds as a result.
Re: (Score:2)
You can turn crap into diamonds. Just takes a lot of heat and pressure. Marketing can also do it, using a similar approach.
AI is trained on peoples mistakes (Score:3)
Re: AI is trained on peoples mistakes (Score:2)
& some languages have more gotchas..& some languages are more common for beginners, who make more mistakes.
Re: (Score:1)
One thing which gives me the greatest cause for concern is that internet tech is changing continuously yet the AIs seem to consider the version of each piece of software to be immaterial or it is version aware yet its knowledge cut-off isn't aware that for the version being used, the recommendations are no-longer appropriate.
Re: AI is trained on peoples mistakes (Score:2)
Yeah, good point.
Re: (Score:2)
Exactly. This will also likely lead to common security mistakes becoming more prevalent, decreasing attacker effort. And as a bonus on top, it will be really hard to prevent an LLM from continuing to recommend some crap code once it is known it is crap.
Yes and no (Score:5, Insightful)
Based on absolutely no studies or anything but my own opinion... I suspect AI will make good coders better and bad coders worse. Good coders will consider the suggestions, take the good ones and reject the bad ones. Bad coders will take everything.
Re: (Score:3)
I would agree with that. It's anecdotal but I've noticed when using Copilot at my job that it usually gets me a "mostly" proper solution. But even getting you mostly to a solution can save you an hour or more of digging through documentation. "Hey Copilot, I have an Excel workbook in a memory stream. Load it up with the Open XML library, open up the Summary spreadsheet, and copy out the contents of cell D:3." AI bots are pretty good at crawling through lots of information and summarizing it; I've been
Re: (Score:2)
Probably, although I am doubtful on the impact on good coders. Since most coders are crap (just look at the flood of security vulnerabilities we see every day), that part of the impact will dominate anyways.
Re: (Score:2)
I investigated: three answers so far (Score:3)
The best idea is Advait Sarkar's. He noticed how bad LLMs are at anything creative, and instead suggested we use them for things they're good at, predicting what humans would say. Especially if they were asked what a critic would say. See https://leaflessca.wordpress.c... [wordpress.com]
Trying Pull Requests with CodeRabbit. One of the things I think LLMs can do well is compare my text with a whole body of other people's work. In that vein, CodeRabbit now offers to review git pull requests. https://leaflessca.wordpress.c... [wordpress.com]
In the search for true artificial intelligence, large language models are a horrible failure which look like a success. https://leaflessca.wordpress.c... [wordpress.com]
Re: (Score:2)
Re: (Score:2)
True. Impressive toys. The "somebody else pays for it" part will not keep though.
Re: (Score:2)
In the search for true artificial intelligence, large language models are a horrible failure which look like a success. https://leaflessca.wordpress.c... [wordpress.com]
You need to have some actual insight to see that though. One thing we are finding out with the current AI craze is how many people actually lack natural insight and typically do not use whatever general intelligence they may actually have available. If you yourself are dumb that way, AI may look like something that can perform on your level or better. That this level can be and often is really bad gets overlooked.
Confidence (Score:5, Insightful)
"85% feeling more confident in their code."
Imagine having such low confidence in the quality of your own code that you feel an LLM is doing you better.
Re: (Score:2)
Oooo. Burn.
Re: (Score:2)
Indeed. Imagine being this bad at your job. And then ask why that is and does not seem to change. Obviously, incompetent coders (the vast majority) always look for some magic language or tool or approach that makes their code not suck. Obviously that does not work and cannot work because the tooling and the processes are not the problem.
Re: (Score:2)
Compilers are also tools that improves code quality.
Personally I don't use AI yet for coding. I have tried it, but it is like asking a junior developer for advice. It can do some small things, but for the most part, it is just faster to do it myself than explain how to do it.
The area where I would like to see AI usage to increase is testing. I think testing is much more better suited for AI than programming, because testing does not require anything except trying all sorts of things, and it doesn't really m
Re: (Score:2)
What about when in testing the "AI" claims a test was successful when it was not, either by test design or by misinterpretation of results? I think mistakes matter a lot in testing.
CoPilot in Python is excellent (Score:1)
The success you'll have with copilot will depend on the language used. Python is the language that CoPilot generates more useful code between java, Angular et C#.
Re: CoPilot in Python is excellent (Score:2)
Interesting. Some languages have more "gotcha"s, so there are surely more examples our there of code that falls into those traps, and so the AI surely uses those in its answers too. Languages with fewer "gotchas" result in better AI code...
No?
CoPilot can't even declare a Java String correctly (Score:2)
The success you'll have with copilot will depend on the language used. Python is the language that CoPilot generates more useful code between java, Angular et C#.
Hmm, I tried it 2 weeks ago with a question of "for /aaa/bbb/.../xxx[12334]yyy write me a RegEx in Java that replaces 1234 with abcd" (roughly...can't share the details).
1. The RegEx was declared on a Java String with newlines, so it didn't even compile.
2. The RegEx was wrong. It didn't work if I had fixed it for them...it just completely fucked up the RegEx.
3. The RegEx they tried to do was about 10x more complicated than it needed to be
4. The Java API they used was really outdated
5. Their general
Re: (Score:2)
Yeah they don't work for everything. Regex's seem to be a pretty big weak point in particular. After using them for a while you get a feel for their strengths and weaknesses.
Re: (Score:2)
Yep, pretty much. The thing is generating a RegEx requires insight. Obviously an LLM can only give you a RegEx has seen before or incompetently try to combine some. That will not work. And your example was _really_ simple.
Re: (Score:2)
On the other hand, I think it would be possible to make an AI that is specifically educated to create regex, because it is quite easy to verify and score the results. And I think this would also be pretty good idea as at least I personally quite often need to find and replace something in hundreds of files. If I could get a regex for that in few seconds, by just asking a question, it would be nice.
Re: (Score:2)
RegEx generators and tools that help you design RegExes exist, if you really need them. And, unlike AI, they do not hallucinate.
\o/ (Score:1)
This whole thing seems super-self-serving and suspect so in that setting:
I'm no expert but isn't the idea that you tweak your code until 100% pass is reached and don't stop until then - regardless whether microsoft are watching everything you type? ProTip: Yes.
Re: (Score:2)
Re: (Score:1)
Or a time-limit; both of which invalidate the whole exercise.
Re: (Score:2)
Re: (Score:1)
If tests define expected behaviour and the tests are secret, the developers are aiming at different targets - not a solid foundation for comparison of output.
Re: (Score:2)
When I review code, I look for errors. Having been a programmer for a few decades, I have a decent understanding of where bugs usually hide. E.g. static variables in Java class that is supposed to be thread safe. Those can cause pretty horrible bugs that are nearly impossible to reproduce and usually never found in unit tests. Also anything that opens a connection or similar, I check if it is closed. Those are also very rarely caught by tests.
It we talk about pretty code, I look for unnecessary casts. Not b
Re: (Score:1)
E.g. static variables in Java class that is supposed to be thread safe.
A class that is thread safe? In what regard?
Or a static variable that is threat safe?
Do you have an example for a problem?
Every junior is leaning on AI... hard (Score:4, Interesting)
It's literally impossible to tell the level developers are at with AI. Juniors are abusing it so hard (and hiding it) that they do the most inscrutable, wrong, zero-context solutions, and when I called them on it-- my manager received a report I was being mean. I feel like it's time to get my hose and spray the kids to keep them off my lawn with how this is coming off, but what in the hell is going on. Total circus.
Re: (Score:2)
and when I called them on it-- my manager received a report I was being mean.
Use more emojis and memes. You can't say someone is mean when they send you a positive cat GIF.
Like, Stop using AI, you fucker! [giphy.com]
Re: (Score:1)
Re: (Score:2)
Seniors lean on stackoverflow
Re: (Score:2)
And what is worse is that these juniors will never grow into seniors (except by aging), because harder stuff "AI" cannot help them with and they never really learn the simple stuff now.
Re: (Score:2)
What does ChatGPT say? (Score:3)
Copilot, trained on github code (Score:2)
Including repos that are created to demonstrate vulnerable code.
Code quality isn't the point (Score:2)
It solves the wrong problem (Score:2)
The problem is not to write code faster, the problem is to write better code, and even if Copilot would consistently write excellent code, the fact alone that it becomes easier and faster to write code, will cause there to be more code. More code however means more complexity, more errors and higher costs maintaining that code.