

AI Models Still Struggle To Debug Software, Microsoft Study Shows (techcrunch.com) 41
Some of the best AI models today still struggle to resolve software bugs that wouldn't trip up experienced devs. TechCrunch: A new study from Microsoft Research, Microsoft's R&D division, reveals that models, including Anthropic's Claude 3.7 Sonnet and OpenAI's o3-mini, fail to debug many issues in a software development benchmark called SWE-bench Lite. The results are a sobering reminder that, despite bold pronouncements from companies like OpenAI, AI is still no match for human experts in domains such as coding.
The study's co-authors tested nine different models as the backbone for a "single prompt-based agent" that had access to a number of debugging tools, including a Python debugger. They tasked this agent with solving a curated set of 300 software debugging tasks from SWE-bench Lite.
According to the co-authors, even when equipped with stronger and more recent models, their agent rarely completed more than half of the debugging tasks successfully. Claude 3.7 Sonnet had the highest average success rate (48.4%), followed by OpenAI's o1 (30.2%), and o3-mini (22.1%).
The study's co-authors tested nine different models as the backbone for a "single prompt-based agent" that had access to a number of debugging tools, including a Python debugger. They tasked this agent with solving a curated set of 300 software debugging tasks from SWE-bench Lite.
According to the co-authors, even when equipped with stronger and more recent models, their agent rarely completed more than half of the debugging tasks successfully. Claude 3.7 Sonnet had the highest average success rate (48.4%), followed by OpenAI's o1 (30.2%), and o3-mini (22.1%).
Re: They should do a test. (Score:2)
Compare debugging in these three scenarios:
1. Code developed by humans
2. Code developed by humans with AI assistance
3. Code developed with "vibe" coding
You can use any tool you want to debug. But I'm guessing code written by humans is debugged way quicker.
Yes, "AI" models struggle to do anything (Score:5, Insightful)
that they were not designed to, and intelligence is not something they were designed for.
They'll struggle at it for the foreseeable future, and it doesn't really matter how much more power and GPUs the "investment community" throws at them.
Re:Yes, "AI" models struggle to do anything (Score:4, Insightful)
Debugging is twice as hard as writing code in the first place, maybe more. Crucially it requires extensive reasoning skills to catch anything more than trivial mistakes that the compiler can usually flag up anyway. LLMs are not good at that kind of thing.
You can't just hand AI code to a human and expect them to quickly debug it either.
Re: (Score:3)
You can't just hand AI code to a human and expect them to quickly debug it either.
Why? You sure can.
The problem is the generated code fails everywhere. The syntax is wrong, the function calls are wrong, the parameters are wrong. And, of course, you can't expect anything that even begins to approach a design effort. It is just a picture that looks like code to the "AI".
With real pictures your brain sometimes compensates. The compilers, however, tend to be stricter, so the failure is complete.
Re: (Score:3)
Have you looked at AI generated code? It superficially looked okay, but when you start to understand it you realize that it needs massive refactoring and documenting to be debuggable in many cases.
That's not to say that human written code sometimes doesn't need that, but it's far far worse with AI.
Re: (Score:3)
Sometimes the code is structured OK for small things but has subtle bugs like numerical instability that you'd probably only know about if you already know the problem well.
Re: (Score:2)
It solved almost 50% of the bugs. In minutes. For a few dollars.
An easy 95% if not 100% of humans is unable to do that.
80% of humans would need weeks if not months of training to be able to solve any bug at all.
Struggling, my ass.
Re: (Score:2)
AIs are *designed* to recognize patterns. That is in fact what they are good at. To the extent that bugs follow certain patterns, AI should in fact be good at finding them.
Unfortunately, many bugs are in the eye of the beholder, more subjective than objective. Those kinds of bugs will be harder for AI to spot.
Most likely the model was not designed to find bug (Score:2)
To make it find bugs to have to train it with bugs.
Makes no sense to demand from a random model to hint about a topic it is not trained on.
Then again: what is a bug? If you have running code that is sparsely commented and no exceptions, it is pretty difficult to identify something as a bug. It is probably easier if you have the original specc and can write a test for (parts of?) the code,
Re: (Score:3)
Then again: what is a bug? If you have running code that is sparsely commented and no exceptions, it is pretty difficult to identify something as a bug.
OTOH there are some things that are a bug in almost any context. e.g. if you can spot where a C or C++ program invokes undefined behavior, you've spotted a bug by any reasonable definition of "bug".
Re: (Score:1)
e.g. if you can spot where a C or C++ program invokes undefined behavior, you've spotted a bug by any reasonable definition of "bug".
That is not necessarily a bug. As the compiler vendor in question might have defined in this case pretty well what the behaviour is.
But your idea is good. However: strictly speaking I would in our days expect the compiler to give a warning. No idea: do come compiler suits still come with a lint(er)?
Re: (Score:2)
Then again: what is a bug? If you have running code that is sparsely commented and no exceptions, it is pretty difficult to identify something as a bug.
That's exactly the problem. Someone gives an example of code with a bug and the AI doesn't know which part it is.
Re: (Score:2)
Except I think that you could well use say all open source github comits that are marked as bug fixes as a base,
Re: (Score:2)
Yes, but then it applies the one that looks most like the right one that it's seen before, and while that allows it to fix the most common problems, it also lets them screw up in unexpected and even inhuman ways (which might be harder to detect and debug) in unusual situations. So if you want them to write you a fart app great, but if you want them to not only be able to do a shit job it's a problem.
I actually think that's still valuable, but ultimately the quality of software is going to come down to the q
Re: (Score:1)
Most tests I saw were comp!etely inadequate. Perhaps remotely useable as regression test.
They just write a test that conforms to the control flow/data flow.
Without even trying to test for the intended purpose of the code.
Re: (Score:2)
Exactly. You cannot trust the model to produce code and a human who knows how it should work has to check up on it, you cannot trust the model to produce tests and a human who knows how they work has to check up on those as well. This may reduce the work they have to do, but it doesn't reduce the need for them to know things, and may even increase it.
Re: Most likely the model was not designed to find (Score:2)
That's only true of these models are fundamentally incapable of reasoning. Which they are. But that's not how they're sold.
Re: (Score:1)
They do not have to reason.
And I do not know how they are sold.
Never "bought" one and never used one. I train them. Big difference.
Like saying that Managers still struggle to manage (Score:2)
Saying that you're something doesn't mean automagically that you're good at it.
Maybe AI is intelligent after all (Score:5, Funny)
Re:Maybe AI is intelligent after all (Score:4, Interesting)
I read it the same "sorry, AI's not going to fix our shitty software, so you'll have to keep enduring years-old bugs we can't be arsed to fix".
Constrained by the documentation (Score:5, Insightful)
AI's are only as good as the material they've learned from.
I had a bug of sorts that I fixed by myself only today that I'd been trying to get the various flavours of ChatGPT to shine light on.
None gave me the answer and even when I came up with the solution and proposed it to GPT it said "yeah, sure, you are probably right but I couldn't find any references to back up your solution".
I've been writing software professionally for 35 years. AI has knowledge but lacks experience, wisdom, intuition, call it what you will.
AI is brilliant, I couldn't work without it now as it saves me so much time, but it isn't going to replace me in its current form.
Re: (Score:2)
Well, obviously. Experience cannot be taught as opposed to knowledge.
Wisdom and intuition both rely on experience, among other things.
So this is just logical. I use ChatGPT when scripting because I don't have a lot of routine at it so I tend to forget basic things. I can supply some experience and intuition while ChatGPT fills the knowledge gaps.
Works on my level.
And that's why MS software sucks (Score:5, Funny)
Now the mystery is solved. MS has been using AI to write software for longer that we knew.
Debugging is not just looking at code (Score:2)
Are Microsoft having problems? (Score:1)
Re: Are Microsoft having problems? (Score:2)
They've been using these sorts of tools for internal testing of their own code for some time and fixing the bugs before they are released, and their code is proprietary, so no one will ever see what the big was or how it was fixed. Linux is an excellent way to demonstrate the tech without hassles of working with proprietary code.
Re: (Score:2)
As far as I know they aren't required to let people know about bugs they have fixed. Publishing other peoples bugs shows the capability without the embarrassment.
Microsoft Marketing (Score:5, Funny)
This from the company that made opening your email dangerous.
Matches my experience (Score:1)
I can use A.I. to help increase my new development speed by 300%. And that includes new features to existing code.
But A.I. has not been any good at maintenance.
Most debugging sessions are not turned into videos or written up. They are a flow state with the programmer loading sufficient information on the code into their brain until they suddenly have enough to know what the problem is. That or, they literally stumble across the line with the bug.
You'd probably need a million good videos of debugging s
Re: (Score:2)
Re: (Score:2)
>> What works is breaking up a project into simple pieces, getting that ONE piece working
I find that's the best policy, and it takes some practice and experience with the AI to find out the edges of its capability. If your prompt is too general or vague the AI will go nuts and write way too much code that isn't what you wanted and may not even work. You have to give it specific granular tasks and you have to test what it did. And don't be afraid to reject the results and try a different approach.
Re: (Score:2)
Why "still"? (Score:2)
Is there any sane indicator that this will ever change?
Step 1: Use Human Written Code (Score:2)
Re: Step 1: Use Human Written Code (Score:2)
It's a very fair prediction. Garbage in, garbage out.
It's a feature? (Score:1)