AI Models Still Struggle To Debug Software, Microsoft Study Shows (techcrunch.com) 43

Posted by msmash on Friday April 11, 2025 @12:20AM from the reality-check dept.

Some of the best AI models today still struggle to resolve software bugs that wouldn't trip up experienced devs. TechCrunch: A new study from Microsoft Research, Microsoft's R&D division, reveals that models, including Anthropic's Claude 3.7 Sonnet and OpenAI's o3-mini, fail to debug many issues in a software development benchmark called SWE-bench Lite. The results are a sobering reminder that, despite bold pronouncements from companies like OpenAI, AI is still no match for human experts in domains such as coding.

The study's co-authors tested nine different models as the backbone for a "single prompt-based agent" that had access to a number of debugging tools, including a Python debugger. They tasked this agent with solving a curated set of 300 software debugging tasks from SWE-bench Lite.

According to the co-authors, even when equipped with stronger and more recent models, their agent rarely completed more than half of the debugging tasks successfully. Claude 3.7 Sonnet had the highest average success rate (48.4%), followed by OpenAI's o1 (30.2%), and o3-mini (22.1%).

AI Models Still Struggle To Debug Software, Microsoft Study Shows

This discussion has been archived. No new comments can be posted.

Load All Comments

Search 43 Comments Log In/Create an Account

Comments Filter:

- Re: They should do a test. (Score:2)
  
  by reanjr ( 588767 ) writes:
  
  Compare debugging in these three scenarios:
  1. Code developed by humans
  2. Code developed by humans with AI assistance
  3. Code developed with "vibe" coding
  You can use any tool you want to debug. But I'm guessing code written by humans is debugged way quicker.
Yes, "AI" models struggle to do anything (Score:5, Insightful)

by Mr. Dollar Ton ( 5495648 ) writes: on Friday April 11, 2025 @12:42AM (#65296745)

that they were not designed to, and intelligence is not something they were designed for.
They'll struggle at it for the foreseeable future, and it doesn't really matter how much more power and GPUs the "investment community" throws at them.

- Re:Yes, "AI" models struggle to do anything (Score:4, Insightful)
  
  by AmiMoJo ( 196126 ) writes: on Friday April 11, 2025 @04:52AM (#65296959) Homepage Journal
  
  Debugging is twice as hard as writing code in the first place, maybe more. Crucially it requires extensive reasoning skills to catch anything more than trivial mistakes that the compiler can usually flag up anyway. LLMs are not good at that kind of thing.
  You can't just hand AI code to a human and expect them to quickly debug it either.
  
  - Re: (Score:3)
    
    by Mr. Dollar Ton ( 5495648 ) writes:
    
    You can't just hand AI code to a human and expect them to quickly debug it either.
    Why? You sure can.
    The problem is the generated code fails everywhere. The syntax is wrong, the function calls are wrong, the parameters are wrong. And, of course, you can't expect anything that even begins to approach a design effort. It is just a picture that looks like code to the "AI".
    With real pictures your brain sometimes compensates. The compilers, however, tend to be stricter, so the failure is complete.
    - Re: (Score:3)
      
      by AmiMoJo ( 196126 ) writes:
      
      Have you looked at AI generated code? It superficially looked okay, but when you start to understand it you realize that it needs massive refactoring and documenting to be debuggable in many cases.
      That's not to say that human written code sometimes doesn't need that, but it's far far worse with AI.
      - Re: (Score:3)
        
        by serviscope_minor ( 664417 ) writes:
        
        Sometimes the code is structured OK for small things but has subtle bugs like numerical instability that you'd probably only know about if you already know the problem well.
- Re: (Score:2)
  
  by dinfinity ( 2300094 ) writes:
  
  It solved almost 50% of the bugs. In minutes. For a few dollars.
  An easy 95% if not 100% of humans is unable to do that.
  80% of humans would need weeks if not months of training to be able to solve any bug at all.
  Struggling, my ass.
- Re: (Score:2)
  
  by Tony Isaac ( 1301187 ) writes:
  
  AIs are *designed* to recognize patterns. That is in fact what they are good at. To the extent that bugs follow certain patterns, AI should in fact be good at finding them.
  Unfortunately, many bugs are in the eye of the beholder, more subjective than objective. Those kinds of bugs will be harder for AI to spot.
Most likely the model was not designed to find bug (Score:2)

by angel'o'sphere ( 80593 ) writes:

To make it find bugs to have to train it with bugs.
Makes no sense to demand from a random model to hint about a topic it is not trained on.
Then again: what is a bug? If you have running code that is sparsely commented and no exceptions, it is pretty difficult to identify something as a bug. It is probably easier if you have the original specc and can write a test for (parts of?) the code,
- Re: (Score:3)
  
  by Jeremi ( 14640 ) writes:
  
  Then again: what is a bug? If you have running code that is sparsely commented and no exceptions, it is pretty difficult to identify something as a bug.
  OTOH there are some things that are a bug in almost any context. e.g. if you can spot where a C or C++ program invokes undefined behavior, you've spotted a bug by any reasonable definition of "bug".
  - Re: (Score:1)
    
    by angel'o'sphere ( 80593 ) writes:
    
    e.g. if you can spot where a C or C++ program invokes undefined behavior, you've spotted a bug by any reasonable definition of "bug".
    That is not necessarily a bug. As the compiler vendor in question might have defined in this case pretty well what the behaviour is.
    But your idea is good. However: strictly speaking I would in our days expect the compiler to give a warning. No idea: do come compiler suits still come with a lint(er)?
- Re: (Score:2)
  
  by drinkypoo ( 153816 ) writes:
  
  Then again: what is a bug? If you have running code that is sparsely commented and no exceptions, it is pretty difficult to identify something as a bug.
  That's exactly the problem. Someone gives an example of code with a bug and the AI doesn't know which part it is.
  - Re: (Score:2)
    
    by luvirini ( 753157 ) writes:
    
    Except I think that you could well use say all open source github comits that are marked as bug fixes as a base,
    - Re: (Score:2)
      
      by drinkypoo ( 153816 ) writes:
      
      Yes, but then it applies the one that looks most like the right one that it's seen before, and while that allows it to fix the most common problems, it also lets them screw up in unexpected and even inhuman ways (which might be harder to detect and debug) in unusual situations. So if you want them to write you a fart app great, but if you want them to not only be able to do a shit job it's a problem.
      I actually think that's still valuable, but ultimately the quality of software is going to come down to the q
      - Re: (Score:1)
        
        by angel'o'sphere ( 80593 ) writes:
        
        Most tests I saw were comp!etely inadequate. Perhaps remotely useable as regression test.
        They just write a test that conforms to the control flow/data flow.
        Without even trying to test for the intended purpose of the code.
        
        Re: (Score:2)
        
        by drinkypoo ( 153816 ) writes:
        
        Exactly. You cannot trust the model to produce code and a human who knows how it should work has to check up on it, you cannot trust the model to produce tests and a human who knows how they work has to check up on those as well. This may reduce the work they have to do, but it doesn't reduce the need for them to know things, and may even increase it.
        
        Re: (Score:1)
        
        by angel'o'sphere ( 80593 ) writes:
        
        I was talking about unit tests written by humans, not LLMs.
- Re: Most likely the model was not designed to find (Score:2)
  
  by reanjr ( 588767 ) writes:
  
  That's only true of these models are fundamentally incapable of reasoning. Which they are. But that's not how they're sold.
  - Re: (Score:1)
    
    by angel'o'sphere ( 80593 ) writes:
    
    They do not have to reason.
    And I do not know how they are sold.
    Never "bought" one and never used one. I train them. Big difference.
Like saying that Managers still struggle to manage (Score:2)

by thesjaakspoiler ( 4782965 ) writes:

Saying that you're something doesn't mean automagically that you're good at it.
Maybe AI is intelligent after all (Score:5, Funny)

by FireXtol ( 1262832 ) writes: on Friday April 11, 2025 @01:40AM (#65296807) Journal

Here's some shit code, fix it. AI: nah, lulz

- Re:Maybe AI is intelligent after all (Score:4, Interesting)
  
  by coofercat ( 719737 ) writes: on Friday April 11, 2025 @08:15AM (#65297281) Homepage Journal
  
  I read it the same "sorry, AI's not going to fix our shitty software, so you'll have to keep enduring years-old bugs we can't be arsed to fix".
  
Constrained by the documentation (Score:5, Insightful)

by seoras ( 147590 ) writes: on Friday April 11, 2025 @02:06AM (#65296823)

AI's are only as good as the material they've learned from.
I had a bug of sorts that I fixed by myself only today that I'd been trying to get the various flavours of ChatGPT to shine light on.
None gave me the answer and even when I came up with the solution and proposed it to GPT it said "yeah, sure, you are probably right but I couldn't find any references to back up your solution".
I've been writing software professionally for 35 years. AI has knowledge but lacks experience, wisdom, intuition, call it what you will.
AI is brilliant, I couldn't work without it now as it saves me so much time, but it isn't going to replace me in its current form.

- Re: (Score:2)
  
  by Kokuyo ( 549451 ) writes:
  
  Well, obviously. Experience cannot be taught as opposed to knowledge.
  Wisdom and intuition both rely on experience, among other things.
  So this is just logical. I use ChatGPT when scripting because I don't have a lot of routine at it so I tend to forget basic things. I can supply some experience and intuition while ChatGPT fills the knowledge gaps.
  Works on my level.
And that's why MS software sucks (Score:5, Funny)

by Alumoi ( 1321661 ) writes: on Friday April 11, 2025 @02:25AM (#65296843)

Now the mystery is solved. MS has been using AI to write software for longer that we knew.

Debugging is not just looking at code (Score:2)

by fleeped ( 1945926 ) writes:

We need to be looking at the state of the program, we use tools that generate useful information while the program is running (and when it's not) and so on. Do they really think that this is all going to be replaced by text predictors?
Are Microsoft having problems? (Score:1)

by bloognoo ( 6799970 ) writes:

The other day they post a pile of Linux boot vulnerabilities discovered with their AI, without a similar report on the results of a similar run on their own code. Now they're saying everybody elses AIs are rubbish at debugging without giving the result for copilot. Either they are prepping for a big announcement to try and bolster their stock price in a crappy stock market, or they've got internal problems and they're yelling 'don't look behind the curtain'. Given their history of FUD I suspect the latter.
- Re: Are Microsoft having problems? (Score:2)
  
  by reanjr ( 588767 ) writes:
  
  They've been using these sorts of tools for internal testing of their own code for some time and fixing the bugs before they are released, and their code is proprietary, so no one will ever see what the big was or how it was fixed. Linux is an excellent way to demonstrate the tech without hassles of working with proprietary code.
- Re: (Score:2)
  
  by jvkjvk ( 102057 ) writes:
  
  As far as I know they aren't required to let people know about bugs they have fixed. Publishing other peoples bugs shows the capability without the embarrassment.
Microsoft Marketing (Score:5, Funny)

by Mirnotoriety ( 10462951 ) writes: on Friday April 11, 2025 @06:47AM (#65297063)

“AI Models Still Struggle To Debug Software, Microsoft Study Shows”

This from the company that made opening your email dangerous.

Matches my experience (Score:1)

by Maxo-Texas ( 864189 ) writes:

I can use A.I. to help increase my new development speed by 300%. And that includes new features to existing code.
But A.I. has not been any good at maintenance.
Most debugging sessions are not turned into videos or written up. They are a flow state with the programmer loading sufficient information on the code into their brain until they suddenly have enough to know what the problem is. That or, they literally stumble across the line with the bug.
You'd probably need a million good videos of debugging s
- Re: (Score:2)
  
  by deadweight ( 681827 ) writes:
  
  I am no a coder by trade, but ChatGTP has helped me write code for some work projects and my hobby projects with GPS and Nav code. What I have found in both PowerShell and Python is a complicated request generates a very convoluted mess that neither me nor the AI can fix. What works is breaking up a project into simple pieces, getting that ONE piece working, and then adding the next one. I need to keep track of how variables are passed, the AI seems to want to lose track and rename them randomly. It also ma
  - Re: (Score:2)
    
    by ZipNada ( 10152669 ) writes:
    
    >> What works is breaking up a project into simple pieces, getting that ONE piece working
    I find that's the best policy, and it takes some practice and experience with the AI to find out the edges of its capability. If your prompt is too general or vague the AI will go nuts and write way too much code that isn't what you wanted and may not even work. You have to give it specific granular tasks and you have to test what it did. And don't be afraid to reject the results and try a different approach.
    - Re: (Score:2)
      
      by deadweight ( 681827 ) writes:
      
      Quite right. AI will do the grunt work if you know how to manage it, but you still need the higher level ability to engineer what blocks of code you need, how to integrate them, and how to test them. I may not be very good at modern languages on the code-monkey level, but my old-school lay the program out with shapes and arrows on paper skills still work to figure out what to even ask the AI.L Also rant mode on, after the 394th time it puts a bracket in the wrong place and then tells you "that error is beca
Why "still"? (Score:2)

by gweihir ( 88907 ) writes:

Is there any sane indicator that this will ever change?
Step 1: Use Human Written Code (Score:2)

by BrendaEM ( 871664 ) writes:

At some point, "AI" will ingest so much "AI" generating code, that nothing will work.
- Re: Step 1: Use Human Written Code (Score:2)
  
  by devslash0 ( 4203435 ) writes:
  
  It's a very fair prediction. Garbage in, garbage out.
It's a feature? (Score:1)

by Kreela ( 1770584 ) writes:

Debugging code implies a wrong way and a right way to write it, but those aren't always absolutes. It depends on the intent of the programmer, and on the design of the programme. You may have meant to hide that bit of text, or have that form never submit to anywhere, and so on, and given that the creative possibilities of code are endless, an AI can't account for all possible purposes.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Re: They should do a test. (Score:2)

Yes, "AI" models struggle to do anything (Score:5, Insightful)

Re:Yes, "AI" models struggle to do anything (Score:4, Insightful)

Re: (Score:3)

Re: (Score:3)

Re: (Score:3)

Re: (Score:2)

Re: (Score:2)

Most likely the model was not designed to find bug (Score:2)

Re: (Score:3)

Re: (Score:1)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:1)

Re: (Score:2)

Re: (Score:1)

Re: Most likely the model was not designed to find (Score:2)

Re: (Score:1)

Like saying that Managers still struggle to manage (Score:2)

Maybe AI is intelligent after all (Score:5, Funny)

Re:Maybe AI is intelligent after all (Score:4, Interesting)

Constrained by the documentation (Score:5, Insightful)

Re: (Score:2)

And that's why MS software sucks (Score:5, Funny)

Debugging is not just looking at code (Score:2)

Are Microsoft having problems? (Score:1)

Re: Are Microsoft having problems? (Score:2)

Re: (Score:2)

Microsoft Marketing (Score:5, Funny)

Matches my experience (Score:1)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Why "still"? (Score:2)

Step 1: Use Human Written Code (Score:2)

Re: Step 1: Use Human Written Code (Score:2)

It's a feature? (Score:1)

Related Links Top of the: day, week, month.

Slashdot Top Deals