

AI Can Write Code But Lacks Engineer's Instinct, OpenAI Study Finds 76
Leading AI models can fix broken code, but they're nowhere near ready to replace human software engineers, according to extensive testing [PDF] by OpenAI researchers. The company's latest study put AI models and systems through their paces on real-world programming tasks, with even the most advanced models solving only a quarter of typical engineering challenges.
The research team created a test called SWE-Lancer, drawing from 1,488 actual software fixes made to Expensify's codebase, representing $1 million worth of freelance engineering work. When faced with these everyday programming tasks, the best AI model â" Claude 3.5 Sonnet -- managed to complete just 26.2% of hands-on coding tasks and 44.9% of technical management decisions.
Though the AI systems proved adept at quickly finding relevant code sections, they stumbled when it came to understanding how different parts of software interact. The models often suggested surface-level fixes without grasping the deeper implications of their changes.
The research, to be sure, used a set of complex methodologies to test the AI coding abilities. Instead of relying on simplified programming puzzles, OpenAI's benchmark uses complete software engineering tasks that range from quick $50 bug fixes to complex $32,000 feature implementations. Each solution was verified through rigorous end-to-end testing that simulated real user interactions, the researchers said.
The research team created a test called SWE-Lancer, drawing from 1,488 actual software fixes made to Expensify's codebase, representing $1 million worth of freelance engineering work. When faced with these everyday programming tasks, the best AI model â" Claude 3.5 Sonnet -- managed to complete just 26.2% of hands-on coding tasks and 44.9% of technical management decisions.
Though the AI systems proved adept at quickly finding relevant code sections, they stumbled when it came to understanding how different parts of software interact. The models often suggested surface-level fixes without grasping the deeper implications of their changes.
The research, to be sure, used a set of complex methodologies to test the AI coding abilities. Instead of relying on simplified programming puzzles, OpenAI's benchmark uses complete software engineering tasks that range from quick $50 bug fixes to complex $32,000 feature implementations. Each solution was verified through rigorous end-to-end testing that simulated real user interactions, the researchers said.
Instinct (Score:5, Funny)
The light air of the jungle is all I code to. I think not, I plan not. I only exist. "I exist. It is soft, so soft, so slow. And light: it seems as though it suspends in the air. It moves." --Sartre said it. He was a programmer.
Re: (Score:2)
If ChatGPT wrote that then you have truly demonstrated irony.
Re:Instinct (Score:4, Insightful)
Incidentally, instinct is innate and fixed, not learned. They do not even use the right word. What an experienced engineer has is _intuition_, which can at least be improved by learning and doing.
Obviously, a real engineer also has understanding and general intelligence, none of which are present in an LLM.
Re: (Score:2)
I would call it "The Knack". A good engineer has it and it's something that goes beyond intuition - if something is wrong, they know it, even if they can't articulate what.
It's when you're in the middle of coding and then something says you've done something wrong that has you scrapping your design and restarting. Or you have no clue where to start or something is complex, so you do a little bit and then the rest of it starts making a lot of sense.
I've seen someone try to code something using AI, and it was
Re: (Score:3)
I see the neck of logic soft and supple, then..I pounce! Nine months later the code is done and I maintain it for the next twenty years.
LOL - So, umm... I have to ask. Is this about programming or where that little bastard Jimmy came from?
Re: (Score:1)
Instinct...that's how I code. I can feel the sweet smell of algorithm floating on the breeze.
You joke but this is not far from the truth in a lot of cases, you can tell which way is going to be better to go in terms of architecture by feel.
I'm also not going to judge anyone for weighing algorithms by smell. :-)
Re: (Score:2)
It's true. I code a lot "by instinct", and when I need to explain to my co-workers why the way I suggest is better than the slop they coded, then I actually hesitate and sometimes struggle as my brain needs to first load all the logical reasons why my code is better, as well as putting all that abstract reasoning into words.
If you've done something long enough it becomes an instinct. Like playing an instrument or just walking - if you have to think about it, you will not be really good at it.
Re: (Score:2)
I sneak slowly through the binary tree
I mean, you either do or you don’t, that’s the neat thing about binary trees.
Re: (Score:2)
Epic! Very good writing. I could "feel" what you were saying. :)
Hmmm... have you ever read about flapping meat?
https://www.mit.edu/people/dpo... [mit.edu]
Seems to invoke a similar thought process.
This is reasonably fair. (Score:5, Interesting)
I'd say my biggest complaint with Claude in Cursor (which I must preface, is generally excellent) is its tendency to sometimes "wallpaper over problems". "Oh, this variable is arriving in this function out of the expected range? We'll just add in a range check and normalize it if it's out of the expected range." Things like that.
LRMs like o3 aren't as bad as this - the additional reasoning chains are much better at tracing back logical implications, although within limits. o3's API is however buggy as heck (frequently does not apply the changes it thinks it's applying) and its finetune is terrible compared to Claude's (it's lazy, and will argue with you and gaslight you until it gets you mad enough to curse it out in all caps). As-is, on difficult problems, I often first let o3 analyze the situation, then have Claude do the actual implementation.
Still doesn't "replace me", of course. Even if the models were perfect, tools like Cursor don't yet let models go through repeated cycles of "make a change... run the program... get any commandline errors / debug... get any GUI errors/debug... continue doing so as you work through the stages to reproduce the problem...make more changes.... repeat". So at a bare minimum the tools are going to need that before you can even think about replacing programmers. And it'll be critical to have the model create good design documents and a good test suite as it goes and run the test suite with all of its changes, so it'll know if it accidentally breaks some preexisting functionality. (In my current process, I have it update a design document as we go, and every prompt starts with "First, read README.md to understand the project and its goals. Next....")
Anyway, can't wait until the new Claude LRM comes out. My expectations for it are high.
Re: (Score:3)
Re: (Score:2)
Not even remotely. It doesn't just change things without your knowledge - it makes a diff, which you then go through and approve or reject things from. And it makes the diff *way* faster than I can. Even if I were to have to go through many diffs, which I rarely do, it's still a lot faster.
Don't get me wrong, though, there's still plenty of room for improvement.
Re: (Score:2)
I treat it as another engineer working on my team. Just like my interns and jr engineers we do pair programing and code review. Cursor so far is slightly better than 2-3 year engineers in terms of design decisions and bugs. Plus it doesn't mind writing tests and documentation.
Re: (Score:1)
"...the additional reasoning chains..."
Whatever that is. Accepting terms like that is buying into the lie.
"...o3's API is however buggy as heck..."
More assumptions. AI apologists love to define away undesirable outcomes as "bugs" or "hallucinations". AI is deterministic because its underlying engine cannot be otherwise, what it produces is what it is "designed" to produce. The problem is that no one understands the "design" because it is built into the training, not the just the coding. What we see are
Re:This is reasonably fair. (Score:4, Informative)
That's literally how LRMs work. That's literally the difference between LLMs and LRMs.
Whether the API is buggy or not is not an "assumption", for God's sake. The API has trouble with merges, which the website does not. Which leaves the model constantly thinking it's applied things, except the tool changes don't go through. It's a widely recognized API problem and has absolutely zip to do with hallucinations. The model spells out the tool request, but the tool request doesn't follow through.
And beyond this: literally everything else you write is also false. LLMs absolutely have world models, we very much can and do probe how their world models work and how they process information, and very much "understand the design". The fact that you don't personally understand how they work has no bearing.
It's like the Insane Clown Posse writing a song insisting that nobody understands how magnets work because they personally don't understand them. "And I don't wanna talk to a scientist, y'all motherfuckers lying!"
Re: (Score:2)
That's literally how LRMs work. That's literally the difference between LLMs and LRMs.
Except no "reasoning" is done in the "reasoning chain". The term is nothing but a lie.
Re: (Score:2)
If that's your linguistic claim, then the word "reasoning" has no meaning.
Re: (Score:2)
Bullshit. Please stop lying by misdirection.
Re: (Score:2)
Hell, the OP also misses the at least somewhat widely held philosophical position that essentially *Humans* are deterministic - at least in the way you could use that word to describe LLMs etc...
Re: (Score:2)
The (slim) majority of philosophers are apparently materialists, which would require that humans are either deterministic or deterministic with randomness, but computable either way. That's fairly remarkable since more than 80% of humans claim to be affiliated with some religious group, and many of the rest are "spiritual."
Re: (Score:2)
I'd say my biggest complaint with Claude in Cursor (which I must preface, is generally excellent) is its tendency to sometimes "wallpaper over problems". "Oh, this variable is arriving in this function out of the expected range? We'll just add in a range check and normalize it if it's out of the expected range." Things like that.
Soooo, "the variable would have been '0' for a regular user and '1' for an administrator. Lets just make it '1' if it is larger than '1' and everything is fine!"
Yep, sounds like a _great_ tool to create vulnerabilities and corrupted data. About the level of "insight" I expect from an LLM.
Re: (Score:3)
1) Not like *that*. It doesn't do things that are obviously wrong. When such failures happen, they're a result of not seeing enough of the big picture, not "looking at the small picture and doing something nonsensical with the small picture".
2) It doesn't in any way, shape or form keep secret about what its strategy is, which you can reject at will.
3) This is not the general case from use; it's the most common failure case. The vast majority of cases aren't failure cases.
I am shocked (Score:4, Funny)
That they released this report, this flys in the face of everything Sam Altman has said about AI and having an AGI out in a short amount of time.
Re: (Score:3, Insightful)
That they released this report, this flys in the face of everything Sam Altman has said about AI and having an AGI out in a short amount of time.
AI will be ready to replace humans the same day we have our first fusion power plant opening.
Samwise perhaps says what is needed to get that sweet sweet money?
Re: (Score:2)
So around 2035-ish? A decade from now?
Re: (Score:3, Funny)
So around 2035-ish? A decade from now?
Always a decade from now, forever and ever.
Re: (Score:2)
Used to be 30 years from now. Then twenty.
Re: (Score:2)
Used to be 30 years from now. Then twenty.
It's like the old line about getting halfway to your destination one day, then halfway the next, then halfway the next. But you never get there.
Re: (Score:2)
Anyway, what's the hurry? Humans can do fine with the green energy options we already have. It's just our new AI masters that need the fusion reactors.
Re: (Score:2)
Anyway, what's the hurry? Humans can do fine with the green energy options we already have.
I concur.
It's just our new AI masters that need the fusion reactors.
Which is surely a nail in the coffin of our hallucinating power sucking AI Masters.
Re: (Score:2)
It's not "the old line", it's "Zeno's Paradox", and if you follow it to the letter, then no task of any kind can ever be accomplished. It's not actually an argument against anything.
Re: (Score:2)
It's not "the old line", it's "Zeno's Paradox", and if you follow it to the letter, then no task of any kind can ever be accomplished. It's not actually an argument against anything.
I see you took your pedantry pills this morning.
Re: (Score:2)
Spoken like a person who has not at all followed the progress in fusion since you "1st read about it in 1975", a time where our computer models were still highly primitive, the compute power for them highly lacking, and four decades before the advent of commercially available ReBCO tables.
Re: (Score:2)
Hahahaha, no. There is a good change of an actual working industrialized fusion power plant in 100 years or so.
Re: (Score:2)
Hahahaha, no. There is a good change of an actual working industrialized fusion power plant in 100 years or so.
What an optimist! 8^)
Re: (Score:2)
Actually I have listened to the plasma-physicists running the X7. They are currently at the stage that there are no show-stoppers in the Physics anymore. Might still be cost issues, efficiency issues, material issues and those are probably in the 50-150 year range.
Re: (Score:2)
Actually I have listened to the plasma-physicists running the X7. They are currently at the stage that there are no show-stoppers in the Physics anymore. Might still be cost issues, efficiency issues, material issues and those are probably in the 50-150 year range.
I don't doubt it. All of those peripheral issues are the real problems, IMO. All those issues you noted are pretty big.
We've been doing this for a long time, and haven't gotten very far. We're also doing fusion at the shallow end of the pool with Deuterium-Tritium fusion.
I suspect that if we're going to do this for real and sustainably, we're going to need to go to an aneutronic form of fusion. But each has it's own issues Deuterium-Helium is the next candidate, but helium-3 has supply issues. Deuteri
Re: (Score:2)
Re: (Score:2)
Plasma physics, material science, heat-exchanger, design questions, getting what you need manufactured, etc.
I listened to a long podcast from the X7 people a while ago and they are probably 2-3 prototypes away from one that works for energy generation. Assume 20-30 years per. Then one or two of an actual industrialized prototype.
Pouring a lot of money into it could maybe bring this down to 50 years total or so, but the money would probably need to be > 100 times of what is invested now. But even with lim
Re: (Score:2)
Funny you should say that
https://science.slashdot.org/s... [slashdot.org]
Re: (Score:3)
Any company that makes its money writing bubble sort implementations in python will be absolutely killing it with AI.
Finally! (Score:5, Insightful)
Finally, a study that agrees with what my 40+ years as a coder tells me: AI can help at the tactical level, but it doesn't (yet) get the big picture. And it won't - until we come up with a way to describe the big picture. And given that the "big picture" in a large corporation is teetering stack of incompatible wishes scattered across a dozen managers' fever-dreams, I doubt we'll get to AI-nirvana in my lifetime. But for you up-and-coming wanna-be programmers, I don't think code writing will be a lifetime career the way it was for me.
Re: (Score:3)
Finally, a study that agrees with what my 40+ years as a coder tells me: AI can help at the tactical level, but it doesn't (yet) get the big picture. And it won't - until we come up with a way to describe the big picture. And given that the "big picture" in a large corporation is teetering stack of incompatible wishes scattered across a dozen managers' fever-dreams, I doubt we'll get to AI-nirvana in my lifetime. But for you up-and-coming wanna-be programmers, I don't think code writing will be a lifetime career the way it was for me.
Very good points. The ability to understand the big picture and how systems interact is a combination of experience and skill. It's one thing to recommend a fix, another to understand the implications of said fix; as well as filling in blanks in requirements that may be vague. Then there is the whole how will humans likely interact with the system.
Or as an engineering prof once tole me as we were getting ready to calibrate a shock tube for a test:
Anyone can tel yo hitting it with a hammer will produce a
Re: (Score:2)
The ability to understand the big picture and how systems interact is a combination of experience and skill
AI may not have the (full) skill (yet), but at least it's able to eat up tons of experience at once.
Re: (Score:3)
The ability to understand the big picture and how systems interact is a combination of experience and skill
AI may not have the (full) skill (yet), but at least it's able to eat up tons of experience at once.
Hoovering a bunch of data is a lot different than creating a good product. It’s lie watching hundreds of hours of porn and then trying to be a good lover.
Re: (Score:2)
I've started using it to write basic Java objects that let me bring in a CSV file using OpenCSV and the BindByName annotation, etc. as well as DB create table statements.
"Create a DB2 create table statement that will match this header row from a CSV file " and then paste in the header row/record.
Far from perfect, have to add our internal standards stuff (created/updated timestamp and by, etc) and sometimes fiddle with data types or field sizes, but it saves me a lot of time doing mindless copy/pasting or s
Re: (Score:2)
In my experience, while this is true most of the time, sometimes it not just gets the big picture, it even points out where i am thinking too small. Unfortunately such insightful moments are rare.
Re: (Score:2)
Finally, a study that agrees with what my 40+ years as a coder tells me: AI can help at the tactical level, but it doesn't (yet) get the big picture.
You should try it and see how well it actually works.
Wait? (Score:3)
AI Can Write Code But Lacks Engineer's Instinct, OpenAI Study Finds
Wasn't AI supposed to replace all coders, software engineers and computer scientists by 2027? ... and eliminate all human labour by 2030 and replace it with Tesla bots?
Re: (Score:2)
and replace it with Tesla bots?
Musk too busy to deal with that now.
Re: (Score:2)
and replace it with Tesla bots?
Musk too busy to deal with that now.
If his Tesla bot is anything like as well thought out as his genius idea to fire the keepers of the US nuclear arsenal I'm not going to worry about his bots.
Re:Wait? (Score:5, Insightful)
Yes! And rain fluffy bunnies on everybody as well!
Lies, damned lies, and the crap a scammer tells you. And that is what the LLM-bros are mostly by now.
Re: (Score:2)
Yep, and we were going to need UBI ASAP to make sure unemployed coders don't die of starvation. I guess we can wait a little longer now.
what can we learn? (Score:5, Insightful)
"....even the most advanced models solving only a quarter of typical engineering challenges."
So this means that a quarter of engineering challenges, as OpenAI defines them, require no creative activity. A quarter of these engineering tasks are limited to pure crank-turning and AI can do those...poorly.
Everything about AI reduces to defining terms to suit narratives and take money.
Re: (Score:3)
And they are lying by misdirection there. The thing is, an engineer is the person that looks at a problem and then decides whether boilerplate will do it, or whether it needs more analysis. They did not even test for that here. What they tested for is whether an LLM can replace the technician that got told what to implement by the engineer, in the absence of that engineer. And no, it cannot. Because that is impossible.
Re: (Score:2)
Everything about AI reduces to defining terms to suit narratives and take money.
Remember, it has to be centient before it can be dollarient.
Re: (Score:2)
So this means that a quarter of engineering challenges, as OpenAI defines them, require no creative activity.
For software engineering, 90% of what we do is "solved' problems, similar to what we've done before.
The creative parts come from figuring out how to communicate the structure and intent to future programmers, and to understand the structure/intent of the present code. That is creative also, though.
Nowhere ready? (Score:2)
they're nowhere near ready to replace human software engineers
When they do, they'll be ready to replace humans altogether.
Give it time (Score:2)
Re: (Score:3)
My son is a CS sophomore in college. AI can beat his coding skills on the toy projects they assign in Comp Sci courses. But each year the assignments get larger and more abstract.
Compare to my skills as developer with decades of experience. AI is fine for well known, highly localized issues similar to the toy projects of a college class. But ask a general AI a question about the broad architecture of a large program? It is hopelessly wrong 90% of the time.
For programming I believe we need models specialized
Managers beware (Score:3)
Apparently it ain't my job they're coming for.
Arrogant (Score:2)
No one knows what they want (Score:2)
Re: (Score:2)
X is money. It's always money. "How do we transfer more customer and supplier/partner money into our hands?"
That is "rationality", not "instinct". (Score:5, Insightful)
And even then, it would be "intuition".
LLMs cannot understand what they are doing. That is their first difference to an engineer. And the second is that they do not and cannot have an experienced engineer's intuition. Hence they are so far removed form what a good engineer can do, the comparison makes no sense at all. Might as well compare a camera to da Vinci and being able to make a photo of the Mona Lisa to being able to paint the real thing from scratch, including having the idea.
Oh, and in case anybody is impressed by that $32'000 figure, that is about 2....3 weeks of work, so still something pretty simple and minor.
Just like real life... (Score:2)
Though the AI systems proved adept at quickly finding relevant code sections, they stumbled when it came to understanding how different parts of software interact. The models often suggested surface-level fixes without grasping the deeper implications of their changes.
So they've replicated a large product development team (large codebase, or large team - doesn't matter which) in an "AI". Perhaps they trained the system on real-life examples... GIGO.
Re: (Score:2)
garden path (Score:2)
The main advantage of experience is that you are not led down some dead end garden path by how it looks at the beginning.
As with every technology, growing pains (Score:2)
Every technology starts with a bang, amazes people, and scares others. But soon, reality sets in as we learn the technology's limitations and issues. AI is no different.
Claude is a friggin liar (Score:2)
Our code assist bot is Claude, and so far it's been really good at wasting my time by lying. I wanted it to inline/refactor a bunch of files in a certain way, and I asked it how I should provide the files, since pasting them into the console would be stupid. It gave me 3 options, including linking to the directory where the files live in Github.
I wasn't expecting this to work at all, and it didn't, but it looked like it was TRYING, which we know isn't a thing, because let's remind ourselves: __it's predicti
In other news, 500 trillion synapses better ... (Score:2)
In other news, 500 trillion synapses are better than 100 billion