'How Good Is ChatGPT at Coding, Really?' (ieee.org) 135

Posted by EditorDavid on Saturday July 06, 2024 @06:44PM from the opening-AI dept.

IEEE Spectrum (the IEEE's official publication) asks the question. "How does an AI code generator compare to a human programmer?" A study published in the June issue of IEEE Transactions on Software Engineering evaluated the code produced by OpenAI's ChatGPT in terms of functionality, complexity and security. The results show that ChatGPT has an extremely broad range of success when it comes to producing functional code — with a success rate ranging from anywhere as poor as 0.66 percent and as good as 89 percent — depending on the difficulty of the task, the programming language, and a number of other factors. While in some cases the AI generator could produce better code than humans, the analysis also reveals some security concerns with AI-generated code.
The study tested GPT-3.5 on 728 coding problems from the LeetCode testing platform — and in five programming languages: C, C++, Java, JavaScript, and Python. The results? Overall, ChatGPT was fairly good at solving problems in the different coding languages — but especially when attempting to solve coding problems that existed on LeetCode before 2021. For instance, it was able to produce functional code for easy, medium, and hard problems with success rates of about 89, 71, and 40 percent, respectively. "However, when it comes to the algorithm problems after 2021, ChatGPT's ability to generate functionally correct code is affected. It sometimes fails to understand the meaning of questions, even for easy level problems," said Yutian Tang, a lecturer at the University of Glasgow. For example, ChatGPT's ability to produce functional code for "easy" coding problems dropped from 89 percent to 52 percent after 2021. And its ability to generate functional code for "hard" problems dropped from 40 percent to 0.66 percent after this time as well...

The researchers also explored the ability of ChatGPT to fix its own coding errors after receiving feedback from LeetCode. They randomly selected 50 coding scenarios where ChatGPT initially generated incorrect coding, either because it didn't understand the content or problem at hand. While ChatGPT was good at fixing compiling errors, it generally was not good at correcting its own mistakes... The researchers also found that ChatGPT-generated code did have a fair amount of vulnerabilities, such as a missing null test, but many of these were easily fixable.
"Interestingly, ChatGPT is able to generate code with smaller runtime and memory overheads than at least 50 percent of human solutions to the same LeetCode problems..."

'How Good Is ChatGPT at Coding, Really?'

This discussion has been archived. No new comments can be posted.

Load All Comments

Search 135 Comments Log In/Create an Account

Comments Filter:

About the same as Stack Overflow (Score:3)

by Tony Isaac ( 1301187 ) writes: on Saturday July 06, 2024 @06:53PM (#64605833) Homepage

The quality of ChatGPT coding suggestions is about the same as Stack Overflow (which it uses for a lot of its source material)...spotty. You can often find good solutions, but you can also find a lot of crappy ones. The difference is, ChatGPT (usually) saves time.
Time saver #1: It will scan through a bunch of Stack Overflow (and other coding site) suggestions, picking ones that seem relevant.
Time saver #2: It will take the suggestions it finds, and customize them to your liking, such as using names you want to use, instead of the names in the SO code.
But it also can waste time. It sometimes picks an answer that is inefficient, or poorly written, or based on obsolete APIs, or just plain doesn't work.
All in all, still an improvement over click, read, click, read, click, read, ad nauseum.

- Re: About the same as Stack Overflow (Score:3)
  
  by ArmoredDragon ( 3450605 ) writes:
  
  Stack overflow is meant to serve as a reference, basically to explain the mechanics of the language, to point to the right library, or to just provide a code snippet for a common thing that most people don't remember enough to recall off-hand when they need it. A lot of times I even use it to reference answers I've posted myself.
  If you lean on it too heavily to solve bigger problems, you're gonna have a bad time. I think the same can be said for chatgpt. That chatgpt fails on newer but still simple problems
  - Re: About the same as Stack Overflow (Score:4, Insightful)
    
    by Tony Isaac ( 1301187 ) writes: on Saturday July 06, 2024 @11:35PM (#64606171) Homepage
    
    No, Stack Overflow isn't a reference. It's a site where people help each other solve technical problems, often by supplying code snippets.
    https://stackoverflow.co/#:~:t... [stackoverflow.co].
    Programming language vendors provide *references*. The purpose of a reference is to document. SO doesn't do that. It's a forum for discussion.
    Yes, ChatGPT is essentially a fancy search engine. When it provides programming answers, it often searches...Stack Overflow.
    
  - Re: (Score:2)
    
    by sjames ( 1099 ) writes:
    
    If it solves the problem of not being able to find an actual answer because a google search is swamped with 1000 references to posts by smug bastards saying google it, it might be worth something.
    - Re: (Score:2)
      
      by fluffernutter ( 1411889 ) writes:
      
      That rarely happens. Most of the time the answer is answered even if it is determined to be a duplicate. Then you have two different bits of information to go by which is better than having one bit.
  - - Re: About the same as Stack Overflow (Score:4, Informative)
      
      by martin-boundary ( 547041 ) writes: on Saturday July 06, 2024 @08:54PM (#64605997)
      
      If you think ambiguity or lack of context is a "bug", then you've never worked in the real world. Dealing well with ambiguity and lack of context in the requirements is the norm for successful people in programming. An experienced programmer anticipates the unsaid, and handles in the present many issues that will arise in future otherwise.
      
      - Re: (Score:3)
        
        by gweihir ( 88907 ) writes:
        
        Indeed. In all engineering, context is everything. Standard solutions, you could find in a book 100 years ago. But understanding whether a standard solution cuts it or whether and how it needs to be adapted is everything. And LLMs cannot do that even in the most simple cases.
    - Re: (Score:2)
      
      by Junta ( 36770 ) writes:
      
      A lot of experience and data shows it does fail on simpler problems, like this very study. In this case even education fodder, where it's absolutely the case that the problem and context is fully described to let a human solve them without further clarification. I can see reasonable defenses of the utility of LLMs, but this is utterly absurd. It is no where near better than the average person in "any random task", it has things it is good at, but it is bad at a lot of things and even when "good" requires b
    - Re: About the same as Stack Overflow (Score:3)
      
      by zkiwi34 ( 974563 ) writes:
      
      The word "newer" is key. Why? Obviously it's harder for ChatGPT to find something to copy and hopefully adapt.
- Re: About the same as Stack Overflow (Score:5, Insightful)
  
  by reanjr ( 588767 ) writes: on Saturday July 06, 2024 @11:08PM (#64606129) Homepage
  
  By the time your GPT has figured out a response, I'm already on the correct Stack Overflow page, complete with comments and alternate solutions.
  Are people really so bad at using search engines that ChatGPT helps them search content? That boggles my mind. The queries for a search engine tend to be more terse, but the results more pointed and useful, in my experience, than GPT vomit.
  
  - Re: (Score:3)
    
    by edwdig ( 47888 ) writes:
    
    This is literally the one case where I've found ChatGPT useful. If I've got fairly simple questions, it's usually faster to ask ChatGPT than to sort thru Stack Overflow to find a quality answer. Going to Stack Overflow usually means sorting thru some bad answers and people bickering over what's the best way to do it. ChatGPT tends to be pretty good at giving me a decent answer quickly in those cases.
    If I have to ask anything non-trivial, ChatGPT is a waste of time.
    - Re: (Score:2)
      
      by Junta ( 36770 ) writes:
      
      Depends on the nature of the discussion.
      Sometimes with a Python question, discussion consists of people arguing back and forth on which is the more "Pythonic" way, and who the hell cares. It's almost a religious debate there. The debate is a waste of my attention span. Though a quick skim pulls me through.
      However, sometimes the discussion is insightful. Like pointing out a history of one library for unfixed security issues, or that the answer from 2017 that is widely referenced and likely the result Chat
  - Re: (Score:3)
    
    by Tony Isaac ( 1301187 ) writes:
    
    Apparently, you search Stack Overflow for simple problems. Usually, when I search Stack Overflow, it's because the answer is actually *hard* to find, or not obvious. After I've clicked the first 10-12 SO links without success, I'm starting to get frustrated. ChatGPT is a lot faster at this process than I am!
    - Re: About the same as Stack Overflow (Score:2)
      
      by reanjr ( 588767 ) writes:
      
      Not at all. But my searches are pointed. If it's not in the first two hits, it's probably not on the Internet.
      - Re: (Score:2)
        
        by fluffernutter ( 1411889 ) writes:
        
        Ditto. Sometimes you need to separate a big problem into small ones. But that's programming anyway. You will still arrive at your solution faster than dealing with ChatGPTs mistakes.
  - open-source code searching, with examples (Score:3)
    
    by echo123 ( 1266692 ) writes:
    
    I've been finding AI to be incredibly useful developing open-source Drupal code, especially form code using Drupal's Form API. I've been using both ChatGPT 4.o and Anthropic's Claude 3.5 and have recently come to prefer the later. Drupal's Form API is very mature and powerful, and I've been developing a form with multiple form fields but only one should be active at a time given a selection. The relatively complexity of the, (known, open-source), solution warrants either an old school StackOverflow search (
  - Re: (Score:2)
    
    by cascadingstylesheet ( 140919 ) writes:
    
    By the time your GPT has figured out a response, I'm already on the correct Stack Overflow page, complete with comments and alternate solutions.
    Are people really so bad at using search engines that ChatGPT helps them search content? That boggles my mind. The queries for a search engine tend to be more terse, but the results more pointed and useful, in my experience, than GPT vomit.
    Good luck asking Stack Overflow to rewrite something into a different language or framework (for just one example).
    All you are telling people who do know how to use it effectively is that you don't. It's like hearing an assembly programmer ranting against compilers.
  - Re: About the same as Stack Overflow (Score:2)
    
    by SuperDre ( 982372 ) writes:
    
    Looking at my fellow programmers, they use ChatGPT all day long, they don't check stackoverflow anymore. Personally I still haven't used it, I still use SO when I don't have a clue.
    - Re: (Score:2)
      
      by fluffernutter ( 1411889 ) writes:
      
      How does it feel that the people who are less smart than you may be using ChatGPT to catch up?
  - Re: (Score:2)
    
    by fluffernutter ( 1411889 ) writes:
    
    I have wondered the same thing. Maybe people didn't realize that they could input questions into google? Maybe if you are working with a specific framework language or platform, put that term first in the search field? That's all I do.
- Re:About the same as Stack Overflow (Score:5, Insightful)
  
  by gweihir ( 88907 ) writes: on Saturday July 06, 2024 @11:15PM (#64606137)
  
  Sure. But who will continue to write Stack Overflow questions and answers when AI now "saves time"? And what will AI get trained on when these postings are missing?
  
  - Re: (Score:2)
    
    by Tony Isaac ( 1301187 ) writes:
    
    First, Q&A sites aren't going away any time soon. It will be quite some time before *everybody* gives up on them.
    Second, there are plenty of other sites where people still share code, you know, like GitHub.
    And third, your question is kind of like asking who is going to learn the fine art of shifting gears, when automatic transmissions are in every car.
    - Re: (Score:2)
      
      by gweihir ( 88907 ) writes:
      
      If they get massively less hits, they will go away. It is actually a pretty cold financial question.
      - Re: (Score:2)
        
        by Tony Isaac ( 1301187 ) writes:
        
        And that's precisely why Stack Overflow is making deals with the AI companies. https://openai.com/index/api-p... [openai.com]
        
        Re: (Score:2)
        
        by gweihir ( 88907 ) writes:
        
        Sure. Stack Overflow is. But what about the users that create the content?
- Re: (Score:2)
  
  by Drethon ( 1445051 ) writes:
  
  Time saver #1: It will scan through a bunch of Stack Overflow (and other coding site) suggestions, picking ones that seem relevant.
  I suspect it isn't even that good. Based on what I know about LLMs, it will identify the most probably response to your query, so it may respond with the most common response to your problem, not the most relevant. That is if the response isn't even more fine grained that each line or word in the code is the most probable next one, so the previous word or line might cause the next one to be less correct to the overall answer.
It's great if it's a problem it's seen before ... (Score:4, Insightful)

by el84 ( 10322963 ) writes: on Saturday July 06, 2024 @07:09PM (#64605843)

like factorial, or quicksort, or one of those lame dumb-ass job interview problems. But give it a problem that is novel and you just get garbage. Particularly if you couch the problem as test cases. Most of the time the code doesn't even run. It's like it's just memorised all the code in the training set and it's working like a cross between search engine and keyword programming. Clearly, it's a start and it will likely get better over time but I'm not worried about being replaced any time soon.

- Re: (Score:3)
  
  by alvinrod ( 889928 ) writes:
  
  Even for something like that which is a known quantity I doubt that it could produce a functioning program if you asked it to write it in a language it had never seen before if given a specification for the language and some simple examples. I suspect most programmers could do that even if the language has some odd design choices.
  
  I do think that with further refinement there's still great use cases for the technology even if it's not the magic bullet that some hoped it might be. I think it would be great
  - Re: (Score:3)
    
    by gweihir ( 88907 ) writes:
    
    Even worse: Unless and until it gets a lot of examples in that new language, it will not ever be able to do anything in it. And who will write these examples?
- Re: (Score:2, Redundant)
  
  by dfghjk ( 711126 ) writes:
  
  Exactly. That AI passes the test at all means the test is contrived for AI to pass it. It's probably better than SuperKendall but that's it.
- Re:It's great if it's a problem it's seen before . (Score:5, Insightful)
  
  by gweihir ( 88907 ) writes: on Saturday July 06, 2024 @11:12PM (#64606133)
  
  It is worse: Give it a known simple problem with a different order of steps than usually used, but clearly specified. It cannot even do that.
  As a coder, this thing is worthless. Sure, many "coders" are worthless as well (see https://blog.codinghorror.com/... [codinghorror.com] for examples), but making worthless coders cheaper is not going to improve anything.
  
  - Re: (Score:2)
    
    by serviscope_minor ( 664417 ) writes:
    
    As a coder, this thing is worthless. Sure, many "coders" are worthless as well (see https://blog.codinghorror.com/ [codinghorror.com]... for examples)
    The GP complained about lame worthless interview questions.
    When I first moved into the tech industry, I was asked a programming question in the interview by an interviewer I didn't know. I thought it was a bit weird at the time, a in most of the interviewers know me, you can see what I've done, surely it's obvious I can code. But I wasn't going to be an arrogant dickhead so i pl
    - Re: (Score:2)
      
      by Don'tJoin ( 6185656 ) writes:
      
      Recursion for teh win!
      Recursion is beautiful, but hard to do right in a language like c, like near impossible to get tail-recursived. A for loop, or nested such on the other hand, is simple and clean and efficient in most languages.
      You only do functions recursive to show off is what I'm trying to say. And in the right context that's exactly right.
    - Re: (Score:2)
      
      by Junta ( 36770 ) writes:
      
      A shocking number of people get through by gaming non-technical managers.
      We had one guy who managed to pull off 5 years with our group, and he had the management *convinced* that his failure to ever do anything useful was the fault of senior developers refusing to let him do anything or refusing to train him. Meanwhile multiple senior engineers wasted hours every week trying to be helpful to teach him and assign him even basic tasks. However his excuse remained the same and the seniors got chastised becaus
    - - Re: (Score:3)
        
        by serviscope_minor ( 664417 ) writes:
        
        I've noticed this kind of thing too. One theory I have is that once people have started their career, they become less willing to prepare for a technical interview going forward.
        Nothing wrong with that. I'm not going to go into a job where I need to bone up on leetcode before attending the interview. Nor will I do take-home interview questions. And I would absolutely not expect anyone to do have to do prep for an interview.
        That's why the initial screen is language of your choice, no APIs, no algorithm trick
        
        Re: (Score:2)
        
        by martin-boundary ( 547041 ) writes:
        
        Nothing wrong with that [...]
        
        To be clear, there is indeed nothing wrong with that. I'm just saying I've noticed it, and for us it wastes the candidate's time as well as ours, since they don't advance to the next stage. There's no shortage of candidates though, so we don't have an incentive to drop the higher barrier to entry (and I usually tell HR to warn the candidate that they will be grilled on the relevant subjects)
- Re: (Score:3)
  
  by Luckyo ( 1726890 ) writes:
  
  That's the point of generative AI. It doesn't replace an expert. It makes one expert do the job of ten, because 9/10 of expert's job is mundane, boring stuff that doesn't require an expert, but comes with the job.
  So you outsource that to AI and check that whatever it made is functional, and do the 1/10 stuff that actually needs your expertise that generative AI can't handle properly. This is notably how it works in many fields in production, right now. It enables a single person do the job of many people, b
- Re: (Score:2)
  
  by fluffernutter ( 1411889 ) writes:
  
  Honest question: If I said, "Write me a Qt6 application in c++ with four text fields labelled "One", "Two", "Banana", and "Three" and a button below them labelled "Submit" which will call field X at API Y and submit the data. The window should be 600x400 with a 20 pixel margin with fields and button stached vertically with a 20px spacing....... What would it do?
  - Re: It's great if it's a problem it's seen before (Score:2)
    
    by ZERO1ZERO ( 948669 ) writes:
    
    Probably write that code for you just as if you had asked an human to do it, but quicker.
3.5? (Score:5, Insightful)

by david-bo ( 578532 ) writes: on Saturday July 06, 2024 @07:10PM (#64605847)

"The study tested GPT-3.5"
That is a pretty uninteresting test. No one who is serious about this would use 3.5. Stupid study.

- - Re: 3.5? (Score:2)
    
    by reanjr ( 588767 ) writes:
    
    "I really can't imagine what kind of stuff are people "programming" if they find it useful."
    I suspect it's people who see boilerplate as a template rather than a target for abstraction and encapsulation. I've never understood this mindset, but I've often encountered it.
    - Re: (Score:2)
      
      by Mr. Dollar Ton ( 5495648 ) writes:
      
      Yep, woe to those poor souls, who get to maintain that kind of "software".
- - Re: (Score:2)
    
    by Bumbul ( 7920730 ) writes:
    
    Not that old, but oh-so-outdated. Just review some results from this page: https://openai.com/index/gpt-4... [openai.com] Quite well highlights, how fast this field is moving.
    - Re: (Score:2)
      
      by Luckyo ( 1726890 ) writes:
      
      I was specifically talking about the complaint that appeared strictly ignorant of the fact that studies take a while to be properly formulated after results are collected.
      Which demonstrated staggering ignorance of reality of how studies are formulated, and effort put into them (outside social sciences).
- - Re: (Score:2)
    
    by david-bo ( 578532 ) writes:
    
    Stupid remark. 3.5 is like two years old. This test can't have taken more than a few weeks. Duh!
  - Re: (Score:2)
    
    by chmod a+x mojo ( 965286 ) writes:
    
    > By that logic any thorough report that took more than a few days to complete can be dismissed as "uninteresting" because it was testing the "old" version.
    It has nothing to do with Tesla or anything at that point. If you take too long in a fast moving environment, or just plain don't do updates to latest code bases, many shortcomings you find will be obsolete and useless.
    It is uninteresting at that point. All it does is point out something that has already been pointed out and fixed. Do you also think b
    - Re: (Score:2)
      
      by misnohmer ( 1636461 ) writes:
      
      the first releases of GPT that it isn't designed for a super deep understanding of, well, ANYTHING. The shit you are supposed to read through and agree to before using it even says it isn't designed to be perfect at anything, and answers it puts out may be just as incorrect as asking the village idiot the same thing.
      And the latest GPT has shit that you agree to that says it is designed to be perfect?
      If you take too long in a fast moving environment, or just plain don't do updates to latest code bases, many shortcomings you find will be obsolete and useless.
      So any software that releases every week should never be studied more than a day or two so that results can be released before the next release, which will make the results uninteresting? I think you found a perfect way to dismiss any results of any testing, release a new version of your software every day, dismiss any criticism as "uninteresting, we probably fixed it already". Oh wait, that is pretty much what Elon does. A
testing the wrong thing (Score:3)

by danda ( 11343 ) writes: on Saturday July 06, 2024 @07:11PM (#64605853)

How often do I find myself needing some algo from a site like rosettacode, stackoverflow, or leetcode? maybe a couple times a month. (I've never even visited leetcode.) The rest of the time I am implementing functionality unique to my project's problem domain and codebase.
I don't see how ChatGPT would help with this.
Testing ChatGPT or any LLM with these types of problems seems like cheating.
I would be interested to see tests with regards to ChatGPT's ability to take a basic description of a new unique and non-trivial feature and implement it for an existing open source project like openssl, openssh, firefox, linux kernel, bitcoin-core, etc.
My guess is that it will perform extremely poorly compared to even a mid-level developer. am I wrong?

- Re:testing the wrong thing (Score:5, Insightful)
  
  by Morromist ( 1207276 ) writes: on Saturday July 06, 2024 @08:35PM (#64605961)
  
  Yeah, I too don't get it. Most programmers are good enough at programming that they can just do the simple stuff with make a quick peek at some old code they've written or something. So who needs chat gpt?
  Whenever I use it it takes more time to fix its little mistakes than it would to just write all the code myself. Waste of time.
  
- Re: (Score:2)
  
  by test321 ( 8891681 ) writes:
  
  I would be interested to see tests with regards to ChatGPT's ability to take a basic description of a new unique and non-trivial feature and implement it for an existing open source project like openssl, openssh, firefox, linux kernel, bitcoin-core, etc.
  Please not openssh, it's really the one where no screw ups are ever to be allowed.
  - Re: (Score:2)
    
    by Mr. Dollar Ton ( 5495648 ) writes:
    
    Amen!
- Re:testing the wrong thing (Score:4, Interesting)
  
  by dvice ( 6309704 ) writes: on Sunday July 07, 2024 @05:42AM (#64606481)
  
  You don't even need to use as complex project as Firefox. If you ask AI to write a software and then ask it to make incremental changes to it, it will at some point fail, even if the program is really, really simple and steps really small. Actually, lets try that. I try with Gemini.
  1. Write a python application that takes 2 numbers from command line and adds them together and prints out the result and only the result and exists the program. The application should not do anything else than what is described here and it should not handle any error situations. (OK)
  2. Change the code so that it will instead of 2 numbers, add 3 numbers together. Make no other changes to the code. (OK)
  3. Change the code so that it works with either 2 or 3 arguments. Make no other changes. (OK)
  4. Change the code to use multiplication instead of addition if there are 3 arguments. Make no other changes. (Failed, it multiplied with 2 arguments, added with 3)
  5. Change the code so that it will add the first 2 arguments always together and if there is a 3rd parameter, the 3rd parameter is multiplied with the sum from first 2 arguments. Make no other changes. (OK, I made this step so we could get previous error fixed as it could be I not gave bad instructions)
  6. Add a 4th argument. If it is given, multiply first 3 arguments with it before doing any other calculations. Make no other changes. (Failed miserably)
  Here is the end result:
  ---------------
  import sys
  # Calculate based on argument count (assuming valid numbers provided)
  if len(sys.argv) == 4:
  product = float(sys.argv[1]) * float(sys.argv[2]) * float(sys.argv[3])
  result = product if len(sys.argv) == 3 else product * float(sys.argv[4])
  else:
  sum_of_first_two = float(sys.argv[1]) + float(sys.argv[2])
  result = sum_of_first_two if len(sys.argv) == 3 else sum_of_first_two * float(sys.argv[3])
  # Print the result
  print(result)
  # Exit the program
  sys.exit()
  ---------------
  And here is how it fails
  ---------------
  $ python3 adder.py 3 4 2
  Traceback (most recent call last):
  File "adder.py", line 6, in
  result = product if len(sys.argv) == 3 else product * float(sys.argv[4])
  IndexError: list index out of range
  ---------------
  
It helps with homework but not jobs (Score:4)

by scrib ( 1277042 ) writes: on Saturday July 06, 2024 @07:29PM (#64605875)

The date seems significant because it implies that chat-gpt does a pretty good job finding an existing answer but a pretty poor job at creating a novel answer.
I am at once feeling more secure in my job and more worried about the influx of chat-gpt-kiddies.
Of course, my job is only secure if my managers understand this. Pardon me, I have to go get chat-gpt to write an email for me.

- Re: (Score:3)
  
  by Anubis IV ( 1279820 ) writes:
  
  As an Engineering Manager, I feel very secure that my devs will continue to have their jobs for as long as we can keep them. I see room for some acceleration via AI, but this is only a realistic threat to people doing work that can be verified with nothing more than a visual inspection (i.e. trivial tasks that we wouldn’t even accept as work in the first place).
  Meanwhile, the company I’m at specializes in project rescue work, and I anticipate a lot more of it coming our way in the years ahead.
solving problems? (Score:4, Insightful)

by dfghjk ( 711126 ) writes: on Saturday July 06, 2024 @07:30PM (#64605879)

And what exactly does "solving problems" mean? Under what criteria is generative AI considered "good" at programming? I'd suspect under real criteria, good design and good implementation, the success rate would surely be zero. Under contrived tests where the least possible unit test is passed, perhaps it is higher.

- Re:solving problems? (Score:4, Insightful)
  
  by gweihir ( 88907 ) writes: on Saturday July 06, 2024 @11:10PM (#64606131)
  
  LLMs cannot "solve" problems at all. All they can do is calculate probabilities that a solution they have seen fits the problem, using some correlations. The results may fit, may partially fit or be complete crap and, bonus!, the LLM has no clue which if the three it is.
  Just to illustrate this: I just have corrected a Python "open Internet" exam. LLMs are not even capable of understanding that indention is critical in Python. Or that a simple specification with three simple steps actually means these steps need to be done in the order specified.
  
  - Re:solving problems? (Score:5, Insightful)
    
    by WaffleMonster ( 969671 ) writes: on Sunday July 07, 2024 @12:48AM (#64606255)
    
    I just have corrected a Python "open Internet" exam. LLMs are not even capable of understanding that indention is critical in Python.
    I'm not capable of understanding why anyone would design a language like that either.
    
    - - Re: (Score:2)
        
        by serviscope_minor ( 664417 ) writes:
        
        I'm inclined to agree. It's part of the macho cult of C.
        I'm currently doing python day to day. I don't love it, but it's fine. You get used to the lack of close braces, though I do find that a persistent but minor irritation when refactoring. I like that in vim, I can easily select code by block, or smoosh code around quickly then hit = to reindent correctly.
        But this is a very minor gripe. I'm not refactoring all day every day. Occasionally I need a few more keystrokes in my favourite editor THEREFORE IT S
      - Re: (Score:3)
        
        by Don'tJoin ( 6185656 ) writes:
        
        I must oppose this, I too think whitespace should not be deciding functionality of how code runs.
        And you are a bit harsh on the previous poster, they never claimed to have difficulty understanding or coding it, just that they disliked the design choice.
        My biggest gripe with Python is how it makes some coders write totally unreadable one- or two-liners to solve a problem. It's like some think they are competing in an obfuscating code contest.
        
        Re: (Score:2)
        
        by Junta ( 36770 ) writes:
        
        My biggest gripe with Python is how it makes some coders write totally unreadable one- or two-liners to solve a problem
        Now this I've never seen. I've certainly seen this with perl and c, but with python if anything I've seen the language actually try to prevent things happening in a one liner. To the point where I've seen regret expressed about the existence of the lambda capability, which is generally pretty readable.
        
        Re: (Score:2)
        
        by gweihir ( 88907 ) writes:
        
        Indeed. Perl and C? Unreadable code is really easy and many "coders" seem to believe it is actually a goal. Python? Possible, but you have to work for it.
      - Re: (Score:2)
        
        by gweihir ( 88907 ) writes:
        
        Requiring minimal actual skills from people claiming to be competent is not "bullshit toxic macho type flexing". But calling that "bullshit toxic macho type flexing" is bullshit toxic macho type flexing. You are calling out yourself.
        And since when did proper indentation make things "more obscure and more difficult"? You must be one of those: https://blog.codinghorror.com/... [codinghorror.com]
  - Re: (Score:2)
    
    by Junta ( 36770 ) writes:
    
    I saw a quote that I'm having a hard time finding that sums it up. LLMs don't provide information, they provide information shaped output.
    What it spits out looks credible, and in some cases that ends up being credible, but it's clear after some experience that it just spews out credible looking stuff that shouldn't be credible. It's a bullshitting machine. So it'd make a good executive.
    - Re: (Score:2)
      
      by gweihir ( 88907 ) writes:
      
      That is actually a very good description.
Same useless AI suggestion shit as on Google (Score:3)

by thesjaakspoiler ( 4782965 ) writes: on Saturday July 06, 2024 @07:41PM (#64605895)

Wish it was 1999 when search results were just the search results without an AI or SEO trying to meddle with the results.

- Re: (Score:2)
  
  by Don'tJoin ( 6185656 ) writes:
  
  SEO totally sucks.
Of course the summary leaves the important data (Score:4, Interesting)

by Rei ( 128717 ) writes: on Saturday July 06, 2024 @07:50PM (#64605907) Homepage

...out: the human success rates for comparison.
I can't access the research paper, sadly, to see what acceptance rates are for each group. But searching on Reddit, it sounds like LeetCode acceptance rates tend to be pretty low. For example this "Easy" question [leetcode.com] has a 17,4% acceptance rate (though that's apparently a particularly low rate for an "easy" question).
Anyone have access to the paper to see what the human mean scores were on the same problems?

- Re: (Score:3)
  
  by martin-boundary ( 547041 ) writes:
  
  In general, you would want to throw away a large percentage of human submissions before calculating the acceptance rate though, since lots of people just sign up for free and try it out for a bit to see what it's like, but don't answer seriously or while watching TV. So this skews the passive statistics.
  It's like when the MOOCs came online, where they teach university courses without requiring exams or homework. Most people who sign up for these courses watch one or two vids and never finish aftwerwards.
  - Re: (Score:2)
    
    by Rei ( 128717 ) writes:
    
    since lots of people just sign up for free and try it out for a bit to see what it's like, but don't answer seriously or while watching TV.
    I don't see how this is a valid argument. If anything, since LeetCode is "a test", and incentives (even gamifies, arguably) success by letting people compete for the highest scores, I expect people to try *far* harder than for some random piece of code for some random project that they were just rushing to complete for some deadline.
    - Re: (Score:2)
      
      by martin-boundary ( 547041 ) writes:
      
      LeetCode and similar is also often recommended to peple for practicing coding interviews (instead of competing). It's cheaper than buying one of those interview questions books, and it's easy enough to attempt a few problems, abandon the ones that are too hard, and feel good about yourself once you've completed two or three.
      - Re: (Score:2)
        
        by Rei ( 128717 ) writes:
        
        If you search on Reddit, everyone is comparing their scores with everyone else. And those who do poorly tend to be frustrated with themselves and/or LeetCode. Everyone is clearly quite motivated to score well - and I'd argue, far more than they would be from some random drudgework at work that they've been working on for years.
- Re: (Score:2)
  
  by Junta ( 36770 ) writes:
  
  Problem is there are a handful of LLMs and so by testing "an interaction with " you've tested the best and worst of a large chunk of the AI. They also either work or don't work on the very first try, there's no point in telling it "failed, try again", because it'll just flounder about randomly. With sampling the acceptance rate of the test you include people that aren't very good or don't even care (e.g. some required training by their company).
  Also, never used the platform myself, so I don't know how "at
  - Re: (Score:2)
    
    by Rei ( 128717 ) writes:
    
    How good it is at "trying again" often tends to come more down to the finetune than the underlying model, in my experience. I find LLaMA 3 much better at trying again than ChatGPT, for example.
    That said, none of them do independent A* right now.
good enough (Score:3)

by backslashdot ( 95548 ) writes: on Saturday July 06, 2024 @08:14PM (#64605941)

If you know how to use it (I reckon most people don't) it's "good enough". I use it for a lot of things when I am lazy. It's good enough for most things .. definitely beats "junior" programmers which is a bit scary because it's hard to become a senior programmer without getting a junior programmer job first. If AI is doing all the junior skilled work .. where's the pathway to becoming senior? Years of learning without a job? We've already made it so most jobs need a bachelor's degree. Now we're going to ask people to show up with a Masters? Note. I said it's hard to .. not impossible. There's always going to be people who can learn and portfolio their way to senior programmer without an entry-level role.

- Re: good enough (Score:2)
  
  by reanjr ( 588767 ) writes:
  
  I'd argue GPT cheapens the value of a degree. It no longer even demonstrates a basic level of understanding if GPT does all your homework.
  Better off hiring high school graduates who have been coding half their lives.
  - Re: (Score:2)
    
    by backslashdot ( 95548 ) writes:
    
    100% agree, but those people are rare. Only caveat is that they must have done AP Calculus BC and maybe Statistics (or be willing to get up to speed on it within a year). Yeah I know that you don't need that for most devops/coding jobs these days .. but I find it's a good filter and also shapes the mind into an engineering mindset.
- - Re: good enough (Score:2)
    
    by godrik ( 1287354 ) writes:
    
    well ieee is an organization, ieee spectrum is a journal (well, aagazine really, but often they republish journal articles)
The truth (Score:4, Insightful)

by Baron_Yam ( 643147 ) writes: on Saturday July 06, 2024 @08:43PM (#64605977)

ChatGPT understands nothing, therefore it cannot do anything truly novel. That it can regurgitate useful code at all after being trained on existing example code is a marvel.
And that's what I use it for - fast search-free regurgitation of code I know has been done before. Then it just needs a quick review to ensure it makes sense and you're off a lot faster than you could have typed and debugged your own code.

- Re: The truth (Score:3)
  
  by godrik ( 1287354 ) writes:
  
  it is actually pretty good at that. I needed to do some basic nlp but i never did much nlp myself. Things like find words of so many syllable that rhymes with that other word. It would have taken me a day to find the right libraries understand how they work, write the code, teat and debug.
  With chatgpt, it right away suggested a library and gave me starter code, somewhat wrong so i had to adjust. But in 90 minutes i was done
- Re: (Score:2)
  
  by Rick Schumann ( 4662797 ) writes:
  
  ChatGPT understands nothing, therefore it cannot do anything truly novel.
  That's what too many people don't seem to understand, or rather, believe that it has actual cognitive ability, instead of just the illusion of cognition, when in reality it has no such capability and never will.
terrible benchmark (Score:4, Insightful)

by darkain ( 749283 ) writes: on Saturday July 06, 2024 @09:18PM (#64606015) Homepage

Leetcode is a terrible benchmark.
Its used for whiteboarding interviews, which is total bullshit to begin with... however, that style of code is *NEVER* used on the job, because they're already solved problems, solved better than any 1 person can come up with, peer reviewed a 1000 times over, and all rolled up into nice neat little libraries.
Can ChatGPT write brand new code to solve novel problems? no. because its an over-glorified copy-pasta bot.

- Re: (Score:2)
  
  by Junta ( 36770 ) writes:
  
  What are you talking about, I constantly need to (checks leetcode question) determine if a string could have perfectly even letter distribution if one and only one character is removed from it.
  Yeah, test questions are a stupidly useless metric for comparing humans to LLMs. They are arguably stupid for evaluating human performance, but become so much worse for LLMs, that have a different set of strengths and weaknesses that generally favor being better at these sorts of "test" questions than they are at rea
what fraction is verbatim copy? (Score:4, Insightful)

by u19925 ( 613350 ) writes: on Saturday July 06, 2024 @09:34PM (#64606021)

Since it says that it had higher accuracy if the problem was published before 2021, I am guessing that it may be doing verbatim copy of the existing solutions. More the solutions, more likely it will fit in one of the patterns that it had seen and more likely that it picked up that solution.
A better benchmark would have been to compare with what you get from search engines and see if they materially differ. I have tried dozens of coding problems with ChatGPT and all I get is usually a boilerplate code. As an example, I asked ChatGPT to write a code about moon phase. Its code was correct but it used horrible formula which gave pathetic answers. Fixing one line formula made it work. ChatGPT had no idea which of the dozens of solutions on the internet was a correct one.

- Re: (Score:2)
  
  by cascadingstylesheet ( 140919 ) writes:
  
  Since it says that it had higher accuracy if the problem was published before 2021, I am guessing that it may be doing verbatim copy of the existing solutions.
  Instead of guessing, why not try it?
  Then, after you try it, if you want to learn something useful instead of score debating points, ask it for some revisions. "Can you rewrite this in Python for me?" or "can you implement the following additional requirements? 1. blah blah 2. yadda yadda ..."
- Re: (Score:2)
  
  by Mr. Dollar Ton ( 5495648 ) writes:
  
  The problem is precisely it isn't a verbatim copy.
  The crap takes some examples that work, slices them up, and then, based on whatever coefficients it derived for the frequency of some strings in what it got as input, it generates crap that looks like code, but won't work.
  Tried it with trivial tasks in several domains:
  - simple webapp (show me an example of two components (A, B), three instances (A1, A2, B1), B reacts on A1, A2 change, for a specific simple js framework) - complete fail, example loads
LLMs are always bad at programming (Score:2)

by OrangeTide ( 124937 ) writes:

LLMs can't reliably write working code, let alone write goodcode. It is as fundamental as the Halting Problem. A thing LLMs aren't aware of.
Pretty good for boilerplate...if you are competent (Score:3)

by memory_register ( 6248354 ) writes: on Saturday July 06, 2024 @09:43PM (#64606031)

I have found ChatGPT (and derivative tools like copilot) to be very useful for writing boilerplate - but only with a lot of specification in the prompt.

It's the classic garbage-in, garbage-out problem: if you're a good coder and good prompter, LLMs deliver useful scaffolds. If you suck, it will suck.

This week in AI (Score:3)

by swm ( 171547 ) writes: <swmcd@world.std.com> on Saturday July 06, 2024 @09:59PM (#64606039) Homepage

From Slashdot
Waymo Issues Software and Mapping Recall After Robotaxi Crashes Into a Telephone Pole
https://tech.slashdot.org/stor... [slashdot.org]
The update corrects an error in the software that "assigned a low damage score" to the telephone pole

From Schneier on Security
Using AI for Political Polling
https://www.schneier.com/blog/... [schneier.com]
See, polling has gotten hard, because people don't answer their phones any more, and if they do answer they won't talk to you, and if they do talk to you, they may tell you what they think you want to hear, rather than what they really think. So what we can do, see, is create these AI chat-bots that act like people, and then poll the chat-bots instead of calling real people.
From the article
What's so powerful about this system is that it can generalize to new scenarios and survey topics, and spit out a plausible answer, even if its accuracy is not guaranteed.

I am simply gob-smacked that adults—actual grown-up people—continue to take this stuff seriously.

- Re: (Score:2)
  
  by gweihir ( 88907 ) writes:
  
  From the article
  What's so powerful about this system is that it can generalize to new scenarios and survey topics, and spit out a plausible answer, even if its accuracy is not guaranteed.
  I am simply gob-smacked that adults—actual grown-up people—continue to take this stuff seriously.
  Same here. In actual reality, it can do none of those things. As soon as they are a tiny bit unexpected, it cannot even do basic things right. The only thing that happens is that lies are getting more extreme. Usually that is a sign of the mindless hype nearing its end. We can only hope it is here too.
It is a fuckup (Score:2)

by gweihir ( 88907 ) writes:

Well, many coders are fuckups as well (see, for example: https://blog.codinghorror.com/... [codinghorror.com]) and these it can, maybe, replace to a degree. But forget about having it write even a simple piece of original code or about it actually understanding even a simple specification.
- Re: (Score:2)
  
  by drinkypoo ( 153816 ) writes:
  
  Define "original code". There's only so many ways to write the same thing and your average programmer has seen a lot of code before they write anything meaningful.
  On the other hand, the software doesn't understand anything, and it doesn't even do a good job of pretending to understand anything it hasn't seen before. That means it can only even mock understanding of specifications which are highly similar to specifications the system was trained on. On the other hand, the massive duplication of effort in pro
You Know What? (Score:3)

by The Cat ( 19816 ) writes: on Saturday July 06, 2024 @11:43PM (#64606185)

I wish I'd known when I was in high school that I'd never have a decent job.
I would have done things quite differently.

Super cool! (Score:2)

by aglider ( 2435074 ) writes:

ChatGPT can easily solve problems that have already been solved in the past and discussed over the Internet. It's not so good with newer insufficiently discussed ones.
Some of my colleagues try to use ChatGPT (Score:2)

by Casandro ( 751346 ) writes:

Essentially it's OK if you could also copy and paste code from stackoverflow. It severely breaks if you want to do anything beyond that. For example we once asked it to write a configuration file snipped for "yate", a commonly used VoIP software with a strong focus on mobile application. This configuration file snipped was support to reject SIP "MESSAGE"-Requests.
The result was an ini-file style file which was something like "MESSAGE=reject".
In reality yate is configured via something called "regex-route".
Whistling in the dark (Score:2)

by cascadingstylesheet ( 140919 ) writes:

The highest rated comments on these stories tell me a few things:
1. The commenters either haven't really tried, or don't know how to effectively use, ChatGPT.
2. The commenters have really poor PM skills.When working with ChatGPT, you need to give it a good set of requirements, and good feedback so it can do revisions.
It's a tool. When you learn how to use it, it's an incredible help. Truly incredible stuff, unless you've put the goalposts onto a bullet train, sending them away into the distance.
Coding is not just writing independent modules (Score:2)

by maiden_taiwan ( 516943 ) writes:

What's missed in all of this hype is that coding is a team activity. Put ten software engineers together, each using AI to generate their parts, and you get a big, unmaintainable mess with no design consistency and tons of redundant code.
It sometimes fails to understand ... (Score:4, Insightful)

by LordNimon ( 85072 ) writes: on Sunday July 07, 2024 @09:41AM (#64606753)

> It sometimes fails to understand the meaning of questions,
No, it never understands the meaning of the question. That's the whole problem with LLMs.

"extremely broad range of success". Uhhhh. Yea. (Score:2)

by mnemotronic ( 586021 ) writes:

The results show that ChatGPT has an extremely broad range of success when it comes to producing functional code
By which I'm guessing that if I produced code like that then I'd be experiencing "a broad range of what to do with my time now that I've been fired"
No funny bone (Score:2)

by shanen ( 462549 ) writes:

High hopes for the topic, but... No one noticed any examples of funny code passed by ChatGPT?
Well that was useless (Score:2)

by aldousd666 ( 640240 ) writes:

Nobody uses GPT 3.5 anymore. That's the old model, and it's distinctively worse than the current model at everything,not just coding. This is a stupid thing to test, now, since nobody will ever use GPT 3.5 for coding ever again

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

About the same as Stack Overflow (Score:3)

Re: About the same as Stack Overflow (Score:3)

Re: About the same as Stack Overflow (Score:4, Insightful)

Re: (Score:2)

Re: (Score:2)

Re: About the same as Stack Overflow (Score:4, Informative)

Re: (Score:3)

Re: (Score:2)

Re: About the same as Stack Overflow (Score:3)

Re: About the same as Stack Overflow (Score:5, Insightful)

Re: (Score:3)

Re: (Score:2)

Re: (Score:3)

Re: About the same as Stack Overflow (Score:2)

Re: (Score:2)

open-source code searching, with examples (Score:3)

Re: (Score:2)

Re: About the same as Stack Overflow (Score:2)

Re: (Score:2)

Re: (Score:2)

Re:About the same as Stack Overflow (Score:5, Insightful)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

It's great if it's a problem it's seen before ... (Score:4, Insightful)

Re: (Score:3)

Re: (Score:3)

Re: (Score:2, Redundant)

Re:It's great if it's a problem it's seen before . (Score:5, Insightful)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:3)

Re: (Score:2)

Re: (Score:3)

Re: (Score:2)

Re: It's great if it's a problem it's seen before (Score:2)

3.5? (Score:5, Insightful)

Re: 3.5? (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

testing the wrong thing (Score:3)

Re:testing the wrong thing (Score:5, Insightful)

Re: (Score:2)

Re: (Score:2)

Re:testing the wrong thing (Score:4, Interesting)

It helps with homework but not jobs (Score:4)

Re: (Score:3)

solving problems? (Score:4, Insightful)

Re:solving problems? (Score:4, Insightful)

Re:solving problems? (Score:5, Insightful)

Re: (Score:2)

Re: (Score:3)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Same useless AI suggestion shit as on Google (Score:3)

Re: (Score:2)

Of course the summary leaves the important data (Score:4, Interesting)

Re: (Score:3)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

good enough (Score:3)

Re: good enough (Score:2)

Re: (Score:2)

Re: good enough (Score:2)

The truth (Score:4, Insightful)

Re: The truth (Score:3)

Re: (Score:2)