ChatGPT's Odds of Getting Code Questions Correct are Worse Than a Coin Flip

ChatGPT's Odds of Getting Code Questions Correct are Worse Than a Coin Flip (theregister.com) 119

Posted by EditorDavid on Saturday August 12, 2023 @06:48PM from the tails-you-lose dept.

An anonymous reader shared this report from the Register: ChatGPT, OpenAI's fabulating chatbot, produces wrong answers to software programming questions more than half the time, according to a [pre-print] study from Purdue University. That said, the bot was convincing enough to fool a third of participants.

The Purdue team analyzed ChatGPT's answers to 517 Stack Overflow questions to assess the correctness, consistency, comprehensiveness, and conciseness of ChatGPT's answers. The U.S. academics also conducted linguistic and sentiment analysis of the answers, and questioned a dozen volunteer participants on the results generated by the model. "Our analysis shows that 52 percent of ChatGPT answers are incorrect and 77 percent are verbose," the team's paper concluded. "Nonetheless, ChatGPT answers are still preferred 39.34 percent of the time due to their comprehensiveness and well-articulated language style." Among the set of preferred ChatGPT answers, 77 percent were wrong...

"During our study, we observed that only when the error in the ChatGPT answer is obvious, users can identify the error," their paper stated. "However, when the error is not readily verifiable or requires external IDE or documentation, users often fail to identify the incorrectness or underestimate the degree of error in the answer." Even when the answer has a glaring error, the paper stated, two out of the 12 participants still marked the response preferred. The paper attributes this to ChatGPT's pleasant, authoritative style.

"From semi-structured interviews, it is apparent that polite language, articulated and text-book style answers, comprehensiveness, and affiliation in answers make completely wrong answers seem correct," the paper explained.

ChatGPT's Odds of Getting Code Questions Correct are Worse Than a Coin Flip

This discussion has been archived. No new comments can be posted.

Load All Comments

Search 119 Comments Log In/Create an Account

Comments Filter:

For those that don't understand statistics (Score:1)

by Luckyo ( 1726890 ) writes:

Slightly less than 50% chance to get working code with a simple input is amazing, because well over 99,9999999% of potential answers are wrong.
- Re:For those that don't understand statistics (Score:5, Insightful)
  
  by Anubis IV ( 1279820 ) writes: on Saturday August 12, 2023 @07:08PM (#63762898)
  
  Slightly less than 50% chance to get working code with a simple input is amazing, because well over 99,9999999% of potential answers are wrong.
  Close enough is fine in horseshoes and hand grenades, but not when providing authoritative answers for StackOverflow. Confidently wrong 50% of the time is unacceptable, and any human with that track record would know to shut up and listen instead of trying to act smart.
  And even for coding, iterating with it is an exercise in frustration. It apologizes and then regresses, then apologizes and introduces syntax from another language, then apologizes and rewrites the code completely. A better way of thinking of it is that when it’s wrong it’ll likely never get it right eventually, or if it does the effort far outstrips any benefits.
  
  - Re:For those that don't understand statistics (Score:5, Insightful)
    
    by ShanghaiBill ( 739463 ) writes: on Saturday August 12, 2023 @07:33PM (#63762934)
    
    Confidently wrong 50% of the time is unacceptable
    I once went to NYC's Chinatown with my cousin. There was a chicken trained to play tic-tac-toe. My cousin tried playing and lost to the chicken. When I laughed at them for LOSING TO A CHICKEN, he started making excuses: "The chicken had the first move", "The chicken plays everyday", etc. Yes, that is all true, but it was a FRICK'N CHICKEN.
    The same is true here. Sure, ChatCPT is wrong half the time. But it is a FRICK'N PROGRAM. Instead of complaining that it isn't always correct, we should be amazed that it works at all.
    How many humans do you know, even skilled programmers, who can answer random StackOverflow questions correctly half the time? I know I can't.
    When I use ChatGPT, I don't expect an answer I can blindly cut-and-paste into my GIT commit. I use it to get ideas, and the outline of the code, that I can then fix and fine-tune.
    In other words, I use it the same why I use human answers on Stackoverflow.
    
    - Re: (Score:2)
      
      by fluffernutter ( 1411889 ) writes:
      
      That's not supposed to be the point of AI. If you use it the same way as a stack overflow answer why not just use stack overflow?
      - Re:For those that don't understand statistics (Score:5, Insightful)
        
        by Linux Torvalds ( 647197 ) writes: on Saturday August 12, 2023 @07:49PM (#63762966)
        
        Because ChatGPT doesn't come up with some bullshit reason why the question doesn't belong on ChatGPT, such as being answered 10 years ago in a completely different language or application.
        Any other questions?
        
        
        Re: (Score:1)
        
        by fluffernutter ( 1411889 ) writes:
        
        You are making stuff up. If it's a duplicate answer they link you to it and i didn't even know there were other languages. I think you just think AI is cool so you create a problem to fit a solution
        
        Re: (Score:2)
        
        by Stoutlimb ( 143245 ) writes:
        
        I imagine latency of requests is another matter. How many AI queries can one do in the time it takes to get a good reply from Stack Overflow?
      - Re: (Score:2)
        
        by Stoutlimb ( 143245 ) writes:
        
        Please explain the "point" of AI. I genuinely want to know what you think it is, because your comment doesn't make sense to me. I'm open to learning new things.
        
        Re: For those that don't understand statistics (Score:2)
        
        by LindleyF ( 9395567 ) writes:
        
        I find it does pretty good code autocomplete. I don't use it for anything bigger than a for loop or debug print, but it does save a few minutes.
        
        Re: (Score:2)
        
        by sajavete ( 5054387 ) writes:
        
        My colleague said it does fairly decent unit tests.
      - Re: (Score:2)
        
        by Ksevio ( 865461 ) writes:
        
        It answers a lot faster and doesn't complain about your question
    - Re:For those that don't understand statistics (Score:5, Insightful)
      
      by Junta ( 36770 ) writes: on Saturday August 12, 2023 @08:38PM (#63763016)
      
      The same is true here. Sure, ChatCPT is wrong half the time. But it is a FRICK'N PROGRAM. Instead of complaining that it isn't always correct, we should be amazed that it works at all.
      Amazed, sure, but the utility is the question. There's plenty of amazing things we can see that aren't really that useful in practice.
      What I've found is that if there wouldn't be a human answer on StackOverflow, then GPT won't be able to answer either. However, it's harder to tell when GPT has completely fabricated a function name that doesn't exist, and it is less likely to hit upon success than finding some material in a quick google search.
      
    - Re:For those that don't understand statistics (Score:5, Funny)
      
      by Narcocide ( 102829 ) writes: on Saturday August 12, 2023 @08:47PM (#63763024) Homepage
      
      The actual point you're trying to make seems unclear. Are you suggesting we should teach the chicken to program?
      
      - Re: For those that don't understand statistics (Score:2)
        
        by LindleyF ( 9395567 ) writes:
        
        This is clearly the best idea.
      - Robot Chicken is helping you with your homework. (Score:2)
        
        by Larsen E Whipsnade ( 4686581 ) writes:
        
        Don't worry about it. Unless you intend to put that code into production.
    - Re: (Score:2)
      
      by NotEmmanuelGoldstein ( 6423622 ) writes:
      
      ... lost to the chicken.
      My lessons in decision-trees and pruning involved tic-tac-toe. Always start in a corner (most humans take the centre), then starting on the 4th/5th move (piece placed on the board), block the opponents line (horizonal, vertical, diagonal). Just like alpha-beta, the point is giving the opponent the fewest choices. In a game that can't force an error (Ie. chess can via pins, forks and check), the goal is to draw/stalemate.
      - Re: (Score:2)
        
        by serviscope_minor ( 664417 ) writes:
        
        Going in the middle does give the opponent fewest choices. An edge square blocks off two possible winning board positions, a corner blocks off three and the centre blocks off 4.
        
        Re: (Score:2)
        
        by HiThere ( 15173 ) writes:
        
        Well, with correct play a draw is always guaranteed. But people make a lot of mistakes. So the question is "Which starting move is most likely to cause your opponent to make a mistake?". They're probably less familiar with the corner starting move. I've even had some success with the center of a side, which is logically a terrible move, but unexpected.
        
        Re: (Score:2)
        
        by crunchygranola ( 1954152 ) writes:
        
        Well, with correct play a draw is always guaranteed. But people make a lot of mistakes.
        It is true that many people receive less tic-tac-toe instruction than do some circus chickens.
    - Re: (Score:2)
      
      by Ecuador ( 740021 ) writes:
      
      I once went to NYC's Chinatown with my cousin. There was a chicken trained to play tic-tac-toe. My cousin tried playing and lost to the chicken.
      That sounds absurd, is your cousin particularly slow or was there cheating involved? Tic tac toe is a game that is solved for most 12+ year olds. It should always be a draw from then on, regardless of the opponent.
      - Re: (Score:2)
        
        by HiThere ( 15173 ) writes:
        
        I had it solved when I was twelve, and then forgot. I solved it again during college, and now I've forgotten again. I wouldn't be surprised if I lost the first game of a series these days.
    - Re: (Score:2)
      
      by crunchygranola ( 1954152 ) writes:
      
      But it is a FRICK'N PROGRAM.
      
      The normal expectation is that a program can correctly perform its assigned task, not excel at deceptively failing.
    - Re: (Score:2)
      
      by martin-boundary ( 547041 ) writes:
      
      That's only really saying something about yourself, rather than all programmers in the world. I for one have so many ideas, I don't have enough time in the day to code them all up. I've had to become very good at coding by necessity, just to finish ideas before I get bored with them. Similarly I would never read StackOverflow, it would just slow me down.
      The trick is to train yourself to become so fluent at coding that you don't need to think about anything but design. You design quietly, then put on spoti
    - - Re: For those that don't understand statistics (Score:2)
        
        by bingoUV ( 1066850 ) writes:
        
        Says who? ChatGPT doesn't agree.
  - Re: (Score:3, Insightful)
    
    by Tony Isaac ( 1301187 ) writes:
    
    As a regular Stack Overflow user, 50% seems like *great* odds to me. Most of the time on Stack Overflow, you've got to try multiple suggested solutions before you find one that actually works.
    Also, I don't need the answers to be perfectly right, just steer me in the right direction. Once I have the gist of the answer, I can tweak it to fit my situation.
  - Re: (Score:2)
    
    by Junta ( 36770 ) writes:
    
    Yes, my experience with it has been:
    -It worked... But it would have been trivial for me to do, and even if I didn't already know the answer, a quick google search would have provided pretty much the same code, either from project documentation, some github library or gist, or stackoverflow. If I used the library, then the 'answer' is actually maintained by a third party.
    -It doesn't work because I asked a question it really can't answer, but to someone who didn't already know the answer, it provides a hallu
    - Re: (Score:2)
      
      by Greyman027 ( 2435660 ) writes:
      
      I use ChatGPT occasionally, almost exclusively for writing quick knock-up Python, Bash, or PowerShell scripts. I'm a professional programmer, but in none of those languages. From what I've seen, I can knock up a quick Python script
      and fix the occasional ChatGPT error (it really sucks at writing regexes for example) in about one-half to one-third the time it would take a moderately experienced Python programmer to write it all from scratch.
      Yes, you could Google what you need in order to put those scripts t
      - Re: For those that don't understand statistics (Score:2)
        
        by UncleScidhuv ( 7657782 ) writes:
        
        Agreed. And to me, here's the kicker. This is not a tool that was built to write code. The fact that it does is pretty cool. The fact that this is basically a beta tool, imagine where it will be in five years!
  - Re: (Score:2)
    
    by WaffleMonster ( 969671 ) writes:
    
    Close enough is fine in horseshoes and hand grenades, but not when providing authoritative answers for StackOverflow.
    ??? Who is talking about authoritative answers for anything much less SO?
    Confidently wrong 50% of the time is unacceptable, and any human with that track record would know to shut up and listen instead of trying to act smart.
    Unacceptable to whom? I'm wrong 100% of the time I write a sizable chunk of new code and I still manage to get a paycheck.
    The way ChatGPT works it's a single pass without any attempt at introspection or testing. If people can't do that without fucking up it's foolish to expect last years AI model to fair any better.
    It's as foolish to accept the word of SO on face value as it is the output of an AI model.
    And even for coding, iterating with it is an exercise in frustration. It apologizes and then regresses, then apologizes and introduces syntax from another language, then apologizes and rewrites the code completely. A better way of thinking of it is that when itâ(TM)s wrong itâ(TM)ll likely never get it right eventually, or if it does the effort far outstrips any benefits.
    Anyone expecting AI or SO to w
    - Re: (Score:2)
      
      by Anubis IV ( 1279820 ) writes:
      
      ??? Who is talking about authoritative answers for anything much less SO?
      The summary. Purdue. It’s literally what they were attempting to do.
      - Re: (Score:2)
        
        by WaffleMonster ( 969671 ) writes:
        
        The summary. Purdue. Itâ(TM)s literally what they were attempting to do.
        Neither summary or the study itself is about providing authoritative answers to SO. It's simply a comparison between SO and ChatGPT answers to a bunch of SO questions.
        
        Re: (Score:2)
        
        by Anubis IV ( 1279820 ) writes:
        
        It's simply a comparison between SO and ChatGPT answers to a bunch of SO questions.
        You’re clearly attempting to make a distinction I’m not comprehending, given that we seemingly agree on this yet you seem to be insisting that we disagree.
    - Re: (Score:2)
      
      by iMadeGhostzilla ( 1851560 ) writes:
      
      "The way ChatGPT works it's a single pass without any attempt at introspection or testing."
      100% agree. Writing code -- unless you are writing a stupid coding puzzle -- is a process of DISCOVERY, where you learn about what you are trying to do as much as how to do it. With a single pass output with bizarre dreamlike comments and logic and possible hidden hallucinations you discover nothing.
      The exception is things like producing one off shell commands that you can easily inspect if they are hallucination-free
  - Re: (Score:2)
    
    by quantaman ( 517394 ) writes:
    
    And even for coding, iterating with it is an exercise in frustration. It apologizes and then regresses, then apologizes and introduces syntax from another language, then apologizes and rewrites the code completely. A better way of thinking of it is that when it’s wrong it’ll likely never get it right eventually, or if it does the effort far outstrips any benefits.
    It definitely struggles as a module grows in size or has to interact with other code.
    It's still useful, but it's more about getting the functional bits out and including them yourself.
    And another thing I find it's good at is a quick and dirty way of figuring out the "standard" way of doing something (ie, the way over represented in the training set).
    But yeah, the wrong answers are tricky because we use mastery of language as a measure of general competence. And when dealing with native speakers it's fairly
  - Re: (Score:2)
    
    by sweet 'n sour ( 595166 ) writes:
    
    I think a better measure isn't if the first response is 50% correct, it's if you're able to eventually get to a correct answer versus going down that 'exercise in frustration' path.
    Also, what is the % of people who were able to find the answer in stack overflow after that.
    My anecdotal experience is that if I do go down that frustrating path, I'm a lot less likely to find an answer online anywhere. (That or the info is newer than 2021)
- Re: (Score:2)
  
  by Darinbob ( 1142669 ) writes:
  
  Also, it's a higher correctness rate than the average low budget programmer.
- Re: (Score:2)
  
  by znrt ( 2424692 ) writes:
  
  because well over 99,9999999% of potential answers are wrong.
  that's incorrect, because there are also infinite correct answers. no need to show your numbers, you are arguing that some infinites are bigger than others which is a valid but meaningless proposition in this case, if anything because chatgpt doesn't even aim at correctness, it just produces sequences of words that sound human and appear to make sense. getting it right is just coincidence. that it still can outperform humans in some contexts is amazing indeed. i'd even say 87% astonishing.
  - Re: (Score:2)
    
    by Luckyo ( 1726890 ) writes:
    
    This is a classic case of "I never took statistics class/I took it and never learned about it". Except that in this case, this coupled with failure at general mathematics.
    There is an infinite amount of potential answers, of which a certain percentage are correct and certain percentage are incorrect. The amount of both answers is infinite.
    This doesn't change the ratio between correct answers and incorrect ones. Even in case of infinity, you can have things that are more rare than others.
    - Re: (Score:2)
      
      by znrt ( 2424692 ) writes:
      
      if you insist then you definitely should show your numbers. for starters, what's your sample?
      - Re: (Score:2)
        
        by Luckyo ( 1726890 ) writes:
        
        Did we just not agree for the sake of argument that both numbers are infinite, and did I not provide you with an example ratio in the post before that?
        
        Re: (Score:2)
        
        by znrt ( 2424692 ) writes:
        
        my point is that you pulled that ratio out of thin air. what is the sample?
- To give an equally irrelevant comparison (Score:2)
  
  by piojo ( 995934 ) writes:
  
  Hey, good news! A coin flip's odds of getting a programming problem right are also worse than a coin flip.
- Re: (Score:2)
  
  by gweihir ( 88907 ) writes:
  
  Suire. It is amazing. It also makes ChatGPT completely unusable as coding support and very dangerous and hugely unprofessional because as the researchers also found out users cannot spot false answers far too often.
  - Re: (Score:2)
    
    by WaffleMonster ( 969671 ) writes:
    
    Suire. It is amazing. It also makes ChatGPT completely unusable as coding support and very dangerous and hugely unprofessional because as the researchers also found out users cannot spot false answers far too often.
    The study is worthless. Not only did they use an obsolete GPT model massively less capable than the current v4 model human evaluation does not meaningfully reflect real world usage.
    What they did was present random questions to 12 people and ask them to evaluate them over the course of 20 minutes. Only one person out of the 12 was an actual software engineer. This does not resemble how SO is actually used. Questions that people ask and the answers people seek are specific, relevant and meaningful to the
  - Re: (Score:2)
    
    by Luckyo ( 1726890 ) writes:
    
    LLMs are driven by material they learned from, algorithm used to derive machine learning and finally quality of input.
    This study failed to control 2/3. It is as conclusive as trying to solve x+y+z while knowing only x.
So let's call it what it really is (Score:5, Funny)

by 93 Escort Wagon ( 326346 ) writes: on Saturday August 12, 2023 @07:00PM (#63762892)

There's no "hallucinating" going on. ChatGPT excels at shoveling bullshit.
If ChatGPT were the guy down at the end of the bar, he'd be the most popular guy in the bar.

- Re: (Score:3)
  
  by christoban ( 3028573 ) writes:
  
  But usually very useful bullshit. I like the references to source and it generates boilerplate code nicely.
- Re:So let's call it what it really is (Score:4, Insightful)
  
  by StormReaver ( 59959 ) writes: on Saturday August 12, 2023 @09:33PM (#63763088)
  
  There's no "hallucinating" going on.
  Absolutely correct. LLM's don't hallucinate. Ever. They malfunction. But then they try to convince you that they didn't. When called out, they apologize for the malfunction and then malfunction again. And again. And again.
  "But they point me in the right direction!"
  They sometimes do, but less often than they point you in the wrong direction.
  By far, the greatest achievement with LLM's like ChatGPT is the natural language interface. It excels at speech to text, but that's about all I would trust it to do. Anything else is varying degrees of playing with fire.
  
  - Re: So let's call it what it really is (Score:2)
    
    by LordofWinterfell ( 90845 ) writes:
    
    Itâ(TM)s insane that anyone thinks that itâ(TM)s anything else than that. Itâ(TM)s a demo of what LLMs are capable of, but realistically for them to be reliable source of information, they need to be trained on a specific dataset, not a widely capable one. Itâ(TM)s a statistical guess at a desired response.
    I asked how do you mine quartz, and itâ(TM)s response was something about hobs and the nether - Minecraft. It doesnâ(TM)t know the difference between reality and any other da
    - - Re: (Score:2)
        
        by StormReaver ( 59959 ) writes:
        
        No understanding, just guessing...right?
        It seems like you have never used a C compiler before. The LLM "understands" equally well.
  - Re: (Score:2)
    
    by gweihir ( 88907 ) writes:
    
    Indeed. That nicely sums it up. The artificial moron cannot even get simple things reliably right. And while the natural language interface is impressive, it makes the problem worse.
  - Re: (Score:2)
    
    by Ksevio ( 865461 ) writes:
    
    ITT lots of people complaining about tools they don't know how to use properly. It's like complaining when you typed your search into Google and pressed "I'm feeling lucky" you didn't get the website you wanted, so it's pointless for anyone to use search engines
- That guy at the end of the bar (Score:4, Funny)
  
  by Larsen E Whipsnade ( 4686581 ) writes: on Sunday August 13, 2023 @11:46AM (#63763834)
  
  Cliff Clavin raises a mug to you.
  
- Re: (Score:2)
  
  by pauljlucas ( 529435 ) writes:
  
  There's no "hallucinating" going on.
  The use of that term has always irked me. By definition, "hallucination" is seeing something that isn't there. At best, ChatGPT is delusional (believing something that isn't true).
Version is important (3.5 v 4) (Score:2)

by CrazyPhillyOldGuy ( 10499252 ) writes:

The paper doesnâ(TM)t say what version they used of the web interface. It could be 3.5. This seems poorly reported.
Definition of "correct" (Score:2)

by Tony Isaac ( 1301187 ) writes:

If you're looking for code snippets that need zero editing before you incorporate it into your code, then I'd say 50% is great. But I don't think that's fair. Humans don't get the code 100% right on the first pass either. It's an iterative process of improvement and fine-tuning. So if I go to ChatGPT for code help, I'm looking for something that can get me _started_ and then I can tweak and fine tune as I like. In my experience, it's able to provide this kind of head start most of the time.
- Re: (Score:2)
  
  by fluffernutter ( 1411889 ) writes:
  
  But how is that an improvement over just using an SO answer?
  - Re: (Score:2)
    
    by Tony Isaac ( 1301187 ) writes:
    
    The improvement with ChatGPT over StackOverflow, is that it saves me a bunch of reading. Instead of clicking on several SO links and scanning through them to find one that looks about right, ChatGPT does that for me, and produces the result that appears to be the closest to what I'm asking for. So it's basically a time saver.
    - Re:Definition of "correct" (Score:4, Informative)
      
      by fluffernutter ( 1411889 ) writes: on Saturday August 12, 2023 @11:31PM (#63763192)
      
      Then you need to learn how to search properly. 90% of the time i get the right answer on the first go. The other times it takes me seconds to find out its not what i want and go to the next one.
      
      - Re: (Score:2)
        
        by Tony Isaac ( 1301187 ) writes:
        
        Yes, I too get good results on the "first go" when I'm searching for something relatively straightforward. When it's a problem that does not have a straightforward solution or a common consensus, it gets trickier.
      - Re: (Score:2)
        
        by Tony Isaac ( 1301187 ) writes:
        
        Also, ChatGPT can customize the answer with your own choice of names and variables. Stack Overflow can't do that.
        
        Re: (Score:2)
        
        by fluffernutter ( 1411889 ) writes:
        
        Ok so you can't take what stack overflow tells you and adapt it in your head?
        
        Re: (Score:2)
        
        by Tony Isaac ( 1301187 ) writes:
        
        Sure I can. And I can walk everywhere I go instead of driving my car, and I can raise chickens in my back yard so I can have meat to eat, and I can use hand tools to build things instead of power tools. But I choose not to because there are more convenient options. That's the value of ChatGPT--it's not that I can't do things without it, it's that it makes it easier and more efficient than doing things manually.
      - Re: (Score:2)
        
        by fluffernutter ( 1411889 ) writes:
        
        I guess some people can source it together better than others
  - Re: (Score:3, Interesting)
    
    by Greyman027 ( 2435660 ) writes:
    
    All this obsession with comparing SO to ChatGPT - the use cases they cover only vaguely overlap. An SO answer is going to give you a code fragment that solves a very specific problem. It's very unlikely to give you a complete program.
    For example, ChatGPT can produce an instant workable solution to this request, SO most certainly cannot:
    "Could you please write me a Python program to scan all files in a folder and all subfolders, that will find all instances of a HTML element (or a string containing a HTML
    - Re: (Score:3)
      
      by fluffernutter ( 1411889 ) writes:
      
      I could find the answers to that with a few searches and put it together faster than you could type all that into chatgpt.
    - Re: (Score:2)
      
      by znrt ( 2424692 ) writes:
      
      All this obsession with comparing SO to ChatGPT - the use cases they cover only vaguely overlap.
      they fully overlap. actually chatgpt means that stackoverflow as you knew it is out of business, only relevant in as much it is a trove of data to train more chatgpts.
      An SO answer is going to give you a code fragment that solves a very specific problem. It's very unlikely to give you a complete program
      that's exactly what chatgpt does. it also can give you a complete program if you ask for it, but most of the time it will be an answer to a specific question, i.e. prompt, which is exactly what a stackoverflow user expects. except the answer will be immediate, and can be refined right away in a few iterations to get the desired pointer or resu
  - Re: (Score:2)
    
    by Stoutlimb ( 143245 ) writes:
    
    For the same reason any automation is an improvement. How many ChatGPT queries can one do in the time that it takes some rando to read a question and post a useful reply?
    The real metric is the aggregation of many ChatGPT queries and iterations compared to only one Stack Exchange query.
    In the case of answers that already exist on stack exchange, finding the useful one is an exercise in reading, critical thinking, copy-pasting, and testing code. It's little different form repeated queries to an AI in that s
    - Re: (Score:2)
      
      by fluffernutter ( 1411889 ) writes:
      
      Its becoming clear that the people who benefit from ai are the ones who can't take three SO answers and merge them together for their own purpose very well.
It may be wrong, but... (Score:5, Insightful)

by edibobb ( 113989 ) writes: on Saturday August 12, 2023 @07:44PM (#63762956) Homepage

It may be wrong, but it gives me ideas and is excellent at looking up obscure function calls, support files, and weird syntax. Just don't copy and paste without verifying the code. ChatGPT has saved me a lot of hours, even though it does not provide much directly usable code.

Outdated version (Score:4, Informative)

by WaffleMonster ( 969671 ) writes: on Saturday August 12, 2023 @07:56PM (#63762970)

According to the paper they are using gpt-3.5-turbo.

- - Re: (Score:3)
    
    by mdrejhon ( 203654 ) * writes:
    
    In my experience, 4.0 scores much better than 3.5. It's the reason why I use Visual Studio Code plugins ("Genie AI") and a paid subscription to GPT 4.0 API. Much better AI-based pair programmer. Outputs more slowly but seems 10x smarter about code.
    There was indeed some regressions in 4.0 but only ~10% closer to 3.5
How does this compare to humans? (Score:3)

by larryjoe ( 135075 ) writes: on Saturday August 12, 2023 @07:56PM (#63762974)

How does a software development pipeline that includes testing change with a GPT-based coder? Is the coding accuracy worse than code from a human? Is debugging code from a GPT harder than code from a human, maybe because the bugs are more subtle as the article suggests? Also, given that the GPT is much faster, what is the effect on the total pipeline time to acceptable code?
Note that "one" GPT coder is faster than an entire team of human coders. So the total pipeline time for development should be considered for the entire project.

Missing the obvious comparison (Score:2)

by RightwingNutjob ( 1302813 ) writes:

Coders make up something like 5% or less of the adult workforce. Pick a random person and you have a 95% chance of a Not Even Wrong answer on a code question.
Similarly, less than 2% of the working population is a licensed physician, meaning you're likely to get a 98% error rate on medical questions (above basic first aid stuff).
Etc, etc, for every profession with specialized knowledge from astronomy to zoology.
Compared to the general population, the chatbot that scores around 50% is a true Renaissance Man g
- Re: Missing the obvious comparison (Score:5, Insightful)
  
  by iAmWaySmarterThanYou ( 10095012 ) writes: on Saturday August 12, 2023 @08:10PM (#63762992)
  
  So instead of asking a specialist in each area of knowledge to get the right answer, we should be happy there's a computer than can answer anything at all half the time?
  And be wrong the other half?
  I'll pass.
  I found gpt can generate simple functions as found in textbooks but beyond that it started coming up with some ridiculous shit no matter what I told it.
  
  - Re: (Score:2)
    
    by smartfart ( 215944 ) * writes:
    
    Aye, I'll pass right along with you. I've spent too many decades learning my craft. People pay me to know what I'm doing, and I'm proud of the services I'm able to provide for them.
    - Re: Missing the obvious comparison (Score:2)
      
      by RightwingNutjob ( 1302813 ) writes:
      
      I sympathize, but I guarantee there will be growing pains as half-wrong-but-dirt-cheap displaces some, but not that much, of right-but-doesn't-work-for-free.
      It won't be nearly the shitshow that spreadsheets were for accounting clerks, but some fraction of the coding workforce will be released to use their talents elsewhere.
      - Re: (Score:2)
        
        by Stoutlimb ( 143245 ) writes:
        
        Half wrong but dirt cheap is pure gold in an iterative environment such as programming. How many professional coders expect their code to be perfect and complete the very first time they press "run"?
        AI is an accelerator for the iterative process. Expecting more of it is unreasonable for now.
  - Re: (Score:2)
    
    by Stoutlimb ( 143245 ) writes:
    
    It's never been about either/or. Anyone who suggests that is living in a scifi fantasy, at least for now.
- Re: (Score:2)
  
  by edwdig ( 47888 ) writes:
  
  The type of error matters a lot. If you ask the general population a coding question, the vast majority of them will just say "I don't know". Assuming it's a non-trivial question, even a good chunk of coders will still answer you "I don't know." And a good chunk of the answers you get will be hedged with something like "I'm not sure" or "maybe this will work". You won't get many confident answers that are wrong.
  ChatGPT will happily give you a confident answer that's wrong. And unless you've already done so
  - Re: Missing the obvious comparison (Score:2)
    
    by RightwingNutjob ( 1302813 ) writes:
    
    If the benefit outweighs the penalty, and the dollar cost of trying is still near zero, it doesn't matter how confidently wrong it is when it's wrong so long as it's right enough of the time.
    I can think of a few use cases where this is the case. Nothing having to do with money management or medical stuff, but "write me an app to alert me when there's rain forecast for my anything on my calendar" is something that you could throw at a real coder for some money, do yourself if you have a few hours over a few
    - Re: (Score:2)
      
      by edwdig ( 47888 ) writes:
      
      You're acknowledging the problem and choosing to avoid it by only using AI for the most trivial of tasks. It's not a meaningful way of addressing problems.
  - Re: (Score:1)
    
    by Anonymous Coward writes:
    
    ChatGPT will happily give you a confident answer that's wrong. And unless you've already done some digging into the topic, it won't be obvious how accurate ChatGPT's answer is.
    Confidently providing completely wrong answers can often be a lot worse than not being able to give an answer at all.
    If people are making decisions based on how confident a person or thing sounds perhaps at least some introspection is warranted. Humans routinely mask their actual confidence level and are often explicitly trained to do so.
This is not surprising (Score:2)

by MpVpRb ( 1423381 ) writes:

Chat GPT is the wrong tool for answering stack overflow questions
so what (Score:2)

by bloodhawk ( 813939 ) writes:

I use the Bing chat version regularly and the code is often flawed or has errors, BUT it is a very fast way to find script and API samples as starting points, often even when wrong they still have the endpoints I need to call etc. definitely has saved me a lot of time. I don't need it to generate perfect code (although that would be nice), I just need it to accelerate my code generation which it does.
so, tell it when it's wrong (Score:2)

by resfilter ( 960880 ) writes:

50%.... that seems about right. because a few times i have tried chatgpt for fun to generate some code. around half the time the code it spit out had an issue.
for example i have asked it to generate c++ functions for bilinear interpolation. it used array indexing with an off-by-one indexing error (quite a human mistake). i also asked it to generate some 68HC11 assembly code that reprogrammed the on-chip EEPROM while checking the results. both of them were great attempts, similar to the attempts i would
- Re: so, tell it when it's wrong (Score:2)
  
  by gwjgwj ( 727408 ) writes:
  
  Why don't you always tell it the code contains the bug?
I worked in a computer lab once, long ago... (Score:2)

by cirby ( 2599 ) writes:

I had a whole bunch of undergrad programming students convinced I was a genius.
They'd ask me for help. I'd walk over and look at the code for a moment, and say "you left out a parenthesis."
After a bit, they'd curse and fix the problem they finally found.
(I was counting the left and right parentheses, and if the numbers didn't match. Yes, that was the whole trick. They wouldn't remember if I couldn't see the fix, but they were in awe of someone who could "find the bug" in just a few seconds.)
ChatGPT, even wi
I've had limited success (Score:2)

by magzteel ( 5013587 ) writes:

Questions like "How do I do something in Java using the version 2.3 API of something" never give me an answer that uses the specified API version.
The answers I get often seem to be scraped directly from Stack Overflow
Casio's new calculator. (Score:3)

by gizmo2199 ( 458329 ) writes: on Sunday August 13, 2023 @12:35AM (#63763236) Homepage

2+2=7.5, wait, 2+2=3!

ChatGPT coding trial (Score:2)

by Daina.0 ( 7328506 ) writes:

I know of a case where ChatGPT was used to translate from one coding language to another. The original code worked. The languages had some similarity but not very close. At first I was surprised how clever it was figuring out the nuances of the new language. After further examination it was terrible. It was missing many things, some parts didn't get translated at all while other were translated awkwardly. None of the translated code worked. At best it set up the program structure for the new language.
here is the code repository (Score:2)

by valentyn ( 248783 ) writes:

https://github.com/SamiaKabir/ChatGPT-Answers-to-SO-questions
The Registe is a joke site (Score:3)

by greytree ( 7124971 ) writes: on Sunday August 13, 2023 @05:44AM (#63763426)

The Register gets the facts straight in less than half of its stories.
And they don't print corrections.

You might as well ask Chatgpt for news.

It's useful ... (Score:2)

by cascadingstylesheet ( 140919 ) writes:

... for tasks that are boring and not yet very automated for some reason.
Like ask it to generate some code for some admin settings fields in a WordPress plugin, and it usually gets it right (or close enough for little tweaks).
Saves some time. That's about it.
Plus, let's be honest, just the fact that it can do it at all is cool. My five billionth Google search is not fun. Getting a chatbot to write code is.
Really the use case (Score:2)

by JoeRobe ( 207552 ) writes:

Are programmers out there really just asking chatgpt to make them a program and then being surprised when it doesn't work half the time? I was under the impression (not being a programmer) that it was more used to build out the bulk of a bunch of code, which could then be troubleshot quickly. That's strikes me as valuable anyway.
Certainly when I do code I'm not 50% on my first shot. But I suppose that a difference between myself and chatgpt is that by my second or third iteration the probability of success
Verbose? (Score:4, Funny)

by RogueWarrior65 ( 678876 ) writes: on Sunday August 13, 2023 @10:42AM (#63763734)

By "verbose", do they mean non-obfuscated, well-commented code? I'll take that action any day.

is this a case of the other shoe dropping? (Score:3)

by lpq ( 583377 ) writes: on Sunday August 13, 2023 @01:36PM (#63764044) Homepage Journal

So much hype, then the reality hits...

It's just a tool, and very useful for what it does (Score:2)

by Ami Ganguli ( 921 ) writes:

I've used ChatGTP for both coding and for writing text. Every single time I've used it, it's produced plausible, but wrong answers. When writing prose, the answers tend to be overly wordy. When coding, it produces decent looking but incorrect code, but normally gets the API parameters right.
It doesn't matter.
The hardest part of writing prose is getting started. The hardest part of coding is remembering all the API calls. ChatGTP provides a starting point for prose, and the API calls for code.
I take ChatGT
- Re: (Score:2)
  
  by Ami Ganguli ( 921 ) writes:
  
  Don't know why I kept typing ChatGTP rather than ChatGPT. Wish I could edit my posts.
It is a generator based on likelihood (Score:2)

by laughingskeptic ( 1004414 ) writes:

One mistake I see it make is always using a common parameter. For instance most Azure API calls take a --subscription parameter, so it adds that in places where it is incorrect. I find this more interesting than "wrong". Though never runnable-as-is I find ChatGPT's outputs are often a good starting point when dealing with new APIs. As with everything ChatGPT, you have to know enough to work with and further research what it provides. If they are seeing 50% correct, they must be asking really simple que

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

For those that don't understand statistics (Score:1)

Re:For those that don't understand statistics (Score:5, Insightful)

Re:For those that don't understand statistics (Score:5, Insightful)

Re: (Score:2)

Re:For those that don't understand statistics (Score:5, Insightful)

Re: (Score:1)

Re: (Score:2)

Re: (Score:2)

Re: For those that don't understand statistics (Score:2)

Re: (Score:2)

Re: (Score:2)

Re:For those that don't understand statistics (Score:5, Insightful)

Re:For those that don't understand statistics (Score:5, Funny)

Re: For those that don't understand statistics (Score:2)

Robot Chicken is helping you with your homework. (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: For those that don't understand statistics (Score:2)

Re: (Score:3, Insightful)

Re: (Score:2)

Re: (Score:2)

Re: For those that don't understand statistics (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

To give an equally irrelevant comparison (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

So let's call it what it really is (Score:5, Funny)

Re: (Score:3)

Re:So let's call it what it really is (Score:4, Insightful)

Re: So let's call it what it really is (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

That guy at the end of the bar (Score:4, Funny)

Re: (Score:2)

Version is important (3.5 v 4) (Score:2)

Definition of "correct" (Score:2)

Re: (Score:2)

Re: (Score:2)

Re:Definition of "correct" (Score:4, Informative)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:3, Interesting)

Re: (Score:3)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

It may be wrong, but... (Score:5, Insightful)

Outdated version (Score:4, Informative)

Re: (Score:3)

How does this compare to humans? (Score:3)

Missing the obvious comparison (Score:2)

Re: Missing the obvious comparison (Score:5, Insightful)

Re: (Score:2)

Re: Missing the obvious comparison (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)