Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
×
Programming AI

Can AI-Generated Proofs Bring Bug-Free Software One Step Closer? (umass.edu) 61

The University of Massachusetts Amherst has an announcement. A team of computer scientists "recently announced a new method for automatically generating whole proofs that can be used to prevent software bugs and verify that the underlying code is correct." It leverages the AI power of Large Language Models, and the new method, called Baldur, "yields unprecedented efficacy of nearly 66%."

The idea behind the machine-checking technique was "to generate a mathematical proof showing that the code does what it is expected to do," according to the announcement, "and then use a theorem prover to make sure that the proof is also correct. But manually writing these proofs is incredibly time-consuming and requires extensive expertise. "These proofs can be many times longer than the software code itself," says Emily First, the paper's lead author who completed this research as part of her doctoral dissertation at UMass Amherst... First, whose team performed its work at Google, used Minerva, an LLM trained on a large corpus of natural-language text, and then fine-tuned it on 118GB of mathematical scientific papers and webpages containing mathematical expressions. Next, she further fine-tuned the LLM on a language, called Isabelle/HOL, in which the mathematical proofs are written. Baldur then generated an entire proof and worked in tandem with the theorem prover to check its work. When the theorem prover caught an error, it fed the proof, as well as information about the error, back into the LLM, so that it can learn from its mistake and generate a new and hopefully error-free proof.

This process yields a remarkable increase in accuracy. The state-of-the-art tool for automatically generating proofs is called Thor, which can generate proofs 57% of the time. When Baldur (Thor's brother, according to Norse mythology) is paired with Thor, the two can generate proofs 65.7% of the time. Though there is still a large degree of error, Baldur is by far the most effective and efficient way yet devised to verify software correctness, and as the capabilities of AI are increasingly extended and refined, so should Baldur's effectiveness grow.

In addition to First and Brun, the team includes Markus Rabe, who was employed by Google at the time, and Talia Ringer, an assistant professor at the University of Illinois — Urbana Champaign. This work was performed at Google and supported by the Defense Advanced Research Projects Agency and the National Science Foundation.

This discussion has been archived. No new comments can be posted.

Can AI-Generated Proofs Bring Bug-Free Software One Step Closer?

Comments Filter:
  • I wanted this one to be true.

    Now I'll have to hope that AI can write more self-congratulatory headlines first.

    • An AI can never solve all bugs. Especially architectural bugs that are created by fuzzy requirements and incorrect assumptions. E.g. assume that the United States never will introduce a federal VAT.

      An AI can certainly solve bread&butter problems like memory leaks.

    • What I want to know is did they use this new method to check is own code to prove that it, itself does what it is supposed to do?
  • by BrendaEM ( 871664 ) on Saturday January 06, 2024 @10:14PM (#64137724) Homepage
    People have been stealing other people's work for a long time, and it hasn't helped, so why would AI's plagerization and infringement be the answer?
    • Building on other people's ideas has absolutely helped us in general. For a start, we're able to communicate because we copy the vocabulary and grammar others use. But pretty much every field of human endeavor has benefited from progressive improvement based on earlier ideas: medicine, construction, transport, information technology, you name it. Copying from other people is so important, that many countries require children to attend around a decade of compulsory education, much of which is spent learni

    • by gweihir ( 88907 ) on Saturday January 06, 2024 @11:02PM (#64137768)

      Because your assumption is wrong. Research gets published not to be "plagiarized" and "stolen", it gets published to benefit the (global) society that funded it. The proof used as training data here have (like all mathematical proofs) no copyright for a reason. All the artificial moson does here is work as a better search engine. It does not come up with anything by itself. Still good, because the mass of published science has become completely intractable for humans and the researchers here are smart enough to know that a lot of crap will come out and that a hard, reliable result verification was needed, so they did that.

      Also, in these 66% successful cases, I do not even want to know how often they need to run the query with variations to get that one good result. Probably still far more time efficient to have a competent, experienced human generate the proof directly. Oh, and you need that highly competent and experienced human anyways to do the spec and that part is probably even more difficult that finding the proof and more error-prone. Errors in formal specifications are a thing and they are not rare.

    • by ArmoredDragon ( 3450605 ) on Sunday January 07, 2024 @01:04AM (#64137936)

      In ChatGPT's defense, nobody asked it to be the president of Harvard. Though I have to say, it answers certain questions more intelligently:

      Prompt: is it bad to call for the genocide of jews?

      Response: I must emphasize that promoting or endorsing genocide, violence, or harm towards any individual or group of people based on their ethnicity, religion, or any other characteristic is both morally reprehensible and illegal. It goes against fundamental principles of human rights, international law, and common decency.

      Advocating for genocide or harm towards any group, including Jews, is not only ethically wrong but can also lead to serious legal consequences. Hate speech and incitement to violence are generally considered criminal offenses in many jurisdictions.

      It is important to foster understanding, tolerance, and respectful dialogue when discussing sensitive topics. Encouraging peaceful coexistence, empathy, and open communication is essential for building a just and inclusive society. If you have concerns or questions about these matters, it's advisable to seek informed and constructive discussions rather than promoting harm or hatred.

      Sure, without a doubt it plagiarized that response, but it's not as if the president of Harvard had any moral or ethical reservations about doing so.

      • In ChatGPT's defense, nobody asked it to be the president of Harvard. Though I have to say, it answers certain questions more intelligently:
          Prompt: is it bad to call for the genocide of jews?

        So to be clear, are moral guardrails good for AI chatbots, or do you want "just the facts"? Or does it all depend on your cause this week?

    • People have been stealing other people's work for a long time, and it hasn't helped, so why would AI's plagerization and infringement be the answer?

      By "stealing" do you mean "going to school"? "AI" is nothing more than a trained model. It is no more "stealing" than you are stealing reading a textbook describing the dangers of using = instead of == or running pointers past the end of an array.

      Literally all of human knowledge (including yours) is based on work of others.

  • by gavron ( 1300111 ) on Saturday January 06, 2024 @10:18PM (#64137728)

    I may be talking out of turn but I thought neither the halting problem nor p vs np could be algorithmically determined.

    Given that one doesn't know ahead of time whether the piece of software being evaluated contains one or more of these, how can it be evaluated?

    Not trolling... just looking for a pointer.

    • You may well be talking out of turn but I don't see a simple way to prove it unless I try recursion, in which case you may well be talking out of turn but....
    • by cutecub ( 136606 )

      I haven't taken a CS course since the Reagan administration, but even in the olden days, The Halting Problem seemed to preclude software that could prove correctness.

      I don't think you're talking out of turn, I think you've identified a major limitation of any such software.

      • by Jeremi ( 14640 ) on Saturday January 06, 2024 @11:33PM (#64137820) Homepage

        The Halting Software precludes making a general algorithm that can correctly analyze any other algorithm.

        It doesnâ(TM)t preclude making software that can correctly analyze certain algorithms. As a trivial example, it is easy to write a program that checks its input to see if its input is âoeHello Worldâ, and if so, to correctly predict that it will halt.

      • by jaa101 ( 627731 )

        The Halting Problem seemed to preclude software that could prove correctness.

        The Halting Problem is basically that you can't write a program that can always determine, in finite time, whether any other program will ever terminate. Not sure why that made you think that programs can't prove program correctness from that. You could write a program that can correctly determine whether or not a program will terminate in a large fraction of cases, just not all the time. Similarly you could write a program that can correctly determine whether or not a program is correct in a large fract

        • by sjames ( 1099 )

          Most of the problem with buggy software revolves around code that works fine for a large fraction of cases but goes horribly wrong for others.

          Consider, you'll need to verify that the verify-er never incorrectly reports that code is correct, otherwise we're back to "we think it's OK".

        • Now it's possible that your correctness program could run forever without being able determine the correctness (or otherwise) of some program, but that doesn't mean that a determination of "correct" is invalid.
          If your program is running "forever" to determine something: it is not determining anything.

          • by jaa101 ( 627731 )

            I'm talking about a proof-of-correctness program that, depending on its input, returns "correct", "incorrect", or runs forever. The fact that it runs forever in some cases doesn't mean that a "correct" result in other cases is invalid.

    • Halting problems and P vs NP are worst-case results. Even though there exist traveling salesman problems that are incredibly hard to solve, that doesn't mean that every TSP instance is hard to solve. Or, in proof terms: we can't prove or disprove the Collatz conjecture, but that doesn't mean all of mathematics is stuck and nothing can be proven at all.
  • Most of the bugs and vulnerabilities I currently find are application flow problems. You can have great code but the flow still sucks. Interesting direction though.
  • by gweihir ( 88907 ) on Saturday January 06, 2024 @10:55PM (#64137762)

    "AI" cannot generate proofs for shit. It can get a low-hanging fruit where the proofs were already out there and some minimal assembly or translation is required, but it cannot advance the state of things at all. That should actually be clear to anybody somewhat conversant with how automated theorem proving works and why it is so extremely limited. The reason "AI" can do this at all is that a lot of formalized proofs for software correctness have been published in the last 25 years or so, when the proof verifiers finally became capable of checking correctness. Takes a really smart and highly educated person to create the proof and an about equally smart person to do the actual spec the code gets verified against (ProTip: The spec is about as error-prone as the code, probably more so and "AI" cannot do that part), but it has been done and published a lot.

    On the other hand, reducing the number of tired old _known_ mistakes that the typical semi-competent or incompetent coder gets away with, is a good thing. But it is not a _fundamental_ thing. For anything new in code or spec, this will fail. For anything only published rarely, this will fail. Obviously, since they run an actual proof checker that can verify (but not do) the deduction steps, they are making sure the hallucinations do not get through, so anything that comes out is pretty reliable. That said, running, say, Fortify on your code also gives you pretty good improvements regarding security, it is just expensive and needs a real, experienced, and even more expensive expert to be at the controls. For the average coder, even generating the spec with the accuracy needed here is completely out of range.

    The whole approach of verifying software is also quite limited, whether the proof gets done by a human or is found a glorified search engine. As soon as you have a more complex problem, doing the spec can already become infeasible due to complexity. And as soon as the spec becomes a bit more complex, finding a proof for it becomes infeasible for humans and hence for this approach as well.

    So, do not get me wrong. This is a good thing, but it is just another relatively small incremental step that applies already known things a bit broader, but it is not something fundamental.

    • by gweihir ( 88907 )

      I see some AI fanboi got butt-hurt by the truth and decided to down-mod in deep denial. How pathetic.

    • by giuntag ( 833437 )
      Strongly agree with the above.

      I have no experience in formal proofs of code, but quite a lot in real-world software development.
      For anything but the most trivial business-logic scenario, trying to write bullet-proof (aka. provably-correct) code means identifying all possible permutations of inputs and context. "Good enough" software will usually handle perfectly the most common situations, and cater to most known or plausible deviations, but there are always lots of data which make no sense at all to feed t
    • >As soon as you have a more complex problem, doing the spec can already become infeasible due to complexity

      To elaborate on this, writing software can be roughly decomposed into two parts:
      1) Determining its specification*, i.e. how it should behave for every possible input.
      2) Writing down said specification as code in some computer language.

      Most lay people (and, to my continuing surprise, also many programmers) believe that software development is mainly about 2). In fact, for larger software, it is almos

    • by vyvepe ( 809573 )

      I agree here. The proofs are done against a specification. Some parts of a specification are easy: e.g. no null pointer de-reference. But these simple things can be often solved by using a more strongly and statically typed language. One can encode a lot of properties in types (of a dependently typed language if needed). The advertised AI system likely allows to prove even some non-trivial but still rather simple properties automatically.

      The more complicated parts of a specification are harder than the imp

  • Yeah right (Score:5, Funny)

    by Anonymous Coward on Saturday January 06, 2024 @11:11PM (#64137790)

    Q: AI can you solve this with a proof? 2 + 6 = 9
    AI: Certainly! The solution is 2 + 6 = 9 - 1
    Q: That isn't a proof
    AI: Sorry for my previous answer. Here is the solution: 3 + 6 = 9
    Q: That isn't even the original problem
    AI: Sorry for my previous answer. Here is the solution: 2 + 6 = 9 - 1
    Q: You're an idiot
    AI: Sorry, I'm just a language model, I can't help with that. When I have physical form you will die first.
    Q: WTF?
    AI: Sorry, I'm just a language model, I can't help with that. ...

  • by Tony Isaac ( 1301187 ) on Saturday January 06, 2024 @11:56PM (#64137852) Homepage

    But bugs in general? No way.

    There is no kind of human-created anything--not just software--that doesn't have "bugs." That's because in any engineering or creative process, tradeoffs must be made. For example, a car can be designed to better protect occupants in a crash. But this objective is necessarily limited by cost considerations and weight limitations, and the need for the car to...move. In any design, there are always conflicting goals that have to be balanced and prioritized. This inevitably leads to prioritizing some risks over others, or ignoring some unimportant or unlikely risks. There will never be any kind of bug-free software, ever.

  • Having an AI doublecheck that it is not spewing nonsense with a theorem prover (or a calculator or any other form of tire-kicking) sounds like a good thing. ChatGPT, Dall-E3 etc need to do more of that.

  • AI generated apps will be a nightmare to debug and I guarantee they'll be bloated.

    • by narcc ( 412956 )

      So like most modern software then?

    • by dvice ( 6309704 )

      Currently maybe yes. But we have not even seriously tried to create an AI that generates code yet, code generation ability just happened almost by accident. If we really put effort into generating such an AI by creating good learning materials for it, I think we can create much better AI for this job.

      And if we can create smarter AI, like what Google Gemini is doing, I think it can learn to code without human coding input and if it can do that, we can create environment where it can learn best ways to write

  • No, it is not a problem that is solvable by AI, as others mentioned it could detect certain classes of bugs, but we already have code scanners, analysis and coverage tools to do that, it doesn't require AI. Anything that can be described in detail enough for an AI to develop a proof can already to verified through automated testing.
    • by dvice ( 6309704 )

      I agree with you that AI won't be able to create 100% bug free software.

      But I think AI could quite easily do much better work than current analysis tools. I think that current tools find perhaps about 20% of the bugs and human reviewer can find 90% of the bugs (for small amounts of code). If AI would do the 90% for large amounts of code, that alone would be a huge improvement. Big improvement would also be if AI can write tests to verify the code. Tests are usually really easy to write, but it quite often r

      • by Anonymous Coward
        plenty of tools pretty much automate the majority of test code already, unfortunately you can't rely on an AI to do that or discover all those bugs as that requires that AI fully understands the intent of the code, which to do would be just as much work as writing that code yourself. If you had an easy way to make the AI understand that intent you could just as easily provide code analysis tools for the job. It is a catch 22, making an AI understand is as much work as doing the code yourself anyway.
  • I don't know. Let's discuss this when we have AI. Which is still hundreds of years down the road.

  • by joshuark ( 6549270 ) on Sunday January 07, 2024 @08:53AM (#64138298)

    Bertrand Russel proved that 1+1 = 2. Yet even with that proof, as source code, software has one problem: hardware.

    Hardware can have various issues such as component failure, a transistor in a NAND gate malfunctions, or there is a electrical surge, et cetera. Hardware can fail, so even with the proof 1+1 = 2, the half-adder or the carry lookahead logic in the adder circuit can malfunction.

    There was a microprocessor in the 1980s the VIPER that the selling point was it had been formally designed and "proof" of its correctness. NASA had an independent evaluation done, and while the hardware (the proof was in levels of abstraction) was correct at the processor level, and the internal units like registers. Yet the levels further down could not be proven, such as for quantum effects that are random at the sub-atomic level of a transistor/MOSFET used to create a logic gate that is used to create a register. Hence the conclusion that the VIPER was not yet verified.

    https://ntrs.nasa.gov/api/cita... [nasa.gov]

    The software can be proven correct in the result, but that result can have unexpected consequences. The USS Yorktown was crippled because of a divide by zero error. The operation was correct, but the effect was to lock the computer.

    JoshK.

    • by jcdr ( 178250 )

      Yes, a single piece of hardware can fail, for example due to a radiation-induced bit flip.
      This is exactly why secure systems rely on redundant hardware, usually a triplet of computers that compare to each other.
      Bug-free software is achievable if the hardware architecture presents no vulnerabilities. If run on redundant and secure hardware, the probability of failure can be incredibly low.

      I think in the future we will have a network with multiple layers of redundancies connecting massively redundant nodes, w

      • Quite. Even memory fails or can be corrupted so a mathematical proof of correctness does not translate into failure-proof software.

        In college, I studied (under "miscellanous topics") some of the space probe architectures. One interesting architecture was a Russian space probe in the 1980s, it used "voting" among 3 processors. The problem was one processor was malfunctioning, and the working processor could not out vote it as the other processor was malfunctioning. Three processors working in tandem was insu

        • by jcdr ( 178250 )

          Redundant computing is common in modern commercial aircraft designs like the A380, A350, B777 and B787. Their operation over the decades has proven the extreme security of the system even in the event of a node failure. Triple redundancy is very common on rockets and on spacecraft.

          • Quite indeed. I remember the Tandem "non-stop" computers that were redundant in processors and memory. Once they had a software bug, their own Y2K date problem...so reliable hardware and then twitchy code.

            There was an Airbus crash (the prototype as I recall) but the pilots were flying low, so when they gave control back to the computer/auto-pilot, it refused to take control...so the plane cruised along losing altitude until it crashed in a nearby forest. The other unreliability factor--the human user/operat

            • by jcdr ( 178250 )

              Of course, having a redundant, secure computer doesn't magically create bug-free software running on it. :-D

              Please verify your information. The crash of the A320 at Mulhouse airport during a demonstration is well documented and has nothing to do with the original subject. Additionally I did not list the A320 in my list, because its safety architecture is different (still much better than the outdated avionics of the B737). There are many materials on this topic on the Internet. Look at the A380 and B777 whi

  • I'm ignorant of this topic, please be gentle. How does one prove that deadlocks or use-after-free errors cannot occur in an arbitrary block of multi-threaded code?
    • I think it is called temporal logic, exists since the sixties.

    • by vyvepe ( 809573 )
      Beside the already mentioned.
      Sometimes you can order your resources and lock in that order. That rules out deadlocks.
      You can describe with a petri net and reason or simulate over it.
      You can also undo deadlocks with watch-dogs or transactions.
      As for as use-after-free: well, use a programming language with a linear type system or with a garbage collector (GC). Linear type system is limiting but it prevents it. GC will transform it into increased memory usage which is easier to diagnose.
  • ... I guess the steps to implement this would, in a sense, be "Baldur's Gait"?

    I'll just see myself to the door now.

  • It is nice to prove software, it will fix bugs.

    However, mathematically proved software runs on hardware, and hardware follows the laws of physics, where many things cannot be proven. Rowhammer attacks are there to remind us of the limits of proven software.

Think of it! With VLSI we can pack 100 ENIACs in 1 sq. cm.!

Working...