Can AI-Generated Proofs Bring Bug-Free Software One Step Closer? (umass.edu) 61
The University of Massachusetts Amherst has an announcement. A team of computer scientists "recently announced a new method for automatically generating whole proofs that can be used to prevent software bugs and verify that the underlying code is correct." It leverages the AI power of Large Language Models, and the new method, called Baldur, "yields unprecedented efficacy of nearly 66%."
The idea behind the machine-checking technique was "to generate a mathematical proof showing that the code does what it is expected to do," according to the announcement, "and then use a theorem prover to make sure that the proof is also correct. But manually writing these proofs is incredibly time-consuming and requires extensive expertise. "These proofs can be many times longer than the software code itself," says Emily First, the paper's lead author who completed this research as part of her doctoral dissertation at UMass Amherst... First, whose team performed its work at Google, used Minerva, an LLM trained on a large corpus of natural-language text, and then fine-tuned it on 118GB of mathematical scientific papers and webpages containing mathematical expressions. Next, she further fine-tuned the LLM on a language, called Isabelle/HOL, in which the mathematical proofs are written. Baldur then generated an entire proof and worked in tandem with the theorem prover to check its work. When the theorem prover caught an error, it fed the proof, as well as information about the error, back into the LLM, so that it can learn from its mistake and generate a new and hopefully error-free proof.
This process yields a remarkable increase in accuracy. The state-of-the-art tool for automatically generating proofs is called Thor, which can generate proofs 57% of the time. When Baldur (Thor's brother, according to Norse mythology) is paired with Thor, the two can generate proofs 65.7% of the time. Though there is still a large degree of error, Baldur is by far the most effective and efficient way yet devised to verify software correctness, and as the capabilities of AI are increasingly extended and refined, so should Baldur's effectiveness grow.
In addition to First and Brun, the team includes Markus Rabe, who was employed by Google at the time, and Talia Ringer, an assistant professor at the University of Illinois — Urbana Champaign. This work was performed at Google and supported by the Defense Advanced Research Projects Agency and the National Science Foundation.
The idea behind the machine-checking technique was "to generate a mathematical proof showing that the code does what it is expected to do," according to the announcement, "and then use a theorem prover to make sure that the proof is also correct. But manually writing these proofs is incredibly time-consuming and requires extensive expertise. "These proofs can be many times longer than the software code itself," says Emily First, the paper's lead author who completed this research as part of her doctoral dissertation at UMass Amherst... First, whose team performed its work at Google, used Minerva, an LLM trained on a large corpus of natural-language text, and then fine-tuned it on 118GB of mathematical scientific papers and webpages containing mathematical expressions. Next, she further fine-tuned the LLM on a language, called Isabelle/HOL, in which the mathematical proofs are written. Baldur then generated an entire proof and worked in tandem with the theorem prover to check its work. When the theorem prover caught an error, it fed the proof, as well as information about the error, back into the LLM, so that it can learn from its mistake and generate a new and hopefully error-free proof.
This process yields a remarkable increase in accuracy. The state-of-the-art tool for automatically generating proofs is called Thor, which can generate proofs 57% of the time. When Baldur (Thor's brother, according to Norse mythology) is paired with Thor, the two can generate proofs 65.7% of the time. Though there is still a large degree of error, Baldur is by far the most effective and efficient way yet devised to verify software correctness, and as the capabilities of AI are increasingly extended and refined, so should Baldur's effectiveness grow.
In addition to First and Brun, the team includes Markus Rabe, who was employed by Google at the time, and Talia Ringer, an assistant professor at the University of Illinois — Urbana Champaign. This work was performed at Google and supported by the Defense Advanced Research Projects Agency and the National Science Foundation.
Damn You, Betteridge! (Score:2)
I wanted this one to be true.
Now I'll have to hope that AI can write more self-congratulatory headlines first.
Re: Damn You, Betteridge! (Score:2)
An AI can never solve all bugs. Especially architectural bugs that are created by fuzzy requirements and incorrect assumptions. E.g. assume that the United States never will introduce a federal VAT.
An AI can certainly solve bread&butter problems like memory leaks.
Recursion (Score:2)
Plagerization and nfringementHasn't Worked So Far (Score:3, Interesting)
Re: (Score:2)
Building on other people's ideas has absolutely helped us in general. For a start, we're able to communicate because we copy the vocabulary and grammar others use. But pretty much every field of human endeavor has benefited from progressive improvement based on earlier ideas: medicine, construction, transport, information technology, you name it. Copying from other people is so important, that many countries require children to attend around a decade of compulsory education, much of which is spent learni
Re:Plagerization and nfringementHasn't Worked So F (Score:5, Insightful)
Because your assumption is wrong. Research gets published not to be "plagiarized" and "stolen", it gets published to benefit the (global) society that funded it. The proof used as training data here have (like all mathematical proofs) no copyright for a reason. All the artificial moson does here is work as a better search engine. It does not come up with anything by itself. Still good, because the mass of published science has become completely intractable for humans and the researchers here are smart enough to know that a lot of crap will come out and that a hard, reliable result verification was needed, so they did that.
Also, in these 66% successful cases, I do not even want to know how often they need to run the query with variations to get that one good result. Probably still far more time efficient to have a competent, experienced human generate the proof directly. Oh, and you need that highly competent and experienced human anyways to do the spec and that part is probably even more difficult that finding the proof and more error-prone. Errors in formal specifications are a thing and they are not rare.
Re: Plagerization and nfringementHasn't Worked So (Score:2)
Research gets published because "publish or perish" is a thing.
Any benefit is secondary.
Re: (Score:2)
You have that ass-backwards. What an insightless thing to say.
Re: Plagerization and nfringementHasn't Worked So (Score:2)
If you don't publish your research you won't get anywhere.
Re: (Score:1)
Absolutely agree for your first paragraph.
However, I am going to bet that the 65% level is better than many humans in the field.
At a minimum, it would save those humans for "real" problems by removing all the low hanging fruit. And, based on chess, Go, radiology reports, etc... it will improve from that 65% level. And it will improve faster than we expect. I bet it's better than humans within 5 years.
Re:Plagerization and nfringementHasn't Worked So F (Score:5, Funny)
In ChatGPT's defense, nobody asked it to be the president of Harvard. Though I have to say, it answers certain questions more intelligently:
Prompt: is it bad to call for the genocide of jews?
Response: I must emphasize that promoting or endorsing genocide, violence, or harm towards any individual or group of people based on their ethnicity, religion, or any other characteristic is both morally reprehensible and illegal. It goes against fundamental principles of human rights, international law, and common decency.
Advocating for genocide or harm towards any group, including Jews, is not only ethically wrong but can also lead to serious legal consequences. Hate speech and incitement to violence are generally considered criminal offenses in many jurisdictions.
It is important to foster understanding, tolerance, and respectful dialogue when discussing sensitive topics. Encouraging peaceful coexistence, empathy, and open communication is essential for building a just and inclusive society. If you have concerns or questions about these matters, it's advisable to seek informed and constructive discussions rather than promoting harm or hatred.
Sure, without a doubt it plagiarized that response, but it's not as if the president of Harvard had any moral or ethical reservations about doing so.
Re: Plagerization and nfringementHasn't Worked So (Score:2)
In ChatGPT's defense, nobody asked it to be the president of Harvard. Though I have to say, it answers certain questions more intelligently:
Prompt: is it bad to call for the genocide of jews?
So to be clear, are moral guardrails good for AI chatbots, or do you want "just the facts"? Or does it all depend on your cause this week?
Re: (Score:2)
People have been stealing other people's work for a long time, and it hasn't helped, so why would AI's plagerization and infringement be the answer?
By "stealing" do you mean "going to school"? "AI" is nothing more than a trained model. It is no more "stealing" than you are stealing reading a textbook describing the dangers of using = instead of == or running pointers past the end of an array.
Literally all of human knowledge (including yours) is based on work of others.
p vs np and the halting problem (Score:4, Informative)
I may be talking out of turn but I thought neither the halting problem nor p vs np could be algorithmically determined.
Given that one doesn't know ahead of time whether the piece of software being evaluated contains one or more of these, how can it be evaluated?
Not trolling... just looking for a pointer.
Re: (Score:3)
Re: (Score:2)
I haven't taken a CS course since the Reagan administration, but even in the olden days, The Halting Problem seemed to preclude software that could prove correctness.
I don't think you're talking out of turn, I think you've identified a major limitation of any such software.
Re: p vs np and the halting problem (Score:5, Informative)
The Halting Software precludes making a general algorithm that can correctly analyze any other algorithm.
It doesnâ(TM)t preclude making software that can correctly analyze certain algorithms. As a trivial example, it is easy to write a program that checks its input to see if its input is âoeHello Worldâ, and if so, to correctly predict that it will halt.
Re: (Score:2)
The Halting Problem seemed to preclude software that could prove correctness.
The Halting Problem is basically that you can't write a program that can always determine, in finite time, whether any other program will ever terminate. Not sure why that made you think that programs can't prove program correctness from that. You could write a program that can correctly determine whether or not a program will terminate in a large fraction of cases, just not all the time. Similarly you could write a program that can correctly determine whether or not a program is correct in a large fract
Re: (Score:2)
Most of the problem with buggy software revolves around code that works fine for a large fraction of cases but goes horribly wrong for others.
Consider, you'll need to verify that the verify-er never incorrectly reports that code is correct, otherwise we're back to "we think it's OK".
Re: (Score:1)
Now it's possible that your correctness program could run forever without being able determine the correctness (or otherwise) of some program, but that doesn't mean that a determination of "correct" is invalid.
If your program is running "forever" to determine something: it is not determining anything.
Re: (Score:2)
I'm talking about a proof-of-correctness program that, depending on its input, returns "correct", "incorrect", or runs forever. The fact that it runs forever in some cases doesn't mean that a "correct" result in other cases is invalid.
Re: (Score:1)
True, but I did not imply that. But good clarification.
Re: (Score:2)
If only... (Score:1)
No. Obviously not. (Score:5, Insightful)
"AI" cannot generate proofs for shit. It can get a low-hanging fruit where the proofs were already out there and some minimal assembly or translation is required, but it cannot advance the state of things at all. That should actually be clear to anybody somewhat conversant with how automated theorem proving works and why it is so extremely limited. The reason "AI" can do this at all is that a lot of formalized proofs for software correctness have been published in the last 25 years or so, when the proof verifiers finally became capable of checking correctness. Takes a really smart and highly educated person to create the proof and an about equally smart person to do the actual spec the code gets verified against (ProTip: The spec is about as error-prone as the code, probably more so and "AI" cannot do that part), but it has been done and published a lot.
On the other hand, reducing the number of tired old _known_ mistakes that the typical semi-competent or incompetent coder gets away with, is a good thing. But it is not a _fundamental_ thing. For anything new in code or spec, this will fail. For anything only published rarely, this will fail. Obviously, since they run an actual proof checker that can verify (but not do) the deduction steps, they are making sure the hallucinations do not get through, so anything that comes out is pretty reliable. That said, running, say, Fortify on your code also gives you pretty good improvements regarding security, it is just expensive and needs a real, experienced, and even more expensive expert to be at the controls. For the average coder, even generating the spec with the accuracy needed here is completely out of range.
The whole approach of verifying software is also quite limited, whether the proof gets done by a human or is found a glorified search engine. As soon as you have a more complex problem, doing the spec can already become infeasible due to complexity. And as soon as the spec becomes a bit more complex, finding a proof for it becomes infeasible for humans and hence for this approach as well.
So, do not get me wrong. This is a good thing, but it is just another relatively small incremental step that applies already known things a bit broader, but it is not something fundamental.
Re: (Score:1)
I see some AI fanboi got butt-hurt by the truth and decided to down-mod in deep denial. How pathetic.
Re: (Score:1)
I have no experience in formal proofs of code, but quite a lot in real-world software development.
For anything but the most trivial business-logic scenario, trying to write bullet-proof (aka. provably-correct) code means identifying all possible permutations of inputs and context. "Good enough" software will usually handle perfectly the most common situations, and cater to most known or plausible deviations, but there are always lots of data which make no sense at all to feed t
Re: (Score:1)
>As soon as you have a more complex problem, doing the spec can already become infeasible due to complexity
To elaborate on this, writing software can be roughly decomposed into two parts:
1) Determining its specification*, i.e. how it should behave for every possible input.
2) Writing down said specification as code in some computer language.
Most lay people (and, to my continuing surprise, also many programmers) believe that software development is mainly about 2). In fact, for larger software, it is almos
Re: (Score:2)
I agree here. The proofs are done against a specification. Some parts of a specification are easy: e.g. no null pointer de-reference. But these simple things can be often solved by using a more strongly and statically typed language. One can encode a lot of properties in types (of a dependently typed language if needed). The advertised AI system likely allows to prove even some non-trivial but still rather simple properties automatically.
The more complicated parts of a specification are harder than the imp
Yeah right (Score:5, Funny)
Q: AI can you solve this with a proof? 2 + 6 = 9 ...
AI: Certainly! The solution is 2 + 6 = 9 - 1
Q: That isn't a proof
AI: Sorry for my previous answer. Here is the solution: 3 + 6 = 9
Q: That isn't even the original problem
AI: Sorry for my previous answer. Here is the solution: 2 + 6 = 9 - 1
Q: You're an idiot
AI: Sorry, I'm just a language model, I can't help with that. When I have physical form you will die first.
Q: WTF?
AI: Sorry, I'm just a language model, I can't help with that.
Certain classes of bugs, perhaps (Score:5, Insightful)
But bugs in general? No way.
There is no kind of human-created anything--not just software--that doesn't have "bugs." That's because in any engineering or creative process, tradeoffs must be made. For example, a car can be designed to better protect occupants in a crash. But this objective is necessarily limited by cost considerations and weight limitations, and the need for the car to...move. In any design, there are always conflicting goals that have to be balanced and prioritized. This inevitably leads to prioritizing some risks over others, or ignoring some unimportant or unlikely risks. There will never be any kind of bug-free software, ever.
Yay! (Score:2)
Having an AI doublecheck that it is not spewing nonsense with a theorem prover (or a calculator or any other form of tire-kicking) sounds like a good thing. ChatGPT, Dall-E3 etc need to do more of that.
AI generated code will suck ass (Score:2)
AI generated apps will be a nightmare to debug and I guarantee they'll be bloated.
Re: (Score:2)
So like most modern software then?
Re: (Score:2)
Currently maybe yes. But we have not even seriously tried to create an AI that generates code yet, code generation ability just happened almost by accident. If we really put effort into generating such an AI by creating good learning materials for it, I think we can create much better AI for this job.
And if we can create smarter AI, like what Google Gemini is doing, I think it can learn to code without human coding input and if it can do that, we can create environment where it can learn best ways to write
No (Score:2)
Re: (Score:2)
I agree with you that AI won't be able to create 100% bug free software.
But I think AI could quite easily do much better work than current analysis tools. I think that current tools find perhaps about 20% of the bugs and human reviewer can find 90% of the bugs (for small amounts of code). If AI would do the 90% for large amounts of code, that alone would be a huge improvement. Big improvement would also be if AI can write tests to verify the code. Tests are usually really easy to write, but it quite often r
Re: (Score:1)
Lonf Wait (Score:2)
I don't know. Let's discuss this when we have AI. Which is still hundreds of years down the road.
Bertrand Russel proved... (Score:3)
Bertrand Russel proved that 1+1 = 2. Yet even with that proof, as source code, software has one problem: hardware.
Hardware can have various issues such as component failure, a transistor in a NAND gate malfunctions, or there is a electrical surge, et cetera. Hardware can fail, so even with the proof 1+1 = 2, the half-adder or the carry lookahead logic in the adder circuit can malfunction.
There was a microprocessor in the 1980s the VIPER that the selling point was it had been formally designed and "proof" of its correctness. NASA had an independent evaluation done, and while the hardware (the proof was in levels of abstraction) was correct at the processor level, and the internal units like registers. Yet the levels further down could not be proven, such as for quantum effects that are random at the sub-atomic level of a transistor/MOSFET used to create a logic gate that is used to create a register. Hence the conclusion that the VIPER was not yet verified.
https://ntrs.nasa.gov/api/cita... [nasa.gov]
The software can be proven correct in the result, but that result can have unexpected consequences. The USS Yorktown was crippled because of a divide by zero error. The operation was correct, but the effect was to lock the computer.
JoshK.
Re: (Score:2)
Yes, a single piece of hardware can fail, for example due to a radiation-induced bit flip.
This is exactly why secure systems rely on redundant hardware, usually a triplet of computers that compare to each other.
Bug-free software is achievable if the hardware architecture presents no vulnerabilities. If run on redundant and secure hardware, the probability of failure can be incredibly low.
I think in the future we will have a network with multiple layers of redundancies connecting massively redundant nodes, w
Re: (Score:2)
Quite. Even memory fails or can be corrupted so a mathematical proof of correctness does not translate into failure-proof software.
In college, I studied (under "miscellanous topics") some of the space probe architectures. One interesting architecture was a Russian space probe in the 1980s, it used "voting" among 3 processors. The problem was one processor was malfunctioning, and the working processor could not out vote it as the other processor was malfunctioning. Three processors working in tandem was insu
Re: (Score:2)
Redundant computing is common in modern commercial aircraft designs like the A380, A350, B777 and B787. Their operation over the decades has proven the extreme security of the system even in the event of a node failure. Triple redundancy is very common on rockets and on spacecraft.
Re:Bertrand Russell proved... (Score:2)
Quite indeed. I remember the Tandem "non-stop" computers that were redundant in processors and memory. Once they had a software bug, their own Y2K date problem...so reliable hardware and then twitchy code.
There was an Airbus crash (the prototype as I recall) but the pilots were flying low, so when they gave control back to the computer/auto-pilot, it refused to take control...so the plane cruised along losing altitude until it crashed in a nearby forest. The other unreliability factor--the human user/operat
Re: (Score:2)
Of course, having a redundant, secure computer doesn't magically create bug-free software running on it. :-D
Please verify your information. The crash of the A320 at Mulhouse airport during a demonstration is well documented and has nothing to do with the original subject. Additionally I did not list the A320 in my list, because its safety architecture is different (still much better than the outdated avionics of the B737). There are many materials on this topic on the Internet. Look at the A380 and B777 whi
Re: (Score:2)
I was sharing the Airbus story as an illustrative example of how a computer and software can work fine, only the human element gets involved. An allusion, not a research paper reference. :) for conversation. And it was not meant as a criticism of Airbus, again an allusion to the human element. One crash of an airliner involved the captain letting his son handle the controls, and that switched off the auto-pilot, then led to the crash.
The USS Eisenhower once was adrift because of a division by zero error, w
Re: (Score:2)
https://users.csc.calpoly.edu/... [calpoly.edu]
https://en.wikipedia.org/wiki/... [wikipedia.org]
While the first is clearly a software bug, this is not the case of the second: the pilots did take too many risks and responded too late. The machine operated within the specification of it certification. https://bea.aero/docspa/1988/f... [bea.aero] There was multiple manipulations by some medias on the subject of that accident, trying to point out the FBW. If you digs a bit into how the A320 work, you will quickly understand that the alleged alternate
Re: (Score:2)
Okay, but whatever. Thanks for the links. I re-read the Airbus on Wikipedia, so got that. :)
JoshK.
Re: (Score:2)
You're welcome. Two more links you may find interesting:
https://en.wikipedia.org/wiki/... [wikipedia.org]
https://en.wikipedia.org/wiki/... [wikipedia.org]
Have a nice day.
Re: (Score:2)
Thanks! Check out:
https://www.amazon.com/Day-Pho... [amazon.com]
Have a better week!
JoshK.
Re: (Score:2)
There's a reason why avionic is an application of very advanced safe redundant computing. Since a very long time in fact:
https://en.wikipedia.org/wiki/... [wikipedia.org]
https://firstmicroprocessor.co... [firstmicroprocessor.com]
If safe computing is a topic of interest to your study, avionic is probably a very good starting point.
Re: (Score:2)
Thanks, and I concur.
Here's yet-another book, that touches upon some things we have shared, mentioned...
https://archive.org/details/fa... [archive.org]
JoshK.
Question (Score:2)
Re: Question (Score:1)
I think it is called temporal logic, exists since the sixties.
Re: (Score:2)
Sometimes you can order your resources and lock in that order. That rules out deadlocks.
You can describe with a petri net and reason or simulate over it.
You can also undo deadlocks with watch-dogs or transactions.
As for as use-after-free: well, use a programming language with a linear type system or with a garbage collector (GC). Linear type system is limiting but it prevents it. GC will transform it into increased memory usage which is easier to diagnose.
Apologies in advance... (Score:2)
I'll just see myself to the door now.
Maths vs physics (Score:2)
It is nice to prove software, it will fix bugs.
However, mathematically proved software runs on hardware, and hardware follows the laws of physics, where many things cannot be proven. Rowhammer attacks are there to remind us of the limits of proven software.