


AI Slop? Not This Time. AI Tools Found 50 Real Bugs In cURL (theregister.com) 90
The Register reports:
Over the past two years, the open source curl project has been flooded with bogus bug reports generated by AI models. The deluge prompted project maintainer Daniel Stenberg to publish several blog posts about the issue in an effort to convince bug bounty hunters to show some restraint and not waste contributors' time with invalid issues. Shoddy AI-generated bug reports have been a problem not just for curl, but also for the Python community, Open Collective, and the Mesa Project.
It turns out the problem is people rather than technology. Last month, the curl project received dozens of potential issues from Joshua Rogers, a security researcher based in Poland. Rogers identified assorted bugs and vulnerabilities with the help of various AI scanning tools. And his reports were not only valid but appreciated. Stenberg in a Mastodon post last month remarked, "Actually truly awesome findings." In his mailing list update last week, Stenberg said, "most of them were tiny mistakes and nits in ordinary static code analyzer style, but they were still mistakes that we are better off having addressed. Several of the found issues were quite impressive findings...."
Stenberg told The Register that about 50 bugfixes based on Rogers' reports have been merged. "In my view, this list of issues achieved with the help of AI tooling shows that AI can be used for good," he said in an email. "Powerful tools in the hand of a clever human is certainly a good combination. It always was...!" Rogers wrote up a summary of the AI vulnerability scanning tools he tested. He concluded that these tools — Almanax, Corgea, ZeroPath, Gecko, and Amplify — are capable of finding real vulnerabilities in complex code.
The Register's conclusion? AI tools "when applied with human intelligence by someone with meaningful domain experience, can be quite helpful."
jantangring (Slashdot reader #79,804) has published an article on Stenberg's new position, including recently published comments from Stenberg that "It really looks like these new tools are finding problems that none of the old, established tools detect."
It turns out the problem is people rather than technology. Last month, the curl project received dozens of potential issues from Joshua Rogers, a security researcher based in Poland. Rogers identified assorted bugs and vulnerabilities with the help of various AI scanning tools. And his reports were not only valid but appreciated. Stenberg in a Mastodon post last month remarked, "Actually truly awesome findings." In his mailing list update last week, Stenberg said, "most of them were tiny mistakes and nits in ordinary static code analyzer style, but they were still mistakes that we are better off having addressed. Several of the found issues were quite impressive findings...."
Stenberg told The Register that about 50 bugfixes based on Rogers' reports have been merged. "In my view, this list of issues achieved with the help of AI tooling shows that AI can be used for good," he said in an email. "Powerful tools in the hand of a clever human is certainly a good combination. It always was...!" Rogers wrote up a summary of the AI vulnerability scanning tools he tested. He concluded that these tools — Almanax, Corgea, ZeroPath, Gecko, and Amplify — are capable of finding real vulnerabilities in complex code.
The Register's conclusion? AI tools "when applied with human intelligence by someone with meaningful domain experience, can be quite helpful."
jantangring (Slashdot reader #79,804) has published an article on Stenberg's new position, including recently published comments from Stenberg that "It really looks like these new tools are finding problems that none of the old, established tools detect."
Fake "success" is fake (Score:4, Interesting)
The real question is how many wrong reports those 50 had to be filtered from. If it is a larger number, then this is still a fail and unusable.
Re: (Score:1, Insightful)
Huh? It's clearly usable since they obtained this list of 50 valid bugs. How is that a fail. Without the AI it may have taken a decade to find these bugs by which time North Korea may have found one or two that we'd have missed.
Re:Fake "success" is fake (Score:5, Interesting)
If it reported 1500 bugs but only 50 of those were valid, then the signal to noise ratio is too low for a real person with budgeted time to be able to filter the output.
Re: (Score:2)
Indeed. It may have been done this once, but it is not sustainable. I have no idea why people do not know or understand this well-established basic fact. Seems the AI fans are operating more like cult-members than like rational people.
Re: (Score:2)
If it reported 1500 bugs but only 50 of those were valid, then the signal to noise ratio is too low for a real person with budgeted time to be able to filter the output.
If a real person with budgeted time only finds a handful otherwise then that is still a vast improvement. Noisy signals can still point you in the right direction better than no signal at all.
Re: (Score:3)
Noisy signals can still point you in the right direction better than no signal at all.
This is absolutely true. Curl is a very critical, quite old, well studied piece of software. There is a high likelihood that a bug in curl has security implications somewhere and so is very valuable to fix. Looking at up to say 200 bug reports to find 50 correct ones would be absolutely justifiable and fine. Where gweihir is right is that looking at 200,000 such reports to find 50 would not be fine. Not only would it take too much time, the chance of incorrectly accepting a wrong bug report somewhere in the
Re: (Score:3)
We need the false positive and false negative rates to understand the value of this
I am not sure we do. The evidence is that 50 correct issues were found by a single person using these tools in a "reasonable" period of time (I have no idea if this is days, weeks or months). Let's assume he spent a whole year on the project. We have a single person-year of effort (+ cost of tools) to find these issues. For a project that is critical, that is almost certainly a reasonable price to pay. Companies like Microsoft and Google likely have teams of highly paid programmers already dedicated to root
Re: (Score:2)
I think we could all agree that it not seeing all the bugs, or reporting false bugs on any particular code base should be considered a bug.
So, I guess the answer is obvious (since it is clearly not being applied to itself). That does not mean the thing is useless, it's just
Re: (Score:2)
Read the blog. https://joshua.hu/llm-engineer-review-sast-security-ai-tools-pentesters. cURL wasn't the only software he tried it on. In particular, he found issues with sudo as well.
I really don't see why so many people are making comments like "it's not going to be all powerful". I don't think his was either in the article or in the blog. And as far as the halting problem - the tools didn't find a known bug with that problem (read the blog), so the limitations were actually discussed.
This doesn't prevent
Re: (Score:2)
I think your parameter - time spent per actual bug - is more or less directly correlated to mine. Your come up with the fact that 7 days per bug is reasonable, which is fine and something I can accept. I only need to divide out by the time to investigate a report (let's say 1 day per report) and we come up with an acceptable false positive level of about 7:1.
Using your numbers (50 bugs in 1 year), simply stated, 500:1 will be unacceptable; 50:1 may get one time use below 5:1 is definitely interesting for k
Re: (Score:2)
The actual blog article is here. https://joshua.hu/llm-engineer-review-sast-security-ai-tools-pentesters
It looks like this was a few month's of work in total. This work was in evaluating a number of tools to see how well they work. It wasn't exclusively analyzing cURL - he found issues with sudo as well (a use-after free that happened to be in a non-used code path and a TLS validation issue). So the amount of time specifically working on cURL was lower than the total amount of time on his project (which is
Does a bug follow procedure? (Score:1)
Re: (Score:3)
Assuming you're the manager, compare: the time (or cost) it takes in highly skilled security specialist to identify 50 security issues, to: the time (or cost) it takes to a bug wrangler to filter out 1450 invalid reports and identify 50 security issues. I assume a bug filtering employee is probably less expensive than a security consultant.
Also, 1500 candidates is a fixed (even if large) number that ensures you can manage the task with your resources, for example you can cap the time they spend on each so t
Re: (Score:2)
Assuming you're the manager, compare: the time (or cost) it takes in highly skilled security specialist to identify 50 security issues, to: the time (or cost) it takes to a bug wrangler to filter out 1450 invalid reports and identify 50 security issues. I assume a bug filtering employee is probably less expensive than a security consultant.
Is there any justification for that assumption, the hard part is identifying the problem, the input only gives you hints, the arguments the AI makes can be highly deceiving, but it can indeed have found a bug. Remember, AI is very poor in explaining what it is doing.
Also, 1500 candidates is a fixed (even if large) number that ensures you can manage the task with your resources, for example you can cap the time they spend on each so that you finish in fixed time. With the security consultant it's possibly more difficult to plan.
So basically you only want to find the easy to see bugs, not the hard ones. This assumption is also on shaky ground, as often the subtle, hard to identify problems are the important ones and limiting time spent on each candidate can exclude impo
Re: (Score:2)
...too low for a real person with budgeted time to be able to filter the output.
I'm waiting with bated breath for someone to suggest an AI filter the reports so the human can work more efficiently.
Re: (Score:2)
I'm waiting with bated breath for someone to suggest an AI filter the reports so the human can work more efficiently.
Well, obviously that's your mistake. You should have asked the AI whether it''s a good idea to replace the humans with AIs.
Re: (Score:2)
Are you saying the security researcher who they named, isn't a real person? Or are you saying he spent a decade on filtering through AI bugs? That's not a very reasonable assumption is it?
Re: (Score:2)
It all depends upon the consequences of letting a bug through. I would say that for critical software used by millions, a 30-1 false positive rate is acceptable - and even a fairly good result (I don't know the real false-positive rate - I am going with your hypothetical result). Apple has raised its bug bounty program to $2,000,000 for certain bugs. How much effort does that justify in sifting through false positives? I would even think that Apple would find it worthwhile to assign a few programmers to use
Re: (Score:1)
If it is a larger number, then this is still a fail and unusable.
Do you have any rational basis for this claim? If there were 101 reports, and 51 were bogus, the discovery only 50 legitimate flaws in a widely used and mature code base is somehow an unworkable process?
I believe we're witnessing the emergence AIDS. AI derangement syndrome.
Re: (Score:2)
This is really well established. False positives make any detection system unusable. Sometimes not initially, but always in the longer run. This is _basics_.
Also, what about vulnerabilities these tools do not find? If there are patterns for those, this helps the attackers.
Re: (Score:2)
This is really well established.
[citation needed]
If there are patterns for those, this helps the attackers.
Now you're engaging in a regression chain: If false positives are high it's useless. And if it's not useless, it helps attackers. And if it doesn't help attackers...
Re:Fake "success" is fake (Score:5, Insightful)
> [citation needed]
If your test has a high false positive rate, you are spending extra time and resources investigating potential problems that are not actually problems. It also undermines trust in the system. The real world consequences of this are not hard to spot; If the fire alarm in your office or apartment has a record of going off without there ever being an actual fire, how much more likely are you to delay acting every time it goes off, or ignore it completely? People die because of this effect.
Medicine also has a real problem with false positives. Imagine testing positive for cancer, spending the next few months worrying about it and possibly getting treatments or surgeries which have their own risks (and expenses), only to find out it was all for nothing? This is also a very real thing that happens.
Now maybe the cost of false positives in the case of finding software vulnerabilities isn't quite so dire, but the effect is still real. For every positive result, someone has to spend time looking into it... otherwise, what's the point? So the LLM tells you there's a vulnerability in some part of the software, and you spend weeks trying to figure out how it works and how to fix it, only to conclude that it was never actually broken. How many times does that have to happen before people just stop taking the LLM's suggestions seriously? How much time and money are you willing to throw imaginary problems until you conclude it's not actually worth it?
> Now you're engaging in a regression chain
Not really, no. If an attacker is aware of a vulnerability and the LLM fails to find it for an extended period of time, that could give them clues as to what makes that vulnerability difficult for the LLM to identify, and therefore where they might look for new vulnerabilities or even how to craft new malware that exploits those same blindspots. Doesn't seem very far fetched.
And this is true regardless of how high the false positive rate is, because this is a false negative. Finding problems that aren't real, versus NOT finding problems that ARE real, are very different type of failures.
=Smidge=
Re: (Score:2)
Thanks. Good to see that some people are aware of the basic facts of the matter.
Re: (Score:2)
If the fire alarm in your office or apartment has a record of going off without there ever being an actual fire, how much more likely are you to delay acting every time it goes off, or ignore it completely? People die because of this effect.
Now apply the same logic to a colonoscopy.
Re: (Score:2)
> Now apply the same logic to a colonoscopy.
Okay?
A colonoscopy is diagnostic. It analogy in terms of fire alarms is a fire drill, where you practice efficient evacuation and condition yourself to respond to the alarm.
What's your point? Did you have one?
=Smidge=
Re: (Score:2)
Re: (Score:2)
Sure. That is not the situation here. Anybody claiming that is simply incompetent and unaware of the respective literature.
Re: Fake "success" is fake (Score:2)
Re: (Score:2)
This is really well established. False positives make any detection system unusable.
I'll just give a counterexample, which invalidates your point. We have Fire detectors in our house, every room has one in the ceiling. Once in a while, a detector goes off to some random reason, typically cooking (with lots of steam generated). So no fire. False positive. Does it make the detection system unusable? Do you feel that I should rip out those detectors?
Re: (Score:2)
While your point is valid, I don't think it applies in this case because the reporter didn't complain of too many false-positives, and the curl maintainer, who has complained in the past about AI crap, seems to be happy in this instance.
Re: (Score:2)
Which may well be an isolated instance. The point is that this "positive" result is reported in such a way as to make "AI" look good, but fails to actually do that to anybody with an understanding of the actual situation. And hence that report may well be a lie by misdirection.
Re: (Score:2)
Is this what clanker apologetics looks like?
Re: (Score:2)
I think so. Or mindless cult-like fanbois that are mentally incapable of seeing or accepting any problems with their fetish.
Re: (Score:2)
I think so. Or mindless cult-like fanbois that are mentally incapable of seeing or accepting any problems with their fetish.
This explains a lot, as we all know, that you have a fetish for denouncing any AI/LLM use case, no matter how successful those would be.
Re: (Score:2)
And here you operate on the kindergarten level of "no you are it". Fascinating. You may be dumb enough to actually benefit from using "AI".
Re: (Score:2)
Would you describe your refutal as being based on logic and facts, or rather an appeal to emotion?
Re: (Score:2)
Don't pretend to be smarter than you are.
Re: (Score:2)
My guess is that there was probably thousands. It took someone who actually knows what the heck is being reported to ignore those.
Like one of the problems for just about everyone who operates a website is getting thousands of bug reports from China and India "bug hunters" who ask for bug bounties while disclosing absolutely nothing. These all go straight to the trash. If you're not willing to give a two sentence explanation of what "bug" you found, I'm assuming you just thought pressing F12 and seeing the s
Re: (Score:2)
Indeed. And then there is the problem of bugs it does not find. Creating a false sense of security is worse than knowing your code is insecure.
Need metrcis on number of positives + hours needed (Score:3)
Need:
- Number of possible bugs found by the AI
- Cost or runtime of the AI to scan the source code
- Total human hours needed to analyze them
- Total human hours needed to setup and run the AI based source code scans (assuming multiple AI tools were used)
- Breakdown of the findings into
not a bug
a style issue
a bug that is handled elsewhere (null checked parameter is already done in another function)
a minor bug
an invalid OS library call
a post-API or other call which does not check for the right return value or m
Re: (Score:2)
Indeed. Without the full picture, this is just mindless "It did find bugs!" hype that has no real-world meaning. This hype is really stupid, bust seem to have become the standard for AI "success" reporting.
I can find all bugs in cURL or any other software trivially, after all: Simply report all code. That is obviously a completely useless approach, even if it does really find all bugs.
Re: Need metrcis on number of positives + hours ne (Score:3)
The person who made the report is a professional penetration tester. His usual method is to look for anything that could be wrong and then test whether it actually is. What he found is that the AI tools came up with potential issues he hadn't thought of, and they weren't all wrong, so it's a valuable tool to him because he normally runs out of ideas rather than running out of time to test them. He complained about the UI making it hard to go through large lists of reported issues exhaustively, and he only u
Re: (Score:2)
Pen-testers do not do full tests. They go after the easy possibilities because of budget and time constraints. That means this is even worse than real software security testing done with AI, which would be bad enough.
Re: (Score:2)
Given how common cURL is, how many things use it, I'd suggest that even a fairly substantial investment of energy, money, and time is probably worth it if any of these 50 bugs are severe.
New job opportunity? Human Manager of AI? (Score:2)
If a real expert did the work of validating the bug reports before submitting them, then I don't see the problem here. At least for the time being. The AI would have served as a useful tool, even though the bugs need to and should be validated a couple of times. Just part of the process of fixing the software without introducing new bugs or regression bugs. (But of course the worst bugs are going to involve interfaces with other software...)
So from that relatively optimistic perspective, you could argue thi
Re: (Score:2)
You do not see the problem that filtering out the false positives and validating the rest may have taken more effort than can be spent longer-term? Or more effort than manual bug-search may have taken? There is indications that using AI coding assistants reduces productivity by about 20%. I would not be surprised if we see something like that here as well.
Sure, these bugs have to be fixed, because attackers will use AI as well to find them, but overall it is quite possible AI use makes the situation worse.
T
Re: (Score:2)
Currently reading Nexus and feeling increasingly bleak about the future...
Re: (Score:2)
Well, whenever things start to look up, some assholes with power and money start being destructive and turning things back to shit.
Re: (Score:2)
From Joshua Roger's blog (linked to in The Register's article: https://joshua.hu/llm-engineer-review-sast-security-ai-tools-pentesters)
However, I should note, in comparison to other SAST tools I have used in the past, the false positive rate here is extremely low. I am simply comparing the numbers between these tools.
Re: (Score:2)
This is for a pen-testing context. These are not suitable as software security tests as they are far too limited in coverage.
Re: (Score:2)
If the project itself was inundated with AI slop bug reports- then that's obviously bad, and no signal is going to make it through that noise.
However- if the person who filed the 50 valid reports was responsible for the filtering and testing, then we ultimately don't give one squirt of piss how many false positives they had.
Their input to the project had a 0% false positive rate.
The devil, as always, is in the details.
Re: (Score:2)
Apparently, this was done by a pen-tester. So there was pre-filtering to the easy cases, because pen-testers have to go for these due to the constraints they work under. That basically makes the headline a lie.
Re: (Score:2)
That basically makes the headline a lie.
They usually are ;)
Re: (Score:2)
As the project had large headache about "slop" reports, you probably can take their success seriously. If it would have been more trouble than worth, they would have told you so, as they were annoyed previously and are happy now.
AI tools found shit (Score:2)
I'll quote TFS just in case:
Rogers identified assorted bugs and vulnerabilities
Re: AI tools found shit (Score:4, Insightful)
Re: (Score:2)
Don't expect AI to be autonomous
Why not? You must have missed all the hype that's telling me how humanity will be out of things to do because of the "AI".
In fact, there is no "AI", and the human programmers have always "paired" with tools to do their jobs.
Re: AI tools found shit (Score:3)
These tools are commonly called AI. Disagree with the name if you like, but not the reality.
Re: (Score:2)
It's qualitatively different.
No, it isn't.
Good luck using a complex piece of software for a non-trivial task just by "asking AI" instead of understanding the documentation.
Re: AI tools found shit (Score:3)
Re: (Score:2)
That's not at all what I said
I read what you said. I simply pointed out that the "right" way to use this stuff isn't "qualitatively different", and explained why.
You disagree emotionally, but you weasel out of the simple fact that your "qualitatively different" claim was resoundingly proven wrong.
No worries, happens here all the time to "AI" adepts.
Re: AI tools found shit (Score:2)
Re: (Score:2)
Also programmers are doing things to automate themselves away since decades, knowing they won't lose their job, but be able to deliver better results. Why did IT scale up that much? Because we don't need to build the foundation from scratch anymore. When did you write the last assembler code? Many people nowadays can't even program in C and still create good programs using compilers or interpreters that allow them to use constructs that require less knowledge about how to use the hardware efficiently. AI is
This (Score:2)
Conclusion (Score:5, Insightful)
The Register's conclusion? AI tools "when applied with human intelligence by someone with meaningful domain experience, can be quite helpful."
Experts using tools can actually get work done? That's some fine reporting there, Lou.
I think the thing AI can do (Score:2)
It's much harder to find a bug that you don't know is there than it is to validate whether something you think is a bug is or not.
Re: (Score:2)
This is like that person that self-posted their own blog to HN with their remarkable insight: things that fit in your CPU's cache are faster than things that sit way out there in main memory. And yes, they did indeed ask ChatGPT to give them numbers, how did you guess?
Like Chess Engines? (Score:2)
Re: (Score:2)
No. Even the early chess engines used things like alpha-beta pruning and position evaluation functions. Chess was too complex to just calculate all possible moves. IBM *did* use a lot of brute force on top of that, but it requires the "intelligent" underpinnings.
Re: (Score:2)
You cannot even represent the state space in any PC. No, not even in your hypothetical future one, there are a few laws of physics against that.
Re: (Score:2)
No the thing is, ever since the 90's, it was possible to compute all possible moves. Like a lot of early chess games on the 8088/8086/80286 the "computer" AI players had to be nerfed to only think ahead by X many turns because they would be unbeatable.
Like I remember playing Battlechess, and the computer would sometimes take a whole minute to "think" before making a move. Computers are on the whole 1000 times faster in clock speed alone since then, never mind anything else. I'd assume that anyone playing ch
Project maintainer gets it! (Score:5, Insightful)
"In my view, this list of issues achieved with the help of AI tooling shows that AI can be used for good," he said in an email. "Powerful tools in the hand of a clever human is certainly a good combination. It always was!"
* This is how AI should be used, not the view from corporate executives and MBAs it should be used to create machine slaves to replace people.
* Powerful tools in the hands of foolish humans results in bullshit bug reports and "vibe coded" applications with a total lack of critical analysis.
AI has real uses but it is far from being [slashdot.org] the wish-granting genie [slashdot.org] that a significant [slashdot.org] portion of [slashdot.org] the population [slashdot.org] seems to [slashdot.org] believe it is. [slashdot.org]
Great at finding bugs with a caveat (Score:4, Insightful)
The tools I use are fantastic at this. But, there is a massive caveat. I can look at the bug identified, and I can then proceed to fix it. Great. But, if I use the AI tool to provide me the "fixed" code, it is often very broken. To the point of not compiling, or leaving out major functionality. Along with it may very well introduce major bugs of its own.
One of my favourite examples was where I was using threading very correctly. It then yanked out everything which was there to prevent obvious race conditions and other critical aspects of threading. It was hot garbage. But, the original bug I had been hunting was correctly identified.
AI is a very useful too, but it is not a programmer. I'm sick of seeing people think it is a programmer by "proving" this with apps with about the complexity of a TODO app.
Re: (Score:2)
Ultimately, it's very much a dark art of mixing agentic tools and models.
It definitely doesn't replace an actual progammer though, as they're going to need to know how to move around the various reagents and be able to identify when they've moved into hot garbage territory.
Using an iterative process, I was able to whip up a reactive TUI application for SNMP network monitoring and graphing of multiple nodes to get a "birds eye"
Joshua Rogers from Poland? (Score:2)
"Joshua Rogers, a security researcher based in Poland."
I know that "based" doesn't mean he was born there, but did anyone else do a double take at that name + country combination?
That's right up there was "Dmitri Peskov, a cattle breeder working out of Texarcana, TX"...
Re: (Score:2)
I don't really see why it's interesting because people move around a lot these days. Do you have a low opinion of Poland or you want to be more impressed by his origins?
This information is trivial to find, BTW:
https://joshua.hu/about.html [joshua.hu]
A fool... (Score:2)
"A fool with a tool is still a fool"
You have to know how to use the tool...