This web page was created programmatically, to learn the article in its unique location you may go to the hyperlink bellow:
https://newsletter.semianalysis.com/p/finding-miscompiles-for-fun-not-profit
and if you wish to take away this text from our website please contact us
I’ve labored on compilers for ML for the final decade throughout Google, Waymo, and OpenAI. This consists of CUDA assist in clang, XLA:GPU, Triton, and OpenAI’s customized {hardware}. I’ve seen stuff. But over the previous week or so I had one of the unsettling experiences of my profession: In one afternoon, I spent greater than $10,000 operating AI brokers over compiler code, discovering a whole bunch of believable bugs in LLVM, together with many miscompiles and not less than one which’s Quite Serious. This is the story of how I obtained right here and the place we may be going.
In January 2026, I made a decision to attempt to discover some bugs in LLVM (the compiler behind clang, rustc, and AMD’s GPU compiler, amongst others), as a private mission. Codex and I collaboratively wrote a fuzzer. The primary concept is to generate a random program, run it by a part of the compiler, after which examine that the ensuing program after compilation does the identical factor as the unique program (normally simply by operating the 2 applications). I spent just a few weeks on it, and I discovered and stuck 5 bugs in instcombine, LLVM’s peephole optimization go. After that, my fuzzer began taking longer to search out bugs, and I misplaced curiosity.
Fast ahead to mid-May 2026. I joined SemiAnalysis as a contractor, and I made a decision to attempt making use of the identical method to NVIDIA’s low-level compiler, ptxas. I anticipated this to be much less fruitful than fuzzing LLVM, for just a few causes.
-
In common, fuzzers can get “stuck”: Once they discover a bug, they will maintain discovering new methods to set off it. With an open-source compiler like LLVM, you may “just” repair the bug after which proceed fuzzing. But with a closed-source compiler like ptxas, the very best you are able to do is attempt to modify your fuzzer so it doesn’t generate inputs that set off the identical bug. Implementing that is tedious at finest.
-
With LLVM I can run only one go (e.g. instcombine), whereas with ptxas I’ve to run the entire compiler end-to-end. I apprehensive that this may make some bugs require bigger or extra difficult reproducers, making them much less more likely to be discovered by a fuzzer.
-
I can construct LLVM myself, so I can compile with flags that add instrumentation into this system underneath check to assist the fuzzer select “interesting” inputs that discover new elements of this system. Although AFL++ has modes for gathering instrumentation from precompiled binaries, they decelerate this system underneath check, and basically I didn’t anticipate them to be as helpful. (I didn’t find yourself really utilizing these modes when fuzzing ptxas; I simply did purely undirected fuzzing.)
I anticipated possibly I’d discover a handful of bugs after just a few weeks of labor, like earlier than.
Instead, in three days, I had 40 applications that ptxas miscompiles. (Per week later, this quantity is as much as about 80.) Although a few of these check circumstances in all probability replicate the identical underlying bug within the compiler, I used to be nonetheless type of astonished. Many of the reproducers cut back to pretty “normal-looking” instruction sequences.
If you need to take a look on the bugs I discovered (actually, that Codex and Claude discovered), they dwell within the FuzzX repo on GitHub.
Why was my fuzzer a lot simpler to jot down this time? As far as I can inform, it was the distinction between ChatGPT 5.2 and 5.5. This time, I vibe-coded this complete fuzzer, by no means a line of code. The LLM did the tedious job of altering the fuzzer after each bug we discovered in order that we wouldn’t get caught discovering the identical bug time and again. It additionally minimized every check case it generated, typically spending an hour or longer doing so. It independently determined which PTX directions to fuzz, and which sequences of directions had been “safe” (i.e. didn’t set off undefined conduct). I simply put it in a loop utilizing /objective and went to sleep.
To be clear, it’s not stunning that fuzzing discovered bugs. But it was stunning that I discovered this many bugs so shortly, with nearly no handbook effort.
Naturally, my subsequent query was whether or not I may additionally discover bugs in LLVM’s AMDGPU backend. You received’t be shocked to listen to that I may, at roughly the identical charge as I discovered bugs in ptxas. At some level, my private ChatGPT Pro account ran out of credit and I switched to SemiAnalysis’s Claude account. I didn’t discover a distinction in high quality between Opus 4.7 and ChatGPT 5.5, they had been each nice.
The story may finish right here. I reported the bugs I’d discovered to AMD and NVIDIA. As of this writing, AMD has already fastened 5 of them, and since their compiler is open-source, I can apply their fixes instantly. If any of those bugs had been essential to me, I may even repair them myself. Hooray for open-source toolchains.
At this level my ptxas and AMDGPU fuzzers began to decelerate; they had been operating for more and more lengthy occasions with out discovering any new bugs. I nearly stopped there, however I had a thought. What if I simply requested Claude to learn by LLVM and discover bugs? In different phrases, what if I used to be constructing a fuzzer as a result of I hadn’t absorbed the Bitter Lesson?
I requested Claude to spin up 50 subagents at a time searching for bugs. Boy did I get them. Claude discovered bugs at a charge of one each 4 minutes. (In distinction, at this level the fuzzer was taking hours to search out new bugs.)
My preliminary response was, I don’t even assume this implies LLVM’s AMDGPU backend is especially buggy. Indeed, a buddy instructed I do the identical to the x86 backend, and it produced bugs at a charge of virtually two per minute.
My bug-finding brokers didn’t present any indicators of slowing down earlier than I finished them. I don’t know what number of extra points are on the market. Probably so much?
The query with automated bug-finding normally is, “Do these bugs matter?” I’ve solely gone by about 20% of the x86 bugs up to now. The bugs discovered by brokers are positively much less extreme on common than these discovered by the fuzzer; each fuzz bug is a demonstrable miscompile, whereas brokers are utilizing their judgement to determine what’s and isn’t a bug, and generally they’re incorrect.
On the opposite hand, one of many 30 or so agent-discovered bugs I’ve examined consists of this extremely frightening case the place LLVM will flip an atomic retailer into two non-atomic shops. This bug would have been very tough to search out through fuzzing (fuzzing atomics is difficult), and would seemingly be each fairly unhealthy and fairly arduous to root-cause if it occurred in manufacturing (as a result of downgrading an atomic to a non-atomic retailer might be advantageous 99% of the time, and 1% of the time will corrupt your information).
The different query you need to ask is, how a lot did all this price? The vibe-coded fuzzer was comparatively low cost. I used to be utilizing about double the weekly quota for my $200/month ChatGPT Pro account to jot down the AMD and NVIDIA fuzzers on the identical time. It was dearer with pay-per-token Opus 4.7, on the order of $1000 for just a few days’ work. I don’t know if it will have been cheaper with a Claude Max account, and I don’t know if the work I used to be doing with Claude was extra token-intensive than the work I’d been doing with Codex earlier.
On the opposite hand, having a military of subagents learn the code was, um, not low cost. I spent greater than $10,000 in just a few hours, and I wasn’t even utilizing quick mode. (Thanks for the tokens, Dylan!). Even although bugs discovered this manner had been on common lower-severity than fuzz bugs, the truth that code inspection can discover entire lessons of bugs that will be very difficult to search out through fuzzing makes it priceless to me.
Frankly, simply the one atomics bug talked about may do excess of $10,000 in harm underneath the appropriate (or maybe incorrect) circumstances; silent information corruption like it will break nearly any manufacturing system. And even when your system had safeguards that allowed it to note this type of corruption, you’d nonetheless seemingly burn months of engineering time attempting to find its supply. If I had been doing this once more, I wouldn’t hassle with fuzzing, assuming I had the price range to unleash my herd of brokers.
I’m nonetheless attempting to make which means out of this expertise. I feel it’s greater than merely “With enough subagents, all bugs are shallow.” Maybe the lesson is that this:
Things that had been inconceivable 5 months in the past at the moment are “just” Very Expensive.
A corollary is, should you don’t have the price range, you’re working in a smaller a part of the likelihood area than those that do. And I anticipate that hole to develop considerably over the approaching months.
SemiAnalysis spent an order of magnitude extra on my tokens than on paying me the day I ran the brokers. If the worth of one thing is how a lot somebody is prepared to pay for it, then that was the primary time in my profession that I delivered much less worth to my employer than my AI did.
It makes me surprise how a lot SemiAnalysis might be prepared to pay for tokens in six months, and what life might be like for folks and firms who can’t or received’t pay.
One final thought. I may do the “have Claude read the code” strategy with the AMDGPU and x86 LLVM backends as a result of I’ve the supply code. I can’t do it with ptxas as a result of I solely have a binary. But…how positive am I that, given sufficient tokens, Opus 5.7 (and even 4.7) couldn’t discover bugs simply by studying the meeting?
This web page was created programmatically, to learn the article in its unique location you may go to the hyperlink bellow:
https://newsletter.semianalysis.com/p/finding-miscompiles-for-fun-not-profit
and if you wish to take away this text from our website please contact us


