Robots Dream of Root Shells

Robots Dream of Root Shells

It's been an incredible year for AI. Back in the early 2000s, there were AI posters up all over my local computer science department, and it was all genetic algorithms, genetic programming, and particle swarm optimization as far as you could see. They could figure out if a circle was centered on an image, but it didn't work very well. It was the tail-end of a long AI winter. Fast forward to today and we're seeing all sorts of emergent stuff, the rate of progress is off the charts, and there's not a genetic algorithm in sight.

Recently I've been following the new DARPA competition, the Artificial Intelligence Cyber Challenge (AIxCC). The basic idea is to discover if we can use AI to find security vulnerabilities, and then use AI to fix them as well. The net result would be an autonomous system that can make software more secure with no humans in the loop, and that could have big implications for... well, all the software in the world.

If that sounds familiar, the goal of AIxCC is somewhat similar to a previous DARPA competition, the Cyber Grand Challenge (CGC) in 2016. The original hope for CGC was that we could take the thought processes used by capture the flag (CTF) players and turn them into automated systems, and overall it was quite successful. I suspect the organizers were expecting more in the way of novel program analysis techniques, but in practice most competitors converged on using fuzzing. Perhaps the inadvertent success of CGC was in highlighting how much more effective fuzzing is than other automated bug discovery methodologies, because there's been a huge amount of energy and attention around fuzzing in the past 8 years, and not so much on all the other stuff.

For better or worse, the CGC was designed around a highly contrived execution environment and relatively simple challenge binaries. This made the problem tractable, but it did raise some questions about real-world applicability, particularly around the exploit generation and binary patching parts. The idea of AIxCC is to set up a similar competition structure, but to drop the contrivances. Challenges will be based on real software like the Linux kernel and Jenkins, and all of the source code will be available. Find the bugs, fix the bugs, but you don't need to solve the exploit generation problem. 

Overall, AIxCC is probably a harder challenge than CGC, if only because performing automated reasoning on such vastly different types of software is so hard, and you can't get much more vastly different than a modern operating system kernel and a Java-based CI/CD server. If you think about it, the Venn diagram intersection of people who even know how to get both of these things set up and running is going to be fairly tiny. 

But maybe that's the point? We have AI now, and AI is in that intersection. AI is in the intersection of every Venn diagram.

The problem is that AI doesn't seem to be very good at finding security bugs yet, even when the model is specifically designed to analyze code. There's some promising results around using AI to solve CTF challenges, but so far that hasn't translated to real world software very well. To get a sense for how quickly this space is moving though, a few days ago it looked like we had a potential leap forward on AI-automated bug-discovery for the Linux kernel, only for it it to quickly fizzle out once the details were checked.

It turns out that finding security bugs in real software is hard – impossibly, stupidly hard – at least from the perspective of computational complexity. The basic problem is state explosion, where each system interaction leads to an exponential number of new possibilities, which in turn leads to an exponential number of new possibilities, and so on. If you see "find a bug in this source code" as a search optimization problem, then the search space is mind boggling. One way to make it tractable is to simplify the problem: CTF problems, CGC challenge binaries, looking at a single file/function at a time. But real world security bugs don't work like that, they involve huge codebases with all sorts of cross-function, cross-library, and cross-process interactions that blow up the search space immediately.

This is why fuzzing has been winning the methodology wars. When the search space is this big, all of the fancy program analysis stuff breaks down, and you're left with some fairly primitive tools – random mutations with a code-coverage/compare-value feedback loop, and a bunch of clever trimming of the search space (like enabling compiler sanitizers to make bugs easier to trigger). But maybe AI can change that?

At the moment there's a minor practical problem: the LLM tech of 2024 has a relatively small context window, and so it's usually not possible (or cost-effective) to reason on the entire system at once, unless the system is very basic. That means splitting the work up into smaller chunks, which is tricky to do in a way that doesn't introduce ambiguity or incorrectness into the analysis, and that will also fundamentally limit the ability to find bugs that are "spread out" across a larger codebase (like use-after-free bugs, which tend not to be localized to a single function). I don't think this is going to be a big problem in the future, because there are already models coming down the pipeline with context windows large enough to handle 100k lines-of-code (LOC), and that number keeps growing. To put this in context though (excuse the pun), the entire Linux kernel is about 27 million LOC – so there's still a long way to go, and there's definitely some uncertainty about how well the transformer architecture will continue to scale.

The bigger problem is that AI can't just magically erase the state explosion involved in this type of analysis, and it still has to find a way to navigate the same search space that a fuzzer or a human code reviewer does. We know that humans can navigate this search space with some degree of success, and that sometimes humans can even find bugs that fuzzers can't. With that said, the reason we want to automate this task is that human code auditors are excruciatingly slow and generally pretty unreliable. 

So how will the AI navigate this search space? I've spent years talking to security researchers about their techniques and approaches for finding security bugs, and when it comes to code review, we're really quite bad at explaining how we do it. The notion of "show your working" never caught on in our scene, and so you'll find hundreds of blog posts that intricately explain how to trigger a bug and what happens next, but very few that succinctly explain the steps that led to that discovery, and even fewer that explain why those exact steps were chosen in the first place.

All of this is to say that the training corpus for AI-driven bug hunting seems to have some problematic gaps. We can still expect some amount of emergent behavior, but it's too much to expect these systems to match or surpass a human reviewer if we can't even begin to describe what the "bug hunter's mind palace" really looks like. The promising news here is that AI is getting really good at in-context learning, so if you can find a way to describe the thought process of a human code review, then good prompt engineering should be able to transfer some of those insights, even if the training corpus is bad.

As an aside, I suspect the public training corpus is only going to get worse – I've talked about this in the past, but recently there's been a steady divergence between the public and private state-of-the-art in security research. Attackers are highly incentivized to keep their knowledge hidden and to share it with as few people as possible, and they're also investing in exploit development at a much higher rate than defenders are, so the net effect is that the security research you see in public presentations and blog posts is often on a completely different wavelength to the stuff that's happening in private. That's a big opportunity if you're one of the handful of groups that have access to enough relevant R&D to build a training dataset that has modern exploit development included, but not such good news for defenders. 

So what does this all mean for AIxCC? There's some chatter about some of the early work being competitive with traditional fuzzers, but nobody seems to be willing to show their hand before the big event. It looks like there might be two potential strategies emerging though. The first, AI-focused, will try to use the analytical ability of LLMs to directly point out where vulnerabilities in the source code lie. The second, fuzzing-focused, will use AI to assist in setting up a more traditional fuzzing environment.

Interestingly the winners of the previous CGC competition, Mayhem, weren't included in the funded track for AIxCC. Perhaps the organizers thought that Mayhem's proposal was too fuzzing-heavy? I don't know for sure, but I hope that Mayhem competes in the open track so that we can get a comparison of AI-focused and fuzzing-focused approaches to this problem, and it would be nice to have that historic link back to CGC.

If I were designing an entry, I would be leaning toward a fuzzing-focused approach, particularly given the short timelines and the current code analysis limitations of LLMs in 2024. The counter-argument to this is that the organizers would probably prefer an AI-focused winner, so there's a decent chance that challenges will be designed in a way that's highly amenable to that approach.

My intuition would still be to use LLMs to 1) help with building the target projects with coverage instrumentation and sanitizers enabled, 2) finding a good seed corpus (or a good way to generate one), and 3) generating the actual fuzzing harness. Creating fuzzing harnesses is a lot of manual work, and it looks like Google has had some good success with LLM-generated harnesses. Then I'd let a target-appropriate fuzzer (syzkaller, afl++, ffuf, domato, etc) do the heavy lifting on the actual bug discovery part. Once you have an interesting test-case in hand, I think you could bring the LLM back into the picture for the source code patching process. LLMs seem to be able to handle well-scoped debugging tasks, and you can narrow the problem space down significantly once you have a test case that triggers something interesting (like a crash). Based on that I think you should be able to get some quite good results on the automated patching side, and hopefully this is the area where we will see a lasting impact from AIxCC.

The AIxCC semi-finals are due to take place in Las Vegas later this year, with the final being held a year later in August 2025. It looks like we're going to find out soon if AI can find and fix bugs in real world software. If it can, then things are going to get exciting very quickly.

Good luck to all of the competitors, I'll be watching from afar!

- Ben Hawkes