How to Build a Fuzzing Corpus

Fuzzing for security vulnerabilities is a strange thing. Throwing randomly generated or mutated data at an application until it crashes sounds like an extremely primitive way to find vulnerabilities, and yet the last decade is full of fuzzing success stories.

In many respects, it's still poorly understood why fuzzing works so well, and where its limitations may lie. A lot of what we know about fuzzing was originally discovered through trial-and-error rather than rigorous scientific method, and the science of fuzzing has only recently caught up.

I was preparing a different article on the "state-of-the-art" in fuzzing (stay tuned for that one) when it occurred to me that there was a small gap in the articles and papers I was reading about fuzzing. Specifically, I couldn't find many resources about how to create a "seed corpus" from scratch. Since a seed corpus is often an important part of an effective fuzzing campaign, I thought I'd briefly share what I know about the topic.

This post covers 1) what a "seed corpus" is used for, 2) why they are important for fuzzing, and 3) how you can build your own fuzzing corpus in a scalable way. I'm also sharing a new open source tool called "common-corpus" that can help with this process.

What is a seed corpus?

At a high level, fuzzing is split into two different types: generational and mutational.

In generational fuzzing, a grammar is manually defined, typically using a file format that resembles EBNF with some domain-specific extensions. Then each test case is a random walk through the graph of tokens that the grammar defines. Domato is a good example of this approach.

In mutational fuzzing however, the idea is to take an existing file and make a series of random mutations (such as flipping, reordering, removing, and inserting data) before having the target application parse the mutated file. The files that are mutated are selected from the seed corpus, which is a collection of files that exhibit interesting behavior for the application that's being tested.

Why is a seed corpus important?

The goal of a fuzzer is to find an input that either crashes or triggers a rare condition that we are monitoring for.

However, fuzzing practitioners have observed that there is an important interim goal that seems to be deeply connected to the end goal of crashes. As a general rule, if you increase the amount of code that the fuzzer can reach in the target application (e.g. increase code coverage), then you increase the number of crashes that fuzzer will find.

A large portion of modern fuzzing research is oriented around increasing code coverage, usually through increasingly complex feedback loops. The end result is truly impressive; you can watch your fuzzer "grow" highly structured inputs that include file signatures, TLVs, data structure fields, strings, and checksums that the target application expects.

However we know that 1) growing "interesting" inputs that are very different from the base file that you're mutating is computationally expensive, and 2) there are certain types of data structures interdependencies that fuzzers might not be able to solve in a timely manner, resulting in a roadblock in terms of code coverage.

This is why a good seed corpus is important, it allows your fuzzer to start as close as possible to the "interesting" points in the target application, and it allows your fuzzer to jump over potentially unsolvable cliffs in code coverage.

The science of fuzzing has really advanced in recent years, and now we can benefit from peer-reviewed papers that back-up many of the intuitions that we gained with hands-on experience. In 2021, a team led by Adrian Herrara published "Seed Selection for Successful Fuzzing". They compared the effectiveness of using an empty file, using a single file, and using a minimized corpus of pre-selected files, and the results were clear: "Seed selection has a significant impact on a fuzzer’s ability to find bugs. Both AFL and AFL++ perform better when bootstrapped with a minimized corpus."

How is a seed corpus created?

In practice it's quite easy to create a seed corpus, but it's relatively hard to create a good seed corpus. Consider all of the following questions. Should you have multiple files that cover the same feature, or just one? Should you choose the smallest files for each feature, or the largest, or something in the middle? Should you minimize each file individually, or keep the file in its original state? Should you remove files that only have common features, or keep them in?

The way I learned this at Google (mostly by copying whatever Tavis was doing) was to create the smallest set of files that cover the largest amount of branch coverage in the target application. This is sometimes called the set cover minimization, or corpus distillation. Then the fuzzer would take care of the rest by running at scale and using code-coverage feedback loops.

But how can we actually find the files to put into the seed corpus in the first place? There's a few different options:

Create them.

For any given file format, there's almost certainly a number of different ways to create files of that type. The idea here is that if there's a feature with a security bug in it, there will be some tool to create files that use that feature. This could mean running a command line tool with every different combination of options, or progressing through every option in the GUI of an authoring tool. Ideally you want to do this with all of the possible tools for your file format, since distinct implementations will give distinct output files.

I've tried this approach once or twice, particularly when it was easy to automate (like running cwebp across a pre-existing corpus of JPEGs and PNGs), but it's a lot of manual work for each new file format you want to support.
Use existing test suites and bug reports.

It's not necessarily common, but some tools have excellent test suites with a range of interesting files already. These are historically used for conformance or integration testing, but they can be useful for fuzzing as well.

You can also find a small number of quirky files that exhibit weird behavior in the bug reports for that project. While there's not normally a lot of these, they can provide high signal, as these test cases often were already exhibiting undefined behavior in the target application in some form.
Crawl the web.

This was the standard approach we used at Google. As it turns out, Google has a good index of the publicly accessible web. From this it's possible to return a list of, for example, every Flash file that was crawled that day, or every PDF, or every EXE, etc. There's so much weird and wonderful stuff on the web, including a wide variety of files that have different features and quirks, so this approach was very effective.

Unsurprisingly the size of this data set was huge, and so corpus distillation was required. This involves running every file in the crawled data set and recording its code coverage profile. If the file triggered code that we hadn't seen before, we would keep that file. Otherwise, it was discarded. The end result is a seed corpus that is suitable for mutational fuzzing.

Most security researchers don't have direct access to the Google web crawler, but we have a great alternative: the Common Crawl. We can use Common Crawl to replicate this process and build our own seed corpus for fuzzing.

How can we build our own seed corpus?

This section provides a hands-on guide to building a seed corpus for fuzzing using the Common Crawl. Of course this isn't the only approach to building a seed corpus, but I've found it to be a cost-effective and simple way to get started, and many other researchers have been using this approach as well.

Common Crawl is an open data project that crawls the web and shares all of the crawled data publicly. Recently the Common Crawl has been popularly used as a training data set for Large Language Models (LLMs) such as GPT-3, but it's also very useful for fuzzing.

Common Crawl includes an index in a columnar format called Apache Parquet. An index is a data structure that allows us to efficiently and directly lookup the subset of data we're looking for without having to process the entire crawled data set, and fortunately two of the indexed fields are incredibly helpful in creating a fuzzing corpus: content_mime_type and content_mime_detected.

The content type (sometimes called MIME type) is an identifier that's commonly sent in HTTP responses to indicate the file format of the response. For example, if a web server says that something has a content type of "application/pdf", there's a good chance it's a PDF file. Common Crawl goes one step further, it attempts to confirm the real file format by using the "Content Detection" feature of Apache Tika. This is what Common Crawl's index calls the "content_mime_detected", and that's what we'll be using to extract files that are relevant for us.

The first step is choosing which file format (and corresponding mime-type) the seed corpus will be made up of. Then we're going to look for that mime-type in the Common Crawl index. For demonstrative purposes, let's use PDF ("application/pdf"). I've also had success with JPEG2000 ("image/x-jp2-container" and "image/x-jp2-codestream").

The recommended approach for querying the Common Crawl index is to use AWS Athena, which natively supports Apache Parquet. It's possible to set up Apache Spark locally to perform this step, but I found that the AWS Athena route is very straightforward and affordable. That means that the next step is to set up AWS Athena:

Create an AWS account with an IAM user and a user group with Athena and S3 permissions.
Create an S3 bucket for the results of our query of the common crawl index to be stored in.
Open the Athena query editor and set up the output S3 bucket.
Run the following SQL query to create a database for the index to reside in:
```
CREATE DATABASE ccindex
```

Create a new query, and then run the following to create a new table for the index:

CREATE EXTERNAL TABLE IF NOT EXISTS ccindex (
    url_surtkey                   STRING,
    url                           STRING,
    url_host_name                 STRING,
    url_host_tld                  STRING,
    url_host_2nd_last_part        STRING,
    url_host_3rd_last_part        STRING,
    url_host_4th_last_part        STRING,
    url_host_5th_last_part        STRING,
    url_host_registry_suffix      STRING,
    url_host_registered_domain    STRING,
    url_host_private_suffix       STRING,
    url_host_private_domain       STRING,
    url_host_name_reversed        STRING,
    url_protocol                  STRING,
    url_port                      INT,
    url_path                      STRING,
    url_query                     STRING,
    fetch_time                    TIMESTAMP,
    fetch_status                  SMALLINT,
    fetch_redirect                STRING,
    content_digest                STRING,
    content_mime_type             STRING,
    content_mime_detected         STRING,
    content_charset               STRING,
    content_languages             STRING,
    content_truncated             STRING,
    warc_filename                 STRING,
    warc_record_offset            INT,
    warc_record_length            INT,
    warc_segment                  STRING)
PARTITIONED BY (
    crawl                         STRING,
    subset                        STRING)
STORED AS parquet		
LOCATION 's3://commoncrawl/cc-index/table/cc-main/warc/';

Create a new query and run the following, which will populate our table with the actual data of the Common Crawl index:
```
MSCK REPAIR TABLE ccindex
```

And then we're ready to query the index for our chosen file format:

SELECT url, warc_filename, warc_record_offset, warc_record_length
FROM ccindex.ccindex WHERE (crawl = 'CC-MAIN-2023-14') 
AND subset = 'warc' AND content_mime_detected = 'application/pdf';

Running this query will create a CSV file in the S3 bucket we configured earlier, and it should only cost a few dollars. The CSV file will contain all of the information we need to extract the original file contents from the Common Crawl archives, which will allow us to build a new seed corpus.

Note that popular file formats will have a lot of results from this single query, but file formats that are more rare can augment their results by running the query on multiple crawls (e.g. replacing CC-MAIN-2023-14 with other values from here) and de-duplicating the results.

At this point we have a list of PDF files, and the job is to find the smallest possible subset of files that has the maximum coverage. A basic approach is to iterate through each line in the CSV, check the size of the file, extract the file from the Common Crawl, run the file with code coverage instrumentation, and then record if the file has any new code coverage that we haven't seen so far.

Note that we need to check the size of the file because Common Crawl only stores up to 1mb of data for each crawled response. That means we can only use files that are less than 1mb, as larger files will be truncated. In practice this is a reasonable restriction, because fuzzing large files is usually inefficient. The intuition here is that mutations made to larger files will land on "non-interesting" parts of the file at a higher rate than smaller files.

Common Crawl uses an archive file format called WARC to store crawled data. The CSV file we created earlier has all of the parameters we need to extract the data using a support library such as warcio.

In terms of code coverage, there are multiple potential solutions, but a simple way to achieve branch coverage on native code (C/C++) is to use Clang's SanitizerCoverage feature. The default implementation provided by the sanitizer run-time is sufficient, it simply dumps raw coverage data to a ".sancov" file in the working directory. To use this, we simply compile the target with "-fsanitize=address -fsanitize-coverage=trace-pc-guard" and then run the target with the following environment variable: "ASAN_OPTIONS=coverage=1"

Using this code coverage data we can filter out any files that don't trigger any new code paths. Once we're confident that no new seed files will be found, we can perform a final set cover minimization to achieve the smallest possible corpus that covers the maximum amount of code.

And that's really all there is to it. In order to make this a bit easier to follow, I've released a simple example script called common-corpus. The common-corpus tool takes a CSV in the format described above and a target binary compiled with SanitizerCoverage, and then creates an output directory containing a set of files that may be used for a fuzzing corpus.

Final Thoughts

There's lots of trial-and-error involved in building effective fuzzers, and many of the standard approaches to high-quality fuzzing aren't widely understood. The process of building a new seed corpus is one of those areas that hasn't been particularly well documented, despite being an important step for mutational fuzzing.

This post introduces some of the basic theory behind using a seed corpus for fuzzing, and answers three key questions: what is a seed corpus, why are they useful, and how can we build one from scratch? A cost-effective and openly accessible approach using the Common Crawl and code coverage based corpus minimization is described in step-by-step detail, including the release of a basic tool called common-corpus that helps with this process.

- Ben Hawkes