Deconstructing LibProtobuf/Mutator Fuzzing
Intro
LibProtobufMutator (LPM) is a library that helps fuzz structured input from a LibProtobuf (LP) specification. Among other things, LPM can assist coverage-guided fuzzing. This post explores the nitty-gritties of writing an LP-based fuzzer using KCC’s example.
What we need
To write an LP-based fuzzer, what you will need are:
- An LP specification: This is a descriptive file with a
.proto
extension - LP compiler: This compiles the LP spec. into code (C++ bindings) that can be called from the test harness
- LP-to-native-format-converter: Since fuzzing happens on the LP abstraction, we need a LP formatted input to native format converter if we are to fuzz the native format.
- Fuzzer test harness: This is a C/C++ test harness that invokes some program API that consumes (parses) native-formatted input Most importantly, what we don’t need is the LP fuzzer itself: code that mutates the LP formatted input. The fuzzer module is called LibProtobufMutator or LPM, which is an external dependency.
This seems complicated at first; it definitely is for someone, like me, who has never written an LP-based fuzzer before. I will try to make it simpler.
I think the big idea behind this was that it is harder to ask developers to write custom fuzz mutators than it is to ask them to write a format specification and test harness. I’ve never written a custom fuzz mutator before, so I’m not in a position to present my experience.
That aside, the hope with this project is that this setup (LP-based fuzzing) catches bugs faster and more methodically. Methodically because you are fuzzing the specification and not mutating an opaque sequence of bytes. Faster, hopefully because fuzzing only what needs to be fuzzed with only those mutations that make sense arrives at bugs faster than fuzzing everything somehow.
LP specification
Here’s a simple LPM spec taken from here.
Here’s a break-down of the most important fields:
syntax = proto2;
: There are two versions of the protocol buffers language, namelyproto2
andproto3
. This specification is written usingproto2
.message
:message
, although not explicitly defined iiuc, seems to be the smallest unit of a message description. It is a named field. For examplemessage IHDR {
defines a message format calledIHDR
- field rule, type, name, number: A
field
is a portion of a message.- field rule: specifies if the field under consideration is required, optional, or repeated. They mean just that.
- field type: specifies the data type of the field e.g., number (
uint32
), string etc. - field name: name of the field
- field number: unique identifier for said field. It is a good practice to start numbering from
1
since smaller integers require lesser storage.
A much needed digression to understand a real-world data format, the PNG image format. The structure of the simplest PNG image is as follows:
--------
PNG sig
--------
IHDR
--------
IDAT(s)
--------
IEND
--------
Barring IDAT
, all chunnks are singular i.e., must appear only once in a valid PNG file.
PNG signature
The PNG signature is a specific sequence of bytes that signal the beginning of a PNG file. It looks like so (in C/C++ code)
const unsigned char header[] = {0x89, 0x50, 0x4e, 0x47, 0x0d, 0x0a, 0x1a, 0x0a};
IHDR
IHDR stores image meta-data such as its width, height etc. Unlike the signature, IHDR contains variable fields. This makes it a good candidate for a protocol buffers message
From the original PNG specification
The IHDR chunk must appear FIRST. It contains:
Width: 4 bytes
Height: 4 bytes
Bit depth: 1 byte
Color type: 1 byte
Compression method: 1 byte
Filter method: 1 byte
Interlace method: 1 byte
Let’s look at the corresponding protobuf description:
message IHDR {
required uint32 width = 1; // maps to width
required uint32 height = 2; // maps to height
required uint32 other1 = 3; // maps to bitdepth-colortype-compmethod-filtmethod
required uint32 other2 = 4; // Only 1 byte used. (maps to interlacemethod)
}
As we can see, the protobuf description is “serialized” into fields of type uint32
(4-byte sequences).
If you were to closely match the original IHDR spec, the proto-spec would look as follows (note the break-down of fields such as bit_depth
, color_type
etc.
message IHDR {
required uint32 width = 1;
required uint32 height = 2;
enum bit_depth {
BD_ONE = 1;
BD_TWO = 2;
BD_FOUR = 4;
BD_EIGHT = 8;
BD_SIXTEEN = 16;
BD_MAX = 255; // BYTE_MAX
};
enum color_type {
CT_ZERO = 0;
CT_TWO = 2;
CT_THREE = 3;
CT_FOUR = 4;
CT_SIX = 6;
CT_MAX = 255; // BYTE_MAX
};
...
};
Although the BYTE_MAX
option is not part of the specification, I have intentionally added it so that we make the mutator explore specific corner cases. This is hacky, I admit. Who is to say whether or not 200
is a better corner-case than 255
?
IDAT
The IDAT chunk contains compressed image data. This means (in LP terms) it’s spec looks like so
message IDAT {
required bytes data = 1;
}
It’s an opaque byte stream, the mutator is free to synthesize whatever byte-sequence it wants to fuzz an IDAT chunk.
IEND
Here’s how the PNG spec defines IEND
The IEND chunk must appear LAST. It marks the end of the PNG datastream. The chunk’s data field is empty.
Essentially, it is a placeholder with no data that signifies the end of a PNG image.
The LP compiler
The LP compiler is called protoc.
protoc
compiles a Protobuf spec. (.proto
file) into language bindings.
At the moment, the following language bindings are supported by the compiler: C++, Java, and Python.
In these notes, it appears that support for more languges is an ongoing effort.
Invoking the compiler is quite simple, as you can see here, all you need to do is
rm -rf genfiles && mkdir genfiles && LPM/external.protobuf/bin/protoc png_fuzz_proto.proto --cpp_out=genfiles
This is
- Creating a fresh
genfiles
directory where C/C++ bindings will be stored - Invoking the
protoc
compiler that is available from the LPM repo against the PNG LP description we spoke about in the previous section of this blog - Explicitly asking the compiler to generate C++ bindings
Essentially, what this step does is to create a set of C++ header/source files that may be included/linked against by the fuzzer test harness. The generated header/C++ files offer a simple API to access the underlying raw data behind LPM fields.
LP to native format converter
Why do we need a converter in the first place? Here’s the thing: The LPM generates LPM formatted input that, for PNG, looks like this
# xxd C/002d3dd31b1bc41601c0e5d652b97f6599b23ba6
00000000: 6968 6472 207b 0a20 2077 6964 7468 3a20 ihdr {. width:
00000010: 300a 2020 6865 6967 6874 3a20 300a 2020 0. height: 0.
00000020: 6274 3a20 4244 5f4f 4e45 0a20 2063 743a bt: BD_ONE. ct:
00000030: 2043 545f 5448 5245 450a 2020 636d 3a20 CT_THREE. cm:
00000040: 434d 5f4d 4158 0a20 2066 6d3a 2046 4d5f CM_MAX. fm: FM_
00000050: 4d41 580a 2020 693a 2049 5f4d 4158 0a7d MAX. i: I_MAX.}
00000060: 0a
What we actually need when we are debugging is a valid PNG file, that looks like this
# xxd a.png
00000000: 8950 4e47 0d0a 1a0a 0000 000d 4948 4452 .PNG........IHDR
00000010: 0000 0000 0000 0000 0103 ffff ff01 fbc8 ................
00000020: 4300 0000 0049 454e 44ae 4260 82 C....IEND.B`.
As you can see, the LPM generated file holds a bunch of key:value
pairs in serialized form. These need to be parsed so that we construct a serialized form of values
in PNG format. Precisely this is the job of the converter.
In code terms, the converter is an integral part of the test harness itself (see next section). The fuzzer harness, among other things, is accepting an LPM formatted input, converting it to a valid PNG byte stream and feeding it to the fuzzer entry-point API.
Fuzzer test harness
Here’s a gist of the test harness (written by KCC; I’m embedding it via a gist because I’ve not yet found a nifty way to directly embed GH files in GH pages) for us to break down
Let’s look at the includes first:
- some standard stuff happening with
<string>
etc. zlib.h
is needed because (quoting the original spec.)
At present, only compression method 0 (deflate/inflate compression with a sliding window of at most 32768 bytes) is defined. All standard PNG images must be compressed with this scheme. Deflate-compressed datastreams within PNG are stored in the “zlib” format
#include "libprotobuf-mutator/src/libfuzzer/libfuzzer_macro.h"
: This defines theDEFINE_PROTO_FUZZER
that seems to be overridden (?) in the test harness. TBH, I dunno what’s happening here.#include "png_fuzz_proto.pb.h"
: This is theprotoc
generated C++ binding header file for our LP spec.
Past the header inclusions, you see several utility functions
WriteInt
writes an integer in big-endian (network byte order) format as required by the PNG specWriteByte
simply writes a bytecompress
performs zlib compression of chunk data. This is required for IDAT chunks especiallyWriteChunk
writes a specified PNG chunkProtoToPng
is where a proto is converted to astd::string
that contains the fuzzed PNG’s raw data (see previous section). This is where the LPM to native format conversion (see previous section) is happening.FuzzPNG
is the real test harness: This function feeds fuzzed raw PNG data to the underlying PNG API
The FuzzPNG
function is defined in the PNG source repo, which is why it needs to be linked against it like so
$CXX $CXXFLAGS -c -DLLVMFuzzerTestOneInput=FuzzPNG libpng/contrib/oss-fuzz/libpng_read_fuzzer.cc -I libpng
$CXX $CXXFLAGS png_proto_fuzzer_example.cc libpng_read_fuzzer.o genfiles/png_fuzz_proto.pb.cc \
-I genfiles -I. -I libprotobuf-mutator/ -I LPM/external.protobuf/include \
-lz \
LPM/src/libfuzzer/libprotobuf-mutator-libfuzzer.a \
LPM/src/libprotobuf-mutator.a \
LPM/external.protobuf/lib/libprotobuf.a \
libpng/.libs/libpng16.a \
$LIB_FUZZING_ENGINE \
-o $OUT/png_proto_fuzzer_example
Were you to write the FuzzPNG function yourself, it would probably look like this. Looks like standard stuff if you were to read Chapter 13 of the PNG book.
Conclusion
In this post, we explored
- what LibprotobufferMutator is and how one can write an LP spec
- How LP spec can help us write more targeted fuzzers
- How the whole LP/LPM/libFuzzer setup is wired together
Overall, I feel that LP-based fuzzing holds promise for testing language parsers, compilers, interpreters etc. The challenge is to obtain an understanding of the underlying language well enough to be able to (1) write a spec for it and (2) write a proper LP-to-native format converter.
Although I think writing these things is not a big deal, it definitely takes dedicated time and effort. This means, unless you draw benefits from such effort you are more likely to just download a corpus from the Internet and start fuzzing. It’s essentially a cost-benefit trade-off.
In an upcoming post, I plan to compare vanilla (non specification) fuzzer and an LP-based fuzzer with the hope that such a comparison sheds light on the actual benefits of LP-based fuzzing. That’s all folks!