Bhargava Shastry

Writing a Fuzz Unit Test for a Boost Filesystem API

2021-02-27T00:00:00+00:00

Intro

This post summarizes one fuzz unit test for the boost filesystem and a bug it found. Feel free to explore the rather vast landscape of boost filesystem APIs in order to write more unit tests. Help make Boost more robust.

Fuzz unit test

The following unit test

#include 
#include 

using namespace std;
using namespace boost::filesystem;

extern "C" int LLVMFuzzerTestOneInput(const uint8_t* data, size_t size)
{
        string pathString(reinterpret_cast(data), size);
        path p(pathString);
        p.remove_filename();
        return 0;
}

when compiled and run like so (tested on Linux bash console)

echo -e "                                          
#include 
#include 

using namespace std;
using namespace boost::filesystem;

extern \"C\" int LLVMFuzzerTestOneInput(const uint8_t* data, size_t size)
{
        string pathString(reinterpret_cast(data), size);
        path p(pathString);
        p.remove_filename();
        return 0;
}
" | clang++ -x c++ - -fsanitize=fuzzer -o fuzz_bfs -lboost_filesystem && time ./fuzz_bfs

prints the following output on the console (Linux, x86, clang v10, boost v1.71)

INFO: Seed: 3723374228
INFO: Loaded 1 modules   (321 inline 8-bit counters): 321 [0x4f5150, 0x4f5291), 
INFO: Loaded 1 PC tables (321 PCs): 321 [0x4c8f98,0x4ca3a8), 
INFO: -max_len is not provided; libFuzzer will not generate inputs larger than 4096 bytes
INFO: A corpus is not provided, starting from an empty corpus
#2	INITED cov: 4 ft: 5 corp: 1/1b exec/s: 0 rss: 24Mb
terminate called after throwing an instance of 'std::out_of_range'
  what():  basic_string::erase: __pos (which is 18446744073709551615) > this->size() (which is 5)
==702779== ERROR: libFuzzer: deadly signal
    #0 0x4b00f0 in __sanitizer_print_stack_trace (/home/bhargava/fuzz_bfs+0x4b00f0)
    #1 0x45c3f8 in fuzzer::PrintStackTrace() (/home/bhargava/fuzz_bfs+0x45c3f8)
    #2 0x441543 in fuzzer::Fuzzer::CrashCallback() (/home/bhargava/fuzz_bfs+0x441543)
    #3 0x7f45aa9513bf  (/lib/x86_64-linux-gnu/libpthread.so.0+0x153bf)
    #4 0x7f45aa76218a in __libc_signal_restore_set /build/glibc-ZN95T4/glibc-2.31/signal/../sysdeps/unix/sysv/linux/internal-signals.h:86:3
    #5 0x7f45aa76218a in raise /build/glibc-ZN95T4/glibc-2.31/signal/../sysdeps/unix/sysv/linux/raise.c:48:3
    #6 0x7f45aa741858 in abort /build/glibc-ZN95T4/glibc-2.31/stdlib/abort.c:79:7
    #7 0x7f45aab6a950  (/usr/lib/x86_64-linux-gnu/libstdc++.so.6+0x9e950)
    #8 0x7f45aab7647b  (/usr/lib/x86_64-linux-gnu/libstdc++.so.6+0xaa47b)
    #9 0x7f45aab764e6 in std::terminate() (/usr/lib/x86_64-linux-gnu/libstdc++.so.6+0xaa4e6)
    #10 0x7f45aab76798 in __cxa_throw (/usr/lib/x86_64-linux-gnu/libstdc++.so.6+0xaa798)
    #11 0x7f45aab6d3ea  (/usr/lib/x86_64-linux-gnu/libstdc++.so.6+0xa13ea)
    #12 0x7f45aaac0a22 in boost::filesystem::path::remove_filename() (/usr/lib/x86_64-linux-gnu/libboost_filesystem.so.1.71.0+0x12a22)
    #13 0x4b26a7 in LLVMFuzzerTestOneInput (/home/bhargava/fuzz_bfs+0x4b26a7)
    #14 0x442c01 in fuzzer::Fuzzer::ExecuteCallback(unsigned char const*, unsigned long) (/home/bhargava/fuzz_bfs+0x442c01)
    #15 0x442345 in fuzzer::Fuzzer::RunOne(unsigned char const*, unsigned long, bool, fuzzer::InputInfo*, bool*) (/home/bhargava/fuzz_bfs+0x442345)
    #16 0x4445e7 in fuzzer::Fuzzer::MutateAndTestOne() (/home/bhargava/fuzz_bfs+0x4445e7)
    #17 0x4452e5 in fuzzer::Fuzzer::Loop(std::__Fuzzer::vector >&) (/home/bhargava/fuzz_bfs+0x4452e5)
    #18 0x433c9e in fuzzer::FuzzerDriver(int*, char***, int (*)(unsigned char const*, unsigned long)) (/home/bhargava/fuzz_bfs+0x433c9e)
    #19 0x45cae2 in main (/home/bhargava/fuzz_bfs+0x45cae2)
    #20 0x7f45aa7430b2 in __libc_start_main /build/glibc-ZN95T4/glibc-2.31/csu/../csu/libc-start.c:308:16
    #21 0x408a3d in _start (/home/bhargava/fuzz_bfs+0x408a3d)

NOTE: libFuzzer has rudimentary signal handlers.
      Combine libFuzzer with AddressSanitizer or similar for better crash reports.
SUMMARY: libFuzzer: deadly signal
MS: 4 ChangeBit-InsertRepeatedBytes-ShuffleBytes-EraseBytes-; base unit: adc83b19e793491b1c6ea0fd8b46cd9f32e592fc
0x2f,0x2f,0x2f,0x2f,0x2f,
/////
artifact_prefix='./'; Test unit written to ./crash-ece6d237a9393e5c002c541f9d4c92136941d956
Base64: Ly8vLy8=

real    0m1.610s
user    0m1.524s
sys     0m0.008s

This bug was reported upstream and promptly fixed (thank you boost devs!).

The crash may be interpreted as follows:

If you feed an input “/////” to the boost filesystem path object and attempt to remove filename, it throws an exception
The exception if of type std::out_of_range

Quoting

(std::out_of_range) reports errors that are consequence of attempt to access elements out of defined range.

It may be thrown by the member functions of std::bitset and std::basic_string, by std::stoi and std::stod families of functions, and by the bounds-checked member access functions (e.g. std::vector::at and std::map::at).

Typically, malformed inputs like these should not throw low-level exceptions such as this one which is why it is a bug.

Conclusion

It is rather easy to get started with fuzzing boost filesystem APIs. The test in this blog post hardly spans three lines of code (excluding boilerplate), so you get the idea. Hope this post inspires you to explore other nooks and corners of boost filesystem API, and perhaps even fuzz them. The hope is that this will make the boost C++ libraries that several of us—especially in the open-source world—rely on, safer. Stay healthy!

Custom Proto Mutation

2019-12-27T00:00:00+00:00

Intro

This post describes how you can write your own custom protobuf mutators. Protobuf mutators are routines that mutate or change protobuf input. Protobuf input is essentially structured text. It looks like this:

message {
	sub_message {
		int_field: 2
		string_field: "hello"
	}
}

A custom proto mutation is a routine that, say, mutates the string_field of sub_message from the string hello to the string world.

Motivation

What is the use of a custom proto mutation? The thing is structured fuzzing is useful to fuzz programs that accept structured input. A popular implementation technique to perform structured fuzzing is via the use of (1) protocol buffers library to define input structure; and (2) libprotobuf mutator library to perform random protobuf mutations. Random protobuf mutations may be sufficient already, so at the risk of sounding repetative, what is the use of a custom proto mutation?

Well, think of it like this. Say you are fuzzing a program that you have written. You obviously know more about your program than a random fuzzer would, notwithstanding the power of coverage guidance. So, let’s say, you know that your program will perform a state transition when an input field described by sub_message’s string_field is world and not hello. Now, to trigger this mutation without a custom mutator, you’d typically wait for the random mutator, through a series of mutations, to change hello to world. Although this is not too far-fetched, it consumes resources i.e., time and computation cycles.

The point is, if you know some mutation is important for your program, why would you wait for it to be synthesized randomly? Why not program it as part of the fuzzer itself, right?

Writing a custom proto mutator

Now, I describe the technical part of writing your own custom proto mutator, using libpng proto fuzzer as an example. The libpng_proto_fuzzer_example.cc source file describes how to convert protobuf structure defined in png_fuzz_proto.proto to a PNG file. I’ll set ourselves the relatively simple task of writing a mutator that mutates an OtherChunk such that unknown_type chunks are changed to known_type chunks.

libprotobuf-mutator postprocessor callbacks

Before we code the actual mutation routine, let’s take some time to appreciate the callback facility provided by libprotobuf-mutator to enable custom mutations. I believe this callback was first implemented in this pull request. Essentially, the user of libprotobuf-mutator, can register a postprocessor callback on a protobuf message type. This postprocessor is then invoked after every mutation performed by libprotobuf-mutator.

Callback interface

The callback interface looks like so. Essentially, the interface contains two input parameters:

const pointer to message descriptor
function that implements the custom mutation routine. This function accepts two inputs:
- pointer to protobuf message
- seed (unsigned integer)

I will briefly describe each of them in the following paragraphs.

Message

A protobuf message is a unit of input structure. A message may contain fields that may be of a value type (i.e., integer, bool, string etc.) or non-value type e.g., message. In our dummy example, message and sub_message are protobuf messages that describe something. The reason this is part of the callback interface is that, ultimately, we (custom mutation implementors) would like to mutate this data with custom changes.

Message descriptor

A message descriptor describes the nature of a message. The reason this is part of the callback interface is that, internally, libprotobuf-mutator maps a callback (custom mutation routine) against a descriptor. So, for example, if we were to implement a custom mutator for changing the string_field in our dummy example, it would have to be registered against the descriptor of the sub_message message type’s descriptor. To do that, we use protoc (protobuf compiler) generated static function call sub_message::descriptor().

Seed

A seed is a pseudo-random number supplied by libprotobuf-mutator to help the mutation writer tune their mutation. The reason this is part of the callback interface is that, often, mutation routine implementors (us) would want their mutation to be applied only every once in a while. To permit this while keeping fuzzing deterministic, a pseudo-randomly (but deterministically) generated seed is supplied for use by the mutation routine implementor.

A simple manner in which seed may be used is via the modulo operator, like so

/// Apply my mutation roughly once every three LPM mutations
if (seed % 3 == 0)
{
  apply_my_mutation();
}

Callback function

Now that we understand the structure and reasoning behind LPM’s postprocessor interface, we can implement the mutation routine: Change hello to world

protobuf_mutator::libfuzzer::RegisterPostProcessor(
	sub_message::descriptor(),
	[](google::protobuf::Message* message, unsigned int seed)
	{
		sub_message *sub_msg = static_cast(message);
		if (seed % 2)
		{
			if (sub_msg->string_field() == "hello")
			{
				sub_msg->set_string_field("world");
			}
		}
	}
);

Here’s what we are doing:

Register a custom post processor for the sub_message message type
statically casting the canonical protobuf message type to sub_message message type before further checks
applying custom mutation 50% of the time
if string_field is set to hello, then we change it to world

libpng custom mutator

Now, we are ready to apply what we have learnt to the linked libpng-proto fuzzer. Here’s a portion of the pull request in which I implement a simple mutator routine that changes unknown_type chunks to a known_type chunk:

The really cool part is it is 4 lines of source code to do this :-)

Conclusion

This post hopefully made it easier for you to understand and write custom proto mutation routines for your fuzzer. Have fun writing them and experimenting a little until you find that elusive bug that randomness could not find ;-)

Structure aware mruby fuzzer

2019-05-17T00:00:00+00:00

Intro

Structure aware fuzzing is a fuzzing technique in which you make the fuzzer aware of the structure of input. This post describes the application of this technique to the mruby interpreter.

What is mruby?

mruby is a lightweight ruby interpreter that is designed to be embeddable. This means, you can use mruby to write a 20 line “C” program that executes ruby code. Cool, eh? Let’s fuzz it with arbitrary ruby code then.

Why fuzz mruby?

There is some evidence that companies use mruby to execute potentially attacker-controlled ruby programs in security sensitive environments.

Structure of a ruby program

Without awareness of the ruby programming language, the fuzzer is likely to synthesize junk. I mean, today’s fuzzers are smart but they are not smart enough to synthesize ruby programs from thin air. That’s the realm of machine learning, isn’t it? Lol.

Function

Let’s prod the fuzzer along a little bit. Let’s start by defining a very simple input template. Our input template defines a function foo and invokes it thereafter.

def foo()
end
foo

Simple, isn’t it? What does the protobuf specification for such a function look like

message Function {
}

Function, for the moment, is just a stub object, that we can “visit” (in the visitor pattern sense) like so

void protoConverter::visit(Function const& x)
{
	m_output << "def foo()\nvar_0 = 1\n";
	m_output << "end\n";
	m_output << "foo\n";
}

Simple as it is, foo doesn’t do anything. To do something, we need a notion of statements.

Statements

So let’s add a notion of statements.

message Const {
    oneof const_oneof {
        uint32 int_lit = 1;
        bool bool_val = 2;
    }
}

message Rvalue {
  oneof rvalue_oneof {
    Const cons = 1;
  }
}

message AssignmentStatement {
  required Rvalue rvalue = 2;
}

message Statement {
  oneof stmt_oneof {
    AssignmentStatement assignment = 1;
  }
}

message StatementSeq {
  repeated Statement statements = 1;
}

message Function {
  required StatementSeq statements = 1;
}

This specification tells the fuzzer the following

A function consists of a sequence of statements
A statement sequence consists of at least zero statements
A statement can be an assignment statement
An assignment statement consists of a value on the right hand side
- The value can be a constant
- A constant is either an unsigned integer or a boolean literal

Here’s the corresponding visitor.

void protoConverter::visit(AssignmentStatement const& x)
{
	m_output << "var_" << m_numLiveVars << " = ";
	visit(x.rvalue());
	m_output << "\n";
}

void protoConverter::visit(Statement const& x)
{
	switch (x.stmt_oneof_case()) {
		case Statement::kAssignment:
			visit(x.assignment());
			break;
		case Statement::STMT_ONEOF_NOT_SET:
			break;
	}
	m_output << "\n";
}

void protoConverter::visit(Function const& x)
{
	m_output << "def foo()\nvar_0 = 1\n";
	visit(x.statements());
	m_output << "end\n";
	m_output << "foo\n";
}

Let’s see what this generates

def foo()
var_0 = 1337
var_1 = false
end
foo

It’s definitely more lively than the foo we started out with, but it’s still sorta meh.

More statements

We can essentially translate ruby programming language rules into a somewhat equivalent protobuf specification. And trust me, there is a lot more to be done. We can add the notion of strings, hash values, and operations on top of them to begin with. We can teach the fuzzer what it means to call the Time() builtin object.

Time.at(628232400) #=> 1989-11-28 00:00:00 -0500

I have made a humble beginning here.

Contributions welcome. Some specific directions for future work

Add more ruby operations
Avoid generating DoSsy ruby programs like print "1337"*10000000

Help find deep bugs in the mruby interpreter.

Deconstructing LibProtobuf/Mutator Fuzzing

2019-01-18T00:00:00+00:00

Intro

LibProtobufMutator (LPM) is a library that helps fuzz structured input from a LibProtobuf (LP) specification. Among other things, LPM can assist coverage-guided fuzzing. This post explores the nitty-gritties of writing an LP-based fuzzer using KCC’s example.

What we need

To write an LP-based fuzzer, what you will need are:

An LP specification: This is a descriptive file with a .proto extension
LP compiler: This compiles the LP spec. into code (C++ bindings) that can be called from the test harness
LP-to-native-format-converter: Since fuzzing happens on the LP abstraction, we need a LP formatted input to native format converter if we are to fuzz the native format.
Fuzzer test harness: This is a C/C++ test harness that invokes some program API that consumes (parses) native-formatted input Most importantly, what we don’t need is the LP fuzzer itself: code that mutates the LP formatted input. The fuzzer module is called LibProtobufMutator or LPM, which is an external dependency.

This seems complicated at first; it definitely is for someone, like me, who has never written an LP-based fuzzer before. I will try to make it simpler.

I think the big idea behind this was that it is harder to ask developers to write custom fuzz mutators than it is to ask them to write a format specification and test harness. I’ve never written a custom fuzz mutator before, so I’m not in a position to present my experience.

That aside, the hope with this project is that this setup (LP-based fuzzing) catches bugs faster and more methodically. Methodically because you are fuzzing the specification and not mutating an opaque sequence of bytes. Faster, hopefully because fuzzing only what needs to be fuzzed with only those mutations that make sense arrives at bugs faster than fuzzing everything somehow.

LP specification

Here’s a simple LPM spec taken from here.

Here’s a break-down of the most important fields:

syntax = proto2;: There are two versions of the protocol buffers language, namely proto2 and proto3. This specification is written using proto2.
message: message, although not explicitly defined iiuc, seems to be the smallest unit of a message description. It is a named field. For example message IHDR { defines a message format called IHDR
field rule, type, name, number: A field is a portion of a message.
- field rule: specifies if the field under consideration is required, optional, or repeated. They mean just that.
- field type: specifies the data type of the field e.g., number (uint32), string etc.
- field name: name of the field
- field number: unique identifier for said field. It is a good practice to start numbering from 1 since smaller integers require lesser storage.

A much needed digression to understand a real-world data format, the PNG image format. The structure of the simplest PNG image is as follows:

--------
PNG sig
--------
IHDR
--------
IDAT(s)
--------
IEND
--------

Barring IDAT, all chunnks are singular i.e., must appear only once in a valid PNG file.

PNG signature

The PNG signature is a specific sequence of bytes that signal the beginning of a PNG file. It looks like so (in C/C++ code)

const unsigned char header[] = {0x89, 0x50, 0x4e, 0x47, 0x0d, 0x0a, 0x1a, 0x0a};

IHDR

IHDR stores image meta-data such as its width, height etc. Unlike the signature, IHDR contains variable fields. This makes it a good candidate for a protocol buffers message

From the original PNG specification

The IHDR chunk must appear FIRST. It contains:
   Width:              4 bytes
   Height:             4 bytes
   Bit depth:          1 byte
   Color type:         1 byte
   Compression method: 1 byte
   Filter method:      1 byte
   Interlace method:   1 byte

Let’s look at the corresponding protobuf description:

message IHDR {
  required uint32 width = 1;  // maps to width
  required uint32 height = 2; // maps to height
  required uint32 other1 = 3; // maps to bitdepth-colortype-compmethod-filtmethod
  required uint32 other2 = 4;  // Only 1 byte used. (maps to interlacemethod)
}

As we can see, the protobuf description is “serialized” into fields of type uint32 (4-byte sequences). If you were to closely match the original IHDR spec, the proto-spec would look as follows (note the break-down of fields such as bit_depth, color_type etc.

message IHDR {
  required uint32 width = 1;
  required uint32 height = 2;
  enum bit_depth {
    BD_ONE = 1;
    BD_TWO = 2;
    BD_FOUR = 4;
    BD_EIGHT = 8;
    BD_SIXTEEN = 16;
    BD_MAX = 255; // BYTE_MAX
  };
  enum color_type {
    CT_ZERO = 0;
    CT_TWO = 2;
    CT_THREE = 3;
    CT_FOUR = 4;
    CT_SIX = 6;
    CT_MAX = 255; // BYTE_MAX
  };
...
};

Although the BYTE_MAX option is not part of the specification, I have intentionally added it so that we make the mutator explore specific corner cases. This is hacky, I admit. Who is to say whether or not 200 is a better corner-case than 255?

IDAT

The IDAT chunk contains compressed image data. This means (in LP terms) it’s spec looks like so

message IDAT {
  required bytes data = 1;
}

It’s an opaque byte stream, the mutator is free to synthesize whatever byte-sequence it wants to fuzz an IDAT chunk.

IEND

Here’s how the PNG spec defines IEND

The IEND chunk must appear LAST. It marks the end of the PNG datastream. The chunk’s data field is empty.

Essentially, it is a placeholder with no data that signifies the end of a PNG image.

The LP compiler

The LP compiler is called protoc. protoc compiles a Protobuf spec. (.proto file) into language bindings. At the moment, the following language bindings are supported by the compiler: C++, Java, and Python. In these notes, it appears that support for more languges is an ongoing effort. Invoking the compiler is quite simple, as you can see here, all you need to do is

rm -rf genfiles && mkdir genfiles && LPM/external.protobuf/bin/protoc png_fuzz_proto.proto --cpp_out=genfiles

This is

Creating a fresh genfiles directory where C/C++ bindings will be stored
Invoking the protoc compiler that is available from the LPM repo against the PNG LP description we spoke about in the previous section of this blog
Explicitly asking the compiler to generate C++ bindings

Essentially, what this step does is to create a set of C++ header/source files that may be included/linked against by the fuzzer test harness. The generated header/C++ files offer a simple API to access the underlying raw data behind LPM fields.

LP to native format converter

Why do we need a converter in the first place? Here’s the thing: The LPM generates LPM formatted input that, for PNG, looks like this

# xxd C/002d3dd31b1bc41601c0e5d652b97f6599b23ba6
00000000: 6968 6472 207b 0a20 2077 6964 7468 3a20  ihdr {.  width: 
00000010: 300a 2020 6865 6967 6874 3a20 300a 2020  0.  height: 0.  
00000020: 6274 3a20 4244 5f4f 4e45 0a20 2063 743a  bt: BD_ONE.  ct:
00000030: 2043 545f 5448 5245 450a 2020 636d 3a20   CT_THREE.  cm: 
00000040: 434d 5f4d 4158 0a20 2066 6d3a 2046 4d5f  CM_MAX.  fm: FM_
00000050: 4d41 580a 2020 693a 2049 5f4d 4158 0a7d  MAX.  i: I_MAX.}
00000060: 0a

What we actually need when we are debugging is a valid PNG file, that looks like this

# xxd a.png
00000000: 8950 4e47 0d0a 1a0a 0000 000d 4948 4452  .PNG........IHDR
00000010: 0000 0000 0000 0000 0103 ffff ff01 fbc8  ................
00000020: 4300 0000 0049 454e 44ae 4260 82         C....IEND.B`.

As you can see, the LPM generated file holds a bunch of key:value pairs in serialized form. These need to be parsed so that we construct a serialized form of values in PNG format. Precisely this is the job of the converter.

In code terms, the converter is an integral part of the test harness itself (see next section). The fuzzer harness, among other things, is accepting an LPM formatted input, converting it to a valid PNG byte stream and feeding it to the fuzzer entry-point API.

Fuzzer test harness

Here’s a gist of the test harness (written by KCC; I’m embedding it via a gist because I’ve not yet found a nifty way to directly embed GH files in GH pages) for us to break down

Let’s look at the includes first:

some standard stuff happening with etc.
zlib.h is needed because (quoting the original spec.)

At present, only compression method 0 (deflate/inflate compression with a sliding window of at most 32768 bytes) is defined. All standard PNG images must be compressed with this scheme. Deflate-compressed datastreams within PNG are stored in the “zlib” format

#include "libprotobuf-mutator/src/libfuzzer/libfuzzer_macro.h": This defines the DEFINE_PROTO_FUZZER that seems to be overridden (?) in the test harness. TBH, I dunno what’s happening here.
#include "png_fuzz_proto.pb.h": This is the protoc generated C++ binding header file for our LP spec.

Past the header inclusions, you see several utility functions

WriteInt writes an integer in big-endian (network byte order) format as required by the PNG spec
WriteByte simply writes a byte
compress performs zlib compression of chunk data. This is required for IDAT chunks especially
WriteChunk writes a specified PNG chunk
ProtoToPng is where a proto is converted to a std::string that contains the fuzzed PNG’s raw data (see previous section). This is where the LPM to native format conversion (see previous section) is happening.
FuzzPNG is the real test harness: This function feeds fuzzed raw PNG data to the underlying PNG API

The FuzzPNG function is defined in the PNG source repo, which is why it needs to be linked against it like so

$CXX $CXXFLAGS -c -DLLVMFuzzerTestOneInput=FuzzPNG libpng/contrib/oss-fuzz/libpng_read_fuzzer.cc -I libpng
$CXX $CXXFLAGS png_proto_fuzzer_example.cc libpng_read_fuzzer.o genfiles/png_fuzz_proto.pb.cc \
  -I genfiles -I.  -I libprotobuf-mutator/  -I LPM/external.protobuf/include \
  -lz \
  LPM/src/libfuzzer/libprotobuf-mutator-libfuzzer.a \
  LPM/src/libprotobuf-mutator.a \
  LPM/external.protobuf/lib/libprotobuf.a \
  libpng/.libs/libpng16.a \
  $LIB_FUZZING_ENGINE \
  -o $OUT/png_proto_fuzzer_example

Were you to write the FuzzPNG function yourself, it would probably look like this. Looks like standard stuff if you were to read Chapter 13 of the PNG book.

Conclusion

In this post, we explored

what LibprotobufferMutator is and how one can write an LP spec
How LP spec can help us write more targeted fuzzers
How the whole LP/LPM/libFuzzer setup is wired together

Overall, I feel that LP-based fuzzing holds promise for testing language parsers, compilers, interpreters etc. The challenge is to obtain an understanding of the underlying language well enough to be able to (1) write a spec for it and (2) write a proper LP-to-native format converter.

Although I think writing these things is not a big deal, it definitely takes dedicated time and effort. This means, unless you draw benefits from such effort you are more likely to just download a corpus from the Internet and start fuzzing. It’s essentially a cost-benefit trade-off.

In an upcoming post, I plan to compare vanilla (non specification) fuzzer and an LP-based fuzzer with the hope that such a comparison sheds light on the actual benefits of LP-based fuzzing. That’s all folks!

Quick Dive into Trail of Bits’ Slither

2018-11-05T00:00:00+00:00

Intro

Slither is a static analyzer that has been developed by Trail of Bits to help smart contract developers find bugs in their code. In this post, I’ll try to get my hands dirty with Slither so you don’t have to. Moreover, having a background writing static analysis tools myself, I’m curious how Slither is architected and I’m excited at the prospect of writing detector for it…one day.

This post attempts to understand the work-flow of Slither. Target audience for this are folks who

would like to understand the architecture/work-flow of Slither
would like to start to write a detector (like me) but don’t know where to start

Treat this as a (shoddy) introduction to Slither, that at the time of writing addresses only the author’s curiosity. haha.

First things first, Slither itself is written in python3, yaay! One of the first things slither does is to use the solidity compiler (solc binary) to obtain the AST of the program to be analyzed. Therefore, before I proceed, let me install the Solidity compiler. Since most of the test contracts in the slither code base are targeted at compiler version 0.4.24, I chose to pick it up from the official GitHub page here. One could also fetch the officially distributed compiler for your Ubuntu distribution like so:

sudo add-apt-repository ppa:ethereum/ethereum
sudo apt-get update
sudo apt-get install solc

Try Slither Out

After installing the solc binary, I set up a python IDE to debug slither. Essentially, the idea is to use a good debugger (I’m using Jet Brain’s PyCharm) to step through slither code and understand the steps involved in analyzing smart contracts.

The invocation that I am using for debugging is the elementary:

$ slither .sol

What this is supposed to do is analyze the source code of the contract and spit out bug reports, like so:

INFO:Detectors: Uninitialized state variable in ../solidity/001_name_references.sol, Contract: test, Variable: variable, Used in ['f']
INFO:Detectors: Contract 'test' is not in CapWords
INFO:Detectors: Parameter '' is not in mixedCase, Contract: '', Function: 'test''

What you’d notice when you run slither against buggy code are the following things

The smart contract to be analyzed needs to be compilable but not necessarily runnable
Bug reports are spit out on stderr
Each bug report is prefixed with the string INFO:Detectors:

But this is too high level, let’s step through slither at a more easy pace

Entry point

The entry point for slither is the main function of course. This function is defined in a python file called __main__.py in the slither distribution. The very first thing this main function does is to fetch all detectors and printers. Each detector object in slither detects a class of bugs, and each printer object logs useful information about the program under analysis e.g., its call graph, what a function is trying to do (so called function summary) etc.

Detectors

To get a sense of the kind of bugs Slither detects, let’s look at the default set of detectors that Slither provides. Here’s an exhaustive list at the time of writing

UninitializedStateVarsDetection,
ConstantPragma,
OldSolc,
Reentrancy,
UninitializedStorageVars,
LockedEther,
ArbitrarySend,
Suicidal,
UnusedStateVars,
TxOrigin,
Assembly,
LowLevelCalls,
NamingConvention,
ConstCandidateStateVars,
ExternalFunction

That makes it a total of 15 detectors for as many bug classes. A brief digression: Until we have a formalization of bug classes as in the C/C++ space (see the common weakness enumeration project), I’d expect bug classification for Solidity to be largely ad-hoc.

Let’s dive deep into an elementary bug class to see how bug detection is implemented. The Backdoor detector (unlisted, but available in source) looks like an example detector that makes for a good starting example. Here’s the backdoor.sol contract that may be found in the slither code base that the backdoor detector is meant to detect.

pragma solidity 0.4.24;

contract C{

    function i_am_a_backdoor() public{
        selfdestruct(msg.sender);
    }

}

Evidently, this contract

defines a function that calls the selfdestruct method on the msg sender

What’s the selfdestruct method?

The only possibility that code is removed from the blockchain is when a contract at that address performs the selfdestruct operation. The remaining Ether stored at that address is sent to a designated target and then the storage and code is removed from the state.

In this intentionally buggy piece of code:

When some other contract calls C.i_am_a_backdoor() the piece of code that points to msg.sender i.e., the caller of C.i_am_a_backdoor() is going to be removed from the blockchain.
C.i_am_a_backdoor() is a means to hide oneself

So, let’s see what happens when Slither analyzes this piece of code:

INFO:Detectors: Backdoor function found in C.i_am_a_backdoor
INFO:Detectors: Suicidal function in /home/bhargava/work/github/slither/tests/backdoor.sol Contract: C, Function: i_am_a_backdoor
INFO:Detectors: Function 'i_am_a_backdoor' is not in mixedCase, Contract: 'C' 
INFO:Detectors: Public function in /home/bhargava/work/github/slither/tests/backdoor.sol Contract: C, Function: i_am_a_backdoor should be declared external
INFO:Slither:/home/bhargava/work/github/slither/tests/backdoor.sol analyzed (1 contracts), 4 result(s) found

Voila, the backdoor function is flagged and reported to the user (see first line of report). We will ignore the other bugs flagged by other detectors since our purpose is to get a general sense of how detection works, not understand the specifics of a particular detector. So, how does the detection work under the hood?

Well, to begin with, any static analyzer needs to “understand” the code being analyzed. What needs to be understood is essentially: “What is this program trying to do? Is there a bug in it?”. These two questions hinge on semantic program analysis which is a complex problem.

We can begin to get a semantic understanding of a program by first looking at its syntax tree. A syntax tree is a tree: A directed acyclic graph that remains acyclic even if directionality is removed. The nodes of the tree are syntactic elements of the programming language in which the analyzed program is written. Here’s a snippet of an actual AST (as a JSON string) of the backdoor program shown above.

{
	"attributes" : 
	{
		"absolutePath" : "tests/backdoor.sol",
		"exportedSymbols" : 
		{
			"C" : 
			[
				11
			]
		}
	},
	"children" : 
	[
		{
			"attributes" : 
			{
				"literals" : 
				[
					"solidity",
					"0.4",
					".24"
				]
			},
			"id" : 1,
			"name" : "PragmaDirective",
			"src" : "0:23:0"
		},
		{
			"attributes" : 
			{
				"baseContracts" : 
				[
					null
				],
				"contractDependencies" : 
				[
					null
				],
				"contractKind" : "contract",
				"documentation" : null,
				"fullyImplemented" : true,
				"linearizedBaseContracts" : 
				[
					11
				],
				"name" : "C",
				"scope" : 12
			},
			...
		}
		...
}

Hope this gives you a sense of the AST. The AST is essentially a dictionary object with certain top-level attributes and a list of children. For example, one of the children is the pragma directive on line 1 of backdoor.sol. This child contains an ID, mapping to the source file, and a list of string literals it holds together. In the following, I briefly describe what happens inside Slither even before bug detection is attempted.

Step 1: Obtain AST

The first thing that slither does is obtain the AST of the analyzed program in the form of a JSON string using the Solidity compiler, solc. solc supports this off-the-shelf with such an invocation as:

$ ./solc tests/backdoor.sol --ast-json --allow-paths .

Step 2: Parse AST into CFG

Once the AST (JSON string) has been obtained, the next thing Slither does is to parse it. Parsing the AST entails parsing the JSON of the AST. The AST parsing in Slither is quite sophisticated, not something I can describe succinctly here.

The main idea behind parsing the AST is to created a (cyclic) directed graph that shows control flow in the analyzed smart contract. This is necessary because the AST itself is not adequate to grasp control-flow.

Control-flow graph is created at the granularity of a function call i.e., each function in the analyzed smart contract maps to a corresponding CFG. You can find the function that does the AST parsing/CFG creation here.

Step 3: Drop to Slithir

Once the CFG has been created for all functions in the smart contract under analysis, Slither drops the AST/CFG representation of the analyzed smart contract into an SSA-based intermediate representation called Slithir. By “dropping”, I mean conversion from a higher-level program abstraction (AST/CFG) to a lower-level program abstraction (Slithir). But why?

I can only hazard the following guesses:

Analysis based on an IR removes the dependency on the PL in which a smart contract is written. If tomorrow, a new smart contract PL is invented, Slither can still support it by adding a parser/converter to IR.
SSA-based IR makes certain kinds of analysis simpler (see section called “Benefits” in the SSA wiki article)

Step 4: Detect Backdoor

Steps 1–3 are performed as the Slither python object is created. Once the analysis infrastructure is ready (AST,CFG,Slithir), detectors are processed sequentially. Each detector encodes the “business logic” of detection for the bug class that it is meant to detect.

So, let’s see what’s happening in the sample backdoor detector.

class Backdoor(AbstractDetector):
    """
    Detect function named backdoor
    """

    ARGUMENT = 'backdoor'  # slither will launch the detector with slither.py --mydetector
    HELP = 'Function named backdoor (detector example)'
    IMPACT = DetectorClassification.HIGH
    CONFIDENCE = DetectorClassification.HIGH

    def detect(self):
        ret = []

        for contract in self.slither.contracts_derived:
            # Check if a function has 'backdoor' in its name
            for f in contract.functions:
                if 'backdoor' in f.name:
                    # Info to be printed
                    info = 'Backdoor function found in {}.{}'.format(contract.name, f.name)
                    # Print the info
                    self.log(info)
                    # Add the result in ret
                    source = f.source_mapping
                    ret.append({'vuln': 'backdoor', 'contract': contract.name, 'sourceMapping' : source})

        return ret

You’d notice that the business logic of bug detection is quite concise. The detection logic resides in the detect method of the Backdoor object that implements the AbstractDetector interface. To my mind, this is the python equivalent of a Clang Static Analyzer checker.

Everything that a detector wants to know about the program is contained in the self.slither object. This object contains the following fields:

contracts_derived: This field holds the
- _data: AST obtained from the Solidity compiler
- functions: CFG of all functions in the contract
- slither: Slithir representation of the contract

The detector uses this information to decide whether to flag a bug or not. A detector need only use the information that is necessary for the bug detection logic. For example, here’s what the backdoor detector is doing

Iterate over all functions in the analyzed contract
- If a function is called “backdoor”
  - Flag a bug saying “backdoor found”
return a nicely formatted bug diagnostics object (list of dictionaries, each dictionary being a distinct bug report)

In other words, the backdoor detector is only using the function.name field in the function’s CFG to flag a bug. Of course, this is cheating cos you can’t simply conclude that a function is a backdoor if it is called one. However, the reason I picked up this specific detector is because it is meant as an introduction to writing detectors.

In the real-world, you’d do some analysis on the IR (e.g., check if the analyzed function makes a call to the selfdestruct function) before concluding that it is indeed a backdoor. Perhaps, this entails listing all calls made by a function and checking if selfdestruct happens to be one of them.

Outro

So that was a quick dive into Slither. We laid out the work flow of Slither from (1) taking the AST of a smart contract as input, (2) producing its CFG, (3) reducing this to an SSA-based IR (4) and finally, detecting bugs based on program information contained in the IR.

If there is some specific aspect of Slither you’d want to know more about that this post didn’t cover, let me know. When I have the time, I’d be more than happy to write a part 2 of this post. That’s all folks.

Fuzzing the Solidity Compiler

2018-10-20T00:00:00+00:00

Intro

This post describes related work in the field of compiler fuzzing, the motivation for fuzzing the Solidity compiler, how to fuzz it, and the kind of bugs it helps find. In the final section of this post, I briefly discuss what could be done to target more interesting code.

First things first. Solidity is a high-level programming language for creating smart contracts. The solidity compiler is the official compiler for programs (aka smart contracts) written in the Solidity programming language. In the context of this post, Solidity means the compiler implementation and not the language itself.

Disclaimer: The bugs disclosed in this post have been reported upstream. More importantly, the bugs are benign typing errors that have no security implications to the best of my knowledge. Therefore, I see no harm in disclosing them. If this post inspires you to fuzz Solidity and you happen to find a security-critical bug, please consider reporting it to the Ethereum bounty program.

Folks have fuzzed

Ethereum VM implementations e.g., this, that
Applications (smart contracts) e.g., this

The compiler, Solidity, has garnered lesser attention. Solidity, falls in between applications and EVM. It compiles applications to EVM byte code that is executed by the underlying EVM implementation.

Fuzzing compilers is nothing new. For example, the CSmith project is geared towards finding bugs in C compilers. Kostya Serebryany’s talk at llvm-dev meeting describes how to intelligently fuzz compilers using a technique he calls “structure aware fuzzing”. His main observation is that fuzzing compilers with generic mutators (e.g., bit flips, add/remove bytes) is less likely to generate parseable programs. So his talk is a call for mutators that understand the structure of input accepted by the program e.g., the structure of a C program. This is an interesting idea for fuzzing solidity as well that I shall briefly discuss in the final section of this post.

Motivation

Some reasons for fuzzing the Solidity compiler are:

Test compiler stability e.g., crash freedom
Test compiler correctness e.g., code generation

I will add one more reason that drew me to fuzzing Solidity

Test the de-facto Solidity specification

Here, I refer to the following statement sourced from a paper titled “Defining the Ethereum Virtual Machine for Interactive Theorem Provers” by Y. Hirai (emphasis mine).

Although ultimately all Ethereum smart contracts are deployed as EVM bytecode, the bytecode is rarely directly written. The most popular programming language Solidity has a rich syntax but no specification. The only definition of Solidity is the Solidity compiler implementation, which compiles Solidity programs into EVM bytecode.

To me, this implies:

Bugs in Solidity may impact correctness of Solidity-written smart contracts
Bugs in Solidity may shed light on bugs in Solidity language design

I don’t think Solidity is the only language that does not have a specification. Actually, I’m pretty sure very few programming languages have a formal spec. So, I’m not sure these reasons are specific to Solidity. Perhaps, the most important reason to fuzz the Solidity compiler is (quoting Y. Hirai again)

A deployed Ethereum smart contract is public under adversarial scrutiny, and the code is not updatable. Most applications (auctions, prediction markets, identity/reputation management etc.) involve smart contracts managing funds or authenticating external entities. In this environment, the code should be trustworthy.

In the worst case, bugs in Solidity could lead to unintended code execution in the context of security-critical applications. However, the bugs discussed in this post are benign so treat my previous statement as FUD.

Test harness

Fortunately for me, the test harness that was used for fuzzing is maintained in the source repo. It is my understanding that Solidity is routinely fuzzed using afl-fuzz. So, kudos to the Solidity team to have integrated fuzzing in their SDLC.

Here’s what the test harness looks like at a high level:

int main()
{
    ...
    // data, size are sourced from stdin
    string input(reinterpret_cast(data), size);
    testCompiler(input);
}

Essentially, it:

Takes a binary byte stream from stdin
converts this into a string
- The string is the solidity program that is fed to the compiler
compiles the string (solidity program)

testCompiler is a utility function that eventually makes a call to the compileStandard API exposed by the solidity compiler library called libsolc. The nifty thing about this API interface is that it does I/O via JSON objects. This means the compileStandard API accepts input via a JSON object and spits another JSON object as output. How is the input string (solidity program) serialized into a JSON object you ask?

Simple, the fuzzed input goes into a field called sources[""]["content"]. Here’s a sample input accepted by compileStandard

The other fields in this JSON object are targeted at configuring compilation parameters such as optimization level, compiler output formating etc. The output produced by the API is rather long but very detailed, so let’s overlook that for now.

Fuzzing

The fuzzing itself is quite straightforward. Here’s what you do (tested on Ubuntu 18.04):

// Fetch dependency
$ sudo apt install libboost-all-dev
// Fetch solidity
$ git clone https://github.com/ethereum/solidity.git
$ cd solidity && mkdir build
// Build, turning off SMT solver support
$ cd build && cmake -DUSE_Z3=OFF -DUSE_CVC4=OFF ..
$ make solfuzzer -j
// Populate afl-in with seeds
$ mkdir afl-in
$ find . -type f -name "*.sol" -exec cp {} -t afl-in \;
// Fuzz
$ afl-fuzz -m none -i afl-in -o afl-out -- solfuzzer

This

Installs boost libs required to compile solidity (and the fuzzer)
Fetches, and compiles the solidity fuzzer
Uses solidity contracts present in the source repo as fuzzing seeds
Runs afl-fuzz on the fuzzing binary

The fuzzing itself is very slow (under 100 execs/s). However, it already helped find a couple of type-related bugs one of which was already known and the other was new.

Results

Bug 1: Unexpected function type conversion

Here’s the new bug that fuzzing discovered

$ ./solc issue_5279.sol 
Internal compiler error during compilation:
/home/bhargava/work/github/solidity/libsolidity/codegen/CompilerUtils.cpp(1020): Throw in function void dev::solidity::CompilerUtils::convertType(const dev::solidity::Type&, const dev::solidity::Type&, bool, bool, bool)
Dynamic exception type: boost::exception_detail::clone_impl
std::exception::what: Invalid type conversion requested.
[dev::tag_comment*] = Invalid type conversion requested.

tl;dr

solc is the solidity compiler binary
issue_5279.sol is the solidity contract (found by fuzzing) that triggers the bug
The bug is an assertion failure that states the cause as Invalid type conversion requested

Here’s the full contract that triggers this bug

contract C {
    function h() pure external {
    }
    function f() view external returns (bytes4) {
            function ()  external g = this.h;
            return g.selector;
    }
}
// ----

As commented by one of the lead devs of Solidity (Chris), here’s the diff contract that does not trigger the bug

contract C {
    function h() pure external {
    }
    function f() view external returns (bytes4) {
            function () pure external g = this.h;
            return g.selector;
    }
}

So, what’s the invalid type conversion that the bug is talking about?

Some basics before we proceed.

What is a pure function?

Functions can be declared pure in which case they promise not to read from or modify the state.

What is a view function?

Functions can be declared view in which case they promise not to modify the state.

What is an external function?

External functions are part of the contract interface, which means they can be called from other contracts and via transactions. An external function f cannot be called internally (i.e. f() does not work, but this.f() works). External functions are sometimes more efficient when they receive large arrays of data. Functions can be declared pure in which case they promise not to read from or modify the state.

What is a function selector?

The first four bytes of the call data for a function call specifies the function to be called. It is the first (left, high-order in big-endian) four bytes of the Keccak (SHA-3) hash of the signature of the function. The signature is defined as the canonical expression of the basic prototype, i.e. the function name with the parenthesised list of parameter types. Parameter types are split by a single comma - no spaces are used.

tl;dr

pure means stateless
view means (stateful) read-only
external means just that
a function selector is the first four bytes of the hash of the function’s signature
- imagine taking a SHA-3 hash of a c++ mangled function and using its first four bytes

From these facts, here’s my understanding of the bug. First, note that the difference between buggy and non-buggy contracts is the following line of buggy code

function ()  external g = this.h;

this.h is an external pure (aka stateless) function
g on the other hand is simply an external function

Evidently, there is (implicit) type conversion happening here. If one looks into the faulting code, here’s what one would find:

void CompilerUtils::convertType(
     Type const& _typeOnStack,
     Type const& _targetType,
     bool _cleanupNeeded,
     bool _chopSignBits,
     bool _asPartOfArgumentDecoding)
{
...
   switch(stackType)
   ...
   case default:
   ...
   solAssert(_typeOnStack == _targetType, "Invalid type conversion requested.");
...
}

The next thing I did is firing up a gdb instance and debugging. Here’s what I found on line 1020 (the failing assertion)

(gdb) p _typeOnStack.richIdentifier()
$1 = "t_function_external_pure()returns()"
(gdb)  p _targetType.richIdentifier()
$2 = "t_function_external_nonpayable()returns()"

The buggy contract has led the compiler to make an invalid type conversion. But I thought solidity is a statically typed language in which such errors are picked up at compile time? Evidently, there is some dynamic typing going on with implicit function casts which led to this bug.

Bug 2: Variable declaration type error

This was a known bug but the fuzzer kinda rediscovered it in a different context imo. Here’s the buggy solidity contract that triggers a (dynamic) type error.

library L{struct Nested{n y;}function(function(Nested)external){}}

Here’s the error it throws up:

Internal compiler error during compilation:
/home/bhargava/work/github/solidity/libsolidity/ast/Types.cpp(2127): Throw in function virtual bool dev::solidity::StructType::canBeUsedExternally(bool) const
Dynamic exception type: boost::exception_detail::clone_impl
std::exception::what:
[dev::tag_comment*] =

Let’s fire up gdb and find out what the failing assertion in Types.cpp on line 2127 is all about.

Here’s the buggy code in question

(gdb) p var->annotation().type.get()
$3 = (std::__shared_ptr::element_type *) 0x0
(gdb) bt
#0  dev::solidity::StructType::canBeUsedExternally (this=0x558db174d750, _inLibrary=false) at /home/bhargava/work/github/solidity/libsolidity/ast/Types.cpp:2127
#1  0x0000558db0774719 in dev::solidity::ReferencesResolver::endVisit (this=0x7ffd332ee5f0, _typeName=...) at /home/bhargava/work/github/solidity/libsolidity/analysis/ReferencesResolver.cpp:210
#2  0x0000558db07ca836 in dev::solidity::FunctionTypeName::accept (this=0x558db1746b60, _visitor=...) at /home/bhargava/work/github/solidity/libsolidity/ast/AST_accept.h:339

Evidently, as the Solidity contract’s AST is being built up, and while a function declaration is being visited and its parameters resolved, the compiler complains that a member of the referenced struct is not typed.

I expected the compiler to throw up an error that the type of member y of struct Nested is undefined. Seemingly, this is not happening. However, if I modify the buggy contract like so:

library L{struct Nested{n y;}function(function()external){}}

The compiler correctly throws up a warning that the user-defined type n is undefined.

$ solc mod_contract.sol
Warning: This is a pre-release compiler version, please do not use it in production.
			     ../../bugs/issue_5340_min.sol:1:25: Error: Identifier not found or not unique.
			     library L{struct Nested{n y;}function(function()external){}}

I have a feeling that there is some lazy type resolution going on that results in a run-time error for what should be a compile-time error.

Next Steps

It’s very cool that the Solidity compiler team is using fuzzing as part of their SDLC to catch bugs like this. So far, most of the bugs found point to deficiencies in typing rules for Solidity. Although this is a good first step, it won’t find bugs in the more critical compiler back-end component that is responsible for generating EVM code. A bug in the back-end that generates incorrect EVM code is a lot more interesting from a security perspective.

The main drawback of the current test harness is speed. This could be addressed by targeted fuzz testing of specific portions of the compiler rather than the entire compiler in one test. This is akin to fuzzing unit tests.

Finally, Kostya’s call for structure-aware fuzzing mutators is something that should go heeded in the Solidity space as well. There has been some work on this front in the Solidity community. It’d be cool to use this infra to fuzz Solidity.

In summary

fuzz specific security-critical components
break fuzz tests down to smaller units
use custom fuzz mutators

That’s all folks!

Can Good-Turing Frequency Estimation Tell Us When to Stop Fuzzing?

2018-10-08T00:00:00+00:00

tl;dr: Depends, but I’m sceptical atm :-)

In this post, I will try to examine the utility of the Good-Turing frequency estimation for fuzz testing. I focus on the following question that is of practical importance for practioners: When to stop fuzz testing?

Intro

This paper talks highly of the utility of the Good-Turing frequency estimation for fuzz testing. It makes some very cool arguments why it makes sense to apply GT to fuzzing, I enjoyed reading it! Here’s the setting examined by that paper. Fuzz testing involves decision making in the face of uncertainty. For example, often, practioners would like to know when to stop fuzzing, cos who knows? A new crash may be found if only the fuzzer were left running for an additional hour/day/week etc.

In theoretical terms, what we would like to know at regular fuzzing intervals is the following: What is the probability of finding something new, should fuzzing continue? Surprisingly, this is exactly what I.J. Good tried to understand (in a different setting of course) in the early 50s.

Of course, your definition of non-trivial probability is likely diffferent from mine. The idea is to define a parameter, say \(\alpha{}\), and stop fuzzing when the probability of finding something new is less than the parameter \(\alpha{}\). I admit this is a very specific (and likely limited) way to apply the GT estimate to fuzzing, so take the following arguments with spoonfuls of salt.

Prelims

We need to set up our theoretical model of fuzzing that is suited to the Good-Turing formula. So, let’s begin with the following assumptions:

A species is defined as some discretized program behavior
- We need some way to characterize distinct species
A test input can belong to one and only one species
- Of course, multiple test inputs can belong to the same species, but the other way round is not possible

Discretizing program behavior

afl-fuzz computes the hash of the coverage bit map to discretize program behavior. Each byte in the coverage bitmap corresponds to some branch executed in the program. So it discretizes program behavior like so:

// trace_bits is the state of the coverage bitmap
// after an input is executed
exec_cksum = hash32(trace_bits, MAP_SIZE, HASH_CONST);

where hash32 is a 32-bit hash of its input (trace_bits of length MAP_SIZE; salt is some constant HASH_CONST).

First things first. exec_cksum is imprecise: program behavior is more complex than what exec_cksum portrays it to be. For example, two inputs can have the same exec_cksum but trigger two different execution paths \(p_{1}\) and \(p_{2}\). But, exec_cksum is efficient to compute and takes modest memory. Therefore, it is an acceptable trade-off between precision of program behavior discretization and performance.

A minor digression to understand how libFuzzer discretizes program behavior.

// kNumPCs is roughly 2.1 million
uintptr_t __sancov_trace_pc_pcs[fuzzer::TracePC::kNumPCs];
uint8_t __sancov_trace_pc_guard_8bit_counters[fuzzer::TracePC::kNumPCs];

There are two arrays

An array of program counters (branch call sites) seen during fuzzing
An array of counters for these program counters
- This is used to count how often a branch is hit

In addition, there is something that libFuzzer creates called a feature. My understanding is that a feature maps to an index of __sancov_trace_pc_pcs. So, each branch in the fuzzed program is a feature. Sadly, unlike afl-fuzz, libFuzzer does not keep track of a checksum of features for a fuzzed input; something akin to afl-fuzz’s exec_cksum. This means that, one would need to add (hashing) code to do this in libFuzzer.

Lifting Good-Turing for Fuzzing

Suppose we have the set of all possible program behaviors (exec_cksum)

\[P = \{p_{1},p_{2},...,p_{m}\}\]

where \(p_{k}\) is a program path.

We also have a sequence E of N (\(N \le{} m\)) program behaviors corresponding to as many independently chosen inputs in the fuzzing corpus.

\[E = \{e_{1},e_{2},...,e_{n}\}, e_{k} \in{} P\]

We want to estimate \(\theta{}[j]\), the probability that a future sample will be \(p_{j}\). Now, we define a set of the frequency of program behaviours observed thus far.

\[F = {f_{1},f_{2},...,f_{n}}\]

where \(f_{k}\) is the number of times behavior \(p_{k}\) has been observed. The frequency of unobserved behaviors is zero.

\[f_{i} = 0, n+1 \leq i \leq m\]

The relative frequency estimate for \(p_{j}\) is \(f_{j}/n\). This estimate is inaccurate for small counts. For example, if \(f_{j}=0\), our estimate is essentially saying “you can’t expect to see what you have not seen” which can be grossly inaccurate.

Before we proceed, we make the following assumption.

\[f_{j} == f_{k} \implies{} \theta{}[j] == \theta{}[k]\]

In other words, if two program behaviors appear with the same frequency in our present fuzzing corpus, then the probability of their future occurence is the same. We can weaken this assumption later, but let’s stick to this simple case in this post.

With this assumption, we introduce more notation. Let \(\theta{}(r)\) be the probability of a behavior occuring given that it appeared \(r\) times in \(E\).

\[g_{r} = |\{e_{j} : f_{j} = r\}|\] \[G = \{g_{0},g_{1},...,g_{R_{max}}\}\]

where \(R_{max} = max(F)\).

In other words, while the set \(F\) computes the frequency of observed program behaviors, the set \(G\) computes the frequency of frequencies of observed behaviors. Moreover, \(R_{max}\) is the highest frequency of observed program behaviors. It follows that

\[N = \sum_{r} rg_{r}\]

where N (as we had denoted for the set E) is the total number of observed program behaviors. N, as it turns out, is also the amount of fuzz i.e., total number of test inputs generated by fuzzing thus far.

Against this backdrop, we introduce the Good-Turing estimate \(\hat{\theta{}}(r)\) for \(\theta{}(r)\).

\[\hat{\theta{}}(r) = (1/N)*(r+1)*(g_{r+1}/g_{r})\]

This estimate tells us, for instance, that the probability of observing as yet unseen behaviors in the future (\(g_{0}\)) is:

\[\hat{\theta{}}(0) = (1/N)*(g_{1}/g_{0})\]

That is to say, this probability is greater than \((1/N)\) for positive \(g_{1}\) when \(g_{1} \gt{} g_{0}\). When N=1 (after one program behavior has been observed), this probability is \(1/(M-1)\) which can be grossly inaccurate. But the hope is, as N grows, this estimate converges on more realistic actual probability.

Applying Good-Turing Estimate to Fuzzing

One way in which the Good-Turing estimate is useful is in deciding when to stop fuzz testing. We stop fuzzing when \(\hat{\theta{}}(0)\) is lower than some pre-defined threshold \(\alpha{}\). Even before I go ahead and implement this estimate inside, say afl-fuzz, I see three potential problems:

Q1: What is a good value of \(\alpha{}\)?
- It is likely different for different targets
Q2: How to deal with noise in \(\hat{\theta{}}(0)\)?
- Note that \(g_{1}\) may fluctuate to varying extents which in turn influences the value of \(\hat{\theta{}}(0)\)
- For example, at some point \(t=t_{k}\) the estimate may go below \(\alpha{}\) only to increase in value thereafter
Q3: How to compute \(g_{0}\)?
- \(g_{0}\) depends on \(M\), the total number of feasible program behaviors that we can only estimate
- If a 32-bit exec_cksum is used to discretize program behavior (as in afl-fuzz), \(M \approx{} 4.3\) billion.

At least, I am sceptical that the Good-Turing estimate can be mechanically relied upon to stop fuzzing. A lot depends on the answers to the three questions above, and likely more. Take the issue of computing \(g_{0}\) for instance. If a program contains even 32 branches, it can have at least 4.3 billion paths. Therefore, exec_cksuming falls short of correctly identifying program paths.

Even if we were to assume that exec_cksum is a fair performace-accuracy trade-off, \(M\) is going to dominate the computation of \(\hat{\theta{}}(0)\). My intuition is that \(g_{0}\) (the number of unobserved program paths: \(=M - k\) where \(k\) is the total number of paths discovered thus far) is always going to be very close to \(M\). In my experience, the total paths found by afl-fuzz is of the order of a few thousand for real-world targets and \(M\) is at least 4.3 billion. Therefore, we can approximate the estimate to be like so

\[\hat{\theta{}}(0) = (1/N)*(g_{1}/M) = g_{1}/(N*M)\]

Since \(N\) is the amount of fuzz (how many inputs have been generated by fuzzing), it increases monotonically. Thus, the denominator of the above equation is always increasing. \(g_{1}\) (number of program behaviors observed exactly once thus far) is likely going to go down as we continue fuzzing. This is going to give us insanely low probabilities to begin with. Say we start computing the estimate at some point \(t1\) until when 2000 singleton (seen exactly once) behaviors have been observed and 10000 inputs generated by the fuzzer. We have:

\[\hat{\theta{}_{t1}}(0) = 2000/(4300000000*10000) = 4.65e-11\]

And let’s say, at a subsequent time instance \(t2\), we have 1000 singletons and 20000 inputs generated:

\[\hat{\theta{}_{t2}}(0) = 1000/(4300000000*20000) = 1.16e-11\]

Although these probabilities are relatively very different (e.g., it is four times less likely to find something new at \(t2\) than at \(t1\)), they are very small to be practically useful. At least, these are my first impressions about the utility of GT estimate for one aspect of fuzzing. Hit me up on Twitter (@ibags) if you think my argument is flawed or I’m talking BS; I’m curious to hear from other security practioners what they think.

Anyway, that’s all for now folks. I’ll post a follow-up when I have some empirical evidence from real-world targets. Watch this space!

Updates

2018-12-10:

Another way to think of the extremely low estimates for discovering new paths is to say

\[N_{z} = 1/\hat{\theta{}}(0)\]

where \(N_{z}\) is the expected number of additional fuzz required to uncover a new path.

So, what a \(\hat{\theta{}}(0) = 1.16e-11\) is saying is that you need to run the fuzzer for an additional \(N_{z} \approx{} 86.2\) billion executions until you find a new path. Assuming that the average execution speed of fuzzer is \(1000\) executions per second, this translates to keep the fuzzer running for close to 3 years on a single core! This is grossly inaccurate and of little practical utility. Evidently, we need estimates that are tailored for exponential spaces, which I feel Good-Turing is not.

2018-3-11:

Thanks to Marcel Böhme for pointing out errors in the first version of the post

Statistical Evaluation of a Fuzzing Dictionary

2018-10-01T00:00:00+00:00

Intro

Fuzz testing involves several configuration parameters: seeds, dictionary, fuzz scheduling (what to fuzz), fuzz duration (how long to fuzz something), fuzz mutation (how to fuzz), fuzz sites (what portions of input to fuzz) etc. This post attempts to statistically evaluate the effect of one fuzzing parameter: dictionary. The purpose of this post is to understand if the use of a dictionary for a very specific fuzzing target (a parser) leads to significantly better outcomes, statistically speaking. The fuzzer target that this post focuses on is not really relevant: so I won’t name it. It suffices to say that this target is a run-of-the-mill parser that parses string input.

We have made the argument before that the use of dictionaries makes security testing of network parsers more effective. However, a recent paper called “Evaluating Fuzz Testing” has good recommendations for basing such judgements on basic statistical tests rather than, say, a visual inspection of the measured distributions. There are two statistical tests that are recommended in the fuzzing evaluation paper. One is a significance test and the other an effect-size test.

Significance test

Firstly, it is recommended that researchers perform a significance test (e.g., Mann-Whitney U test) in order to decide if their fuzzing optimization brings about statistically significant change in some performance metric. For people unfamiliar with even basic statistics, like me, the Mann-Whiteny U test is used to—quoting the wiki page on the topic—“determine whether two independent samples were selected from populations having the same distribution.”

My understanding of this test applied to fuzzing evaluations is as follows. Consider you propose a cool tweak to afl-fuzz that you believe will bring about an improvement in fuzz testing. For simplicity, let’s assume that the only metric you are interested in improving is “fuzzing coverage” per unit time: Lines of code that are hit by the fuzzer in some unit time (say 1 minute). So, you want to check if your tweak actually performs better than the baseline on this metric.

In order to convince a scientific audience that your tweak indeed indeed brings about a positive improvement, you need to do the following before proceeding further:

Run the baseline fuzzer (that does not contain your tweak) “N” times (greater the value of N, the better), measuring and noting the value of the metric of interest (coverage/unit time) in each run
- You will end up with an array of measurements like so: B = [b_1, b_2,…, b_N]
Run the tweaked fuzzer “N” times, and as before, measuring and noting the value of the metric of interest (coverage/unit time) in each run
- You will end up with an array of measurements like so: T = [t_1, t_2,…,t_N]
Compute the Mann Whitney U test p-value for the arrays B and T
- This can tell you if the performance numbers for the tweak show statistically significant divergence from the performance numbers for the baseline

Now, you have two “populations” (arrays, B and T) of independent samples (independent because each run is independent of the other) of coverage numbers. We do not know the distribution of either population; actually this is not important to us. What we are interested in is checking whether the distributions differ. Specifically, we assume that it is equally likely that a randomly selected value from one population is less than or greater than a randomly selected value from the other population; this is called the null hypothesis. We are interested in proving or disproving the null hypothesis. Getting back to the topic of fuzzing evaluations, we are interested in disproving the null hypothesis that the performance measurements for the baseline and tweak have the same distribution, because if they do, the tweak did not do anything particularly interesting.

The p-value computation is a standard way of quantitatively checking the validity of the null hypothesis. A p-value is essentially the probability of falsely concluding that the null hypothesis is not valid; the lower the p-value, the greater the assurance that we have correctly concluded that our tweak is indeed different than the baseline. Traditionally, p-values of under 0.05 are considered good enough to show a statistically significant difference between two populations. The value of 0.05 is called the level of significance: One can choose a lower level of significance (say 0.001) if one wants to be damn sure about the difference in populations.

Fortunately, there is a ready-made python function called mannwhitneyu in the scipy.stats module that outputs the p-value for two lists of numbers. So, all you need to do is write a simple python script like so:

from scipy.stats import mannwhitneyu
# Read in baseline performance scors into array
B = [b_1,...,b_N]
# Read in performance scores for tweak into another array
T = [t_1,...,t_N]
print(mannwhitneyu(B,T))

Then you see output like so:

MannwhitneyuResult(statistic=682.5, pvalue=2.582424268793943e-26)

This tells you that the p-value is 2.58e-26 or 2.58*10^-26. This number is a lot smaller than 0.05 so we conclude that the performance numbers corresponding to the tweak are indeed (statistically significantly) different than performance numbers corresponding to the baseline.

Although p-values of under 0.05 show that the compared populations are significantly different, it does not tell us what the quantum of this difference is. In an extreme case, the tweak may result in a miniscule improvement (e.g., it covers 2 additional lines of code than baseline) with a very low p-value (e.g., 2.58e-26). So although you convince people that your tweak brings about a certain improvement, the quantum of this improvement is too little to be considered scientifically interesting.

In other words, low p-values are necessary but not sufficient for our evaluation. p-values say nothing about the extent of divergence, also known as the effect size. This brings me to the second test recommended in the fuzzing evaluation paper.

Vargha Delaney’s A measure

The VDA measure can be used to gauge the extent of divergence between two populations. Essentially, the VDA measure outputs the probability p that one population is different (greater/lesser) by computing pair-wise ordinal relationships (< or =) between samples in the two populations. This probability p if equal to 0.5 (half) indicates that both populations have identical values (no change). The following values of p are conventionally accepted as indicating change:

p>0.56 Small change
p>0.64 Medium change
p>0.71 Big change

Essentially, if greater than 21% of pair-wise comparisons show a greater value for one population, that population is considered diverging in a big way from the other. Tim Menzies has published python code to compute VDA measure, thanks Tim. So, all you need to do to compute the VDA measure is the following:

## Fetch module from Tim Menzies' gist linked above
from a12 import *

## Create a labeled array
B_norm = ["baseline"]
## Append B values from baseline measurements
B_norm.extend(B)
## Likewise for tweak measurements
T_norm = ["tweak"]
T_norm.extend(T)
## Create consolidated list
C = [B_norm, T_norm]
for rx in a12s(C,rev=True,enough=0.71): print(rx)

The enough parameter is essentially the effect-size threshold of your choice. For the listing above, I have used the conventional big threshold i.e., p>0.71. The python code above should output something like so

rank #1 tweak at 
rank #2 baseline at 

where populations are sorted in descending order (i.e., highest coverage on top) and T_cov and B_cov are means of the tweak and baseline populations. We interpret this result as follows: There exists a big change between tweak and baseline because a lot of samples from the tweaked population show better performance (say, coverage numbers) compared to the baseline samples. In summary, if the p-value for the measurement values corresponding to your tweak is <0.05 and has a big VDA measure, then your tweak is indeed pretty cool! Next, I describe in what context I applied this knowledge.

Context

I was going to submit a PR to oss-fuzz to integrate a new fuzzing target. Such a PR typically contains configuration for the fuzzing engines that Google uses (afl-fuzz and libFuzzer) apart from the test case itself. One such configuration parameter is a dictionary file that contains line seperated tokens of interest that are enclosed within double quotes (see my post on inferring program input format for more details about this). Naturally, I was interested in knowing if the dictionary that I was including in the PR is actually useful.

Before I set about evaluating the usefulness of a dictionary for this specific target, I built a few simple dictionaries using tools that I had developed: Mostly this clang front-end tool called clang-sdict that performs a front-end pass on source code collecting constant string tokens used in potentially data-dependent control flow. You can find a primitive implementation of clang-sdict here.

Before finalizing on a dictionary, I wanted to experiment with a few variations and see how they fare. The nice thing about clang-sdict is that it permits several customizations: Prominently, one can tune it to focus on specific coding patterns. For example, one can add specific parsing functions (by name) and the tool extracts tokens accepted by that function. I went ahead and created three different dictionaries each with a slightly different set of string tokens. Let’s call these dictionaries “dict A”, “dict B”, and “dict C.” When the fuzzer is supplied such a dictionary, it chooses one string at random and uses it in a fuzzing mutation: Say overwrites a byte sequence with this string.

Evaluation

Now that I had these three dictionaries, I set about evaluating their “effectiveness” and “size of effect” using Mann-Whitney U Test and Vargha Delaney’s A measure. To recap, these tests answer the following two questions (in that order): (1) Is using a dictionary bring out noticeable gains in the outcome of fuzzing? and (2) How much of an effect do dictionaries have on the said outcome?

Of course, we need to fix metrics before we use these statistical tests. The metric I chose for this post is the number of lines of code covered by a fuzzing session: libFuzzer (one of the fuzzing engines behind oss-fuzz) prints the number of CFG edges covered during fuzzing. More edges covered is better than fewer edges covered (more is better).

Before I present evaluation methodology and results, some meta data about the dictionary candidates.

Dict	Num. tokens
Baseline	0
Dict A	120
Dict B	222
Dict C	388

Dict A has the fewest tokens, followed by Dict B, and Dict C.

Evaluation Methodology

The methodology centers around the following broad set of requirements with design choices shown in braces.

Each variant should be run several times (100 runs chosen)
Each variant should be run for the same fixed duration (5 minutes chosen)
Reasonable metric for comparison must be used (Program edge coverage chosen)

Therefore our experiment must do the following:

Run the baseline (no dictionary), Dict A (exp 1), Dict B (exp 2), Dict (exp 3) a total of 100 times each with 5 minutes per fuzzing session
Log the total coverage achieved in this fuzzing session

Once we do this, we end up with a 2D array like so (numbers are hypothetical):

baseline = [b_1,b_2,b_3,...,b_100]
exp1 = [e1_1,e1_2,e1_3,...,e1_100]
exp2 = [e2_1,e2_2,e2_3,...,e2_100]
exp1 = [e3_1,e3_2,e3_3,...,e3_100]

Okay, so let’s make a box-plot of them and see what they look like: Remember more edges covered, the better is the fuzzing outcome.

Y-axis is the number of CFG edges covered; X-axis is the fuzzing configuration whose coverage distribution is presented as a box plot. Okay, it (visually) appears that “Dict A” is best of all in terms of median value (the orange line that strikes through the boxes is the median of that sample set) and quartile distribution. Some more basic statistics for the test coverage populations follow.

Name	Mean	Variance	Min	Max
Baseline	1488.3	1918.5	1427	1591
Dict A	1601.2	3157.2	1502	1719
Dict B	1579.2	2775.4	1497	1693
Dict C	1572.4	2374.7	1500	1675

Although it appears that Dict A has the highest mean (and hence the best), its high variance can be one ground to be suspicious about the claim that “it is the best.” This is precisely where significance tests enter the picture.

Mann Whitney U Test

We can check the “soundness” of the hypothesis “Dict A is different” by performing a Mann-Whitney U test on our data set. Here’s a gist of my evaluation python script: Nothing fancy, reading coverage numbers from a log file and using mannwhitneyu function from the scipy.stats python module on the sets of acquired coverage numbers.

The p-values between different sets of evaluations are shown in the table below. The table is to be read as (p-value between row label vs. column label); 1e-2 is to be read as 1x10^-2 or 0.01. Since Mann Whitney p-values for the tuples (A,B) and (B,A) (where A,B are two non-identical sets of numbers) is the same, and p-value of (A,A) does not make any sense, these fields in the table have been denoted as N.A., short for not applicable. A p-value of under 0.05 (i.e., < 5e-2) means that there is a significant difference between the means of the two sets of numbers.

Name vs.	Baseline	Dict A	Dict B	Dict C
Baseline	N.A.	N.A.	N.A.	N.A.
Dict A	2.58e-26	N.A.	2.22e-3	6.96e-5
Dict B	5.72e-23	N.A.	N.A.	19.4e-2
Dict C	5.61e-22	N.A.	N.A.	N.A.

From these numbers, we can create the following “significance” table (to be read as do (row,column) populations differ significantly):

Name vs.	Baseline	Dict A	Dict B	Dict C
Dict A	Yes	N.A.	Yes	Yes
Dict B	Yes	No	N.A.	No
Dict C	Yes	No	No	N.A.

This table tells us that

All “Dict” populations are significantly different than the baseline AND
Dict A population is significantly different than the rest

In some ways this is a counter-intuitive result because I would have expected more tokens (in Dict B and Dict C) result in a significant change in the outcome. It turns out it is more important to have a small set of correct tokens than a larger set: More tokens in a dictionary is not necessarily a good thing.

Bear in mind that all runs were performed for 5 minutes only, results may/will change for longer fuzzing durations. My original motivation in choosing a 5-minute fuzzing window was to get a quick understanding of the effectiveness of each of the dictionaries before sending out the PR. Having said that, given enough time and resources, we can perform the same tests after a longer time interval (say 1 hour of fuzzing) and repeat this analysis.

Vargha Delaney A12 Test

Stastical significance cannot be equated to scientifically important. The latter requires stricter evaluation of the delta in the metric: How much more improvement in test coverage did the evaluated dictionaries achieve? We know that Dict A population not only has the highest mean/median, but is also significantly different than the rest, but how much better is it? VDA test is useful for answering precisely this question.

Let’s recall that a VDA score between (X,Y) of >0.56 indicates a small change, >0.64 indicates a medium change, and >0.71 indicates a big change. Using my evaluation gist outlined above, I compute the VDA probabilities as follows. Again, I would like to credit Tim Menzies whose VDA implementation was the basis for these computations. In my script, I use standard effect sizes (small=0.56, medium=0.64,large=0.71) to compute three such rankings. Here is what I find.

Small effect ranking

Rank 1: Dict A
Rank 2: Dict B
Rank 2: Dict C
Rank 3: Baseline

In other words, Dict A offers at least small improvements in program coverage over Dict B and Dict C, which in turn offer at least small improvements in program coverage over the baseline.

Medium effect ranking

Rank 1: Dict A
Rank 1: Dict B
Rank 1: Dict C
Rank 2: Baseline

In other words, Dict A, Dict B, and Dict C are roughly the same if we require the improvement in test coverage to be at least medium (p > 0.64). Still, any of these dictionaries have at least a medium size of improvement in coverage than the baseline i.e., no dictionary.

Big effect ranking

Rank 1: Dict A
Rank 1: Dict B
Rank 1: Dict C
Rank 2: Baseline

The medium result holds even for a big effect: This means that any of the three dictionaries offer big improvement in coverage compared to the baseline.

From this we can conclude that (1) there is a small delta between Dict A, and Dict B/C; and (2) there is a big delta between Dict A/B/C and the baseline (no dictionary). In a nutshell, the “winner” is Dict A.

Conclusion

I draw the following conclusions from this work:

Simple statistical tests provide an understanding of the significance of a change in some fuzzing parameter
For the specific fuzzing target evaluated in this post, dictionaries indeed are very useful

Some caveats: (1) Fuzzing window chosen for evaluation was short, (2) results focus on the coverage metric and not e.g., for speed of bug finding. However, this methodology offers a scientific basis for drawing conclusions which is pretty cool. Needless to say, I added Dict A in my PR to oss-fuzz and now I can say that (in a very limited way) my PR is based on scientific evidence ;-)

Acknowledgments

Thanks to

The authors of the “Evaluating Fuzz Testing” paper, check the paper out.
Tim Menzies whose A12 implementation I used in this work
My wife, Divya, for teaching me basic stats

Exploring Fuzzer Crashes

2017-08-04T00:00:00+00:00

Part 1 | Part 2 | Part 3

Prologue

This post concludes the three part series on compiler assisted vulnerability diagnosis in open-source C/C++ code. “Compiler assisted” means that the presented techniques pivot around a compiler, and “vulnerability diagnosis” refers to the process of finding and fixing vulnerabilities (software weaknesses that can be used to intentionally cause harm). Software weaknesses (bugs) are a superset of vulnerabilities in that not all weaknesses are harmful from a security perspective. The challenging part of diagnosing vulnerabilities in source code is to arrive at the (usually) small subset of vulnerabilities from the (usually) larger set of bugs and non-bugs (that the source analyzer believes to be real bugs aka false positives).

Intro

Software testing is arguably the most important process in the quality assurance phase of software development. Bugs found during testing achieve an important objective: Helping fix programming errors before a software release. Therefore, bug count is a reasonable metric to assess the effficiency of the software testing process. If technique X helps find more bugs than technique Y, the former is said to be more effective.

This post argues that, for practical reasons, fuzz testing alone may be sub-optimal to maximize bug count, and that static analysis can help find bugs in scenarios where fuzzing is not an option. Here is a non-exhaustive list of scenarios where fuzzing is not straightforward:

Crypto code
Stateful application logic in networking stacks
No unit test to test feature X
No fuzzable unit test to test feature X

Of course, this does not mean fuzzing in these scenarios is impossible. It just means it is harder (requires manual labor) to fuzz in these scenarios. So, it does not scale out.

Static exploration of fuzzer crashes

How can we scale bug discovery beyond fuzz testing? My proposal is to use static analysis in order to automatically explore the findings of a fuzzer. By “findings of a fuzzer”, I mean fuzzer-discovered program crashes that can be localized (attributed) to a small portion of the program. By “exploration”, I mean spotting recurances of the underlying cause of fuzzer-discovered crashes. This opens up two problems: How to automatically (1) localize fuzzer crashes? (2) explore them statically? Considering that static analysis over-approximates, a third problem is to how to handle false positives? We shall be investigating each problem in the next paragraphs.

Fault localization

In this post, we focus on fault localization in an open-source setting, although fault localization has been shown to be possible in a closed source setting. So, our fault localization tool should accept source code and a fuzzer corpus (set of test inputs) as input, and produce a set of localized code segments that correspond to each unique fuzzer-discovered crash. Crash de-duplication tools such as exploitable provide us the set of uniquely crashing program inputs. So, our problem is reduced to that of obtaining localized code segments for each unique crash in the set of deduplicated crashes.

For memory corruption bugs, memory-tracing tools such as AddressSanitizer and Valgrind can greatly assist fault localization. These tools track the state of memory use at byte granularity, reporting buffer overflows, use-after-free and other memory related issues that are endemic to C/C++ applications. AddressSanitizer even has a structured bug diagnostic report that can be leveraged to programmatically narrow down the lines of code that caused the bug.

Let’s run through a small example here. The code below contains a synthetic buffer overflow that we can spot with the help of ASan:

$ cat < example.c
#include 

void vulnerable(int y, char *buf) {
   buf[y] = 0;
}

int main(int argc, char *argv[]) {
   char buf[256];
   size_t x = 0;
   scanf("%lu", &x);
   vulnerable(x, buf);
   return 0;
}
EOF

$ clang -fsanitize=address example.c
$ ./a.out
256
=================================================================
==2290==ERROR: AddressSanitizer: stack-buffer-overflow on address 0x7ffcd43ff9c0 at pc 0x0000004e9be2 bp 0x7ffcd43ff860 sp 0x7ffcd43ff85
8
WRITE of size 1 at 0x7ffcd43ff9c0 thread T0
    #0 0x4e9be1 in vulnerable /home/bhargava/work/github/bshastry.github.io/code/example1.c:4:11
    #1 0x4e9d71 in main /home/bhargava/work/github/bshastry.github.io/code/example1.c:11:4
    #2 0x7f13f2f8682f in __libc_start_main /build/glibc-bfm8X4/glibc-2.23/csu/../csu/libc-start.c:291
    #3 0x418538 in _start (/home/bhargava/work/github/bshastry.github.io/a.out+0x418538)

Address 0x7ffcd43ff9c0 is located in stack of thread T0 at offset 288 in frame
    #0 0x4e9bff in main /home/bhargava/work/github/bshastry.github.io/code/example1.c:7

This frame has 2 object(s):
    [32, 288) 'buf' <== Memory access at offset 288 overflows this variable
    [352, 360) 'x'

Note that the ASan diagnostic report not only shows the program stack trace at the time the buffer overflow occured, but also the program variable that overflowed. Moreover, the formatting of the report is regular enough for us to automatically parse this information.

What if we are dealing with a bug that is not caused due to memory corruption, say, an assertion failure. In the synthetic example below (abort.c), the program aborts when the parsed input equals the string literal doom. More realistically, one would be dealing with an assertion failure due to an unexpected program state. Nonetheless, the example is simple enough to demonstrate how we handle non memory corruption bugs. Lines have been numbered so we can speak about execution traces in terms of a set of line numbers. This will be clear shortly.

$ cat < abort.c
#include 
#include 
#include 
#include 
#define CUSTOM() abort()
void fuzzable(const char *input) {
    // Fuzzer finds this bug
    if (!strcmp(input, "doom"))
		abort();
}

// Fuzzer test harness
// INPUT: stdin
int main() {
    char buf[256];
    memset(buf, 0, 256);
    read(0, buf, 255);
    fuzzable(buf);
    return 0;
}

Using a coverage tracer such as SanitizerCoverage, we can obtain the execution trace for this program for a given input. Let’s assume that the fuzzer discovered the program input “doom” that causes the program to abort, immediately after it mutated an input “doo” that it had previously generated. For the input “doom”, we can see that the following lines are in the execution trace

$ clang -fsanitize-coverage=bb -fsanitize=undefined -g abort.c
$ perl -e 'print "doom"' | UBSAN_OPTIONS="coverage=1:coverage_direct=1" ./a.out
Aborted (core dumped)
$ sancov.py rawunpack 2900.sancov.raw
$ sancov.py print a.out.2900.sancov | llvm-symbolizer -obj a.out
/usr/local/bin/pysancov: read 8 64-bit PCs from a.out.3150.sancov
/usr/local/bin/pysancov: 1 file merged; 8 PCs total
fuzzable
/home/bhargava/work/github/bshastry.github.io/code/abort.c:6:0

fuzzable
/home/bhargava/work/github/bshastry.github.io/code/abort.c:8:7

fuzzable
/home/bhargava/work/github/bshastry.github.io/code/abort.c:8:7

fuzzable
/home/bhargava/work/github/bshastry.github.io/code/abort.c:8:7

fuzzable
/home/bhargava/work/github/bshastry.github.io/code/abort.c:8:7

main
/home/bhargava/work/github/bshastry.github.io/code/abort.c:14:0

main
/home/bhargava/work/github/bshastry.github.io/code/abort.c:16:3

main
/home/bhargava/work/github/bshastry.github.io/code/abort.c:16:3

After de-duplicating line numbers, we are left with the following execution trace for the input “doom”: (6,8,14,16). The trace for the input “doo” is: (6,8,10,14,16). Note that the coverage tracing tool might have false negatives (executed lines that are not registered), but we can live with that. If we obtain the set difference between the traces for “doo” and “doom”, we are left with line number 10. What this tells us is that the function fuzzable does not return when passed input “doom” but returns when the passed input is “doo”. From this, we can deduce that the crashing input caused a crash between lines 8 and 10 i.e., line 9. In doing so, we have localized the failure (somewhat) to lines 8–10.

What we obtain after fault localization is a set of source code locations (say, a list of file:line tuples) that (most likely) were the root-cause of a program crash. Our next problem is to find where similar code patterns exist.

Static exploration of root-cause of failure

In order to explore code patterns similar to the root-cause of fuzzer-discovered crashes, we take a compiler-based code query approach. We will be using clang-query, a tool that lets us efficiently query the abstract syntax tree of code bases. The query syntax of clang-query is a functional language predicated over properties of the program AST. I will try to break down what this means. A tool like grep is what we seek to emulate: Given a code pattern that is known to be vulnerable, we would like to search for its recurrances. However, unlike grep, we do not match the textual representation of code, rather how it looks like to the compiler. At the risk of oversimplication, I call it compiler grepping! If you are wondering what compiler grepping brings to the table that grep does not, it lets us match against the structure and semantics of code rather than it’s appearance. This can make a big difference as we shall see.

The next question then is: How can we formulate compiler queries from code segments that we have obtained after fault localization? To understand this, let’s try to understand what code segments look like to the compiler. Here’s a snippet of abort.c’s AST.

$ clang -fsyntax-only -ast-dump abort.c
`-FunctionDecl 0x2c61778  line:14:5 main 'int ()'
  `-CompoundStmt 0x2c61d48 
      |-DeclStmt 0x2c618f8 
          | `-VarDecl 0x2c61898  col:8 used buf 'char [256]'
      |-CallExpr 0x2c61a00  'void *'
	  | |-ImplicitCastExpr 0x2c619e8  'void *(*)(void *, int, unsigned long)' 
		  | | `-DeclRefExpr 0x2c61910  'void *(void *, int, unsigned long)' Function 0x2baf100 'memset' 'void *(void *, int, unsigned long)'

Here’s the break down of the AST snippet:

FunctionDecl is an AST node that represents the declaration of the main() function
CompoundStmt is an AST node that signals the start of the function’s body. Note that this node is a child of FunctionDecl implying that the CompoundStmt in question is to be found in the function body of main()
DeclStmt is an AST node that represents the declaration of the char buffer whose name is buf. The referenced variable VarDecl is a child of DeclStmt implying that the variable in question binds to the said declarative statement
… … and so on.

AST features (type of AST node, and its relationship to adjacent AST nodes) can help issue efficient queries for static exploration. For example, if we want to explore all calls to the function abort() we can issue the following clang-query style query:

$ clang-query abort.c
clang-query> match declRefExpr(to(
				functionDecl(hasName("abort"))
				))
Match #1:

/home/bhargava/work/github/bshastry.github.io/code/abort.c:9:3: note: "root" binds here
                abort();
                ^~~~~
1 match.

This example demonstrates how simple functional queries may be used to explore a code base. In this work, we focus on directed exploration i.e., we would like to explore the code base with specific issues in mind. To demonstrate this, consider the following stack trace discovered by fuzzing a modified version of the abort.c program that we shall call abort-mod.c.

$ cat < abort-mod.c
#include 
#include 
#include 
#include 
#define CUSTOM() abort()
void fuzzable(const char *input) {
    // Fuzzer finds this bug
    if (!strcmp(input, "doom"))
		abort();
}
void cov_bottleneck(const char *input) {
char *hash = crypt(input, "salt");

// Fuzzer is unlikely to find this bug
if (!strcmp(hash, "hash_val"))
	CUSTOM(); // grep misses this
}

// Fuzzer test harness
// INPUT: stdin
int main() {
    char buf[256];
    memset(buf, 0, 256);
    read(0, buf, 255);
    fuzzable(buf);
cov_bottleneck(buf);
    return 0;
}
EOF
$ clang -g -lcrypt abort-mod.c
$ perl -e 'print "doom"' | gdb -q -ex=r -ex=bt -ex=quit ./a.out
Reading symbols from ./a.out...done.
Starting program: /home/bhargava/work/github/bshastry.github.io/code/a.out

Program received signal SIGABRT, Aborted.
0x00007ffff780a428 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#0  0x00007ffff780a428 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007ffff780c02a in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x000000000040073a in fuzzable (input=0x7fffffffd850 "doom") at abort-mod.c:9
#3  0x0000000000400814 in main () at abort-mod.c:25

Essentially, as expected the input doom triggers a program abort. Things like this are relatively easy to find using a fuzzer. Also note that there is a similar “vulnerability” hiding under crypto code. Essentially, the fuzzer would need to generate a hash collision to get past the branch leading to this vuln, which is very unlikely. Also note that the call to the abort() function is lexically different: It is called CUSTOM() and not abort(). This is intentional to show that lexical or even textual matching tools such as grep will not be able to match it for the query abort. Now, I will demonstrate how we deal with code scenarios like those in the example.

First, we localize the defect using the stack trace. If you filter out function calls not in source code (not systems/library code) and pick the first such stack frame, we are left with the call to abort() in the fuzzable() function. So let’s list all calls to abort() in the entire code base.

$ cat < abort_query.txt
match declRefExpr(to(functionDecl(hasName("abort"))))
EOF
$ clang-query -f=abort_query.txt abort-mod.c
Match #1:

/home/bhargava/work/github/bshastry.github.io/code/abort-mod.c:9:3: note: "root" binds here
                abort();
		^~~~~

Match #2:

/home/bhargava/work/github/bshastry.github.io/code/abort-mod.c:16:3: note: "root" binds here
CUSTOM(); // grep misses this
^~~~~~~~

/home/bhargava/work/github/bshastry.github.io/code/abort-mod.c:5:18: note: expanded from macro

'CUSTOM'
#define CUSTOM() abort ()
                 ^~~~~
2 matches.

As shown, fuzzer-directed queries can help spot issues that might have been missed by fuzzing alone. This is where directed compiler-based queries help. Being static they can explore the entire code base without being hampered by dynamic bottlenecks such as cryptographic code or more simply code that doesn’t get exercised by existing unit tests.

Dealing with false positives

This sounds too good to be true. It is. Static analysis over-approximates that leads to false positives, and eventually manual time spent in report validation. For example, in the synthetic example above, calls to abort() is too broad a query to find real issues. There are likely calls to abort() in dead code and/or not relevant. In general, the more precise we are able to model fuzzer crashes from the post-failure diagnostics (stack trace, core dump etc.), the better static matches we get. For the time being, we have a simple but effective way to facilitate manual review.

Ranking matches

First, we measure the test coverage reached by fuzzing. We do this by using a program coverage tracer tool such as Gcov, and SanitizerCoverage. Second, for each match returned by the static analyzer, we check if it comprises code that is already covered or not. Matches in unfuzzed code is prioritized for review.

Results

This research was evaluated on Open vSwitch codebase. It led to the discovery of several corner cases that OvS developers appreciated. Prominently, we showed that our method could spot a security issue that was a regression that appeared in one release and also catch a real issue similar to a fuzzer discovered vuln elsewhere in the same codebase. The analysis undertaken is fast and thus doable on a regular basis e.g., CI. I think the approach taken in this work holds promise for catching other classes of recurring vulns in large codebases.

Part 1 | Part 2 | Part 3

Inferring Program Input Format

2017-08-03T00:00:00+00:00

Part 1 | Part 2 | Part 3

Prologue

This post is the second of the three part series on compiler assisted vulnerability diagnosis in open-source C/C++ code. “Compiler assisted” means that the presented techniques pivot around a compiler, and “vulnerability diagnosis” refers to the process of finding and fixing vulnerabilities (software weaknesses that can be used to intentionally cause harm). Software weaknesses (bugs) are a superset of vulnerabilities in that not all weaknesses are harmful from a security perspective. The challenging part of diagnosing vulnerabilities in source code is to arrive at the (usually) small subset of vulnerabilities from the (usually) larger set of bugs and non-bugs (that the source analyzer believes to be real bugs aka false positives).

Intro

Coverage guided fuzzers such as afl-fuzz are clever enough to generate inputs that exercise new program paths. However, there are instances where additional help is valuable. By valuable, I mean one of two things: (1) Reduces time to vulnerability exposure; and/or (2) Increases number of uncovered vulns. This post investigates one way in which additional support may be provided to the fuzzer.

Inferring Input Format From Source Code

I will be using a libFuzzer test harness to demonstrate the central idea behind this post. Consider the following code example.

$ cat < libfuzzer-example.c
bool FuzzMe(const uint8_t *Data, size_t Size) {
    return Size >=3 &&
	    Data[0] == 'F' &&
	    Data[1] == 'U' &&
	    Data[2] == 'Z' &&
	    Data[3] == 'Z';
}

int LLVMFuzzerTestOneInput(const uint8_t *Data, size_t Size) {
    FuzzMe(Data, Size);
    return 0;
}
EOF

All this test harness is doing is fuzzing a buggy function called FuzzMe() that contains an out-of-bounds read (Size == 3 && input == "FUZ"). Let’s time libFuzzer on this test case on an empty corpus.

$ clang++ -g -fsanitize=address -fsanitize-coverage=trace-pc-guard ~/FTS/tutorial/fuzz_me.cc libFuzzer.a
$ time ./a.out
...
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
==15307==ABORTING
MS: 1 EraseBytes-; base unit: 6cdcffd840bb810dcdd4778c1a5caaa6cd012f0c
0x46,0x55,0x5a,
FUZ
artifact_prefix='./'; Test unit written to ./crash-0eb8e4ed029b774d80f2b66408203801cb982a60
Base64: RlVa

real    0m0.844s
user    0m0.440s
sys     0m0.180s

So, roughly after 0.8s, libFuzzer was able to find the input (“FUZ”) that triggered the singly byte out-of-bounds read. That’s really fast. However, it could be made faster if we can gain some insight on the program input format. Let’s run a simple clang front-end tool to extract constant strings used in comparison statements even before we start to fuzz. Remember, we are doing a static pass over the source code here.

$ cat < clang-charlitmatcher.c
#include "clang/AST/ASTConsumer.h"
#include "clang/AST/RecursiveASTVisitor.h"
#include "clang/Frontend/CompilerInstance.h"
#include "clang/Frontend/FrontendAction.h"
#include "clang/Tooling/Tooling.h"
#include "clang/ASTMatchers/ASTMatchers.h"
#include "clang/ASTMatchers/ASTMatchFinder.h"
// Declares clang::SyntaxOnlyAction.
#include "clang/Frontend/FrontendActions.h"
#include "clang/Tooling/CommonOptionsParser.h"
// Declares llvm::cl::extrahelp.
#include "llvm/Support/CommandLine.h"
#include "llvm/Support/Regex.h"
using namespace clang::tooling;
using namespace llvm;
using namespace clang;
using namespace clang::ast_matchers;
// Apply a custom category to all command-line options so that they are the
// only ones displayed.
static cl::OptionCategory MyToolCategory("clang-sdict options");
// CommonOptionsParser declares HelpMessage with a description of the common
// command-line options related to the compilation database and input files.
// It's nice to have this help message in all tools.
static cl::extrahelp CommonHelp(CommonOptionsParser::HelpMessage);
// A help message for this specific tool can be added afterwards.
static cl::extrahelp MoreHelp("\nTakes a compilation database and spits out CString Literals in source files\n");
// character literal in binary op matcher
StatementMatcher CharLitMatcher = characterLiteral(hasParent(binaryOperator())).bind("charlit");

class MatchPrinter : public MatchFinder::MatchCallback {
public :

    void printToken(StringRef token) {
      size_t tokenlen = token.size();
      if ((tokenlen == 0) || (tokenlen > 128))
        return;
      llvm::outs() << "\"" + token + "\"" << "\n";
    }

    void prettyPrintIntString(std::string inString) {

      if (inString.empty())
        return;

      size_t inStrLen = inString.size();
      if (inStrLen % 2) {
        inString.insert(0, "0");
        inStrLen++;
      }

      for (size_t i = 0; i < (2 * inStrLen); i+=4)
        inString.insert(i, "\\x");

      printToken(inString);
    }

    void formatCharLiteral(const CharacterLiteral *CL) {
      unsigned value = CL->getValue();
      std::string valString = llvm::APInt(8, value).toString(16, false);
      prettyPrintIntString(valString);
    }

    virtual void run(const MatchFinder::MatchResult &Result) {
      if (const clang::CharacterLiteral *CL = Result.Nodes.getNodeAs("charlit"))
        formatCharLiteral(CL);
    }
};

int main(int argc, const char **argv) {
  CommonOptionsParser OptionsParser(argc, argv, MyToolCategory);
  ClangTool Tool(OptionsParser.getCompilations(),
  OptionsParser.getSourcePathList());
  MatchPrinter Printer;
  MatchFinder Finder;
  Finder.addMatcher(CharLitMatcher, &Printer);
  return Tool.run(newFrontendActionFactory(&Finder).get());
}
EOF

Long story short, clang front end tool does the following:

Makes a pass over source code AST
Looks for character literals that are children of binary operators
Prints these character literals

Note that all of this is done in under 100 lines of code including boilerplate code. Now, let’s run this against our libfuzzer code example.

$ clang-clmatcher libfuzzer-example.c > dict
$ cat dict
"\x46"
"\x55"
"\x5A"
"\x5A"																																																																																																																																																																																																																						     }

Essentially, this gave us ‘F’, ‘U’, ‘Z’, ‘Z’ (after deduplication: ‘F’, ‘U’, and ‘Z’). Let’s put this in an afl-style dictionary and reinvoke libfuzzer with this dictionary. The idea is to compare the times libfuzzer takes with and without the dictionary. As we have already noted, it takes about 0.8s to spot the buffer over-read without a dictionary.

$ time ./a.out -dict=dict
...
MS: 3 ChangeByte-ShuffleBytes-EraseBytes-; base unit: d211f6eb0b35f1d135f354587b1a0851779fcc28
0x46,0x55,0x5a,
FUZ
artifact_prefix='./'; Test unit written to ./crash-0eb8e4ed029b774d80f2b66408203801cb982a60
Base64: RlVa

real    0m0.129s
user    0m0.012s
sys     0m0.024s

Naturally, it’s a lot faster because we already know some things about the input format. Of course, more information may be gathered such as the context in which certain tokens are used, the order in which they are used and so on. You may read how this can be done in the paper linked below.

Results

Statically generated dictionaries may make fuzzing campaigns more effective. These dictionaries are particular suitable for fuzzing applications that parse highly structured inputs such as file format and network parsers. For example, we found over 15 zero-day vulns in network parsers due to the use of dictionaries alone. Having said that, understanding where they won’t help might help one decide if using one is desirable. Can it find bugs in the non-parser code path faster? No, because knowledge of input format is irrelavant for bugs not in the parsing code path. Will a smart fuzzer not find these bugs by itself? That is unlikely. Good fuzzers usually eventually find the same bugs. However, dictionaries can support them by triggering these code paths much faster so that a fuzzer may “focus” on other interesting code paths. You can read the full paper (to be published in the proceedings of RAID’17 by Springer) that this work produced and form your own opinion.

Part 1 | Part 2 | Part 3

Bhargava Shastry

Writing a Fuzz Unit Test for a Boost Filesystem API

Intro

Fuzz unit test

Conclusion

Custom Proto Mutation

Intro

Motivation

Writing a custom proto mutator

libprotobuf-mutator postprocessor callbacks

Callback interface

Message

Message descriptor

Seed

Callback function

libpng custom mutator

Conclusion

Structure aware mruby fuzzer

Intro

What is mruby?

Why fuzz mruby?

Structure of a ruby program

Function

Statements

More statements

Deconstructing LibProtobuf/Mutator Fuzzing

Intro

What we need

LP specification

PNG signature

IHDR

IDAT

IEND

The LP compiler

LP to native format converter

Fuzzer test harness

Conclusion

Quick Dive into Trail of Bits’ Slither

Intro

Try Slither Out

Entry point

Detectors

Step 1: Obtain AST

Step 2: Parse AST into CFG

Step 3: Drop to Slithir

Step 4: Detect Backdoor

Outro

Fuzzing the Solidity Compiler

Intro

Related Work

Motivation

Test harness

Fuzzing

Results

Bug 1: Unexpected function type conversion

Bug 2: Variable declaration type error

Next Steps

Can Good-Turing Frequency Estimation Tell Us When to Stop Fuzzing?

Intro

Prelims

Discretizing program behavior

Lifting Good-Turing for Fuzzing

Applying Good-Turing Estimate to Fuzzing

Updates

Statistical Evaluation of a Fuzzing Dictionary

Intro

Significance test

Vargha Delaney’s A measure

Context

Evaluation

Evaluation Methodology

Mann Whitney U Test

Vargha Delaney A12 Test

Conclusion

Acknowledgments

Exploring Fuzzer Crashes

Prologue

Intro

Static exploration of fuzzer crashes

Fault localization