Jekyll2022-06-13T17:32:26+00:00/feed.xmlBhargava ShastryYo! I am a security engineer at the Ethereum Foundation/indie security researcher.
Writing a Fuzz Unit Test for a Boost Filesystem API2021-02-27T00:00:00+00:002021-02-27T00:00:00+00:00/2021/02/27/Fuzzing-Boost-Filesystem<h3 id="intro">Intro</h3>
<p>This post summarizes one fuzz unit test for the boost filesystem and a bug it found.
Feel free to explore the rather vast landscape of boost filesystem APIs in order to write more unit tests.
Help make Boost more robust.</p>
<h3 id="fuzz-unit-test">Fuzz unit test</h3>
<p>The following unit test</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>#include <boost/filesystem.hpp>
#include <string>
using namespace std;
using namespace boost::filesystem;
extern "C" int LLVMFuzzerTestOneInput(const uint8_t* data, size_t size)
{
string pathString(reinterpret_cast<const char*>(data), size);
path p(pathString);
p.remove_filename();
return 0;
}
</code></pre></div></div>
<p>when compiled and run like so (tested on Linux bash console)</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>echo -e "
#include <boost/filesystem.hpp>
#include <string>
using namespace std;
using namespace boost::filesystem;
extern \"C\" int LLVMFuzzerTestOneInput(const uint8_t* data, size_t size)
{
string pathString(reinterpret_cast<const char*>(data), size);
path p(pathString);
p.remove_filename();
return 0;
}
" | clang++ -x c++ - -fsanitize=fuzzer -o fuzz_bfs -lboost_filesystem && time ./fuzz_bfs
</code></pre></div></div>
<p>prints the following output on the console (Linux, x86, clang v10, boost v1.71)</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>INFO: Seed: 3723374228
INFO: Loaded 1 modules (321 inline 8-bit counters): 321 [0x4f5150, 0x4f5291),
INFO: Loaded 1 PC tables (321 PCs): 321 [0x4c8f98,0x4ca3a8),
INFO: -max_len is not provided; libFuzzer will not generate inputs larger than 4096 bytes
INFO: A corpus is not provided, starting from an empty corpus
#2 INITED cov: 4 ft: 5 corp: 1/1b exec/s: 0 rss: 24Mb
terminate called after throwing an instance of 'std::out_of_range'
what(): basic_string::erase: __pos (which is 18446744073709551615) > this->size() (which is 5)
==702779== ERROR: libFuzzer: deadly signal
#0 0x4b00f0 in __sanitizer_print_stack_trace (/home/bhargava/fuzz_bfs+0x4b00f0)
#1 0x45c3f8 in fuzzer::PrintStackTrace() (/home/bhargava/fuzz_bfs+0x45c3f8)
#2 0x441543 in fuzzer::Fuzzer::CrashCallback() (/home/bhargava/fuzz_bfs+0x441543)
#3 0x7f45aa9513bf (/lib/x86_64-linux-gnu/libpthread.so.0+0x153bf)
#4 0x7f45aa76218a in __libc_signal_restore_set /build/glibc-ZN95T4/glibc-2.31/signal/../sysdeps/unix/sysv/linux/internal-signals.h:86:3
#5 0x7f45aa76218a in raise /build/glibc-ZN95T4/glibc-2.31/signal/../sysdeps/unix/sysv/linux/raise.c:48:3
#6 0x7f45aa741858 in abort /build/glibc-ZN95T4/glibc-2.31/stdlib/abort.c:79:7
#7 0x7f45aab6a950 (/usr/lib/x86_64-linux-gnu/libstdc++.so.6+0x9e950)
#8 0x7f45aab7647b (/usr/lib/x86_64-linux-gnu/libstdc++.so.6+0xaa47b)
#9 0x7f45aab764e6 in std::terminate() (/usr/lib/x86_64-linux-gnu/libstdc++.so.6+0xaa4e6)
#10 0x7f45aab76798 in __cxa_throw (/usr/lib/x86_64-linux-gnu/libstdc++.so.6+0xaa798)
#11 0x7f45aab6d3ea (/usr/lib/x86_64-linux-gnu/libstdc++.so.6+0xa13ea)
#12 0x7f45aaac0a22 in boost::filesystem::path::remove_filename() (/usr/lib/x86_64-linux-gnu/libboost_filesystem.so.1.71.0+0x12a22)
#13 0x4b26a7 in LLVMFuzzerTestOneInput (/home/bhargava/fuzz_bfs+0x4b26a7)
#14 0x442c01 in fuzzer::Fuzzer::ExecuteCallback(unsigned char const*, unsigned long) (/home/bhargava/fuzz_bfs+0x442c01)
#15 0x442345 in fuzzer::Fuzzer::RunOne(unsigned char const*, unsigned long, bool, fuzzer::InputInfo*, bool*) (/home/bhargava/fuzz_bfs+0x442345)
#16 0x4445e7 in fuzzer::Fuzzer::MutateAndTestOne() (/home/bhargava/fuzz_bfs+0x4445e7)
#17 0x4452e5 in fuzzer::Fuzzer::Loop(std::__Fuzzer::vector<fuzzer::SizedFile, fuzzer::fuzzer_allocator<fuzzer::SizedFile> >&) (/home/bhargava/fuzz_bfs+0x4452e5)
#18 0x433c9e in fuzzer::FuzzerDriver(int*, char***, int (*)(unsigned char const*, unsigned long)) (/home/bhargava/fuzz_bfs+0x433c9e)
#19 0x45cae2 in main (/home/bhargava/fuzz_bfs+0x45cae2)
#20 0x7f45aa7430b2 in __libc_start_main /build/glibc-ZN95T4/glibc-2.31/csu/../csu/libc-start.c:308:16
#21 0x408a3d in _start (/home/bhargava/fuzz_bfs+0x408a3d)
NOTE: libFuzzer has rudimentary signal handlers.
Combine libFuzzer with AddressSanitizer or similar for better crash reports.
SUMMARY: libFuzzer: deadly signal
MS: 4 ChangeBit-InsertRepeatedBytes-ShuffleBytes-EraseBytes-; base unit: adc83b19e793491b1c6ea0fd8b46cd9f32e592fc
0x2f,0x2f,0x2f,0x2f,0x2f,
/////
artifact_prefix='./'; Test unit written to ./crash-ece6d237a9393e5c002c541f9d4c92136941d956
Base64: Ly8vLy8=
real 0m1.610s
user 0m1.524s
sys 0m0.008s
</code></pre></div></div>
<p>This bug was <a href="https://github.com/boostorg/filesystem/issues/176">reported</a> upstream and promptly <a href="https://github.com/boostorg/filesystem/commit/cc57d28995c4a61e19d718040f9bc616b111a552">fixed</a> (thank you boost devs!).</p>
<p>The crash may be interpreted as follows:</p>
<ul>
<li>If you feed an input “/////” to the boost filesystem path object and attempt to remove filename, it throws an exception</li>
<li>The exception if of type <a href="https://en.cppreference.com/w/cpp/error/out_of_range">std::out_of_range</a></li>
</ul>
<p>Quoting</p>
<blockquote>
<p>(std::out_of_range) reports errors that are consequence of attempt to access elements out of defined range.</p>
</blockquote>
<blockquote>
<p>It may be thrown by the member functions of std::bitset and std::basic_string, by std::stoi and std::stod families of functions, and by the bounds-checked member access functions (e.g. std::vector::at and std::map::at).</p>
</blockquote>
<p>Typically, malformed inputs like these should not throw low-level exceptions such as this one which is why it is a bug.</p>
<h3 id="conclusion">Conclusion</h3>
<p>It is rather easy to get started with fuzzing boost filesystem APIs.
The test in this blog post hardly spans three lines of code (excluding boilerplate), so you get the idea.
Hope this post inspires you to explore other nooks and corners of boost filesystem API, and perhaps even fuzz them.
The hope is that this will make the boost C++ libraries that several of us—especially in the open-source world—rely on, safer.
Stay healthy!</p>IntroCustom Proto Mutation2019-12-27T00:00:00+00:002019-12-27T00:00:00+00:00/2019/12/27/Custom-Proto-Mutation<h2 id="intro">Intro</h2>
<p>This post describes how you can write your own custom protobuf mutators. Protobuf mutators are routines that mutate or change protobuf input. Protobuf input is essentially structured text. It looks like this:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>message {
sub_message {
int_field: 2
string_field: "hello"
}
}
</code></pre></div></div>
<p>A custom proto mutation is a routine that, say, mutates the <code class="language-plaintext highlighter-rouge">string_field</code> of <code class="language-plaintext highlighter-rouge">sub_message</code> from the string <code class="language-plaintext highlighter-rouge">hello</code> to the string <code class="language-plaintext highlighter-rouge">world</code>.</p>
<h2 id="motivation">Motivation</h2>
<p>What is the use of a custom proto mutation? The thing is <a href="https://github.com/google/fuzzing/blob/master/docs/structure-aware-fuzzing.md">structured fuzzing</a> is useful to fuzz programs that accept structured input. A popular implementation technique to perform structured fuzzing is via the use of (1) <a href="https://github.com/protocolbuffers/protobuf">protocol buffers library</a> to define input structure; and (2) <a href="https://github.com/google/libprotobuf-mutator">libprotobuf mutator library</a> to perform random protobuf mutations. Random protobuf mutations may be sufficient already, so at the risk of sounding repetative, what is the use of a custom proto mutation?</p>
<p>Well, think of it like this. Say you are fuzzing a program that you have written. You obviously know more about your program than a random fuzzer would, notwithstanding the power of coverage guidance. So, let’s say, you <strong>know</strong> that your program will perform a state transition when an input field described by <code class="language-plaintext highlighter-rouge">sub_message</code>’s <code class="language-plaintext highlighter-rouge">string_field</code> is <code class="language-plaintext highlighter-rouge">world</code> and not <code class="language-plaintext highlighter-rouge">hello</code>. Now, to trigger this mutation without a custom mutator, you’d typically wait for the random mutator, through a series of mutations, to change <code class="language-plaintext highlighter-rouge">hello</code> to <code class="language-plaintext highlighter-rouge">world</code>. Although this is not too far-fetched, it consumes resources i.e., time and computation cycles.</p>
<p>The point is, if you <strong>know</strong> some mutation is important for your program, why would you wait for it to be synthesized randomly? Why not program it as part of the fuzzer itself, right?</p>
<h2 id="writing-a-custom-proto-mutator">Writing a custom proto mutator</h2>
<p>Now, I describe the technical part of writing your own custom proto mutator, using <a href="https://github.com/google/oss-fuzz/tree/master/projects/libpng-proto">libpng proto fuzzer</a> as an example. The <a href="https://github.com/google/oss-fuzz/blob/master/projects/libpng-proto/png_proto_fuzzer_example.cc">libpng_proto_fuzzer_example.cc</a> source file describes how to convert protobuf structure defined in <a href="https://github.com/google/oss-fuzz/blob/master/projects/libpng-proto/png_fuzz_proto.proto">png_fuzz_proto.proto</a> to a PNG file. I’ll set ourselves the relatively simple task of writing a mutator that mutates an <code class="language-plaintext highlighter-rouge">OtherChunk</code> such that <code class="language-plaintext highlighter-rouge">unknown_type</code> chunks are changed to <code class="language-plaintext highlighter-rouge">known_type</code> chunks.</p>
<h3 id="libprotobuf-mutator-postprocessor-callbacks">libprotobuf-mutator postprocessor callbacks</h3>
<p>Before we code the actual mutation routine, let’s take some time to appreciate the callback facility provided by libprotobuf-mutator to enable custom mutations. I believe this callback was first implemented in <a href="https://github.com/google/libprotobuf-mutator/pull/137">this pull request</a>. Essentially, the user of libprotobuf-mutator, can register a postprocessor callback on a protobuf message type. This postprocessor is then invoked after <strong>every</strong> mutation performed by libprotobuf-mutator.</p>
<h3 id="callback-interface">Callback interface</h3>
<p>The callback interface <a href="https://github.com/google/libprotobuf-mutator/blob/dd89da92b59b1714bab6e2a135093948a1cf1c6a/src/libfuzzer/libfuzzer_macro.h#L109-L112">looks like so</a>. Essentially, the interface contains two input parameters:</p>
<ul>
<li>const pointer to message descriptor</li>
<li>function that implements the custom mutation routine. This function accepts two inputs:
<ul>
<li>pointer to protobuf message</li>
<li>seed (unsigned integer)</li>
</ul>
</li>
</ul>
<p>I will briefly describe each of them in the following paragraphs.</p>
<h4 id="message">Message</h4>
<p>A protobuf message is a unit of input structure. A message may contain fields that may be of a value type (i.e., integer, bool, string etc.) or non-value type e.g., message. In our dummy example, <code class="language-plaintext highlighter-rouge">message</code> and <code class="language-plaintext highlighter-rouge">sub_message</code> are protobuf messages that describe something. The reason this is part of the callback interface is that, ultimately, we (custom mutation implementors) would like to mutate this data with custom changes.</p>
<h4 id="message-descriptor">Message descriptor</h4>
<p>A message descriptor describes the nature of a message. The reason this is part of the callback interface is that, internally, libprotobuf-mutator maps a callback (custom mutation routine) against a descriptor. So, for example, if we were to implement a custom mutator for changing the <code class="language-plaintext highlighter-rouge">string_field</code> in our dummy example, it would have to be registered against the descriptor of the <code class="language-plaintext highlighter-rouge">sub_message</code> message type’s descriptor. To do that, we use protoc (protobuf compiler) generated static function call <code class="language-plaintext highlighter-rouge">sub_message::descriptor()</code>.</p>
<h4 id="seed">Seed</h4>
<p>A seed is a pseudo-random number supplied by libprotobuf-mutator to help the mutation writer tune their mutation. The reason this is part of the callback interface is that, often, mutation routine implementors (us) would want their mutation to be applied only every once in a while. To permit this while keeping fuzzing deterministic, a pseudo-randomly (but deterministically) generated seed is supplied for use by the mutation routine implementor.</p>
<p>A simple manner in which <code class="language-plaintext highlighter-rouge">seed</code> may be used is via the modulo operator, like so</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>/// Apply my mutation roughly once every three LPM mutations
if (seed % 3 == 0)
{
apply_my_mutation();
}
</code></pre></div></div>
<h4 id="callback-function">Callback function</h4>
<p>Now that we understand the structure and reasoning behind LPM’s postprocessor interface, we can implement the mutation routine: Change <code class="language-plaintext highlighter-rouge">hello</code> to <code class="language-plaintext highlighter-rouge">world</code></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>protobuf_mutator::libfuzzer::RegisterPostProcessor(
sub_message::descriptor(),
[](google::protobuf::Message* message, unsigned int seed)
{
sub_message *sub_msg = static_cast<sub_message *>(message);
if (seed % 2)
{
if (sub_msg->string_field() == "hello")
{
sub_msg->set_string_field("world");
}
}
}
);
</code></pre></div></div>
<p>Here’s what we are doing:</p>
<ul>
<li>Register a custom post processor for the <code class="language-plaintext highlighter-rouge">sub_message</code> message type</li>
<li>statically casting the canonical protobuf message type to <code class="language-plaintext highlighter-rouge">sub_message</code> message type before further checks</li>
<li>applying custom mutation 50% of the time</li>
<li>if <code class="language-plaintext highlighter-rouge">string_field</code> is set to <code class="language-plaintext highlighter-rouge">hello</code>, then we change it to <code class="language-plaintext highlighter-rouge">world</code></li>
</ul>
<h3 id="libpng-custom-mutator">libpng custom mutator</h3>
<p>Now, we are ready to apply what we have learnt to the linked libpng-proto fuzzer. Here’s <a href="https://github.com/google/oss-fuzz/pull/3168/files#diff-0e216d0c3c3e73c9bdee0a482ac288beR20-R33">a portion of the pull request</a> in which I implement a simple mutator routine that changes <code class="language-plaintext highlighter-rouge">unknown_type</code> chunks to a <code class="language-plaintext highlighter-rouge">known_type</code> chunk:</p>
<p>The really cool part is it is 4 lines of source code to do this :-)</p>
<h2 id="conclusion">Conclusion</h2>
<p>This post hopefully made it easier for you to understand and write custom proto mutation routines for your fuzzer. Have fun writing them and experimenting a little until you find that elusive bug that randomness could not find ;-)</p>IntroStructure aware mruby fuzzer2019-05-17T00:00:00+00:002019-05-17T00:00:00+00:00/2019/05/17/mruby-proto-fuzzer<h2 id="intro">Intro</h2>
<p><a href="https://github.com/google/fuzzer-test-suite/blob/master/tutorial/structure-aware-fuzzing.md">Structure aware fuzzing</a> is a fuzzing technique in which you make the fuzzer aware of the structure of input.
This post describes the application of this technique to the mruby interpreter.</p>
<h2 id="what-is-mruby">What is mruby?</h2>
<p><a href="https://en.wikipedia.org/wiki/Mruby">mruby</a> is a lightweight ruby interpreter that is designed to be embeddable.
This means, you can use mruby to write a <a href="http://mruby.org/docs/articles/executing-ruby-code-with-mruby.html">20 line “C” program that executes ruby code</a>.
Cool, eh? Let’s fuzz it with arbitrary ruby code then.</p>
<h2 id="why-fuzz-mruby">Why fuzz mruby?</h2>
<p>There is some <a href="https://hackerone.com/shopify-scripts">evidence</a> that companies use mruby to execute potentially attacker-controlled ruby programs in security sensitive environments.</p>
<h2 id="structure-of-a-ruby-program">Structure of a ruby program</h2>
<p>Without awareness of the ruby programming language, the fuzzer is likely to synthesize junk.
I mean, today’s fuzzers are smart but they are not smart enough to synthesize ruby programs from thin air.
That’s the realm of machine learning, isn’t it?
Lol.</p>
<h3 id="function">Function</h3>
<p>Let’s prod the fuzzer along a little bit.
Let’s start by defining a very simple input template.
Our input template defines a function foo and invokes it thereafter.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>def foo()
end
foo
</code></pre></div></div>
<p>Simple, isn’t it?
What does the protobuf specification for such a function look like</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>message Function {
}
</code></pre></div></div>
<p><code class="language-plaintext highlighter-rouge">Function</code>, for the moment, is just a stub object, that we can “visit” (in the <a href="https://en.wikipedia.org/wiki/Visitor_pattern">visitor pattern sense</a>) like so</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>void protoConverter::visit(Function const& x)
{
m_output << "def foo()\nvar_0 = 1\n";
m_output << "end\n";
m_output << "foo\n";
}
</code></pre></div></div>
<p>Simple as it is, foo doesn’t do anything.
To do something, we need a notion of statements.</p>
<h3 id="statements">Statements</h3>
<p>So let’s add a notion of statements.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>message Const {
oneof const_oneof {
uint32 int_lit = 1;
bool bool_val = 2;
}
}
message Rvalue {
oneof rvalue_oneof {
Const cons = 1;
}
}
message AssignmentStatement {
required Rvalue rvalue = 2;
}
message Statement {
oneof stmt_oneof {
AssignmentStatement assignment = 1;
}
}
message StatementSeq {
repeated Statement statements = 1;
}
message Function {
required StatementSeq statements = 1;
}
</code></pre></div></div>
<p>This specification tells the fuzzer the following</p>
<ul>
<li>A function consists of a sequence of statements</li>
<li>A statement sequence consists of at least zero statements</li>
<li>A statement can be an assignment statement</li>
<li>An assignment statement consists of a value on the right hand side
<ul>
<li>The value can be a constant</li>
<li>A constant is either an unsigned integer or a boolean literal</li>
</ul>
</li>
</ul>
<p>Here’s the corresponding visitor.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>void protoConverter::visit(AssignmentStatement const& x)
{
m_output << "var_" << m_numLiveVars << " = ";
visit(x.rvalue());
m_output << "\n";
}
void protoConverter::visit(Statement const& x)
{
switch (x.stmt_oneof_case()) {
case Statement::kAssignment:
visit(x.assignment());
break;
case Statement::STMT_ONEOF_NOT_SET:
break;
}
m_output << "\n";
}
void protoConverter::visit(Function const& x)
{
m_output << "def foo()\nvar_0 = 1\n";
visit(x.statements());
m_output << "end\n";
m_output << "foo\n";
}
</code></pre></div></div>
<p>Let’s see what this generates</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>def foo()
var_0 = 1337
var_1 = false
end
foo
</code></pre></div></div>
<p>It’s definitely more lively than the foo we started out with, but it’s still sorta meh.</p>
<h3 id="more-statements">More statements</h3>
<p>We can essentially translate ruby programming language rules into a somewhat equivalent protobuf specification.
And trust me, there is a lot more to be done.
We can add the notion of strings, hash values, and operations on top of them to begin with.
We can teach the fuzzer what it means to call the <code class="language-plaintext highlighter-rouge">Time()</code> builtin object.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Time.at(628232400) #=> 1989-11-28 00:00:00 -0500
</code></pre></div></div>
<p>I have made a humble beginning <a href="https://github.com/mruby/mruby/tree/master/oss-fuzz">here</a>.</p>
<ul>
<li><a href="https://github.com/mruby/mruby/blob/master/oss-fuzz/ruby.proto">Ruby proto spec</a></li>
<li><a href="https://github.com/mruby/mruby/blob/master/oss-fuzz/proto_to_ruby.cpp">Ruby proto spec to ruby program converter class</a></li>
</ul>
<p>Contributions welcome. Some specific directions for future work</p>
<ul>
<li>Add more ruby operations</li>
<li>Avoid generating DoSsy ruby programs like print <code class="language-plaintext highlighter-rouge">"1337"*10000000</code></li>
</ul>
<p>Help find deep bugs in the mruby interpreter.</p>IntroDeconstructing LibProtobuf/Mutator Fuzzing2019-01-18T00:00:00+00:002019-01-18T00:00:00+00:00/2019/01/18/Deconstructing-LPM<h3 id="intro">Intro</h3>
<p><a href="https://github.com/google/libprotobuf-mutator">LibProtobufMutator</a> (LPM) is a library that helps fuzz structured input from a <a href="https://github.com/protocolbuffers/protobuf">LibProtobuf</a> (LP) specification.
Among other things, LPM can <a href="https://chromium.googlesource.com/chromium/src/testing/libfuzzer/+/HEAD/libprotobuf-mutator.md#Write-a-grammar_based-fuzzer-with-libprotobuf_mutator">assist coverage-guided fuzzing</a>.
This post explores the nitty-gritties of writing an LP-based fuzzer using <a href="https://github.com/google/oss-fuzz/pull/2048">KCC’s example</a>.</p>
<h3 id="what-we-need">What we need</h3>
<p>To write an LP-based fuzzer, what you will need are:</p>
<ul>
<li>An LP specification: This is a descriptive file with a <code class="language-plaintext highlighter-rouge">.proto</code> extension</li>
<li>LP compiler: This compiles the LP spec. into code (C++ bindings) that can be called from the test harness</li>
<li>LP-to-native-format-converter: Since fuzzing happens on the LP abstraction, we need a LP formatted input to native format converter if we are to fuzz the native format.</li>
<li>Fuzzer test harness: This is a C/C++ test harness that invokes some program API that consumes (parses) native-formatted input
Most importantly, what we don’t need is the LP fuzzer itself: code that mutates the LP formatted input. The fuzzer module is called LibProtobufMutator or LPM, which is an external dependency.</li>
</ul>
<p>This seems complicated at first; it definitely is for someone, like me, who has never written an LP-based fuzzer before.
I will try to make it simpler.</p>
<p>I think the big idea behind this was that it is harder to ask developers to write custom fuzz mutators than it is to ask them to write a format specification and test harness.
I’ve never written a custom fuzz mutator before, so I’m not in a position to present my experience.</p>
<p>That aside, the hope with this project is that this setup (LP-based fuzzing) catches bugs faster and more methodically.
Methodically because you are fuzzing the specification and not mutating an opaque sequence of bytes.
Faster, hopefully because fuzzing only what needs to be fuzzed with only those mutations that make sense arrives at bugs faster than fuzzing everything somehow.</p>
<h3 id="lp-specification">LP specification</h3>
<p>Here’s a simple LPM spec taken from <a href="https://github.com/google/oss-fuzz/pull/2048">here</a>.</p>
<script src="https://gist.github.com/7c78e89af167700387a2ac93798a1c29.js"> </script>
<p>Here’s a break-down of the most important fields:</p>
<ul>
<li><code class="language-plaintext highlighter-rouge">syntax = proto2;</code>: There are two versions of the protocol buffers language, namely <code class="language-plaintext highlighter-rouge">proto2</code> and <code class="language-plaintext highlighter-rouge">proto3</code>. This specification is written using <code class="language-plaintext highlighter-rouge">proto2</code>.</li>
<li><code class="language-plaintext highlighter-rouge">message</code>: <code class="language-plaintext highlighter-rouge">message</code>, although not explicitly defined iiuc, seems to be the smallest unit of a message description. It is a named field. For example <code class="language-plaintext highlighter-rouge">message IHDR {</code> defines a message format called <code class="language-plaintext highlighter-rouge">IHDR</code></li>
<li>field rule, type, name, number: A <code class="language-plaintext highlighter-rouge">field</code> is a portion of a message.
<ul>
<li>field rule: specifies if the field under consideration is required, optional, or repeated. They mean just that.</li>
<li>field type: specifies the data type of the field e.g., number (<code class="language-plaintext highlighter-rouge">uint32</code>), string etc.</li>
<li>field name: name of the field</li>
<li>field number: unique identifier for said field. It is a good practice to start numbering from <code class="language-plaintext highlighter-rouge">1</code> since smaller integers require lesser storage.</li>
</ul>
</li>
</ul>
<p>A much needed digression to understand a real-world data format, the PNG image format. The structure of the simplest PNG image is as follows:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>--------
PNG sig
--------
IHDR
--------
IDAT(s)
--------
IEND
--------
</code></pre></div></div>
<p>Barring <code class="language-plaintext highlighter-rouge">IDAT</code>, all chunnks are singular i.e., must appear only once in a valid PNG file.</p>
<h4 id="png-signature">PNG signature</h4>
<p>The PNG signature is a specific sequence of bytes that signal the beginning of a PNG file. It looks like so (in C/C++ code)</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>const unsigned char header[] = {0x89, 0x50, 0x4e, 0x47, 0x0d, 0x0a, 0x1a, 0x0a};
</code></pre></div></div>
<h4 id="ihdr">IHDR</h4>
<p>IHDR stores image meta-data such as its width, height etc. Unlike the signature, IHDR contains variable fields. This makes it a good candidate for a protocol buffers message</p>
<p>From the <a href="http://www.libpng.org/pub/png/spec/1.2/PNG-Contents.html">original PNG specification</a></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>The IHDR chunk must appear FIRST. It contains:
Width: 4 bytes
Height: 4 bytes
Bit depth: 1 byte
Color type: 1 byte
Compression method: 1 byte
Filter method: 1 byte
Interlace method: 1 byte
</code></pre></div></div>
<p>Let’s look at the corresponding protobuf description:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>message IHDR {
required uint32 width = 1; // maps to width
required uint32 height = 2; // maps to height
required uint32 other1 = 3; // maps to bitdepth-colortype-compmethod-filtmethod
required uint32 other2 = 4; // Only 1 byte used. (maps to interlacemethod)
}
</code></pre></div></div>
<p>As we can see, the protobuf description is “serialized” into fields of type <code class="language-plaintext highlighter-rouge">uint32</code> (4-byte sequences).
If you were to closely match the original IHDR spec, the proto-spec would look as follows (note the break-down of fields such as <code class="language-plaintext highlighter-rouge">bit_depth</code>, <code class="language-plaintext highlighter-rouge">color_type</code> etc.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>message IHDR {
required uint32 width = 1;
required uint32 height = 2;
enum bit_depth {
BD_ONE = 1;
BD_TWO = 2;
BD_FOUR = 4;
BD_EIGHT = 8;
BD_SIXTEEN = 16;
BD_MAX = 255; // BYTE_MAX
};
enum color_type {
CT_ZERO = 0;
CT_TWO = 2;
CT_THREE = 3;
CT_FOUR = 4;
CT_SIX = 6;
CT_MAX = 255; // BYTE_MAX
};
...
};
</code></pre></div></div>
<p>Although the <code class="language-plaintext highlighter-rouge">BYTE_MAX</code> option is not part of the specification, I have intentionally added it so that we make the mutator explore specific corner cases. This is hacky, I admit. Who is to say whether or not <code class="language-plaintext highlighter-rouge">200</code> is a better corner-case than <code class="language-plaintext highlighter-rouge">255</code>?</p>
<h4 id="idat">IDAT</h4>
<p>The IDAT chunk contains compressed image data. This means (in LP terms) it’s spec looks like so</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>message IDAT {
required bytes data = 1;
}
</code></pre></div></div>
<p>It’s an opaque byte stream, the mutator is free to synthesize whatever byte-sequence it wants to fuzz an IDAT chunk.</p>
<h4 id="iend">IEND</h4>
<p>Here’s how the PNG spec defines IEND</p>
<blockquote>
<p>The IEND chunk must appear LAST. It marks the end of the PNG datastream. The chunk’s data field is empty.</p>
</blockquote>
<p>Essentially, it is a placeholder with no data that signifies the end of a PNG image.</p>
<h3 id="the-lp-compiler">The LP compiler</h3>
<p>The LP compiler is called <code class="language-plaintext highlighter-rouge">protoc.</code> <code class="language-plaintext highlighter-rouge">protoc</code> compiles a Protobuf spec. (<code class="language-plaintext highlighter-rouge">.proto</code> file) into language bindings.
At the moment, the following language bindings are supported by the compiler: C++, Java, and Python.
In <a href="https://developers.google.com/protocol-buffers/docs/reference/other">these notes</a>, it appears that support for more languges is an ongoing effort.
Invoking the compiler is quite simple, as you can see <a href="https://github.com/google/oss-fuzz/pull/2048">here</a>, all you need to do is</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>rm -rf genfiles && mkdir genfiles && LPM/external.protobuf/bin/protoc png_fuzz_proto.proto --cpp_out=genfiles
</code></pre></div></div>
<p>This is</p>
<ul>
<li>Creating a fresh <code class="language-plaintext highlighter-rouge">genfiles</code> directory where C/C++ bindings will be stored</li>
<li>Invoking the <code class="language-plaintext highlighter-rouge">protoc</code> compiler that is available from the LPM repo against the PNG LP description we spoke about in the previous section of this blog</li>
<li>Explicitly asking the compiler to generate C++ bindings</li>
</ul>
<p>Essentially, what this step does is to create a set of C++ header/source files that may be included/linked against by the fuzzer test harness.
The generated header/C++ files offer a simple API to access the underlying raw data behind LPM fields.</p>
<h3 id="lp-to-native-format-converter">LP to native format converter</h3>
<p>Why do we need a converter in the first place?
Here’s the thing: The LPM generates LPM formatted input that, for PNG, looks like this</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># xxd C/002d3dd31b1bc41601c0e5d652b97f6599b23ba6
00000000: 6968 6472 207b 0a20 2077 6964 7468 3a20 ihdr {. width:
00000010: 300a 2020 6865 6967 6874 3a20 300a 2020 0. height: 0.
00000020: 6274 3a20 4244 5f4f 4e45 0a20 2063 743a bt: BD_ONE. ct:
00000030: 2043 545f 5448 5245 450a 2020 636d 3a20 CT_THREE. cm:
00000040: 434d 5f4d 4158 0a20 2066 6d3a 2046 4d5f CM_MAX. fm: FM_
00000050: 4d41 580a 2020 693a 2049 5f4d 4158 0a7d MAX. i: I_MAX.}
00000060: 0a
</code></pre></div></div>
<p>What we actually need when we are debugging is a valid PNG file, that looks like this</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># xxd a.png
00000000: 8950 4e47 0d0a 1a0a 0000 000d 4948 4452 .PNG........IHDR
00000010: 0000 0000 0000 0000 0103 ffff ff01 fbc8 ................
00000020: 4300 0000 0049 454e 44ae 4260 82 C....IEND.B`.
</code></pre></div></div>
<p>As you can see, the LPM generated file holds a bunch of <code class="language-plaintext highlighter-rouge">key:value</code> pairs in serialized form. These need to be parsed so that we construct a serialized form of <code class="language-plaintext highlighter-rouge">values</code> in PNG format. Precisely this is the job of the converter.</p>
<p>In code terms, the converter is an integral part of the test harness itself (see next section).
The fuzzer harness, among other things, is accepting an LPM formatted input, converting it to a valid PNG byte stream and feeding it to the fuzzer entry-point API.</p>
<h3 id="fuzzer-test-harness">Fuzzer test harness</h3>
<p>Here’s a gist of the test harness (written by KCC; I’m embedding it via a gist because I’ve not yet found a nifty way to directly embed GH files in GH pages) for us to break down</p>
<script src="https://gist.github.com/79fb0771418c1929b6c0d6b22bf3550a.js"> </script>
<p>Let’s look at the includes first:</p>
<ul>
<li>some standard stuff happening with <code class="language-plaintext highlighter-rouge"><string></code> etc.</li>
<li><code class="language-plaintext highlighter-rouge">zlib.h</code> is needed because (quoting the original spec.)</li>
</ul>
<blockquote>
<p>At present, only compression method 0 (deflate/inflate compression with a sliding window of at most 32768 bytes) is defined. All standard PNG images must be compressed with this scheme.
Deflate-compressed datastreams within PNG are stored in the “zlib” format</p>
</blockquote>
<ul>
<li><code class="language-plaintext highlighter-rouge">#include "libprotobuf-mutator/src/libfuzzer/libfuzzer_macro.h"</code>: This defines the <code class="language-plaintext highlighter-rouge">DEFINE_PROTO_FUZZER</code> that seems to be overridden (?) in the test harness. TBH, I dunno what’s happening here.</li>
<li><code class="language-plaintext highlighter-rouge">#include "png_fuzz_proto.pb.h"</code>: This is the <code class="language-plaintext highlighter-rouge">protoc</code> generated C++ binding header file for our LP spec.</li>
</ul>
<p>Past the header inclusions, you see several utility functions</p>
<ul>
<li><code class="language-plaintext highlighter-rouge">WriteInt</code> writes an integer in big-endian (network byte order) format <a href="http://www.libpng.org/pub/png/book/chapter13.html">as required by the PNG spec</a></li>
<li><code class="language-plaintext highlighter-rouge">WriteByte</code> simply writes a byte</li>
<li><code class="language-plaintext highlighter-rouge">compress</code> performs zlib compression of chunk data. This is required for IDAT chunks especially</li>
<li><code class="language-plaintext highlighter-rouge">WriteChunk</code> writes a specified PNG chunk</li>
<li><code class="language-plaintext highlighter-rouge">ProtoToPng</code> is where a proto is converted to a <code class="language-plaintext highlighter-rouge">std::string</code> that contains the fuzzed PNG’s raw data (see previous section). This is where the LPM to native format conversion (see previous section) is happening.</li>
<li><code class="language-plaintext highlighter-rouge">FuzzPNG</code> is the real test harness: This function feeds fuzzed raw PNG data to the underlying PNG API</li>
</ul>
<p>The <code class="language-plaintext highlighter-rouge">FuzzPNG</code> function is defined in the PNG source repo, which is why it needs to be linked against it like so</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$CXX $CXXFLAGS -c -DLLVMFuzzerTestOneInput=FuzzPNG libpng/contrib/oss-fuzz/libpng_read_fuzzer.cc -I libpng
$CXX $CXXFLAGS png_proto_fuzzer_example.cc libpng_read_fuzzer.o genfiles/png_fuzz_proto.pb.cc \
-I genfiles -I. -I libprotobuf-mutator/ -I LPM/external.protobuf/include \
-lz \
LPM/src/libfuzzer/libprotobuf-mutator-libfuzzer.a \
LPM/src/libprotobuf-mutator.a \
LPM/external.protobuf/lib/libprotobuf.a \
libpng/.libs/libpng16.a \
$LIB_FUZZING_ENGINE \
-o $OUT/png_proto_fuzzer_example
</code></pre></div></div>
<p>Were you to write the FuzzPNG function yourself, it would probably <a href="https://chromium.googlesource.com/chromium/src/+/master/testing/libfuzzer/fuzzers/libpng_read_fuzzer.cc">look like this</a>. Looks like standard stuff if you were to read <a href="http://www.libpng.org/pub/png/book/chapter13.html">Chapter 13 of the PNG book</a>.</p>
<h3 id="conclusion">Conclusion</h3>
<p>In this post, we explored</p>
<ul>
<li>what LibprotobufferMutator is and how one can write an LP spec</li>
<li>How LP spec can help us write more targeted fuzzers</li>
<li>How the whole LP/LPM/libFuzzer setup is wired together</li>
</ul>
<p>Overall, I feel that LP-based fuzzing holds promise for testing language parsers, compilers, interpreters etc.
The challenge is to obtain an understanding of the underlying language well enough to be able to (1) write a spec for it and (2) write a proper LP-to-native format converter.</p>
<p>Although I think writing these things is not a big deal, it definitely takes dedicated time and effort.
This means, unless you draw benefits from such effort you are more likely to just download a corpus from the Internet and start fuzzing.
It’s essentially a cost-benefit trade-off.</p>
<p>In an upcoming post, I plan to compare vanilla (non specification) fuzzer and an LP-based fuzzer with the hope that such a comparison sheds light on the actual benefits of LP-based fuzzing. That’s all folks!</p>IntroQuick Dive into Trail of Bits’ Slither2018-11-05T00:00:00+00:002018-11-05T00:00:00+00:00/2018/11/05/Deconstructing-ToBs-Slither<h2 id="intro">Intro</h2>
<p><a href="https://github.com/trailofbits/slither">Slither</a> is a static analyzer that has been developed by Trail of Bits to help smart contract developers find bugs in their code.
In this post, I’ll try to get my hands dirty with Slither so you don’t have to.
Moreover, having a background writing static analysis tools myself, I’m
curious how Slither is architected and I’m excited at the prospect of writing
detector for it…one day.</p>
<p>This post attempts to understand the work-flow of Slither.
Target audience for this are folks who</p>
<ul>
<li>would like to understand the architecture/work-flow of Slither</li>
<li>would like to start to write a detector (like me) but don’t know where to
start</li>
</ul>
<p>Treat this as a (shoddy) introduction to Slither, that at the
time of writing addresses only the author’s curiosity. haha.</p>
<p>First things first, Slither itself is written in <code class="language-plaintext highlighter-rouge">python3</code>, yaay!
One of the first things slither does is to use the solidity compiler (<code class="language-plaintext highlighter-rouge">solc</code>
binary) to obtain the AST of the program to be analyzed.
Therefore, before I proceed, let me install the Solidity compiler.
Since most of the test contracts in the slither code base are targeted at
compiler version 0.4.24, I chose to pick it up from the official GitHub page
<a href="https://github.com/ethereum/solidity/releases/tag/v0.4.24">here</a>.
One could also fetch the officially distributed compiler for your Ubuntu
distribution like so:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>sudo add-apt-repository ppa:ethereum/ethereum
sudo apt-get update
sudo apt-get install solc
</code></pre></div></div>
<h2 id="try-slither-out">Try Slither Out</h2>
<p>After installing the <code class="language-plaintext highlighter-rouge">solc</code> binary, I set up a python IDE to debug slither.
Essentially, the idea is to use a good debugger (I’m using Jet Brain’s PyCharm) to step through slither code and understand the steps involved in analyzing smart contracts.</p>
<p>The invocation that I am using for debugging is the elementary:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ slither <name_of_contract>.sol
</code></pre></div></div>
<p>What this is supposed to do is analyze the source code of the contract and spit out bug reports, like so:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>INFO:Detectors: Uninitialized state variable in ../solidity/001_name_references.sol, Contract: test, Variable: variable, Used in ['f']
INFO:Detectors: Contract 'test' is not in CapWords
INFO:Detectors: Parameter '' is not in mixedCase, Contract: '', Function: 'test''
</code></pre></div></div>
<p>What you’d notice when you run slither against buggy code are the following things</p>
<ul>
<li>The smart contract to be analyzed needs to be compilable but not
necessarily runnable</li>
<li>Bug reports are spit out on <code class="language-plaintext highlighter-rouge">stderr</code></li>
<li>Each bug report is prefixed with the string <code class="language-plaintext highlighter-rouge">INFO:Detectors:</code></li>
</ul>
<p>But this is too high level, let’s step through slither at a more easy pace</p>
<h3 id="entry-point">Entry point</h3>
<p>The entry point for <code class="language-plaintext highlighter-rouge">slither</code> is the main function of course.
This function is defined in a python file called <code class="language-plaintext highlighter-rouge">__main__.py</code> in the slither distribution.
The very first thing this main function does is to fetch all <code class="language-plaintext highlighter-rouge">detectors</code> and <code class="language-plaintext highlighter-rouge">printers</code>.
Each <code class="language-plaintext highlighter-rouge">detector</code> object in slither detects a class of bugs, and each <code class="language-plaintext highlighter-rouge">printer</code> object logs useful information about the program under analysis e.g., its call graph, what a function is trying to do (so called function summary) etc.</p>
<h3 id="detectors">Detectors</h3>
<p>To get a sense of the kind of bugs Slither detects, let’s look at the default set of detectors that Slither provides.
Here’s an exhaustive list at the time of writing</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>UninitializedStateVarsDetection,
ConstantPragma,
OldSolc,
Reentrancy,
UninitializedStorageVars,
LockedEther,
ArbitrarySend,
Suicidal,
UnusedStateVars,
TxOrigin,
Assembly,
LowLevelCalls,
NamingConvention,
ConstCandidateStateVars,
ExternalFunction
</code></pre></div></div>
<p>That makes it a total of 15 detectors for as many bug classes.
A brief digression: Until we have a formalization of bug classes as in the
C/C++ space (see the <a href="https://cwe.mitre.org/">common weakness enumeration</a> project), I’d expect
bug classification for Solidity to be largely ad-hoc.</p>
<p>Let’s dive deep into an elementary bug class to see how bug detection is
implemented.
The <code class="language-plaintext highlighter-rouge">Backdoor</code> detector (unlisted, but available in source) looks like an
example detector that makes for a good
starting example.
Here’s the <code class="language-plaintext highlighter-rouge">backdoor.sol</code> contract that may be found in the slither code base
that the backdoor detector is meant to detect.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pragma solidity 0.4.24;
contract C{
function i_am_a_backdoor() public{
selfdestruct(msg.sender);
}
}
</code></pre></div></div>
<p>Evidently, this contract</p>
<ul>
<li>defines a function that calls the <code class="language-plaintext highlighter-rouge">selfdestruct</code> method on the msg sender</li>
</ul>
<p>What’s the <code class="language-plaintext highlighter-rouge">selfdestruct</code> method?</p>
<blockquote>
<p>The only possibility that code is removed from the blockchain is when a contract at that address performs the selfdestruct operation. The remaining Ether stored at that address is sent to a designated target and then the storage and code is removed from the state.</p>
</blockquote>
<p>In this intentionally buggy piece of code:</p>
<ul>
<li>When some other contract calls <code class="language-plaintext highlighter-rouge">C.i_am_a_backdoor()</code> the piece of code
that points to <code class="language-plaintext highlighter-rouge">msg.sender</code> i.e., the caller of <code class="language-plaintext highlighter-rouge">C.i_am_a_backdoor()</code> is
going to be removed from the blockchain.</li>
<li><code class="language-plaintext highlighter-rouge">C.i_am_a_backdoor()</code> is a means to hide oneself</li>
</ul>
<p>So, let’s see what happens when Slither analyzes this piece of code:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>INFO:Detectors: Backdoor function found in C.i_am_a_backdoor
INFO:Detectors: Suicidal function in /home/bhargava/work/github/slither/tests/backdoor.sol Contract: C, Function: i_am_a_backdoor
INFO:Detectors: Function 'i_am_a_backdoor' is not in mixedCase, Contract: 'C'
INFO:Detectors: Public function in /home/bhargava/work/github/slither/tests/backdoor.sol Contract: C, Function: i_am_a_backdoor should be declared external
INFO:Slither:/home/bhargava/work/github/slither/tests/backdoor.sol analyzed (1 contracts), 4 result(s) found
</code></pre></div></div>
<p>Voila, the backdoor function is flagged and reported to the user (see first
line of report).
We will ignore the other bugs flagged by other detectors since our purpose is
to get a general sense of how detection works, not understand the specifics
of a particular detector.
So, how does the detection work under the hood?</p>
<p>Well, to begin with, any static analyzer needs to “understand” the code being
analyzed.
What needs to be understood is essentially: “What is this program trying to
do? Is there a bug in it?”.
These two questions hinge on semantic program analysis which is a complex
problem.</p>
<p>We can begin to get a semantic understanding of a program by first looking at
its syntax tree.
A syntax tree is a tree: A directed acyclic graph that remains acyclic even
if directionality is removed.
The nodes of the tree are syntactic elements of the programming language in
which the analyzed program is written.
Here’s a snippet of an actual AST (as a JSON string) of the backdoor program
shown above.</p>
<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
</span><span class="nl">"attributes"</span><span class="w"> </span><span class="p">:</span><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="nl">"absolutePath"</span><span class="w"> </span><span class="p">:</span><span class="w"> </span><span class="s2">"tests/backdoor.sol"</span><span class="p">,</span><span class="w">
</span><span class="nl">"exportedSymbols"</span><span class="w"> </span><span class="p">:</span><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="nl">"C"</span><span class="w"> </span><span class="p">:</span><span class="w">
</span><span class="p">[</span><span class="w">
</span><span class="mi">11</span><span class="w">
</span><span class="p">]</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="nl">"children"</span><span class="w"> </span><span class="p">:</span><span class="w">
</span><span class="p">[</span><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="nl">"attributes"</span><span class="w"> </span><span class="p">:</span><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="nl">"literals"</span><span class="w"> </span><span class="p">:</span><span class="w">
</span><span class="p">[</span><span class="w">
</span><span class="s2">"solidity"</span><span class="p">,</span><span class="w">
</span><span class="s2">"0.4"</span><span class="p">,</span><span class="w">
</span><span class="s2">".24"</span><span class="w">
</span><span class="p">]</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="nl">"id"</span><span class="w"> </span><span class="p">:</span><span class="w"> </span><span class="mi">1</span><span class="p">,</span><span class="w">
</span><span class="nl">"name"</span><span class="w"> </span><span class="p">:</span><span class="w"> </span><span class="s2">"PragmaDirective"</span><span class="p">,</span><span class="w">
</span><span class="nl">"src"</span><span class="w"> </span><span class="p">:</span><span class="w"> </span><span class="s2">"0:23:0"</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="nl">"attributes"</span><span class="w"> </span><span class="p">:</span><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="nl">"baseContracts"</span><span class="w"> </span><span class="p">:</span><span class="w">
</span><span class="p">[</span><span class="w">
</span><span class="kc">null</span><span class="w">
</span><span class="p">],</span><span class="w">
</span><span class="nl">"contractDependencies"</span><span class="w"> </span><span class="p">:</span><span class="w">
</span><span class="p">[</span><span class="w">
</span><span class="kc">null</span><span class="w">
</span><span class="p">],</span><span class="w">
</span><span class="nl">"contractKind"</span><span class="w"> </span><span class="p">:</span><span class="w"> </span><span class="s2">"contract"</span><span class="p">,</span><span class="w">
</span><span class="nl">"documentation"</span><span class="w"> </span><span class="p">:</span><span class="w"> </span><span class="kc">null</span><span class="p">,</span><span class="w">
</span><span class="nl">"fullyImplemented"</span><span class="w"> </span><span class="p">:</span><span class="w"> </span><span class="kc">true</span><span class="p">,</span><span class="w">
</span><span class="nl">"linearizedBaseContracts"</span><span class="w"> </span><span class="p">:</span><span class="w">
</span><span class="p">[</span><span class="w">
</span><span class="mi">11</span><span class="w">
</span><span class="p">],</span><span class="w">
</span><span class="nl">"name"</span><span class="w"> </span><span class="p">:</span><span class="w"> </span><span class="s2">"C"</span><span class="p">,</span><span class="w">
</span><span class="nl">"scope"</span><span class="w"> </span><span class="p">:</span><span class="w"> </span><span class="mi">12</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="err">...</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="err">...</span><span class="w">
</span><span class="err">}</span><span class="w">
</span></code></pre></div></div>
<p>Hope this gives you a sense of the AST.
The AST is essentially a dictionary object with certain top-level attributes
and a list of children.
For example, one of the children is the <code class="language-plaintext highlighter-rouge">pragma</code> directive on line 1 of
<code class="language-plaintext highlighter-rouge">backdoor.sol</code>.
This child contains an ID, mapping to the source file, and a list of string
literals it holds together.
In the following, I briefly describe what happens inside Slither even before
bug detection is attempted.</p>
<h3 id="step-1-obtain-ast">Step 1: Obtain AST</h3>
<p>The first thing that slither does is <a href="https://github.com/trailofbits/slither/blob/master/slither/slither.py#L30">obtain the AST</a> of the analyzed
program in the form of a JSON string using the Solidity compiler, <code class="language-plaintext highlighter-rouge">solc</code>.
<code class="language-plaintext highlighter-rouge">solc</code> supports this off-the-shelf with such an
invocation as:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ ./solc tests/backdoor.sol --ast-json --allow-paths .
</code></pre></div></div>
<h3 id="step-2-parse-ast-into-cfg">Step 2: Parse AST into CFG</h3>
<p>Once the AST (JSON string) has been obtained, the next thing Slither does is
to parse it.
Parsing the AST entails <a href="https://github.com/trailofbits/slither/blob/master/slither/slither.py#L34">parsing the JSON of the AST</a>.
The AST parsing in Slither is quite sophisticated, not something I can
describe succinctly here.</p>
<p>The main idea behind parsing the AST is to created a (cyclic) directed graph
that shows control flow in the analyzed smart contract.
This is necessary because the AST itself is not adequate to grasp control-flow.</p>
<p>Control-flow graph is created at the granularity of a function call i.e.,
each function in the analyzed smart contract maps to a corresponding CFG.
You can find the function that does the AST parsing/CFG creation <a href="https://github.com/trailofbits/slither/blob/master/slither/solc_parsing/declarations/function.py#L614">here</a>.</p>
<h3 id="step-3-drop-to-slithir">Step 3: Drop to Slithir</h3>
<p>Once the CFG has been created for all functions in the smart contract under
analysis, Slither drops the AST/CFG representation of the analyzed smart
contract into an <a href="https://en.wikipedia.org/wiki/Static_single_assignment_form">SSA-based</a> intermediate representation called Slithir.
By “dropping”, I mean conversion from a higher-level program abstraction
(AST/CFG) to a lower-level program abstraction (Slithir).
But why?</p>
<p>I can only hazard the following guesses:</p>
<ul>
<li>Analysis based on an IR removes the dependency on the PL in which a smart
contract is written. If tomorrow, a new smart contract PL is invented,
Slither can still support it by adding a parser/converter to IR.</li>
<li>SSA-based IR makes certain kinds of analysis simpler (see section
called “Benefits” in the <a href="https://en.wikipedia.org/wiki/Static_single_assignment_form">SSA wiki article</a>)</li>
</ul>
<h3 id="step-4-detect-backdoor">Step 4: Detect Backdoor</h3>
<p>Steps 1–3 are performed as the <a href="https://github.com/trailofbits/slither/blob/master/slither/__main__.py#L34">Slither python object is created</a>.
Once the analysis infrastructure is ready (AST,CFG,Slithir), detectors are
processed sequentially.
Each detector encodes the “business logic” of detection for the bug class
that it is meant to detect.</p>
<p>So, let’s see what’s happening in the sample backdoor detector.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">Backdoor</span><span class="p">(</span><span class="n">AbstractDetector</span><span class="p">):</span>
<span class="s">"""
Detect function named backdoor
"""</span>
<span class="n">ARGUMENT</span> <span class="o">=</span> <span class="s">'backdoor'</span> <span class="c1"># slither will launch the detector with slither.py --mydetector
</span> <span class="n">HELP</span> <span class="o">=</span> <span class="s">'Function named backdoor (detector example)'</span>
<span class="n">IMPACT</span> <span class="o">=</span> <span class="n">DetectorClassification</span><span class="p">.</span><span class="n">HIGH</span>
<span class="n">CONFIDENCE</span> <span class="o">=</span> <span class="n">DetectorClassification</span><span class="p">.</span><span class="n">HIGH</span>
<span class="k">def</span> <span class="nf">detect</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="n">ret</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">contract</span> <span class="ow">in</span> <span class="bp">self</span><span class="p">.</span><span class="n">slither</span><span class="p">.</span><span class="n">contracts_derived</span><span class="p">:</span>
<span class="c1"># Check if a function has 'backdoor' in its name
</span> <span class="k">for</span> <span class="n">f</span> <span class="ow">in</span> <span class="n">contract</span><span class="p">.</span><span class="n">functions</span><span class="p">:</span>
<span class="k">if</span> <span class="s">'backdoor'</span> <span class="ow">in</span> <span class="n">f</span><span class="p">.</span><span class="n">name</span><span class="p">:</span>
<span class="c1"># Info to be printed
</span> <span class="n">info</span> <span class="o">=</span> <span class="s">'Backdoor function found in {}.{}'</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">contract</span><span class="p">.</span><span class="n">name</span><span class="p">,</span> <span class="n">f</span><span class="p">.</span><span class="n">name</span><span class="p">)</span>
<span class="c1"># Print the info
</span> <span class="bp">self</span><span class="p">.</span><span class="n">log</span><span class="p">(</span><span class="n">info</span><span class="p">)</span>
<span class="c1"># Add the result in ret
</span> <span class="n">source</span> <span class="o">=</span> <span class="n">f</span><span class="p">.</span><span class="n">source_mapping</span>
<span class="n">ret</span><span class="p">.</span><span class="n">append</span><span class="p">({</span><span class="s">'vuln'</span><span class="p">:</span> <span class="s">'backdoor'</span><span class="p">,</span> <span class="s">'contract'</span><span class="p">:</span> <span class="n">contract</span><span class="p">.</span><span class="n">name</span><span class="p">,</span> <span class="s">'sourceMapping'</span> <span class="p">:</span> <span class="n">source</span><span class="p">})</span>
<span class="k">return</span> <span class="n">ret</span>
</code></pre></div></div>
<p>You’d notice that the business logic of bug detection is quite concise.
The detection logic resides in the <code class="language-plaintext highlighter-rouge">detect</code> method of the <code class="language-plaintext highlighter-rouge">Backdoor</code> object
that implements the <code class="language-plaintext highlighter-rouge">AbstractDetector</code> interface.
To my mind, this is the python equivalent of a <a href="https://llvm.org/devmtg/2012-11/Zaks-Rose-Checker24Hours.pdf">Clang Static Analyzer
checker</a>.</p>
<p>Everything that a detector wants to know about the program is contained in
the <code class="language-plaintext highlighter-rouge">self.slither</code> object.
This object contains the following fields:</p>
<ul>
<li><code class="language-plaintext highlighter-rouge">contracts_derived</code>: This field holds the
<ul>
<li><code class="language-plaintext highlighter-rouge">_data</code>: AST obtained from the Solidity compiler</li>
<li><code class="language-plaintext highlighter-rouge">functions</code>: CFG of all functions in the contract</li>
<li><code class="language-plaintext highlighter-rouge">slither</code>: Slithir representation of the contract</li>
</ul>
</li>
</ul>
<p>The detector uses this information to decide whether to flag a bug or not.
A detector need only use the information that is necessary for the bug
detection logic.
For example, here’s what the backdoor detector is doing</p>
<ul>
<li>Iterate over all functions in the analyzed contract
<ul>
<li>If a function is called “backdoor”
<ul>
<li>Flag a bug saying “backdoor found”</li>
</ul>
</li>
</ul>
</li>
<li>return a nicely formatted bug diagnostics object (list of dictionaries,
each dictionary being a distinct bug report)</li>
</ul>
<p>In other words, the <code class="language-plaintext highlighter-rouge">backdoor</code> detector is only using the <code class="language-plaintext highlighter-rouge">function.name</code>
field in the function’s CFG to flag a bug.
Of course, this is cheating cos you can’t simply conclude that a function is
a backdoor if it is called one.
However, the reason I picked up this specific detector is because it is meant
as an introduction to writing detectors.</p>
<p>In the real-world, you’d do some analysis on the IR (e.g., check if the
analyzed function makes a call to the <code class="language-plaintext highlighter-rouge">selfdestruct</code> function) before
concluding that it is indeed a backdoor.
Perhaps, this entails listing all calls made by a function and checking if
<code class="language-plaintext highlighter-rouge">selfdestruct</code> happens to be one of them.</p>
<h2 id="outro">Outro</h2>
<p>So that was a quick dive into Slither.
We laid out the work flow of Slither from (1) taking the AST of a smart
contract as input, (2) producing its CFG, (3) reducing this to an SSA-based
IR (4) and finally, detecting bugs based on program information contained in
the IR.</p>
<p>If there is some specific aspect of Slither you’d want to know more about
that this post didn’t cover, let me know.
When I have the time, I’d be more than happy to write a part 2 of this post.
That’s all folks.</p>IntroFuzzing the Solidity Compiler2018-10-20T00:00:00+00:002018-10-20T00:00:00+00:00/2018/10/20/Fuzzing-Solidity-Compiler<h2 id="intro">Intro</h2>
<p>This post describes related work in the field of compiler fuzzing, the motivation for fuzzing the Solidity compiler, how to fuzz it, and the kind of bugs it helps find.
In the final section of this post, I briefly discuss what could be done to target more interesting code.</p>
<p>First things first.
Solidity is a high-level programming language for creating smart contracts.
The <a href="https://github.com/ethereum/solidity">solidity compiler</a> is the official compiler for programs (aka smart contracts) written in the Solidity programming language.
In the context of this post, Solidity means the compiler implementation and not the language itself.</p>
<p>Disclaimer: The bugs disclosed in this post have been reported upstream. More importantly, the bugs are benign typing errors that have no security implications to the best of my knowledge.
Therefore, I see no harm in disclosing them.
If this post inspires you to fuzz Solidity and you happen to find a security-critical bug, please consider reporting it to the <a href="https://bounty.ethereum.org/">Ethereum bounty program</a>.</p>
<h2 id="related-work">Related Work</h2>
<p>Folks have fuzzed</p>
<ul>
<li>Ethereum VM implementations e.g., <a href="https://github.com/trailofbits/echidna">this</a>, <a href="https://github.com/holiman/evmfuzz">that</a></li>
<li>Applications (smart contracts) e.g., <a href="https://dl.acm.org/citation.cfm?id=3238177">this</a></li>
</ul>
<p>The compiler, Solidity, has garnered lesser attention.
Solidity, falls in between applications and EVM.
It compiles applications to EVM byte code that is executed by the underlying EVM implementation.</p>
<p>Fuzzing compilers is nothing new.
For example, the <a href="https://embed.cs.utah.edu/csmith/">CSmith</a> project is geared towards finding bugs in C compilers.
<a href="https://llvm.org/devmtg/2017-10/slides/Serebryany-Structure-aware%20fuzzing%20for%20Clang%20and%20LLVM%20with%20libprotobuf-mutator.pdf">Kostya Serebryany’s</a> talk at llvm-dev meeting describes how to intelligently fuzz compilers using a technique he calls “structure aware fuzzing”.
His main observation is that fuzzing compilers with generic mutators (e.g., bit flips, add/remove bytes) is less likely to generate parseable programs.
So his talk is a call for mutators that understand the structure of input accepted by the program e.g., the structure of a C program.
This is an interesting idea for fuzzing solidity as well that I shall briefly discuss in the final section of this post.</p>
<h2 id="motivation">Motivation</h2>
<p>Some reasons for fuzzing the Solidity compiler are:</p>
<ul>
<li>Test compiler stability e.g., crash freedom</li>
<li>Test compiler correctness e.g., code generation</li>
</ul>
<p>I will add one more reason that drew me to fuzzing Solidity</p>
<ul>
<li>Test the de-facto Solidity specification</li>
</ul>
<p>Here, I refer to the following statement sourced from a paper titled “Defining the Ethereum Virtual Machine for Interactive Theorem Provers” by Y. Hirai (<strong>emphasis mine</strong>).</p>
<blockquote>
<p>Although ultimately all Ethereum smart contracts are deployed as EVM bytecode, the bytecode is rarely directly written.
The most popular programming language Solidity has a rich syntax but <strong>no specification</strong>. <strong>The only definition of Solidity is the Solidity compiler implementation</strong>, which compiles Solidity programs into EVM bytecode.</p>
</blockquote>
<p>To me, this implies:</p>
<ul>
<li>Bugs in Solidity may impact correctness of Solidity-written smart contracts</li>
<li>Bugs in Solidity may shed light on bugs in Solidity language design</li>
</ul>
<p>I don’t think Solidity is the only language that does not have a specification.
Actually, I’m pretty sure very few programming languages have a formal spec.
So, I’m not sure these reasons are specific to Solidity.
Perhaps, the most important reason to fuzz the Solidity compiler is (quoting Y. Hirai again)</p>
<blockquote>
<p>A deployed Ethereum smart contract is public under adversarial scrutiny, and the code is not
updatable. Most applications (auctions, prediction markets, identity/reputation
management etc.) involve smart contracts managing funds or authenticating external
entities. In this environment, the code should be trustworthy.</p>
</blockquote>
<p>In the worst case, bugs in Solidity could lead to unintended code execution in the context of security-critical applications.
However, the bugs discussed in this post are benign so treat my previous statement as FUD.</p>
<h2 id="test-harness">Test harness</h2>
<p>Fortunately for me, the test harness that was used for fuzzing is maintained in the source repo.
It is my understanding that Solidity is routinely fuzzed using afl-fuzz.
So, kudos to the Solidity team to have integrated fuzzing in their SDLC.</p>
<p>Here’s what the test harness looks like at a high level:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>int main()
{
...
// data, size are sourced from stdin
string input(reinterpret_cast<const char*>(data), size);
testCompiler(input);
}
</code></pre></div></div>
<p>Essentially, it:</p>
<ul>
<li>Takes a binary byte stream from stdin</li>
<li>converts this into a string
<ul>
<li>The string is the solidity program that is fed to the compiler</li>
</ul>
</li>
<li>compiles the string (solidity program)</li>
</ul>
<p><code class="language-plaintext highlighter-rouge">testCompiler</code> is a utility function that eventually makes a call to the <code class="language-plaintext highlighter-rouge">compileStandard</code> API exposed by the solidity compiler library called <code class="language-plaintext highlighter-rouge">libsolc</code>.
The nifty thing about this API interface is that it does I/O via JSON objects.
This means the <code class="language-plaintext highlighter-rouge">compileStandard</code> API accepts input via a JSON object and spits another JSON object as output.
How is the input string (solidity program) serialized into a JSON object you ask?</p>
<p>Simple, the fuzzed input goes into a field called <code class="language-plaintext highlighter-rouge">sources[""]["content"]</code>. Here’s a sample input accepted by <code class="language-plaintext highlighter-rouge">compileStandard</code></p>
<script src="https://gist.github.com/30193d6a3ae438043821d04ff3f863dd.js"> </script>
<p>The other fields in this JSON object are targeted at configuring compilation parameters such as optimization level, compiler output formating etc.
The output produced by the API is rather long but very detailed, so let’s overlook that for now.</p>
<h2 id="fuzzing">Fuzzing</h2>
<p>The fuzzing itself is quite straightforward. Here’s what you do (tested on Ubuntu 18.04):</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>// Fetch dependency
$ sudo apt install libboost-all-dev
// Fetch solidity
$ git clone https://github.com/ethereum/solidity.git
$ cd solidity && mkdir build
// Build, turning off SMT solver support
$ cd build && cmake -DUSE_Z3=OFF -DUSE_CVC4=OFF ..
$ make solfuzzer -j
// Populate afl-in with seeds
$ mkdir afl-in
$ find . -type f -name "*.sol" -exec cp {} -t afl-in \;
// Fuzz
$ afl-fuzz -m none -i afl-in -o afl-out -- solfuzzer
</code></pre></div></div>
<p>This</p>
<ul>
<li>Installs boost libs required to compile solidity (and the fuzzer)</li>
<li>Fetches, and compiles the solidity fuzzer</li>
<li>Uses solidity contracts present in the source repo as fuzzing seeds</li>
<li>Runs afl-fuzz on the fuzzing binary</li>
</ul>
<p>The fuzzing itself is very slow (under 100 execs/s).
However, it already helped find a couple of type-related bugs one of which was already known and the other was new.</p>
<h2 id="results">Results</h2>
<h3 id="bug-1-unexpected-function-type-conversion">Bug 1: Unexpected function type conversion</h3>
<p>Here’s the <a href="https://github.com/ethereum/solidity/issues/5279">new bug</a> that fuzzing discovered</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ ./solc issue_5279.sol
Internal compiler error during compilation:
/home/bhargava/work/github/solidity/libsolidity/codegen/CompilerUtils.cpp(1020): Throw in function void dev::solidity::CompilerUtils::convertType(const dev::solidity::Type&, const dev::solidity::Type&, bool, bool, bool)
Dynamic exception type: boost::exception_detail::clone_impl<dev::solidity::InternalCompilerError>
std::exception::what: Invalid type conversion requested.
[dev::tag_comment*] = Invalid type conversion requested.
</code></pre></div></div>
<p>tl;dr</p>
<ul>
<li><code class="language-plaintext highlighter-rouge">solc</code> is the solidity compiler binary</li>
<li><code class="language-plaintext highlighter-rouge">issue_5279.sol</code> is the solidity contract (found by fuzzing) that triggers the bug</li>
<li>The bug is an assertion failure that states the cause as <code class="language-plaintext highlighter-rouge">Invalid type conversion requested</code></li>
</ul>
<p>Here’s the full contract that triggers this bug</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>contract C {
function h() pure external {
}
function f() view external returns (bytes4) {
function () external g = this.h;
return g.selector;
}
}
// ----
</code></pre></div></div>
<p>As commented by one of the lead devs of Solidity (<a href="https://github.com/ethereum/solidity/issues/5279#issuecomment-432673495">Chris</a>), here’s the diff contract that does <strong>not</strong> trigger the bug</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>contract C {
function h() pure external {
}
function f() view external returns (bytes4) {
function () pure external g = this.h;
return g.selector;
}
}
</code></pre></div></div>
<p>So, what’s the invalid type conversion that the bug is talking about?</p>
<p>Some basics before we proceed.</p>
<p>What is a pure function?</p>
<blockquote>
<p>Functions can be declared pure in which case they promise not to read from or modify the state.</p>
</blockquote>
<p>What is a view function?</p>
<blockquote>
<p>Functions can be declared view in which case they promise not to modify the state.</p>
</blockquote>
<p>What is an external function?</p>
<blockquote>
<p>External functions are part of the contract interface, which means they can be called from other contracts and via transactions. An external function f cannot be called internally (i.e. f() does not work, but this.f() works). External functions are sometimes more efficient when they receive large arrays of data.
Functions can be declared pure in which case they promise not to read from or modify the state.</p>
</blockquote>
<p>What is a function selector?</p>
<blockquote>
<p>The first four bytes of the call data for a function call specifies the function to be called. It is the first (left, high-order in big-endian) four bytes of the Keccak (SHA-3) hash of the signature of the function. The signature is defined as the canonical expression of the basic prototype, i.e. the function name with the parenthesised list of parameter types. Parameter types are split by a single comma - no spaces are used.</p>
</blockquote>
<p>tl;dr</p>
<ul>
<li>pure means stateless</li>
<li>view means (stateful) read-only</li>
<li>external means just that</li>
<li>a function selector is the first four bytes of the hash of the function’s signature
<ul>
<li>imagine taking a SHA-3 hash of a c++ mangled function and using its first four bytes</li>
</ul>
</li>
</ul>
<p>From these facts, here’s my understanding of the bug.
First, note that the difference between buggy and non-buggy contracts is the following line of buggy code</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>function () external g = this.h;
</code></pre></div></div>
<ul>
<li><code class="language-plaintext highlighter-rouge">this.h</code> is an external <code class="language-plaintext highlighter-rouge">pure</code> (aka stateless) function</li>
<li><code class="language-plaintext highlighter-rouge">g</code> on the other hand is simply an external function</li>
</ul>
<p>Evidently, there is (implicit) type conversion happening here.
If one looks into the faulting code, here’s what one would find:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>void CompilerUtils::convertType(
Type const& _typeOnStack,
Type const& _targetType,
bool _cleanupNeeded,
bool _chopSignBits,
bool _asPartOfArgumentDecoding)
{
...
switch(stackType)
...
case default:
...
solAssert(_typeOnStack == _targetType, "Invalid type conversion requested.");
...
}
</code></pre></div></div>
<p>The next thing I did is firing up a gdb instance and debugging.
Here’s what I found on line 1020 (the failing assertion)</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(gdb) p _typeOnStack.richIdentifier()
$1 = "t_function_external_pure()returns()"
(gdb) p _targetType.richIdentifier()
$2 = "t_function_external_nonpayable()returns()"
</code></pre></div></div>
<p>The buggy contract has led the compiler to make an invalid type conversion.
But I thought solidity is a statically typed language in which such errors are picked up at compile time?
Evidently, there is some dynamic typing going on with implicit function casts which led to this bug.</p>
<h3 id="bug-2-variable-declaration-type-error">Bug 2: Variable declaration type error</h3>
<p>This was a <a href="https://github.com/ethereum/solidity/issues/5048">known bug</a> but the fuzzer kinda <a href="https://github.com/ethereum/solidity/issues/5340">rediscovered</a> it in a different context imo.
Here’s the buggy solidity contract that triggers a (dynamic) type error.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>library L{struct Nested{n y;}function(function(Nested)external){}}
</code></pre></div></div>
<p>Here’s the error it throws up:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Internal compiler error during compilation:
/home/bhargava/work/github/solidity/libsolidity/ast/Types.cpp(2127): Throw in function virtual bool dev::solidity::StructType::canBeUsedExternally(bool) const
Dynamic exception type: boost::exception_detail::clone_impl<dev::solidity::InternalCompilerError>
std::exception::what:
[dev::tag_comment*] =
</code></pre></div></div>
<p>Let’s fire up gdb and find out what the failing assertion in <code class="language-plaintext highlighter-rouge">Types.cpp</code> on line <code class="language-plaintext highlighter-rouge">2127</code> is all about.</p>
<p>Here’s the buggy code in question
<script src="https://gist.github.com/f9d7c7104c79954fc2d38d8c050620b0.js"> </script></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(gdb) p var->annotation().type.get()
$3 = (std::__shared_ptr<dev::solidity::Type const, (__gnu_cxx::_Lock_policy)2>::element_type *) 0x0
(gdb) bt
#0 dev::solidity::StructType::canBeUsedExternally (this=0x558db174d750, _inLibrary=false) at /home/bhargava/work/github/solidity/libsolidity/ast/Types.cpp:2127
#1 0x0000558db0774719 in dev::solidity::ReferencesResolver::endVisit (this=0x7ffd332ee5f0, _typeName=...) at /home/bhargava/work/github/solidity/libsolidity/analysis/ReferencesResolver.cpp:210
#2 0x0000558db07ca836 in dev::solidity::FunctionTypeName::accept (this=0x558db1746b60, _visitor=...) at /home/bhargava/work/github/solidity/libsolidity/ast/AST_accept.h:339
</code></pre></div></div>
<p>Evidently, as the Solidity contract’s AST is being built up, and while a function declaration is being visited and its parameters resolved, the compiler complains that a member of the referenced struct is not typed.</p>
<p>I expected the compiler to throw up an error that the type of member <code class="language-plaintext highlighter-rouge">y</code> of struct <code class="language-plaintext highlighter-rouge">Nested</code> is undefined.
Seemingly, this is not happening.
However, if I modify the buggy contract like so:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>library L{struct Nested{n y;}function(function()external){}}
</code></pre></div></div>
<p>The compiler correctly throws up a warning that the user-defined type <code class="language-plaintext highlighter-rouge">n</code> is undefined.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ solc mod_contract.sol
Warning: This is a pre-release compiler version, please do not use it in production.
../../bugs/issue_5340_min.sol:1:25: Error: Identifier not found or not unique.
library L{struct Nested{n y;}function(function()external){}}
</code></pre></div></div>
<p>I have a feeling that there is some lazy type resolution going on that results in a run-time error for what should be a compile-time error.</p>
<h2 id="next-steps">Next Steps</h2>
<p>It’s very cool that the Solidity compiler team is using fuzzing as part of their SDLC to catch bugs like this.
So far, most of the bugs found point to deficiencies in typing rules for Solidity.
Although this is a good first step, it won’t find bugs in the more critical compiler back-end component that is responsible for generating EVM code.
A bug in the back-end that generates incorrect EVM code is a lot more interesting from a security perspective.</p>
<p>The main drawback of the current test harness is speed.
This could be addressed by targeted fuzz testing of specific portions of the compiler rather than the entire compiler in one test.
This is akin to fuzzing unit tests.</p>
<p>Finally, Kostya’s call for structure-aware fuzzing mutators is something that should go heeded in the Solidity space as well.
There has been some work on this front in the <a href="https://github.com/ethereum/solidity/issues/1172">Solidity community</a>.
It’d be cool to use this infra to fuzz Solidity.</p>
<p>In summary</p>
<ul>
<li>fuzz specific security-critical components</li>
<li>break fuzz tests down to smaller units</li>
<li>use custom fuzz mutators</li>
</ul>
<p>That’s all folks!</p>IntroCan Good-Turing Frequency Estimation Tell Us When to Stop Fuzzing?2018-10-08T00:00:00+00:002018-10-08T00:00:00+00:00/2018/10/08/good-turing-fuzzing<script type="text/javascript" src="https://cdn.rawgit.com/mathjax/MathJax/2.7.1/MathJax.js?config=TeX-AMS-MML_HTMLorMML"></script>
<p><strong>tl;dr: Depends, but I’m sceptical atm :-)</strong></p>
<p>In this post, I will try to examine the utility of the <a href="https://en.wikipedia.org/wiki/Good–Turing_frequency_estimation">Good-Turing frequency estimation</a> for fuzz testing.
I focus on the following question that is of practical importance for practioners: When to stop fuzz testing?</p>
<h2 id="intro">Intro</h2>
<p>This <a href="https://arxiv.org/pdf/1807.10255.pdf">paper</a> talks highly of the utility of the Good-Turing frequency estimation for fuzz testing.
It makes some very cool arguments why it makes sense to apply GT to fuzzing, I enjoyed reading it!
Here’s the setting examined by that paper.
Fuzz testing involves decision making in the face of uncertainty.
For example, often, practioners would like to know when to stop fuzzing, cos who knows? A new crash may be found if only the fuzzer were left running for an additional hour/day/week etc.</p>
<p>In theoretical terms, what we would like to know at regular fuzzing intervals is the following: What is the probability of finding something new, should fuzzing continue?
Surprisingly, this is exactly what I.J. Good tried to understand (in a different setting of course) in the early 50s.</p>
<p>Of course, your definition of non-trivial probability is likely diffferent from mine.
The idea is to define a parameter, say \(\alpha{}\), and stop fuzzing when the probability of finding something new is less than the parameter \(\alpha{}\).
I admit this is a very specific (and likely limited) way to apply the GT estimate to fuzzing, so take the following arguments with spoonfuls of salt.</p>
<h2 id="prelims">Prelims</h2>
<p>We need to set up our theoretical model of fuzzing that is suited to the Good-Turing formula.
So, let’s begin with the following assumptions:</p>
<ul>
<li>A species is defined as some discretized program behavior
<ul>
<li>We need some way to characterize distinct species</li>
</ul>
</li>
<li>A test input can belong to one and only one species
<ul>
<li>Of course, multiple test inputs can belong to the same species, but the other way round is not possible</li>
</ul>
</li>
</ul>
<h3 id="discretizing-program-behavior">Discretizing program behavior</h3>
<p>afl-fuzz computes the hash of the coverage bit map to discretize program behavior.
Each byte in the coverage bitmap corresponds to some branch executed in the program.
So it discretizes program behavior like so:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>// trace_bits is the state of the coverage bitmap
// after an input is executed
exec_cksum = hash32(trace_bits, MAP_SIZE, HASH_CONST);
</code></pre></div></div>
<p>where <code class="language-plaintext highlighter-rouge">hash32</code> is a 32-bit hash of its input (<code class="language-plaintext highlighter-rouge">trace_bits</code> of length <code class="language-plaintext highlighter-rouge">MAP_SIZE</code>; salt is some constant <code class="language-plaintext highlighter-rouge">HASH_CONST</code>).</p>
<p>First things first.
<code class="language-plaintext highlighter-rouge">exec_cksum</code> is imprecise: program behavior is more complex than what <code class="language-plaintext highlighter-rouge">exec_cksum</code> portrays it to be.
For example, two inputs can have the same <code class="language-plaintext highlighter-rouge">exec_cksum</code> but trigger two different execution paths \(p_{1}\) and \(p_{2}\).
But, <code class="language-plaintext highlighter-rouge">exec_cksum</code> is efficient to compute and takes modest memory.
Therefore, it is an <strong>acceptable</strong> trade-off between precision of program behavior discretization and performance.</p>
<p>A minor digression to understand how libFuzzer discretizes program behavior.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>// kNumPCs is roughly 2.1 million
uintptr_t __sancov_trace_pc_pcs[fuzzer::TracePC::kNumPCs];
uint8_t __sancov_trace_pc_guard_8bit_counters[fuzzer::TracePC::kNumPCs];
</code></pre></div></div>
<p>There are two arrays</p>
<ul>
<li>An array of program counters (branch call sites) seen during fuzzing</li>
<li>An array of counters for these program counters
<ul>
<li>This is used to count how often a branch is hit</li>
</ul>
</li>
</ul>
<p>In addition, there is something that libFuzzer creates called a feature.
My understanding is that a feature maps to an index of <code class="language-plaintext highlighter-rouge">__sancov_trace_pc_pcs.</code>
So, each branch in the fuzzed program is a feature.
Sadly, unlike afl-fuzz, libFuzzer does not keep track of a checksum of features for a fuzzed input; something akin to afl-fuzz’s <code class="language-plaintext highlighter-rouge">exec_cksum.</code>
This means that, one would need to add (hashing) code to do this in libFuzzer.</p>
<h3 id="lifting-good-turing-for-fuzzing">Lifting Good-Turing for Fuzzing</h3>
<p>Suppose we have the set of all possible program behaviors (<code class="language-plaintext highlighter-rouge">exec_cksum</code>)</p>
\[P = \{p_{1},p_{2},...,p_{m}\}\]
<p>where \(p_{k}\) is a program path.</p>
<p>We also have a sequence E of N (\(N \le{} m\)) program behaviors corresponding to as many independently chosen inputs in the fuzzing corpus.</p>
\[E = \{e_{1},e_{2},...,e_{n}\}, e_{k} \in{} P\]
<p>We want to estimate \(\theta{}[j]\), the probability that a future sample will be \(p_{j}\).
Now, we define a set of the frequency of program behaviours observed thus far.</p>
\[F = {f_{1},f_{2},...,f_{n}}\]
<p>where \(f_{k}\) is the number of times behavior \(p_{k}\) has been observed. The frequency of unobserved behaviors is zero.</p>
\[f_{i} = 0, n+1 \leq i \leq m\]
<p>The relative frequency estimate for \(p_{j}\) is \(f_{j}/n\).
This estimate is inaccurate for small counts.
For example, if \(f_{j}=0\), our estimate is essentially saying “you can’t expect to see what you have not seen” which can be grossly inaccurate.</p>
<p>Before we proceed, we make the following assumption.</p>
\[f_{j} == f_{k} \implies{} \theta{}[j] == \theta{}[k]\]
<p>In other words, if two program behaviors appear with the same frequency in our present fuzzing corpus, then the probability of their future occurence is the same.
We can weaken this assumption later, but let’s stick to this simple case in this post.</p>
<p>With this assumption, we introduce more notation.
Let \(\theta{}(r)\) be the probability of a behavior occuring given that it appeared \(r\) times in \(E\).</p>
\[g_{r} = |\{e_{j} : f_{j} = r\}|\]
\[G = \{g_{0},g_{1},...,g_{R_{max}}\}\]
<p>where \(R_{max} = max(F)\).</p>
<p>In other words, while the set \(F\) computes the frequency of observed program behaviors, the set \(G\) computes the frequency of frequencies of observed behaviors.
Moreover, \(R_{max}\) is the highest frequency of observed program behaviors.
It follows that</p>
\[N = \sum_{r} rg_{r}\]
<p>where N (as we had denoted for the set E) is the total number of observed program behaviors.
N, as it turns out, is also the amount of fuzz i.e., total number of test inputs generated by fuzzing thus far.</p>
<p>Against this backdrop, we introduce the Good-Turing estimate \(\hat{\theta{}}(r)\) for \(\theta{}(r)\).</p>
\[\hat{\theta{}}(r) = (1/N)*(r+1)*(g_{r+1}/g_{r})\]
<p>This estimate tells us, for instance, that the probability of observing as yet unseen behaviors in the future (\(g_{0}\)) is:</p>
\[\hat{\theta{}}(0) = (1/N)*(g_{1}/g_{0})\]
<p>That is to say, this probability is greater than \((1/N)\) for positive \(g_{1}\) when \(g_{1} \gt{} g_{0}\).
When N=1 (after one program behavior has been observed), this probability is \(1/(M-1)\) which can be grossly inaccurate.
But the hope is, as N grows, this estimate converges on more realistic actual probability.</p>
<h2 id="applying-good-turing-estimate-to-fuzzing">Applying Good-Turing Estimate to Fuzzing</h2>
<p>One way in which the Good-Turing estimate is useful is in deciding when to stop fuzz testing.
We stop fuzzing when \(\hat{\theta{}}(0)\) is lower than some pre-defined threshold \(\alpha{}\).
Even before I go ahead and implement this estimate inside, say afl-fuzz, I see three potential problems:</p>
<ul>
<li>Q1: What is a good value of \(\alpha{}\)?
<ul>
<li>It is likely different for different targets</li>
</ul>
</li>
<li>Q2: How to deal with noise in \(\hat{\theta{}}(0)\)?
<ul>
<li>Note that \(g_{1}\) may fluctuate to varying extents which in turn influences the value of \(\hat{\theta{}}(0)\)</li>
<li>For example, at some point \(t=t_{k}\) the estimate may go below \(\alpha{}\) only to increase in value thereafter</li>
</ul>
</li>
<li>Q3: How to compute \(g_{0}\)?
<ul>
<li>\(g_{0}\) depends on \(M\), the total number of feasible program behaviors that we can only estimate</li>
<li>If a 32-bit <code class="language-plaintext highlighter-rouge">exec_cksum</code> is used to discretize program behavior (as in afl-fuzz), \(M \approx{} 4.3\) billion.</li>
</ul>
</li>
</ul>
<p>At least, I am sceptical that the Good-Turing estimate can be mechanically relied upon to stop fuzzing.
A lot depends on the answers to the three questions above, and likely more.
Take the issue of computing \(g_{0}\) for instance.
If a program contains even 32 branches, it can have at least 4.3 billion paths.
Therefore, <code class="language-plaintext highlighter-rouge">exec_cksum</code>ing falls short of correctly identifying program paths.</p>
<p>Even if we were to assume that <code class="language-plaintext highlighter-rouge">exec_cksum</code> is a fair performace-accuracy trade-off, \(M\) is going to dominate the computation of \(\hat{\theta{}}(0)\).
My intuition is that \(g_{0}\) (the number of unobserved program paths: \(=M - k\) where \(k\) is the total number of paths discovered thus far) is always going to be very close to \(M\).
In my experience, the total paths found by afl-fuzz is of the order of a few thousand for real-world targets and \(M\) is at least 4.3 billion.
Therefore, we can approximate the estimate to be like so</p>
\[\hat{\theta{}}(0) = (1/N)*(g_{1}/M) = g_{1}/(N*M)\]
<p>Since \(N\) is the amount of fuzz (how many inputs have been generated by fuzzing), it increases monotonically.
Thus, the denominator of the above equation is always increasing.
\(g_{1}\) (number of program behaviors observed exactly once thus far) is likely going to go down as we continue fuzzing.
This is going to give us insanely low probabilities to begin with.
Say we start computing the estimate at some point \(t1\) until when 2000 singleton (seen exactly once) behaviors have been observed and 10000 inputs generated by the fuzzer. We have:</p>
\[\hat{\theta{}_{t1}}(0) = 2000/(4300000000*10000) = 4.65e-11\]
<p>And let’s say, at a subsequent time instance \(t2\), we have 1000 singletons and 20000 inputs generated:</p>
\[\hat{\theta{}_{t2}}(0) = 1000/(4300000000*20000) = 1.16e-11\]
<p>Although these probabilities are relatively very different (e.g., it is four times less likely to find something new at \(t2\) than at \(t1\)), they are very small to be practically useful.
At least, these are my first impressions about the utility of GT estimate for one aspect of fuzzing.
Hit me up on Twitter (<a href="https://www.twitter.com/ibags">@ibags</a>) if you think my argument is flawed or I’m talking BS; I’m curious to hear from other security practioners what they think.</p>
<p>Anyway, that’s all for now folks.
I’ll post a follow-up when I have some empirical evidence from real-world targets.
Watch this space!</p>
<h4 id="updates">Updates</h4>
<p>2018-12-10:</p>
<p>Another way to think of the extremely low estimates for discovering new paths is to say</p>
\[N_{z} = 1/\hat{\theta{}}(0)\]
<p>where \(N_{z}\) is the expected number of additional fuzz required to uncover a new path.</p>
<p>So, what a \(\hat{\theta{}}(0) = 1.16e-11\) is saying is that you need to run the fuzzer for an additional \(N_{z} \approx{} 86.2\) billion executions until you find a new path.
Assuming that the average execution speed of fuzzer is \(1000\) executions per second, this translates to keep the fuzzer running for close to 3 years on a single core!
This is grossly inaccurate and of little practical utility.
Evidently, we need estimates that are tailored for exponential spaces, which I feel Good-Turing is not.</p>
<p>2018-3-11:</p>
<p>Thanks to Marcel Böhme for pointing out errors in the first version of the post</p>Statistical Evaluation of a Fuzzing Dictionary2018-10-01T00:00:00+00:002018-10-01T00:00:00+00:00/2018/10/01/Evaluating-Dictionary-For-Fuzzing<h2 id="intro">Intro</h2>
<p>Fuzz testing involves several configuration parameters: seeds, dictionary, fuzz scheduling (what to fuzz), fuzz duration (how long to fuzz something), fuzz mutation (how to fuzz), fuzz sites (what portions of input to fuzz) etc.
This post attempts to statistically evaluate the effect of one fuzzing parameter: dictionary.
The purpose of this post is to understand if the use of a dictionary for a very specific fuzzing target (a parser) leads to significantly better outcomes, statistically speaking.
The fuzzer target that this post focuses on is not really relevant: so I won’t name it.
It suffices to say that this target is a run-of-the-mill parser that parses string input.</p>
<p>We have made the argument before that the use of <a href="https://link.springer.com/chapter/10.1007/978-3-319-66332-6_2">dictionaries makes security testing of network parsers more effective</a>.
However, a recent paper called <a href="https://arxiv.org/pdf/1808.09700.pdf">“Evaluating Fuzz Testing”</a> has good recommendations for basing such judgements on basic statistical tests rather than, say, a visual inspection of the measured distributions.
There are two statistical tests that are recommended in the fuzzing evaluation paper.
One is a significance test and the other an effect-size test.</p>
<h3 id="significance-test">Significance test</h3>
<p>Firstly, it is recommended that researchers perform a significance test (e.g., Mann-Whitney U test) in order to decide if their fuzzing optimization brings about statistically significant change in some performance metric.
For people unfamiliar with even basic statistics, like me, the Mann-Whiteny U test is used to—quoting the <a href="https://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U_test">wiki page on the topic</a>—“determine whether two independent samples were selected from populations having the same distribution.”</p>
<p>My understanding of this test applied to fuzzing evaluations is as follows.
Consider you propose a cool tweak to afl-fuzz that you believe will bring about an improvement in fuzz testing.
For simplicity, let’s assume that the only metric you are interested in improving is “fuzzing coverage” per unit time: Lines of code that are hit by the fuzzer in some unit time (say 1 minute).
So, you want to check if your tweak actually performs better than the baseline on this metric.</p>
<p>In order to convince a scientific audience that your tweak indeed indeed brings about a positive improvement, you need to do the following before proceeding further:</p>
<ul>
<li>Run the baseline fuzzer (that does not contain your tweak) “N” times (greater the value of N, the better), measuring and noting the value of the metric of interest (coverage/unit time) in each run
<ul>
<li>You will end up with an array of measurements like so: B = [b_1, b_2,…, b_N]</li>
</ul>
</li>
<li>Run the tweaked fuzzer “N” times, and as before, measuring and noting the value of the metric of interest (coverage/unit time) in each run
<ul>
<li>You will end up with an array of measurements like so: T = [t_1, t_2,…,t_N]</li>
</ul>
</li>
<li>Compute the Mann Whitney U test p-value for the arrays <code class="language-plaintext highlighter-rouge">B</code> and <code class="language-plaintext highlighter-rouge">T</code>
<ul>
<li>This can tell you if the performance numbers for the tweak show statistically significant divergence from the performance numbers for the baseline</li>
</ul>
</li>
</ul>
<p>Now, you have two “populations” (arrays, <code class="language-plaintext highlighter-rouge">B</code> and <code class="language-plaintext highlighter-rouge">T</code>) of independent samples (independent because each run is independent of the other) of coverage numbers.
We do not know the distribution of either population; actually this is not important to us.
What we are interested in is checking whether the distributions differ.
Specifically, we assume that it is equally likely that a randomly selected value from one population is less than or greater than a randomly selected value from the other population; this is called the null hypothesis.
We are interested in proving or disproving the null hypothesis.
Getting back to the topic of fuzzing evaluations, we are interested in <strong>disproving</strong> the null hypothesis that the performance measurements for the baseline and tweak have the same distribution, because if they do, the tweak did not do anything particularly interesting.</p>
<p>The <a href="https://en.wikipedia.org/wiki/P-value">p-value</a> computation is a standard way of quantitatively checking the validity of the null hypothesis.
A p-value is essentially the probability of falsely concluding that the null hypothesis is not valid; the lower the p-value, the greater the assurance that we have correctly concluded that our tweak is indeed different than the baseline.
Traditionally, p-values of under <code class="language-plaintext highlighter-rouge">0.05</code> are considered good enough to show a statistically significant difference between two populations.
The value of <code class="language-plaintext highlighter-rouge">0.05</code> is called the level of significance: One can choose a lower level of significance (say <code class="language-plaintext highlighter-rouge">0.001</code>) if one wants to be damn sure about the difference in populations.</p>
<p>Fortunately, there is a ready-made python function called <code class="language-plaintext highlighter-rouge">mannwhitneyu</code> in the <code class="language-plaintext highlighter-rouge">scipy.stats</code> module that outputs the p-value for two lists of numbers.
So, all you need to do is write a simple python script like so:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>from scipy.stats import mannwhitneyu
# Read in baseline performance scors into array
B = [b_1,...,b_N]
# Read in performance scores for tweak into another array
T = [t_1,...,t_N]
print(mannwhitneyu(B,T))
</code></pre></div></div>
<p>Then you see output like so:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>MannwhitneyuResult(statistic=682.5, pvalue=2.582424268793943e-26)
</code></pre></div></div>
<p>This tells you that the p-value is <code class="language-plaintext highlighter-rouge">2.58e-26</code> or <code class="language-plaintext highlighter-rouge">2.58*10^-26</code>.
This number is a lot smaller than <code class="language-plaintext highlighter-rouge">0.05</code> so we conclude that the performance numbers corresponding to the tweak are indeed (statistically significantly) different than performance numbers corresponding to the baseline.</p>
<p>Although p-values of under <code class="language-plaintext highlighter-rouge">0.05</code> show that the compared populations are significantly different, it does not tell us what the quantum of this difference is.
In an extreme case, the tweak may result in a miniscule improvement (e.g., it covers 2 additional lines of code than baseline) with a very low p-value (e.g., <code class="language-plaintext highlighter-rouge">2.58e-26</code>).
So although you convince people that your tweak brings about a certain improvement, the quantum of this improvement is too little to be considered scientifically interesting.</p>
<p>In other words, low p-values are necessary but not sufficient for our evaluation.
p-values say nothing about the extent of divergence, also known as the effect size.
This brings me to the second test recommended in the fuzzing evaluation paper.</p>
<h3 id="vargha-delaneys-a-measure">Vargha Delaney’s A measure</h3>
<p>The VDA measure can be used to gauge the extent of divergence between two populations.
Essentially, the VDA measure outputs the probability <code class="language-plaintext highlighter-rouge">p</code> that one population is different (greater/lesser) by computing pair-wise ordinal relationships (<code class="language-plaintext highlighter-rouge"><</code> or <code class="language-plaintext highlighter-rouge">=</code>) between samples in the two populations.
This probability <code class="language-plaintext highlighter-rouge">p</code> if equal to <code class="language-plaintext highlighter-rouge">0.5</code> (half) indicates that both populations have identical values (no change).
The following values of <code class="language-plaintext highlighter-rouge">p</code> are conventionally accepted as indicating change:</p>
<ul>
<li><code class="language-plaintext highlighter-rouge">p>0.56</code> Small change</li>
<li><code class="language-plaintext highlighter-rouge">p>0.64</code> Medium change</li>
<li><code class="language-plaintext highlighter-rouge">p>0.71</code> Big change</li>
</ul>
<p>Essentially, if greater than 21% of pair-wise comparisons show a greater value for one population, that population is considered diverging in a <strong>big</strong> way from the other.
Tim Menzies has <a href="https://gist.github.com/timm/5630491">published python code to compute VDA measure</a>, thanks Tim.
So, all you need to do to compute the VDA measure is the following:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## Fetch module from Tim Menzies' gist linked above
from a12 import *
## Create a labeled array
B_norm = ["baseline"]
## Append B values from baseline measurements
B_norm.extend(B)
## Likewise for tweak measurements
T_norm = ["tweak"]
T_norm.extend(T)
## Create consolidated list
C = [B_norm, T_norm]
for rx in a12s(C,rev=True,enough=0.71): print(rx)
</code></pre></div></div>
<p>The <code class="language-plaintext highlighter-rouge">enough</code> parameter is essentially the effect-size threshold of your choice. For the listing above, I have used the conventional big threshold i.e., <code class="language-plaintext highlighter-rouge">p>0.71</code>.
The python code above should output something like so</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>rank #1 tweak at <T_cov>
rank #2 baseline at <B_cov>
</code></pre></div></div>
<p>where populations are sorted in descending order (i.e., highest coverage on top) and <code class="language-plaintext highlighter-rouge">T_cov</code> and <code class="language-plaintext highlighter-rouge">B_cov</code> are means of the tweak and baseline populations.
We interpret this result as follows: There exists a big change between tweak and baseline because a lot of samples from the tweaked population show better performance (say, coverage numbers) compared to the baseline samples.
In summary, if the p-value for the measurement values corresponding to your tweak is <code class="language-plaintext highlighter-rouge"><0.05</code> and has a big VDA measure, then your tweak is indeed pretty cool!
Next, I describe in what context I applied this knowledge.</p>
<h2 id="context">Context</h2>
<p>I was going to submit a PR to oss-fuzz to integrate a new fuzzing target.
Such a PR typically contains configuration for the fuzzing engines that Google uses (afl-fuzz and libFuzzer) apart from the test case itself.
One such configuration parameter is a dictionary file that contains line seperated tokens of interest that are enclosed within double quotes (see my <a href="https://bshastry.github.io/2017/08/03/Inferring-Program-Input-Format.html">post on inferring program input format</a> for more details about this).
Naturally, I was interested in knowing if the dictionary that I was including in the PR is actually useful.</p>
<p>Before I set about evaluating the usefulness of a dictionary for this specific target, I built a few simple dictionaries using tools that I had developed: Mostly this clang front-end tool called clang-sdict that performs a front-end pass on source code collecting constant string tokens used in potentially data-dependent control flow.
You can find a primitive implementation of clang-sdict <a href="https://github.com/test-pipeline/clang-ginfer/blob/master/ClangStringDict.cpp">here</a>.</p>
<p>Before finalizing on a dictionary, I wanted to experiment with a few variations and see how they fare.
The nice thing about clang-sdict is that it permits several customizations: Prominently, one can tune it to focus on specific coding patterns.
For example, one can add specific parsing functions (by name) and the tool extracts tokens accepted by that function.
I went ahead and created three different dictionaries each with a slightly different set of string tokens.
Let’s call these dictionaries “dict A”, “dict B”, and “dict C.”
When the fuzzer is supplied such a dictionary, it chooses one string at random and uses it in a fuzzing mutation: Say overwrites a byte sequence with this string.</p>
<h2 id="evaluation">Evaluation</h2>
<p>Now that I had these three dictionaries, I set about evaluating their “effectiveness” and “size of effect” using Mann-Whitney U Test and Vargha Delaney’s A measure.
To recap, these tests answer the following two questions (in that order): (1) Is using a dictionary bring out noticeable gains in the outcome of fuzzing? and (2) How much of an effect do dictionaries have on the said outcome?</p>
<p>Of course, we need to fix metrics before we use these statistical tests.
The metric I chose for this post is the number of lines of code covered by a fuzzing session: libFuzzer (one of the fuzzing engines behind oss-fuzz) prints <a href="https://clang.llvm.org/docs/SanitizerCoverage.html#id2">the number of CFG edges covered during fuzzing</a>.
More edges covered is better than fewer edges covered (more is better).</p>
<p>Before I present evaluation methodology and results, some meta data about the dictionary candidates.</p>
<table class="table table-striped">
<thead>
<tr>
<th>Dict</th>
<th style="text-align: right">Num. tokens</th>
</tr>
</thead>
<tbody>
<tr>
<td>Baseline</td>
<td style="text-align: right">0</td>
</tr>
<tr>
<td>Dict A</td>
<td style="text-align: right">120</td>
</tr>
<tr>
<td>Dict B</td>
<td style="text-align: right">222</td>
</tr>
<tr>
<td>Dict C</td>
<td style="text-align: right">388</td>
</tr>
</tbody>
</table>
<p>Dict A has the fewest tokens, followed by Dict B, and Dict C.</p>
<h3 id="evaluation-methodology">Evaluation Methodology</h3>
<p>The methodology centers around the following broad set of requirements with design choices shown in braces.</p>
<ul>
<li>Each variant should be run several times (<strong>100 runs chosen</strong>)</li>
<li>Each variant should be run for the same fixed duration (<strong>5 minutes chosen</strong>)</li>
<li>Reasonable metric for comparison must be used (<strong>Program edge coverage chosen</strong>)</li>
</ul>
<p>Therefore our experiment must do the following:</p>
<ul>
<li>Run the baseline (no dictionary), Dict A (exp 1), Dict B (exp 2), Dict (exp 3) a total of 100 times each with 5 minutes per fuzzing session</li>
<li>Log the total coverage achieved in this fuzzing session</li>
</ul>
<p>Once we do this, we end up with a 2D array like so (numbers are hypothetical):</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>baseline = [b_1,b_2,b_3,...,b_100]
exp1 = [e1_1,e1_2,e1_3,...,e1_100]
exp2 = [e2_1,e2_2,e2_3,...,e2_100]
exp1 = [e3_1,e3_2,e3_3,...,e3_100]
</code></pre></div></div>
<p>Okay, so let’s make a box-plot of them and see what they look like: Remember more edges covered, the better is the fuzzing outcome.</p>
<p><img src="/assets/img/Coverage_box_plots.png" alt="Fig. 1: Box plots showing the number of PCs covered across 100 independent runs each for baseline, and Dict A/B/C" class="img-responsive" /></p>
<p>Y-axis is the number of CFG edges covered; X-axis is the fuzzing configuration whose coverage distribution is presented as a box plot.
Okay, it (visually) appears that “Dict A” is best of all in terms of median value (the orange line that strikes through the boxes is the median of that sample set) and quartile distribution.
Some more basic statistics for the test coverage populations follow.</p>
<table class="table table-striped">
<thead>
<tr>
<th>Name</th>
<th>Mean</th>
<th>Variance</th>
<th>Min</th>
<th style="text-align: right">Max</th>
</tr>
</thead>
<tbody>
<tr>
<td>Baseline</td>
<td>1488.3</td>
<td>1918.5</td>
<td>1427</td>
<td style="text-align: right">1591</td>
</tr>
<tr>
<td>Dict A</td>
<td>1601.2</td>
<td>3157.2</td>
<td>1502</td>
<td style="text-align: right">1719</td>
</tr>
<tr>
<td>Dict B</td>
<td>1579.2</td>
<td>2775.4</td>
<td>1497</td>
<td style="text-align: right">1693</td>
</tr>
<tr>
<td>Dict C</td>
<td>1572.4</td>
<td>2374.7</td>
<td>1500</td>
<td style="text-align: right">1675</td>
</tr>
</tbody>
</table>
<p>Although it appears that Dict A has the highest mean (and hence the best), its high variance can be one ground to be suspicious about the claim that “it is the best.”
This is precisely where significance tests enter the picture.</p>
<h3 id="mann-whitney-u-test">Mann Whitney U Test</h3>
<p>We can check the “soundness” of the hypothesis “Dict A is different” by performing a Mann-Whitney U test on our data set.
Here’s a gist of my evaluation python script: Nothing fancy, reading coverage numbers from a log file and using <code class="language-plaintext highlighter-rouge">mannwhitneyu</code> function from the <code class="language-plaintext highlighter-rouge">scipy.stats</code> python module on the sets of acquired coverage numbers.</p>
<script src="https://gist.github.com/df0f07dc0d3f5cac48e9dc9affe20d0f.js"> </script>
<p>The p-values between different sets of evaluations are shown in the table below.
The table is to be read as (p-value between row label vs. column label); 1e-2 is to be read as 1x10^-2 or 0.01.
Since Mann Whitney p-values for the tuples (A,B) and (B,A) (where A,B are two non-identical sets of numbers) is the same, and p-value of (A,A) does not make any sense, these fields in the table have been denoted as <code class="language-plaintext highlighter-rouge">N.A.</code>, short for not applicable.
A p-value of under <code class="language-plaintext highlighter-rouge">0.05</code> (i.e., <code class="language-plaintext highlighter-rouge">< 5e-2</code>) means that there is a significant difference between the means of the two sets of numbers.</p>
<table class="table table-striped">
<thead>
<tr>
<th>Name vs.</th>
<th>Baseline</th>
<th>Dict A</th>
<th>Dict B</th>
<th style="text-align: right">Dict C</th>
</tr>
</thead>
<tbody>
<tr>
<td>Baseline</td>
<td>N.A.</td>
<td>N.A.</td>
<td>N.A.</td>
<td style="text-align: right">N.A.</td>
</tr>
<tr>
<td>Dict A</td>
<td>2.58e-26</td>
<td>N.A.</td>
<td>2.22e-3</td>
<td style="text-align: right">6.96e-5</td>
</tr>
<tr>
<td>Dict B</td>
<td>5.72e-23</td>
<td>N.A.</td>
<td>N.A.</td>
<td style="text-align: right">19.4e-2</td>
</tr>
<tr>
<td>Dict C</td>
<td>5.61e-22</td>
<td>N.A.</td>
<td>N.A.</td>
<td style="text-align: right">N.A.</td>
</tr>
</tbody>
</table>
<p>From these numbers, we can create the following “significance” table (to be read as do (row,column) populations differ significantly):</p>
<table class="table table-striped">
<thead>
<tr>
<th>Name vs.</th>
<th>Baseline</th>
<th>Dict A</th>
<th>Dict B</th>
<th style="text-align: right">Dict C</th>
</tr>
</thead>
<tbody>
<tr>
<td>Dict A</td>
<td><strong>Yes</strong></td>
<td>N.A.</td>
<td><strong>Yes</strong></td>
<td style="text-align: right"><strong>Yes</strong></td>
</tr>
<tr>
<td>Dict B</td>
<td><strong>Yes</strong></td>
<td><strong>No</strong></td>
<td>N.A.</td>
<td style="text-align: right"><strong>No</strong></td>
</tr>
<tr>
<td>Dict C</td>
<td><strong>Yes</strong></td>
<td><strong>No</strong></td>
<td><strong>No</strong></td>
<td style="text-align: right">N.A.</td>
</tr>
</tbody>
</table>
<p>This table tells us that</p>
<ul>
<li>All “Dict” populations are significantly different than the baseline AND</li>
<li>Dict A population is significantly different than the rest</li>
</ul>
<p>In some ways this is a counter-intuitive result because I would have expected more tokens (in Dict B and Dict C) result in a significant change in the outcome.
It turns out it is more important to have a small set of correct tokens than a larger set: More tokens in a dictionary is not necessarily a good thing.</p>
<p>Bear in mind that all runs were performed for 5 minutes only, results may/will change for longer fuzzing durations.
My original motivation in choosing a 5-minute fuzzing window was to get a quick understanding of the effectiveness of each of the dictionaries before sending out the PR.
Having said that, given enough time and resources, we can perform the same tests after a longer time interval (say 1 hour of fuzzing) and repeat this analysis.</p>
<h3 id="vargha-delaney-a12-test">Vargha Delaney A12 Test</h3>
<p>Stastical significance cannot be equated to scientifically important.
The latter requires stricter evaluation of the delta in the metric: How much more improvement in test coverage did the evaluated dictionaries achieve?
We know that Dict A population not only has the highest mean/median, but is also significantly different than the rest, but how much better is it?
<a href="https://www.jstor.org/stable/1165329">VDA test</a> is useful for answering precisely this question.</p>
<p>Let’s recall that a VDA score between (X,Y) of <code class="language-plaintext highlighter-rouge">>0.56</code> indicates a small change, <code class="language-plaintext highlighter-rouge">>0.64</code> indicates a medium change, and <code class="language-plaintext highlighter-rouge">>0.71</code> indicates a big change.
Using my <a href="https://gist.github.com/bshastry/df0f07dc0d3f5cac48e9dc9affe20d0f">evaluation gist outlined above</a>, I compute the VDA probabilities as follows.
Again, I would like to credit Tim Menzies whose <a href="https://gist.github.com/timm/5630491">VDA implementation</a> was the basis for these computations.
In my script, I use standard effect sizes (small=0.56, medium=0.64,large=0.71) to compute three such rankings.
Here is what I find.</p>
<p>Small effect ranking</p>
<ul>
<li>Rank 1: Dict A</li>
<li>Rank 2: Dict B</li>
<li>Rank 2: Dict C</li>
<li>Rank 3: Baseline</li>
</ul>
<p>In other words, Dict A offers <strong>at least</strong> small improvements in program coverage over Dict B and Dict C, which in turn offer <strong>at least</strong> small improvements in program coverage over the baseline.</p>
<p>Medium effect ranking</p>
<ul>
<li>Rank 1: Dict A</li>
<li>Rank 1: Dict B</li>
<li>Rank 1: Dict C</li>
<li>Rank 2: Baseline</li>
</ul>
<p>In other words, Dict A, Dict B, and Dict C are roughly the same if we require the improvement in test coverage to be at least <strong>medium</strong> (p > 0.64). Still, any of these dictionaries have at least a <strong>medium</strong> size of improvement in coverage than the baseline i.e., no dictionary.</p>
<p>Big effect ranking</p>
<ul>
<li>Rank 1: Dict A</li>
<li>Rank 1: Dict B</li>
<li>Rank 1: Dict C</li>
<li>Rank 2: Baseline</li>
</ul>
<p>The medium result holds even for a <strong>big</strong> effect: This means that any of the three dictionaries offer <strong>big</strong> improvement in coverage compared to the baseline.</p>
<p>From this we can conclude that (1) there is a small delta between Dict A, and Dict B/C; and (2) there is a big delta between Dict A/B/C and the baseline (no dictionary).
In a nutshell, the “winner” is Dict A.</p>
<h2 id="conclusion">Conclusion</h2>
<p>I draw the following conclusions from this work:</p>
<ul>
<li>Simple statistical tests provide an understanding of the significance of a change in some fuzzing parameter</li>
<li>For the specific fuzzing target evaluated in this post, dictionaries indeed are very useful</li>
</ul>
<p>Some caveats: (1) Fuzzing window chosen for evaluation was short, (2) results focus on the coverage metric and not e.g., for speed of bug finding.
However, this methodology offers a scientific basis for drawing conclusions which is pretty cool.
Needless to say, I added Dict A in my PR to oss-fuzz and now I can say that (in a very limited way) my PR is based on scientific evidence ;-)</p>
<h3 id="acknowledgments">Acknowledgments</h3>
<p>Thanks to</p>
<ul>
<li>The authors of the “Evaluating Fuzz Testing” paper, check the <a href="https://arxiv.org/pdf/1808.09700.pdf">paper</a> out.</li>
<li>Tim Menzies whose <a href="https://gist.github.com/timm/5630491">A12 implementation</a> I used in this work</li>
<li>My wife, Divya, for teaching me basic stats</li>
</ul>IntroExploring Fuzzer Crashes2017-08-04T00:00:00+00:002017-08-04T00:00:00+00:00/2017/08/04/Exploring-Fuzzer-Crashes<p><a href="/2017/08/02/Diagnosing-Distributed-Vulnerabilities.html">Part 1</a> | <a href="/2017/08/03/Inferring-Program-Input-Format.html">Part 2</a> | <a href="/2017/08/04/Exploring-Fuzzer-Crashes.html">Part 3</a></p>
<h2 id="prologue">Prologue</h2>
<p>This post concludes the three part series on compiler assisted vulnerability diagnosis in open-source C/C++ code. “Compiler assisted” means that the presented techniques pivot around a compiler, and “vulnerability diagnosis” refers to the process of finding and fixing vulnerabilities (software weaknesses that can be used to intentionally cause harm). Software weaknesses (bugs) are a superset of vulnerabilities in that not all weaknesses are harmful from a security perspective. The challenging part of diagnosing vulnerabilities in source code is to arrive at the (usually) small subset of vulnerabilities from the (usually) larger set of bugs and non-bugs (that the source analyzer believes to be real bugs aka false positives).</p>
<h2 id="intro">Intro</h2>
<p>Software testing is arguably the most important process in the quality assurance phase of software development. Bugs found during testing achieve an important objective: Helping fix programming errors before a software release. Therefore, bug count is a reasonable metric to assess the effficiency of the software testing process. If technique X helps find more bugs than technique Y, the former is said to be more effective.</p>
<p>This post argues that, for practical reasons, fuzz testing alone may be sub-optimal to maximize bug count, and that static analysis can help find bugs in scenarios where fuzzing is not an option.
Here is a non-exhaustive list of scenarios where fuzzing is not straightforward:</p>
<ul>
<li>Crypto code</li>
<li>Stateful application logic in networking stacks</li>
<li>No unit test to test feature X</li>
<li>No fuzzable unit test to test feature X</li>
</ul>
<p>Of course, this does not mean fuzzing in these scenarios is impossible.
It just means it is harder (requires manual labor) to fuzz in these scenarios.
So, it does not scale out.</p>
<h2 id="static-exploration-of-fuzzer-crashes">Static exploration of fuzzer crashes</h2>
<p>How can we scale bug discovery beyond fuzz testing?
My proposal is to use static analysis in order to automatically explore the findings of a fuzzer.
By “findings of a fuzzer”, I mean fuzzer-discovered program crashes that can be localized (attributed) to a small portion of the program.
By “exploration”, I mean spotting recurances of the underlying cause of fuzzer-discovered crashes.
This opens up two problems: How to automatically (1) localize fuzzer crashes? (2) explore them statically?
Considering that static analysis over-approximates, a third problem is to how to handle false positives?
We shall be investigating each problem in the next paragraphs.</p>
<h4 id="fault-localization">Fault localization</h4>
<p>In this post, we focus on fault localization in an open-source setting, although fault localization has been <a href="https://dl.acm.org/citation.cfm?id=2519842">shown to be possible in a closed source setting</a>.
So, our fault localization tool should accept source code and a fuzzer corpus (set of test inputs) as input, and produce a set of localized code segments that correspond to each unique fuzzer-discovered crash.
<a href="https://github.com/jfoote/exploitable">Crash de-duplication tools such as exploitable</a> provide us the set of uniquely crashing program inputs.
So, our problem is reduced to that of obtaining localized code segments for each unique crash in the set of deduplicated crashes.</p>
<p>For memory corruption bugs, memory-tracing tools such as AddressSanitizer and Valgrind can greatly assist fault localization.
These tools track the state of memory use at byte granularity, reporting buffer overflows, use-after-free and other memory related issues that are endemic to C/C++ applications.
AddressSanitizer even has a structured bug diagnostic report that can be leveraged to programmatically narrow down the lines of code that caused the bug.</p>
<p>Let’s run through a small example here. The code below contains a synthetic buffer overflow that we can spot with the help of ASan:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cat <<EOF > example.c
#include <stdio.h>
void vulnerable(int y, char *buf) {
buf[y] = 0;
}
int main(int argc, char *argv[]) {
char buf[256];
size_t x = 0;
scanf("%lu", &x);
vulnerable(x, buf);
return 0;
}
EOF
$ clang -fsanitize=address example.c
$ ./a.out
256
=================================================================
==2290==ERROR: AddressSanitizer: stack-buffer-overflow on address 0x7ffcd43ff9c0 at pc 0x0000004e9be2 bp 0x7ffcd43ff860 sp 0x7ffcd43ff85
8
WRITE of size 1 at 0x7ffcd43ff9c0 thread T0
#0 0x4e9be1 in vulnerable /home/bhargava/work/github/bshastry.github.io/code/example1.c:4:11
#1 0x4e9d71 in main /home/bhargava/work/github/bshastry.github.io/code/example1.c:11:4
#2 0x7f13f2f8682f in __libc_start_main /build/glibc-bfm8X4/glibc-2.23/csu/../csu/libc-start.c:291
#3 0x418538 in _start (/home/bhargava/work/github/bshastry.github.io/a.out+0x418538)
Address 0x7ffcd43ff9c0 is located in stack of thread T0 at offset 288 in frame
#0 0x4e9bff in main /home/bhargava/work/github/bshastry.github.io/code/example1.c:7
This frame has 2 object(s):
[32, 288) 'buf' <== Memory access at offset 288 overflows this variable
[352, 360) 'x'
</code></pre></div></div>
<p>Note that the ASan diagnostic report not only shows the program stack trace at the time the buffer overflow occured, but also the program variable that overflowed.
Moreover, the formatting of the report is regular enough for us to automatically parse this information.</p>
<p>What if we are dealing with a bug that is not caused due to memory corruption, say, an assertion failure.
In the synthetic example below (<code class="language-plaintext highlighter-rouge">abort.c</code>), the program aborts when the parsed input equals the string literal <code class="language-plaintext highlighter-rouge">doom</code>. More realistically, one would be dealing with an assertion failure due to an unexpected program state. Nonetheless, the example is simple enough to demonstrate how we handle non memory corruption bugs. Lines have been numbered so we can speak about execution traces in terms of a set of line numbers. This will be clear shortly.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cat <<EOF > abort.c
1. #include <string.h>
2. #include <crypt.h>
3. #include <stdlib.h>
4. #include <unistd.h>
5. #define CUSTOM() abort()
6. void fuzzable(const char *input) {
7. // Fuzzer finds this bug
8. if (!strcmp(input, "doom"))
9. abort();
10. }
11.
12. // Fuzzer test harness
13. // INPUT: stdin
14. int main() {
15. char buf[256];
16. memset(buf, 0, 256);
17. read(0, buf, 255);
18. fuzzable(buf);
19. return 0;
20. }
</code></pre></div></div>
<p>Using a coverage tracer such as <a href="http://releases.llvm.org/3.8.1/tools/docs/SanitizerCoverage.html">SanitizerCoverage</a>, we can obtain the execution trace for this program for a given input.
Let’s assume that the fuzzer discovered the program input “doom” that causes the program to abort, immediately after it mutated an input “doo” that it had previously generated.
For the input “doom”, we can see that the following lines are in the execution trace</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ clang -fsanitize-coverage=bb -fsanitize=undefined -g abort.c
$ perl -e 'print "doom"' | UBSAN_OPTIONS="coverage=1:coverage_direct=1" ./a.out
Aborted (core dumped)
$ sancov.py rawunpack 2900.sancov.raw
$ sancov.py print a.out.2900.sancov | llvm-symbolizer -obj a.out
/usr/local/bin/pysancov: read 8 64-bit PCs from a.out.3150.sancov
/usr/local/bin/pysancov: 1 file merged; 8 PCs total
fuzzable
/home/bhargava/work/github/bshastry.github.io/code/abort.c:6:0
fuzzable
/home/bhargava/work/github/bshastry.github.io/code/abort.c:8:7
fuzzable
/home/bhargava/work/github/bshastry.github.io/code/abort.c:8:7
fuzzable
/home/bhargava/work/github/bshastry.github.io/code/abort.c:8:7
fuzzable
/home/bhargava/work/github/bshastry.github.io/code/abort.c:8:7
main
/home/bhargava/work/github/bshastry.github.io/code/abort.c:14:0
main
/home/bhargava/work/github/bshastry.github.io/code/abort.c:16:3
main
/home/bhargava/work/github/bshastry.github.io/code/abort.c:16:3
</code></pre></div></div>
<p>After de-duplicating line numbers, we are left with the following execution trace for the input “doom”: (6,8,14,16).
The trace for the input “doo” is: (6,8,10,14,16).
Note that the coverage tracing tool might have false negatives (executed lines that are not registered), but we can live with that.
If we obtain the set difference between the traces for “doo” and “doom”, we are left with line number 10.
What this tells us is that the function <code class="language-plaintext highlighter-rouge">fuzzable</code> does not return when passed input “doom” but returns when the passed input is “doo”.
From this, we can deduce that the crashing input caused a crash between lines 8 and 10 i.e., line 9.
In doing so, we have localized the failure (somewhat) to lines 8–10.</p>
<p>What we obtain after fault localization is a set of source code locations (say, a list of file:line tuples) that (most likely) were the root-cause of a program crash.
Our next problem is to find where similar code patterns exist.</p>
<h4 id="static-exploration-of-root-cause-of-failure">Static exploration of root-cause of failure</h4>
<p>In order to explore code patterns similar to the root-cause of fuzzer-discovered crashes, we take a compiler-based code query approach.
We will be using <a href="https://clang.llvm.org/docs/LibASTMatchers.html">clang-query</a>, a tool that lets us efficiently query the abstract syntax tree of code bases.
The query syntax of clang-query is a functional language predicated over properties of the program AST.
I will try to break down what this means.
A tool like <code class="language-plaintext highlighter-rouge">grep</code> is what we seek to emulate: Given a code pattern that is known to be vulnerable, we would like to search for its recurrances.
However, unlike <code class="language-plaintext highlighter-rouge">grep</code>, we do not match the textual representation of code, rather how it looks like to the compiler.
At the risk of oversimplication, I call it compiler grepping!
If you are wondering what compiler grepping brings to the table that <code class="language-plaintext highlighter-rouge">grep</code> does not, it lets us match against the structure and semantics of code rather than it’s appearance.
This can make a big difference as we shall see.</p>
<p>The next question then is: How can we formulate compiler queries from code segments that we have obtained after fault localization?
To understand this, let’s try to understand what code segments look like to the compiler. Here’s a snippet of <code class="language-plaintext highlighter-rouge">abort.c</code>’s AST.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ clang -fsyntax-only -ast-dump abort.c
`-FunctionDecl 0x2c61778 <line:14:1, line:20:1> line:14:5 main 'int ()'
`-CompoundStmt 0x2c61d48 <col:13, line:20:1>
|-DeclStmt 0x2c618f8 <line:15:3, col:17>
| `-VarDecl 0x2c61898 <col:3, col:16> col:8 used buf 'char [256]'
|-CallExpr 0x2c61a00 <line:16:3, col:24> 'void *'
| |-ImplicitCastExpr 0x2c619e8 <col:3> 'void *(*)(void *, int, unsigned long)' <FunctionToPointerDecay>
| | `-DeclRefExpr 0x2c61910 <col:3> 'void *(void *, int, unsigned long)' Function 0x2baf100 'memset' 'void *(void *, int, unsigned long)'
</code></pre></div></div>
<p>Here’s the break down of the AST snippet:</p>
<ul>
<li><code class="language-plaintext highlighter-rouge">FunctionDecl</code> is an AST node that represents the declaration of the <code class="language-plaintext highlighter-rouge">main()</code> function</li>
<li><code class="language-plaintext highlighter-rouge">CompoundStmt</code> is an AST node that signals the start of the function’s body. Note that this node is a child of <code class="language-plaintext highlighter-rouge">FunctionDecl</code> implying that the <code class="language-plaintext highlighter-rouge">CompoundStmt</code> in question is to be found in the function body of <code class="language-plaintext highlighter-rouge">main()</code></li>
<li><code class="language-plaintext highlighter-rouge">DeclStmt</code> is an AST node that represents the declaration of the char buffer whose name is <code class="language-plaintext highlighter-rouge">buf</code>. The referenced variable <code class="language-plaintext highlighter-rouge">VarDecl</code> is a child of <code class="language-plaintext highlighter-rouge">DeclStmt</code> implying that the variable in question binds to the said declarative statement</li>
<li>…
… and so on.</li>
</ul>
<p>AST features (type of AST node, and its relationship to adjacent AST nodes) can help issue efficient queries for static exploration.
For example, if we want to explore all calls to the function <code class="language-plaintext highlighter-rouge">abort()</code> we can issue the following clang-query style query:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ clang-query abort.c
clang-query> match declRefExpr(to(
functionDecl(hasName("abort"))
))
Match #1:
/home/bhargava/work/github/bshastry.github.io/code/abort.c:9:3: note: "root" binds here
abort();
^~~~~
1 match.
</code></pre></div></div>
<p>This example demonstrates how simple functional queries may be used to explore a code base.
In this work, we focus on directed exploration i.e., we would like to explore the code base with specific issues in mind.
To demonstrate this, consider the following stack trace discovered by fuzzing a modified version of the <code class="language-plaintext highlighter-rouge">abort.c</code> program that we shall call <code class="language-plaintext highlighter-rouge">abort-mod.c</code>.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cat <<EOF > abort-mod.c
1. #include <string.h>
2. #include <crypt.h>
3. #include <stdlib.h>
4. #include <unistd.h>
5. #define CUSTOM() abort()
6. void fuzzable(const char *input) {
7. // Fuzzer finds this bug
8. if (!strcmp(input, "doom"))
9. abort();
10. }
11. void cov_bottleneck(const char *input) {
12. char *hash = crypt(input, "salt");
13.
14. // Fuzzer is unlikely to find this bug
15. if (!strcmp(hash, "hash_val"))
16. CUSTOM(); // grep misses this
17. }
18.
19. // Fuzzer test harness
20. // INPUT: stdin
21. int main() {
23. char buf[256];
24. memset(buf, 0, 256);
25. read(0, buf, 255);
26. fuzzable(buf);
27. cov_bottleneck(buf);
28. return 0;
29. }
EOF
$ clang -g -lcrypt abort-mod.c
$ perl -e 'print "doom"' | gdb -q -ex=r -ex=bt -ex=quit ./a.out
Reading symbols from ./a.out...done.
Starting program: /home/bhargava/work/github/bshastry.github.io/code/a.out
Program received signal SIGABRT, Aborted.
0x00007ffff780a428 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#0 0x00007ffff780a428 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007ffff780c02a in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2 0x000000000040073a in fuzzable (input=0x7fffffffd850 "doom") at abort-mod.c:9
#3 0x0000000000400814 in main () at abort-mod.c:25
</code></pre></div></div>
<p>Essentially, as expected the input <code class="language-plaintext highlighter-rouge">doom</code> triggers a program abort. Things like this are relatively easy to find using a fuzzer.
Also note that there is a similar “vulnerability” hiding under crypto code.
Essentially, the fuzzer would need to generate a hash collision to get past the branch leading to this vuln, which is very unlikely.
Also note that the call to the <code class="language-plaintext highlighter-rouge">abort()</code> function is lexically different: It is called <code class="language-plaintext highlighter-rouge">CUSTOM()</code> and not <code class="language-plaintext highlighter-rouge">abort()</code>.
This is intentional to show that lexical or even textual matching tools such as <code class="language-plaintext highlighter-rouge">grep</code> will not be able to match it for the query <code class="language-plaintext highlighter-rouge">abort</code>.
Now, I will demonstrate how we deal with code scenarios like those in the example.</p>
<p>First, we localize the defect using the stack trace.
If you filter out function calls not in source code (not systems/library code) and pick the first such stack frame, we are left with the call to <code class="language-plaintext highlighter-rouge">abort()</code> in the <code class="language-plaintext highlighter-rouge">fuzzable()</code> function.
So let’s list all calls to <code class="language-plaintext highlighter-rouge">abort()</code> in the entire code base.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cat <<EOF > abort_query.txt
match declRefExpr(to(functionDecl(hasName("abort"))))
EOF
$ clang-query -f=abort_query.txt abort-mod.c
Match #1:
/home/bhargava/work/github/bshastry.github.io/code/abort-mod.c:9:3: note: "root" binds here
abort();
^~~~~
Match #2:
/home/bhargava/work/github/bshastry.github.io/code/abort-mod.c:16:3: note: "root" binds here
CUSTOM(); // grep misses this
^~~~~~~~
/home/bhargava/work/github/bshastry.github.io/code/abort-mod.c:5:18: note: expanded from macro
'CUSTOM'
#define CUSTOM() abort ()
^~~~~
2 matches.
</code></pre></div></div>
<p>As shown, fuzzer-directed queries can help spot issues that might have been missed by fuzzing alone. This is where directed compiler-based queries help. Being static they can explore the entire code base without being hampered by dynamic bottlenecks such as cryptographic code or more simply code that doesn’t get exercised by existing unit tests.</p>
<h4 id="dealing-with-false-positives">Dealing with false positives</h4>
<p>This sounds too good to be true. It is. Static analysis over-approximates that leads to false positives, and eventually manual time spent in report validation.
For example, in the synthetic example above, calls to <code class="language-plaintext highlighter-rouge">abort()</code> is too broad a query to find real issues. There are likely calls to <code class="language-plaintext highlighter-rouge">abort()</code> in dead code and/or not relevant.
In general, the more precise we are able to model fuzzer crashes from the post-failure diagnostics (stack trace, core dump etc.), the better static matches we get.
For the time being, we have a simple but effective way to facilitate manual review.</p>
<h4 id="ranking-matches">Ranking matches</h4>
<p>First, we measure the test coverage reached by fuzzing.
We do this by using a program coverage tracer tool such as Gcov, and SanitizerCoverage.
Second, for each match returned by the static analyzer, we check if it comprises code that is already covered or not.
Matches in unfuzzed code is prioritized for review.</p>
<h2 id="results">Results</h2>
<p>This research was evaluated on Open vSwitch codebase. It led to the discovery of several corner cases that OvS developers appreciated.
Prominently, we showed that our method could spot a security issue that was a regression that appeared in one release and also catch a real issue similar to a fuzzer discovered vuln elsewhere in the same codebase.
The analysis undertaken is fast and thus doable on a regular basis e.g., CI.
I think the approach taken in this work holds promise for catching other classes of recurring vulns in large codebases.</p>
<p><a href="/2017/08/02/Diagnosing-Distributed-Vulnerabilities.html">Part 1</a> | <a href="/2017/08/03/Inferring-Program-Input-Format.html">Part 2</a> | <a href="/2017/08/04/Exploring-Fuzzer-Crashes.html">Part 3</a></p>Part 1 | Part 2 | Part 3Inferring Program Input Format2017-08-03T00:00:00+00:002017-08-03T00:00:00+00:00/2017/08/03/Inferring-Program-Input-Format<p><a href="/2017/08/02/Diagnosing-Distributed-Vulnerabilities.html">Part 1</a> | <a href="/2017/08/03/Inferring-Program-Input-Format.html">Part 2</a> | <a href="/2017/08/04/Exploring-Fuzzer-Crashes.html">Part 3</a></p>
<h2 id="prologue">Prologue</h2>
<p>This post is the second of the three part series on compiler assisted vulnerability diagnosis in open-source C/C++ code. “Compiler assisted” means that the presented techniques pivot around a compiler, and “vulnerability diagnosis” refers to the process of finding and fixing vulnerabilities (software weaknesses that can be used to intentionally cause harm). Software weaknesses (bugs) are a superset of vulnerabilities in that not all weaknesses are harmful from a security perspective. The challenging part of diagnosing vulnerabilities in source code is to arrive at the (usually) small subset of vulnerabilities from the (usually) larger set of bugs and non-bugs (that the source analyzer believes to be real bugs aka false positives).</p>
<h2 id="intro">Intro</h2>
<p>Coverage guided fuzzers such as afl-fuzz are clever enough to generate inputs that exercise new program paths. However, there are instances where additional help is valuable. By valuable, I mean one of two things: (1) Reduces time to vulnerability exposure; and/or (2) Increases number of uncovered vulns.
This post investigates one way in which additional support may be provided to the fuzzer.</p>
<h2 id="inferring-input-format-from-source-code">Inferring Input Format From Source Code</h2>
<p>I will be using a <a href="https://llvm.org/docs/LibFuzzer.html">libFuzzer</a> test harness to demonstrate the central idea behind this post.
Consider the following code example.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cat <<EOF > libfuzzer-example.c
bool FuzzMe(const uint8_t *Data, size_t Size) {
return Size >=3 &&
Data[0] == 'F' &&
Data[1] == 'U' &&
Data[2] == 'Z' &&
Data[3] == 'Z';
}
int LLVMFuzzerTestOneInput(const uint8_t *Data, size_t Size) {
FuzzMe(Data, Size);
return 0;
}
EOF
</code></pre></div></div>
<p>All this test harness is doing is fuzzing a buggy function called <code class="language-plaintext highlighter-rouge">FuzzMe()</code> that contains an out-of-bounds read (<code class="language-plaintext highlighter-rouge">Size == 3 && input == "FUZ"</code>).
Let’s time libFuzzer on this test case on an empty corpus.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ clang++ -g -fsanitize=address -fsanitize-coverage=trace-pc-guard ~/FTS/tutorial/fuzz_me.cc libFuzzer.a
$ time ./a.out
...
Stack after return: f5
Stack use after scope: f8
Global redzone: f9
Global init order: f6
Poisoned by user: f7
Container overflow: fc
Array cookie: ac
Intra object redzone: bb
ASan internal: fe
Left alloca redzone: ca
Right alloca redzone: cb
==15307==ABORTING
MS: 1 EraseBytes-; base unit: 6cdcffd840bb810dcdd4778c1a5caaa6cd012f0c
0x46,0x55,0x5a,
FUZ
artifact_prefix='./'; Test unit written to ./crash-0eb8e4ed029b774d80f2b66408203801cb982a60
Base64: RlVa
real 0m0.844s
user 0m0.440s
sys 0m0.180s
</code></pre></div></div>
<p>So, roughly after 0.8s, libFuzzer was able to find the input (“FUZ”) that triggered the singly byte out-of-bounds read.
That’s really fast.
However, it could be made faster if we can gain some insight on the program input format.
Let’s run a simple clang front-end tool to extract constant strings used in comparison statements even before we start to fuzz.
Remember, we are doing a static pass over the source code here.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cat <<EOF > clang-charlitmatcher.c
#include "clang/AST/ASTConsumer.h"
#include "clang/AST/RecursiveASTVisitor.h"
#include "clang/Frontend/CompilerInstance.h"
#include "clang/Frontend/FrontendAction.h"
#include "clang/Tooling/Tooling.h"
#include "clang/ASTMatchers/ASTMatchers.h"
#include "clang/ASTMatchers/ASTMatchFinder.h"
// Declares clang::SyntaxOnlyAction.
#include "clang/Frontend/FrontendActions.h"
#include "clang/Tooling/CommonOptionsParser.h"
// Declares llvm::cl::extrahelp.
#include "llvm/Support/CommandLine.h"
#include "llvm/Support/Regex.h"
using namespace clang::tooling;
using namespace llvm;
using namespace clang;
using namespace clang::ast_matchers;
// Apply a custom category to all command-line options so that they are the
// only ones displayed.
static cl::OptionCategory MyToolCategory("clang-sdict options");
// CommonOptionsParser declares HelpMessage with a description of the common
// command-line options related to the compilation database and input files.
// It's nice to have this help message in all tools.
static cl::extrahelp CommonHelp(CommonOptionsParser::HelpMessage);
// A help message for this specific tool can be added afterwards.
static cl::extrahelp MoreHelp("\nTakes a compilation database and spits out CString Literals in source files\n");
// character literal in binary op matcher
StatementMatcher CharLitMatcher = characterLiteral(hasParent(binaryOperator())).bind("charlit");
class MatchPrinter : public MatchFinder::MatchCallback {
public :
void printToken(StringRef token) {
size_t tokenlen = token.size();
if ((tokenlen == 0) || (tokenlen > 128))
return;
llvm::outs() << "\"" + token + "\"" << "\n";
}
void prettyPrintIntString(std::string inString) {
if (inString.empty())
return;
size_t inStrLen = inString.size();
if (inStrLen % 2) {
inString.insert(0, "0");
inStrLen++;
}
for (size_t i = 0; i < (2 * inStrLen); i+=4)
inString.insert(i, "\\x");
printToken(inString);
}
void formatCharLiteral(const CharacterLiteral *CL) {
unsigned value = CL->getValue();
std::string valString = llvm::APInt(8, value).toString(16, false);
prettyPrintIntString(valString);
}
virtual void run(const MatchFinder::MatchResult &Result) {
if (const clang::CharacterLiteral *CL = Result.Nodes.getNodeAs<clang::CharacterLiteral>("charlit"))
formatCharLiteral(CL);
}
};
int main(int argc, const char **argv) {
CommonOptionsParser OptionsParser(argc, argv, MyToolCategory);
ClangTool Tool(OptionsParser.getCompilations(),
OptionsParser.getSourcePathList());
MatchPrinter Printer;
MatchFinder Finder;
Finder.addMatcher(CharLitMatcher, &Printer);
return Tool.run(newFrontendActionFactory(&Finder).get());
}
EOF
</code></pre></div></div>
<p>Long story short, clang front end tool does the following:</p>
<ul>
<li>Makes a pass over source code AST</li>
<li>Looks for character literals that are children of binary operators</li>
<li>Prints these character literals</li>
</ul>
<p>Note that all of this is done in under 100 lines of code including boilerplate code.
Now, let’s run this against our libfuzzer code example.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ clang-clmatcher libfuzzer-example.c > dict
$ cat dict
"\x46"
"\x55"
"\x5A"
"\x5A" }
</code></pre></div></div>
<p>Essentially, this gave us ‘F’, ‘U’, ‘Z’, ‘Z’ (after deduplication: ‘F’, ‘U’, and ‘Z’). Let’s put this in an afl-style dictionary and reinvoke libfuzzer with this dictionary.
The idea is to compare the times libfuzzer takes with and without the dictionary. As we have already noted, it takes about 0.8s to spot the buffer over-read without a dictionary.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ time ./a.out -dict=dict
...
MS: 3 ChangeByte-ShuffleBytes-EraseBytes-; base unit: d211f6eb0b35f1d135f354587b1a0851779fcc28
0x46,0x55,0x5a,
FUZ
artifact_prefix='./'; Test unit written to ./crash-0eb8e4ed029b774d80f2b66408203801cb982a60
Base64: RlVa
real 0m0.129s
user 0m0.012s
sys 0m0.024s
</code></pre></div></div>
<p>Naturally, it’s a lot faster because we already know some things about the input format. Of course, more information may be gathered such as the context in which certain tokens are used, the order in which they are used and so on. You may read how this can be done in the paper linked below.</p>
<h2 id="results">Results</h2>
<p>Statically generated dictionaries may make fuzzing campaigns more effective.
These dictionaries are particular suitable for fuzzing applications that parse highly structured inputs such as file format and network parsers.
For example, we found over 15 zero-day vulns in network parsers due to the use of dictionaries alone.
Having said that, understanding where they won’t help might help one decide if using one is desirable.
Can it find bugs in the non-parser code path faster? No, because knowledge of input format is irrelavant for bugs not in the parsing code path.
Will a smart fuzzer not find these bugs by itself? That is unlikely. Good fuzzers usually eventually find the same bugs.
However, dictionaries can support them by triggering these code paths much faster so that a fuzzer may “focus” on other interesting code paths.
You can read the <a href="http://users.sec.t-labs.tu-berlin.de/~bshastry/raid17.pdf">full paper</a> (to be published in the proceedings of RAID’17 by Springer) that this work produced and form your own opinion.</p>
<p><a href="/2017/08/02/Diagnosing-Distributed-Vulnerabilities.html">Part 1</a> | <a href="/2017/08/03/Inferring-Program-Input-Format.html">Part 2</a> | <a href="/2017/08/04/Exploring-Fuzzer-Crashes.html">Part 3</a></p>Part 1 | Part 2 | Part 3