As I mentioned in a previous post, the PDF specification seems bloated. Additionally, the Adobe Reader makes a really good effort to display something even when the PDF document is ill-formed. These observations led me to implement a fuzzing framework with a PDF file format fuzzer as a guinea pig application.
Looking around, I only found one fuzzing framework in Haskell, FileH. It seems very fast, but it is targeted at “dumb” binary fuzzing. Flip some bits here and remove some stuff there… increment these 4 bytes as if they were an integer. It also assumes you have a test harness executing the program in question and returning a non-zero exit code when an interesting execution occurs. My original plan was to create a generative file fuzzer creating new PDFs using the QuickCheck module and to integrate the debugger/execution monitor with the fuzzing framework. With these goals in mind, I set out to write FuzzyWuzzy, the Haskell file fuzzing framework. (Note: I have restrained myself from getting into the gory details, but the link to bitbucket gives you the source should it interest you. Also, keep in mind this is the 5th Haskell project I’ve developed [and the largest]… I’m learning.)
My first step was writing the launcher and monitor bits that interface with the operating system. The first generation would be Windows specific with half an eye towards eventually supporting a ptrace interface. The Win32 modules provided with the GHC installation were almost all I needed. I ended up writing an foreign interface to CreateProcess to support an extra flag (DEBUG_PROCESS) not exposed by the unified System.Process module. I also implemented the foreign interface to TerminateProcess, but that was trivial. With those two extra functions available, I was able to use the functions from System.Win32.DebugApi to create the launcher and monitor to detect crashing programs. I have not yet investigated ways to detect large memory usage or CPU load, but those are on the list for the next version. Currently, the monitor will end an execution and flag it as interesting if the application would have crashed had it not been attached to a debugger.
With the OS stuff out of the way, I turned my attention to file generation. I created an abstract representation of the PDF format and implemented a serialization function to turn a PDF type into a file. After some more serious thought about generating almost valid PDFs from QuickCheck generators, I decided to take another direction.
Instead of generating a PDF from nothing, I wrote a PDF parser to turn a PDF into the abstract representation. The next step was to write mutations on the PDF abstract tree — operations like enlarging Name or String objects, adding long chains of escape sequences to Name or String objects, and deleting entries in Dictionary objects. I also wrote some mutations on the raw character stream going back to disk. These were similar to the mutations done by FileH. At this point, the fuzzer was a complete program. I let it run for a bit and watched Acrobat throw lots of nice message boxes complaining about ill-formed PDFs.
In the course of writing the higher level PDF type mutations, I realized the hierarchical PDF structure made it difficult to pick, say, a random String object (String objects are usually referenced as values in a Dictionary object and would rarely if ever be found as top level [indirect] objects themselves). It would be easier to filter if the PDF was a flat list of objects with each node able to reference the id of another object if needed. After adding this as a transformation from the hierarchical representation, I came up with another bunch of mutations that were much easier to formulate with this representation. With this modification I started finding some crash bugs! My little fuzzer actually works.
Now what? Finding the offending mutation wasn’t difficult and now I have a minimal case to play with. Of course, I’ve been coming up with new ideas for mutations each day.
Ideas for the future:
- Implement a system for distributed fuzzing – Break up the fuzzing process to be able to easily distribute the work. In other words, have a few computers doing the generations of new files to test and a pool of tester computers to do the runs.
- Decompose the PDF format further to fuzz the stream contents – Stream objects are usually compressed with the DEFLATE algorithm. This makes for boring fuzzing. Uncompress the Stream objects and decompose them further (Embedded Fonts have known formats, Graphics commands are not difficult to parse, movies, pictures, and music are all stored as Streams as well).
- Notification system – Email notification of newly found crashes with unique stack/EIP backtraces. Who wouldn’t want to know *immediately*?
After doing all this fuzzing work, it’s become apparent why many people have moved towards developing hybrid fuzzers that use dynamic information to control the future inputs. That is probably where I’ll be heading next. Simple fuzzers are hard to measure (as everyone has already said many many times).
UPDATE: I’ve implemented the e-mail notification. I’ve begun the stream mutation code. I’ve also run into some weirdness with the Haskell CreateProcess interface — I’ve gotten a few rare segfaults. Since I haven’t been successful getting a Haskell monitor written, I wrote a quick one in C, but I haven’t ported the PDF mutation stuff to use it. I’m thinking about writing some Python to manage a simple distributed fuzzing system. Of course, I have been spending all my time lately on other stuff completely, including DynaTrex, an open source binary rewriting tool for Windows. It is still very young.