Automated Computer Forensics
Current Research Areas
One of my primary areas of research is the development of algorithms, techniques, and eventually tools for automating a wide variety of computer forensics tasks that are currently performed by trained analysts. Today much work performed by computer analysts is performed with visualization tools that allow an analyst to search for data on a hard drive or captured from a network and slowly construct a story that might be useful in a prosecution or in recovering from a security event. But as data volumes increase and the network environment becomes increasingly complex, there is a need for increasingly automated tools that can perform autonomous analysis and correlation<ref>Garfinkel, S. "Document and Media Exploitation," ACM Queue, November/December 2007.</ref><ref>Garfinkel, Simson, Digital Forensics Research: The Next 10 Years , DFRWS 2010, Portland, OR</ref>
Today my research into this field of automated computer forensics covers these main areas:
- Small-block forensics---Exploring approaches for working with data elements that are smaller than files. This can be used in situations where an entire file is not available for reconstruction, or only a portion of a file is available for analysis. Small block forensics can be used to enable approaches based on statistical sampling rather than full-content analysis.<ref>Simson Garfinkel, Vassil Roussev, Alex Nelson and Douglas White, Using purpose-built functions and block hashes to enable small block and sub-file forensics, DFRWS 2010, Portland, OR</ref>
- Data-rich algorithms and approaches that are designed to work in environments where there is a large collection of data from multiple users, as can be the case in law enforcement, e-discovery, and internal corporate investigations. <ref>Garfinkel, S., Forensic Feature Extraction and Cross-Drive Analysis,The 6th Annual Digital Forensic Research Workshop Lafayette, Indiana, August 14-16, 2006.</ref>
- Media/Web correlation --- Exploring opportunities for automatic correlation of information on hard drives with information that can be found on the web.
- Corpus Creation --- Developing an unclassified Real Data Corpus (RDC) consisting of "real data from real people" that can be used to develop new algorithms, quantify results, and test automated tools.<ref>Garfinkel, Farrell, Roussev and Dinolt, Bringing Science to Digital Forensics with Standardized Forensic Corpora, DFRWS 2009, Montreal, Canada. (slides)</ref>
Some of my previous work in this area includes:
- Developing open source tools for working with electronic evidence. This work is part of the AFF project<ref>"AFF: A New Format for Storing Hard Drive Images," Garfinkel, S., Communications of the ACM, February, 2006</ref>.
- Explorations of file carving. <ref>Garfinkel, S., "Carving Contiguous and Fragmented Files with Fast Object Validation", Digital Forensics Workshop (DFRWS 2007), Pittsburgh, PA, August 2007. </ref>
Related work areas that I am not personally involved in includes:
- File type recognition and identification.
- Approaches for gisting and clustering documents based on their content.
- Approaches that are tuned to human languages other than English.
Recent Research Developments
File-based forensics is forensics that is based on an analysis of files, deleted files and orphan files. Most forensics currently performed for law enforcement, commercial e-discovery, and for intelligence purposes is based on file forensics. The goal here is typically to find a specific file that can be shown to a jury or that contains actionable intelligence. File forensics is typically performed using programs such as EnCase, FTK, or SleuthKit.
- We have developed a batch analysis tool called system called fiwalk which can take a disk image and produce an XML file corresponding to all of the files, deleted files, orphan files, and all of the extracted file metadata from a disk image. This XML file can be used as an input to enable further automated media processing. Using this system we have created a variety of applications for reporting and manipulating disk images. We have also developed an efficient system for allowing remote file-level access of disk images using XML-RPC and REST. Details can be found in our paper<ref>Automating Disk Forensic Processing with SleuthKit, XML and Python, Fourth International IEEE Workshop on Systematic Approaches to Digital Forensic Engineering (IEEE/SADFE'09), May 2009</ref>.
- We have developed a prototype system for performing automated media forensic reporting. Based on PyFlag, the system performs an in-depth analysis of captured media, locates local and online identities, and presents summary information in a report that is tailed to be easy for the consumer of forensic intelligence<ref>A Framework for Automated Digital Forensic Reporting, Lt. Paul Farrell, Master's Thesis, Naval Postgraduate School, Monterey, CA, March 2009</ref>.
Bulk data forensics
Bulk data forensics is based on the bulk analysis of disk images and other kinds of forensic source data. File carving<ref>"Carving Contiguous and Fragmented Files with Fast Object Validation", Garfinkel, S., Digital Investigation, Volume 4, Supplement 1, September 2007, Pages 2--12.</ref> is a traditional form of bulk forensics. We see bulk forensics as a complement to existing forensic processing, rather than as a replacement for it.
Bulk data forensics has several important advantages over traditional forensic processing:
- It's faster, because the disk head scans the disk (or disk image) from beginning to end without having to seek from file to file.
- It can tolerate media that is damaged or incomplete, since the forensic processing does not require the reconstruction of file allocation tables, disk directories, or metadata.
- It works with obscure or unknown operating systems, since no attempt is made to reconstruct the file system or other operating system structures.
- It lends itself to statistical processing. Instead of scanning the entire disk image, the image can be sampled.
We have developed several interesting tools for bulk data forensics:
- frag_find is a tool that can report if sectors of a TARGET file are present on a disk image. This is useful in cases where a TARGET file has been stolen and you wish to establish that the file has been present on a subject's drive. If most of the TARGET file's sectors are found on the IMAGE drive---and if the sectors are in consecutive sector runs---then the chances are excellent that the file was once there. Frag_find uses a three-stage filter with a high-speed but low-quality 32-bit hash, then a Bloom filter of SHA1 hashes,<ref>“Practical Applications of Bloom filters to the NIST RDS and hard drive triage,” Farrell, Garfinkel and White, ACSAC 2008</ref> and finally a linked list of all hashes. The result is that disk sectors are only hashed with necessary, allowing processing speeds of 50K sectors/sec on standard hardware. The program deals with the problem of non-unique blocks by looking for runs of matching blocks, rather than individual blocks. Frag_find is part of the NPS Bloom package, which can be downloaded from http://www.afflib.org.
- bulk_extractor is a tool that searches for recognized features in the bulk data and performs histogram analysis on the result. You can write your own feature extractors using flex. We provide bulk_extractor with an extractor that finds complete email addresses and domain names. The most common email address on a hard drive is usually that of the drive's primary use; other top-occuring email addresses tend to belong to that person's primary correspondents. By looking at the list of email addresses sorted by frequency, it is easy to rapidly infer the user's social network.
- CDA Tool takes the results of the bulk_extractor tool and performs a Cross-Drive Analysis<ref>"Forensic Feature Extraction and Cross-Drive Analysis," Garfinkel, S., Digital Investigation, Volume 3, Supplement 1, September 2006, Pages 71--81.</ref>. This can allow an investigator to discover previously unknown social networks in a set of hard drives, or to see if a newly acquired hard drive belongs to an existing social network. Currently the tool simply prints a report of the amount of connection between each drive. We plan to expand this tool to show the results graphically and to allow the analyst to drill down and see the cause of the connections.
We are working on two research projects that have not yet produced any tools:
- A technique that can rapidly characterize the content of a large hard drive through statistical sampling. We believe that this system will be able to accurately report the percentage of encrypted data on a 1TB hard drive with less than 10 seconds of analysis. For further information please see the page on Sub-Linear drive analysis.
- A technique for ascribing carved data to a particular individual who created that data.