Automated Computer Forensics
We are developing a variety of techniques and tools for performing Automated Document and Media Exploitation (ADOMEX). The thrust of this research consists of several thrusts:
- Developing open source tools for working with electronic evidence. This work is part of the AFF project.
- Developing an unclassified Real Data Corpus (RDC) consisting of "real data from real people" that can be used to develop new algorithms and test automated tools.
- Developing new algorithms and approaches for working in a "data-rich environment."
Recent Research Developments
File-based forensics is forensics that is based on an analysis of files, deleted files and orphan files. Most forensics currently performed for law enforcement, commercial e-discovery, and for intelligence purposes is based on file forensics. The goal here is typically to find a specific file that can be shown to a jury or that contains actionable intelligence. File forensics is typically performed using programs such as EnCase, FTK, or SleuthKit.
- We have developed a batch analysis tool called system called fiwalk which can take a disk image and produce an XML file corresponding to all of the files, deleted files, orphan files, and all of the extracted file metadata from a disk image. This XML file can be used as an input to enable further automated media processing. Using this system we have created a variety of applications for reporting and manipulating disk images. We have also developed an efficient system for allowing remote file-level access of disk images using XML-RPC and REST. Details can be found in our paper.
- We have developed a prototype system for performing automated media forensic reporting. Based on PyFlag, the system performs an in-depth analysis of captured media, locates local and online identities, and presents summary information in a report that is tailed to be easy for the consumer of forensic intelligence.
Bulk Data Forensics
Bulk Data Forensics is based on the bulk analysis of disk images and other kinds of forensic source data. Carving is a traditional form of bulk forensics. We see bulk forensics as a complement to existing forensic processing, rather than as a replacement for it.
Bulk data forensics has several important advantages over traditional forensic processing:
- It's faster, because the disk head scans the disk (or disk image) from beginning to end without having to seek from file to file.
- It can tolerate media that is damaged or incomplete, since the forensic processing does not require the reconstruction of file allocation tables, disk directories, or metadata.
- It works with obscure or unknown operating systems, since no attempt is made to reconstruct the file system or other operating system structures.
- It lends itself to statistical processing. Instead of scanning the entire disk image, the image can be sampled.
We have developed several interesting tools for bulk data forensics:
- frag_find is a tool that can report if sectors of a TARGET file are present on a disk image. This is useful in cases where a TARGET file has been stolen and you wish to establish that the file has been present on a subject's drive. If most of the TARGET file's sectors are found on the IMAGE drive---and if the sectors are in consecutive sector runs---then the chances are excellent that the file was once there. Frag_find uses a three-stage filter with a high-speed but low-quality 32-bit hash, then a Bloom filter of SHA1 hashes, and finally a linked list of all hashes. The result is that disk sectors are only hashed with necessary, allowing processing speeds of 50K sectors/sec on standard hardware. The program deals with the problem of non-unique blocks by looking for runs of matching blocks, rather than individual blocks. Frag_find is part of the NPS Bloom package, which can be downloaded from http://www.afflib.org.
- bulk_extractor is a tool that searches for recognized features in the bulk data and performs histogram analysis on the result. You can write your own feature extractors using flex. We provide bulk_extractor with an extractor that finds complete email addresses and domain names. The most common email address on a hard drive is usually that of the drive's primary use; other top-occuring email addresses tend to belong to that person's primary correspondents. By looking at the list of email addresses sorted by frequency, it is easy to rapidly infer the user's social network.
- CDA Tool takes the results of the bulk_extractor tool and performs a Cross-Drive Analysis. This can allow an investigator to discover previously unknown social networks in a set of hard drives, or to see if a newly acquired hard drive belongs to an existing social network. Currently the tool simply prints a report of the amount of connection between each drive. We plan to expand this tool to show the results graphically and to allow the analyst to drill down and see the cause of the connections.
We are also developing a system that can rapidly characterize the content of a large hard drive through statistical sampling. We believe that this system will be able to accurately report the percentage of encrypted data on a 1TB hard drive with less than 10 seconds of analysis. For further information please see Sub-Linear_Drive_Analysis.
- "AFF: A New Format for Storing Hard Drive Images," Garfinkel, S., Communications of the ACM, February, 2006
- Automating Disk Forensic Processing with SleuthKit, XML and Python, Fourth International IEEE Workshop on Systematic Approaches to Digital Forensic Engineering (IEEE/SADFE'09), May 2009
- A Framework for Automated Digital Forensic Reporting, Lt. Paul Farrell, Master's Thesis, Naval Postgraduate School, Monterey, CA, March 2009
- “Practical Applications of Bloom filters to the NIST RDS and hard drive triage,” Farrell, Garfinkel and White, ACSAC 2008
- "Forensic Feature Extraction and Cross-Drive Analysis," Garfinkel, S., Digital Investigation, Volume 3, Supplement 1, September 2006, Pages 71--81.
- "Standardizing Digital Evidence Storage," The Common Evidence Format Working Group (Carrier, B., Casey, E., Garfinkel, S., Kornblum, J., Hosmer, C., Rogers., M., and Turner., P.,) Communications of the ACM, February, 2006.