Real Data Corpus

From Simson Garfinkel
Jump to navigationJump to search

The Real Data Corpus (RDC) is a collection of raw data extracted from data-carrying devices that were purchased on the secondary market around the world. Many studies have shown that hard drives, cell phones, USB memory sticks, and other data-carrying devices are frequently discarded by their original users without the data first being cleared or purged. By purchasing these devices and extracting their data, we have created a data set that closely mimics data as it is found in the real world.

Potential Uses

The Real Data Corpus is a one-of-a-kind scientific resource for:

  • Developing and validating forensic and data recovery tools.
  • Training students in forensics and data recovery
  • Developing and validating document translation software.
  • Exploring and characterizing real-world computing practices, configuration choices, and option settings.
  • Studying the storage allocation strategies of file systems under real-world conditions

For more information about the Real Data Corpus, please see:

Current Contents

As of February 21, 2011, the Non-US Person's Corpus consists of the following:

  • 1,289 hard drive images ranging in size from 500MB to 80GB.
  • 643 flash memory images (USB, Sony Memory Stick, SD and other), ranging from 128MB to 4GB.
  • 98 CDROMs

For a total of 36TB of data (uncompressed).

Access and Availability

Real Data Corpus can be distributed to sponsors and collaborators as a set of encrypted AFF files. Encryption is with AES 256 and can be based on either a pass phrase or X.509 PKI using AFF encryption:

  • Disk images can be downloaded over the Internet from a secure server using SSL by authorized researchers.
  • Alternatively, we can package the files onto portable terabyte USB hard drives or optical tape.
  • Researchers can be given an account on a multi-user Linux computer on which all of the corpora resides.
  • Finally, we have developed a remote access framework: we publish XML files of each drive’s metadata; you select which sectors you need and download them over the Internet using our XMLRPC framework.

Research Results

We have used this unique resource for a variety of research purposes:

  1. Our first publication [1] alerted the community to the scale of the problem of data on repurposed hard drives. Following this publication, the US government passed legislation creating an affirmative responsibility on the part of American businesses to purge consumer information from hard drives before discarding them [2], and two major products were introduced that use cryptography to rapidly "shred" information on stored magnetic media [3][4].
  2. We conducted a "trace-back" study in which 20 organizations were contacted who had data on the hard drives that we obtained. Based on interviews, we were able to identify the technical and organizational failures that resulted in the data compromises.[5]
  3. We have identified patterns and principles for promoting secure human-computer interaction.
  4. As part of developing this resource, we have developed a new file format for storing disk images[6][7], and we are developing a new technique for mapping social networks among individuals whose data is on captured hard drives. These approaches could be used, for example, to allow the rapid and automated analysis of disk drives seized during the course of a police investigation or obtained as part of military operations.
  5. We have used the corpus to evaluate the effectiveness of today's computer forensic tools.
  6. We are using the corpus to create new computer forensic tools and techniques.


  1. S. Garfinkel. and A. Shelat. "Remembrance of Data Passed: A Study of Disk Sanitization Practices," IEEE Security and Privacy, January/February 2003.
  2. US Congress. Fair and Accurate Credit Transactions Act of 2003
  3. Seagate Technology. Momentus Family Overview, 2006
  4. Decru. DataFort Security Appliances, 2005
  5. S. Garfinkel. "Design Principles and Patterns for Computer Systems that are Simultaneously Secure and Usable," PhD Thesis, Massachusetts Institute of Technology, June 2005
  6. S. Garfinkel. "AFF: A New Format for Storing Hard Drive Iamges," Communications of the ACM, February, 2006
  7. S. Garfinkel and D. Malan and K. Dubec and C. Stevens and C. Pham. "Disk Imaging with the Advanced Forensics Format, Library and Tools," The Second Annual IFIP WG 11.9 International Conference on Digital Forensics, National Center for Forensic Science, Orlando, Florida, USA January 29 - February 1 2006.