Skip to Main Content

Managing Data Sets: Data Preservation

File Format

The file format in which you record, store, and transmit your data is a primary factor in one's ability to use your data in the future. Plan for both hardware and software obsolescence.

Formats likely to be accessible in the future are:

  1. Non-proprietary (ex. use .txt rather than MS Word .doc)
  2. Open, document standards
  3. In common usage by the research community
  4. Use standard character encodings (i.e. ASCII or UTF-8)
  5. Unencrypted
  6. Uncompressed

Examples of preferred format choices:

  • Image: JPEG, JPG-2000, PNG, TIFF
  • Text: HTML, XML, PDF/A, UTF-8, ASCII
  • Audio: AIFF, WAVE
  • Containers: TAR, GZIP, ZIP
  • Databases: prefer XML or CSV to native binary formats

Examples of discouraged format choices:

  • Word (prefer PDF)
  • Quicktime (prefer MPEG-4)
  • GIF (which uses proprietary compression)

For further clarification:

CDL Digital File Format Recommendations

University of Texas Recommended file formats

Digital Preservation (Library of Congress) maintained by the National Digital Information Infrastructure Program, this site contains digital preservation resources, news, reports, and standards.

Backup and Security

The California Digital Library gives a great overview on data security, storage, and backup guidelines.

  • Make 3 copies of your data (e.g. original + external/local + external/remote) and have them geographically distributed.
  • Backup options include local storage (hardrive, CD/DVD, flash drive, departmental server) or cloud based storage (Pepperdine digital commons, Amazon S3, or Carbonite). 
  • Secure your data i.e. unencrypted and uncompressed
  • Test your back up system periodically.

Other Online Resources

UK Data Archive - Storage and Backup

Backups and Security from MIT Libraries

Organizing Data Files

Guidelines from MIT Libraries

Guidelines from California Digital Library

EZID is a service that makes it simple for digital object producers (researchers and others) to obtain and manage identifiers for their objects. You can assign identifiers to anything: scientific datasets, technical reports, audio files, digital photographs, and so forth, as well as non-digital objects. This is a fee based service.

DataONE Dash is a self-service tool for researchers to describe, upload, and share their research data via ONEShare, member repository of the DataONE network.