Skip to main content
Stony Brook University

Research Data: Archiving & Preservation

Data Archiving & Preservation

The difference between backing up & archiving

The terms "backup" and "archiving" are often used interchangeably, as they both relate to saving a specific version of a file, but they are actually very different processes. The term “backup” is used specifically when making copies of various files with the knowledge that the files may change. Backups are kept for a certain amount of time, but can be discarded after a specified time has passed. Archiving is used when a file is to be preserved as-is, often at the end of a project and acts as a static (and usually final) record. [source - DataONE education module]

Plan ahead to preserve your data

In addition to planning for local archive storage options (local server, network or SBU’s digital repository), we recommend that you investigate public data repositories within your subject area or discipline. A searchable list of repositories can be found here, and a list of repositories by discipline is here. See Data Repositories for more information on that option.

In many cases, SBU’s Open Access repository Academic Commons Data can be a suitable archive and sharing mechanism for your data. All items deposited into Academic Commons receive a persistent identifier (DOI or ARK), are freely available to anyone, and are full-text searchable, making them discoverable through Google, Google Scholar and other large search engines. If you are interested in depositing data into Academic Commons, or have further questions, please contact me.

Things to consider when archiving your data

  • File formats for long term access: The file format in which you keep your data is a primary factor in one’s ability to use your data in the future. Plan for both hardware and software obsolescence. See the section Organizing Files and File Formats for details on preferable long-term storage file formats.
  • Don’t forget the documentation and metadata: Document your research and data so others can interpret the data. It is important to begin to document your data at the very beginning of your research project and continue throughout the project.
  • SBU data retention policy
    University faculty and researchers have a responsibility to maintain research data and make that data available for preservation by the University both as a matter of research integrity, and because of the University’s ownership rights. Research data must be archived for five years after the closeout, final reporting or publication of a project, with original data retained wherever possible. Additional data sharing and/or archiving requirements may be imposed by the sponsoring agency; the PI is responsible for complying with such requirements.
  • Ownership and privacy
    Make sure that you have considered the implications of sharing data, in terms of copyright and IP ownership, and ethical requirements like privacy and confidentiality. Data generated by research projects at or under the auspices of Stony Brook University are owned by the University. However, the principle investigator (PI) is responsible for retention, preservation, distribution, and control of the data.

Maintaining the integrity of your data

Digital data are fragile, regardless of which storage medium you choose (DVD, hard disk, tapes, etc.). Digital data are susceptible to bit rot, and are likely to degrade or decay over time. The recommended methods for combating bit rot are refreshment and replication.

Refreshment: Periodically copy your data onto a new drive or disk (every 2-5 years).
Replication: Maintain your original copy, an external copy, and an external remote copy. Use at least two forms of storage in two different locations.

For long-term archiving of finalized data, personal computers and external storage devices are NOT recommended.


Software Obsolescence

Does anyone remember Quattro Pro or Lotus 1-2-3? Exactly. When you archive the final version of your dataset(s), consider using an open, non-proprietary format to ensure that you will be able to fully access it/them in the future. Common file formats for text-based data are plain text (ASCII), HDF and NetCDF. Multimedia formats include JPEG 2000, MNG and PNG. For a list of many other open formats, see here.

If you prefer to keep your data in a proprietary format, there are a couple of ways to ensure continued access to older datasets. When new software versions are released and become established, migrate your older datasets to the newer version or package. In the case of software that becomes obsolete, you may be able to emulate the older software using a virtual machine. The recommended best practice however, is to convert your data to an open format, which facilitates both preservation and sharing.

Adapted from: University of Oregon | University of Virginia