The long-term preservation strategy of The Language Archive consists of two parts: data replication such that it is more likely that the bit-streams will survive in the long run, and the limitation of archival formats such that any conversions in the future in case one of the formats becomes obsolete is more feasible.

Data Replication

In order to prevent data loss in case of technical failures or force majeure, all data in The Language Archive is replicated five times, in a number of different locations. The main archival copy resides at the MPI for Psycholinguistics in Nijmegen and is backed up on tape. The MPI uses a Hierarchical Storage Management system called Oracle SAM-QFS, that automatically stores on copy on hard disk arrays and two copies on LTO-5 magnetic tape. This system keeps MD5 checksums of each archived file, which is used to verify the integrity of the bit-streams. An additional backup is stored at the Rechenzentrum Garching (RZG), near Munich, one of the data centers of the Max Planck Society. The RZG creates another backup copy at the Leibniz-Rechenzentrum (LRZ), which is also located in Garching near Munich. Another backup is stored at the Göttingen Society for Scientific Data Processing (GWDG) in Göttingen, which is another data center of the Max Planck Society. The GWDG again store another backup at the University Medical Center in Göttingen. In total this means that there are at least 6 copies of each file in 5 different buildings in 3 geographically distinct locations. The bit-stream preservation of the copies that reside at the data centers of the Max Planck Society are guaranteed for 50 years.

The data replication from MPI Nijmegen to RZG Garching is done with iRODS and the replication from MPI Nijmegen to GWDG Göttingen is done with rsync.

In the realm of the EUDAT project funded by the European Commission, another copy of the data is copied from the RZG Garching to the SURFsara data center in Amsterdam. This copy is also transferred using iRODS.

Archival file formats

The Language Archive only archives a limited set of file formats. These formats are chosen according to the following criteria, which may sometimes conflict with one another:

  • openness of the format and/or availability of full specifications
  • established standards or de facto standards within the research domain
  • assessment of the longevity of the format
  • no lossy compression if feasible
  • no binary formats if feasible
  • textual data in XML formats and Unicode UTF-8 encoding if feasible

A list of currently accepted archival formats is available in Appendix A of the manual of the LAMUS archive upload and management tool.