Storing and long-term preservation of research data and giving researchers and other interested groups access to these resources is one of the key challenges data driven research is faced with these days. In the area of languages and cultures this aspect is even more important since we are faced with a dramatic loss and change of languages and cultures. From about 6000-7000 languages in average one is dying every week and due to globalization almost all languages and cultures are subject of extreme changes even raising questions about the stability of our societies. Therefore, documenting the state of languages and human processing by experiments and observations is essential, but it needs to be accompanied by proper data management to ensure that all these digital resources will be available for research and the interested public in the coming decades.

The Data Archive at the Max Planck Institute for Psycholinguistics is storing a lot of unique material, from a large variety of languages worldwide, which is recorded and analyzed by researchers from different linguistic disciplines. In particular the DOBES program on Documenting Endangered Languages funded by VolkswagenFoundation and the Digitization of the Human Ethology Archive from Irenäus Eibl-Eibesfeldt need to be mentioned.

The Data Archive at TLA/MPI-PL currently contains:

  • about 80 Terabyte of well-described resources
  • about 20.000 hours of digitized audio/video recordings
  • about 110.000 metadata described sessions
  • about 5 million annotated segments
  • data on more than 200 languages
  • among these, data from about 60 DOBES teams
  • acquisition, speech, multimodal, multilingual, language and cognition, brain imaging, ethnological and other data.
The archive can be accessed by our Metadata Browser or via the Java-based IMDI browser.  Note, however, that not all data are immediately freely accessible — for legal and ethical reasons concerning the personal rights of privacy and intellectual property rights, part of the data have differentiated access rules.

In order to maintain the archive it needs to serve two main goals.

  • Maintaining access to all stored resources for the current generation of researchers, language communities and the interested public.
  • Preserve the valuable cultural heritage for current en future generations.

In order to achieve these goals, six copies of every resource are kept in locations distributed across Germany and the Netherlands. In addition, 11 regional repositories using the same archiving software as TLA, have been set up to bring the material closer to the communities and researchers. To make sure that the archived data will be interpretable in the future, all file formats are checked and conversions are done when formats become obsolete. In order to restore resources and services in the case of a disaster at the location in Nijmegen, a preliminary disaster recovery plan has been written. The archive’s workflow regarding the ingestion, archiving and dissemination of its holdings is described in a document as well.

The Language Archive should be open for all serious data about languages and language processing either gathered in experiments or observations. This is particularly true for language resources from MPI, MPG, BBAW and KNAW researchers, but also for other German and Dutch and even European projects or individual researchers who don’t have the human and technical resources to do proper archiving. Since the archive is in particular devoted to endangered languages it should also be open for deposits and access request for data that help documenting and revitalizing these languages and to preserve this aspect of cultural heritage to future generations. The Language Archive needs to have the human and technical resources to continue to maintain a proper digital archive.