The Language Archive (TLA) is a unit of the Max Planck Institute for Psycholinguistics concerned with digital language resources and tools. Its major features are:
- A large data archive holding resources on languages worldwide. Many of the data are annotated audio and video recordings, but several other data types such as time series (eye tracking, brain images) are included, too. For some languages, the archive contains language acquisition data, data obtained in psychological experiments etc.
Perhaps the best known part of the archive concerns data on smaller languages and cultures as typically obtained in field work. We have currently (end of 2011) data on about 200 languages, where data on some 60 languages come from research projects of the program “Documentation of Endangered Languages” (DOBES).
- TLA is involved in a wide variety research projects which entail infrastructure and software development. Therefore, we have been able to develop several tools for creating, managing and exploring linguistic resources. The “Language Archiving Technology” (LAT) suite of tools and web-services allows for example annotating and analysing recordings and creating and manipulating online multimedia lexical databases. Annotated recordings can be organized into “sessions” which are described and organized by metadata in standard formats. Using LAT web-applications, all data can then be integrated into a structured, sustainable repository with differentiated access levels and user management, and they can be accessed online with the help of dedicated LAT tools for searching and presentation.
- Our long-term expertise with archiving and software development has provided the background and basis for our participation in trend setting international projects and collaborations that aim at developing lasting and functional infrastructures for the digital humanities in general. To name only a few: ISLE, DOBES, CLARIN (EU, D, NL), EUDAT, DASISH, Radieschen, TextGrid, AVATecH, INNET. We also support institutions worldwide that want to establish a LAT-based repository on their own, and we are organizing and participating in education and training activities.
Although TLA is primarily grounded on the research needs of its main funders, the Max Planck Gesellschaft (MPG), the Berlin Brandenburgische Akademie der Wissenschaften (BBAW) and the Koninklijke Nederlandse Akademie van Wetenschappen (KNAW), it has an open policy. We participate in national and international projects and collaborations and contribute to the currently emerging eResearch infrastructures. We are committed to advance and promote international standards that facilitate interoperability; for instance, we host the ISOcat data category registry.
Background for our archiving work are two facts: (1) It is well-known at least since the early nineties that the linguistic diversity is dramatically endangered worldwide – between 50% and 90% of all currently spoken approx. 6000 languages may become extinct within the next 4 to 6 generations. (2) Less known, however, is that, according to an estimate by an UNESCO study, about 80% of the existing recordings about little known languages and cultures are endangered to get lost over the next few decades. Most analogue recordings are still on perishable media carriers, and also digital material needs a lot of effort (curation, constant copying, updating of formats) in order to ensure lasting availability.
Therefore, the primary goal of TLA is to store and preserve valuable language resources in digital form of all relevant data that has been created by researchers in our domain, and to give researchers and other interested users access to them according to the agreed access principles. TLA is also open to requests for depositing appropriate language related data that is to be made available for research purposes. In most cases, data curation is required to transform data into formats based on open standards, increasing the chances of long-term interpretability.
However, like for almost all data-oriented research disciplines, it becomes apparent that the data management and preservation problem has not been satisfactorily solved yet. TLA is developing and maintaining advanced software to allow archive managers to organize and maintain a consistent and coherent digital archive. State of the art software will also allow users to easily create, access and enrich data stored at TLA. For this goal, we develop and integrate new technologies, advancing language research. Thus, TLA applies cutting edge technology that introduces new computational methodologies to the study of languages such as for example statistically based pattern recognition algorithms.
Trust is of key importance for all our activities. Hence, the archive is subject to regular quality assessment procedures to guarantee its reliability. For instance, TLA is participating in regular assessments according to the Data Seal of Approval standard. Also, deposits and access to the data must be based on clear legal and ethical rules. We do not claim copyright of any of the material in the archive – only the right to archive data –, and the depositors decide whether the material is to be freely accessible (possibly after some identification and agreeing to the code of conduct), or if access is to be more restricted. Nevertheless, we very much support the ideal of not only open data formats and open source tools, but also open access and free exchange of data for the benefit of all sciences.
All Software developed by TLA may be used free of charge (freeware). It is also ‘open source’: the source code is available upon request under the Gnu Public License 2 (in some cases GPL 3).