by Przemek Lenkiewicz

Recently the Max Planck Institute started its participation in a very interesting project called CLARA. The name stands for Common Language Resources and their Applications. It is a European project that runs under the Initial Training Network framework of the Marie-Curie Actions.

CLARA offers posts for researchers both PhD and postdocs. The project will train a new generation of researchers who will be able to cooperate across national boundaries on the establishment of a common language resources infrastructure and its exploitation for the construction of the next generation of language models with wide theoretical and applied significance. The work of CLARA researchers will focus around two main goals:

  • to develop the next generation of data-intensive language models and applications by integrating approaches across language and country boundaries;
  • to contribute to the establishment of a pan-European infrastructure for language resources.

Recent advances in technology and widespread research efforts have expanded the size of corpora and the extent of their annotations. From corpora as basic resources, other resources are being derived, e.g. lexicons, frequency lists, word nets, term banks, etc. Although a large number of language resources have been produced to date, many scientific and organizational challenges remain, including the following:

  • Theories and modeling approaches have not yet been applied on a wide range of languages;
  • The gap between academic models and the needs of industrial actors who aim at real life applications remains to be bridged;
  • There is a lack of appropriate documentation for many resources. Moreover there is no good overview of available resources for some European languages;
  • Since some resources are developed for specific purposes, there is a challenge to convert them so they can be reused for other purposes;
  • The long term preservation of language resources needs to be secured;
  • Efficiency issues in accessing language resources in very large repositories must be addressed.

These challenges are meant to be addressed by CLARA researchers by means like:

  • further work on standardization of coding and annotation practices;
  • development of registries and documentation systems for language resources;
  • transfer and integration of single-purpose resources to interoperable, reusable and extendable forms.

The Max Planck Institute is hosting three researchers of the CLARA project, two PhDs and one postdoc. Their work will be organized as contribution to the AVATecH project, which aims at developing methods for automated annotation creation and thus addresses the areas of interests of the CLARA project.

People involved:
Peter Wittenburg – Scientist in charge.
Perry Janssen – Administrative contact.
Przemek Lenkiewicz – Experienced Researcher, Scientific contact.
Hugo García Blanco – Early Stage Researcher.
Binyam Gebrekidan Gebre – Early Stage Researcher.