by Herman Stehouwer & Sebastian Drude

As all linguistic field workers know, transcribing and further annotating audio and video recordings and other texts is a very expensive and time-consuming procedure. For a single hour of a recording of a lesser documented language it can take more than a hundred hours of expert time to create useful linguistic annotations such as “basic annotation” (a transcription and a translation) and “basic glossing”: additional information on individual units – usually morphs, sometimes words – such as an individual gloss (indication of meaning or function) and perhaps categorical information such as a part-of-speech tags (or its equivalents on the morphological level). More advanced glossing can take even longer.

Furthermore, information on the lexical units encountered in the texts need to be transferred to a lexical tool. After all, often one the goals of field work is to create a usable lexicon, describing the endangered language.

Currently, this work is supported best by tools like (The Field Linguist’s) Toolbox or the FieldWorks Language Explorer (FLEx), both without proper support for media-files. Many users have asked for support for advanced annotation tasks in ELAN, ideally using LEXUS to build, access and expand a lexical database. Making this possible is the objective of TLA’s newest project called LEXAN, a modular annotation support framework coupled to a new interface in ELAN. It will support different “annotyzers”,  i.e. modules that produce annotation suggestions for the researcher, including machine-learning modules.

The “annotyzers” will work on a tier or set of tiers, the “source tier[s]”, as chosen by the user, and typically produce an additional tier or a group of tiers, the “target tier[s]”, with content generated based on the source tiers and additional data, e.g. lexical data.

A first annotyzer-like functionality of ELAN (without requiring interaction with a lexicon yet) would be the possibility to copy one entire source tier, for instance a detailed transcript, or a literal translation. The created target tier can then serve as a starting point for preparing another tier with similar but edited content, for instance a cleaner adapted version of the orthographic transcript, or an idiomatic free translation.

Similarly, a basic tokenizer would copy the individual words (recognized by spaces and perhaps hyphens or similar punctuation) on one source tier – containing an orthographical representation of a sentence – into separate annotation units on a new (target) word-tier which can then be corrected (e.g., cells can be joined in the case of composed words such as black board, or on the contrary split in the case of clitics which may orthographically be parts of more comprehensive words).

As a possible next step, already making use of interaction with a lexicon, an annotyzer would use the annotations on the word-tier to build an “intermediate” database of individual inflected word forms. Each entry in this database would have at least a field which contains the citation form of the lexical word for each given inflected word form, possibly together with a semantic label (lexical gloss) and a disambiguating homonym index in case that two lexical words with identical citation forms exist. Some of these fields would be obtained from the lexicon once the citation form has been determined, and the citation form itself and other information (such as a “complete gloss” of the inflected word form which includes semantic effects of inflectional categories and the like) could be written back to new target tiers in ELAN. Although much of this information would still have to be added by hand the first time an inflected word form occurs, this simple setting would already help to: a) create lexical entries for new lexical units, b) reduce writing when the form occurs a second, third etc. time, and c) encourage and support consistency.

Many users acquainted with Toolbox or FLEx would expect a “glossing” functionality like they know it from these tools of the future LEXAN. This would include a parser-module (generic or language-specific, pure string-matching or advanced with using the context, static or with learning capacities etc.) which would split up the individual inflected word forms on a source word-tier into individual morphs on a new target morph-tier. This morph-tier would then serve as a source for adding further target tiers with annotations such as glosses (indication of lexical meaning or functional/categorical effects) and perhaps part-of-speech-like tags (on the morpheme level). In the lexicon, this functionality would presuppose corresponding fields in all entries such as a part-of-speech label for each morph and a gloss, which are probably the most common fields in lexical databases in field research anyway (in addition to the citation and variant forms of the morph and possibly a way to distinguish different but related senses which are given as lexicographical definitions or translation equivalents). Again, correct parses and glosses would be stored in the intermediate database so that they can be re-used and referred to.

It is a well-known fact that general parsers work better for some and less well for other languages (for instance, usually morphological parsers score high with predominantly isolating and agglutinative languages and less good with inflectional and polysynthetic languages). It is also true that glossing schemes and set-ups are based on specific types of linguistic theories – for instance, the setting presented above (which corresponds to the default functionalities of Toolbox and FLEx) is clearly tied to an “item-and-arrangement” (less so “item-and-process”) reasoning on language structure. In principle, an infrastructure as the one proposed here should strive at being as interoperable with different linguistic theories as possible, which would imply that also “word-and-paradigm” theories could fruitfully use the tools and functionalities. The proposal of an “intermediate” database with one entry each for every individual (inflected) word form goes into that direction, allowing, for instance, characterizing forms with respect to their functional categories without assigning these categories to individual morphs. Of course, to be fully functional providing for arbitrary theories and language types, also complex (multiple-word) forms must be covered, which presupposes the development of modules (parsers and the like) that recognize syntactic structures and that are able to cope with, say, discontinuous word forms.

More sophisticated and complete annotations on the morphological, syntactic and even other levels (phonetic/phonological, intonational) can be added by additional annotyzers as corresponding modules become available – for instance, morphological or syntactic constituent structures or grammatical relations could be generated (semi)automatically and represented in corresponding tiers in ELAN.



Click for bigger version

Figure: A schematic view of the architecture of LEXAN