Harald Hammarström works at a researcher at the University of Nijmegen and the Max Planck Institute for Evolutionary Anthropology.

Unsupervised Learning of Morphology for Lesser-Studied Languages

The problem of Unsupervised Learning of Morphology (ULM) can be stated simply: Input raw text data and output a morphological description of the language represented in the text data. Since Harris (1951) is has been conceived that a computer algorithm might be able to solve this problem to some degree by exploiting regularities in character frequencies and substring cooccurrences. We will review own and others’ work conducted on this problem, especially as it pertains to lesser-studied languages and
corpus data for such hosted at the TLA. The current status appears to be that ULM
technologies do not yet achieve a significanty high degree of accuracy, but it should be
possible to get it to do so.