PERL module for text segmentation at sentence/word level, morpho-syntactic annotation (simple and layered – tiered tagging) and lemmatization. It is language independent and it was trained and used mainly for English and Romanian.The TTL functions are: – Named Entity Recognition – through regular expressions defined over sequences of characters, – sentence splitting – a set of regular expressions for identifying the sentence ends markers, – tokenization – based on regular expressions and using lists of idiomatic expressions, suffixes, and prefixes, – POS tagging – through HMM tagging technology, extending the Brants’ TnT for tiered tagging, for a more accurate processing of unknown words and for NE tagging; it uses the MSD tag-set (http://nl.ijs.si/ME/V2/msd/) with its smaller superset CTAG, – lemmatization – by lexicon lookup and, for unknown words, a statisticalmodule which automatically learns normalization rules from the existing lexical stock, – chunking – by regular expressions over sequences of POS tags.

Facebooktwittergoogle_pluslinkedin