The Linguistic Data Consortium has developed a range of (mainly SGML-based) formats for transcripts and other types of annotation that it has published (See below for [UTF NIST’s UTF format], which provides a combined framework for several of these existing formats). Some online documentation is available for individual corpora authored at different times by different groups, e.g. Switchboard at TI in 1991, Trains at Rochester in 1992-3, etc, as well as a general SGML transcription specification currently used for (orthographic) transcription of telephone conversations and broadcast news recordings. The LDC has also implemented a general data model for searching annotated text and speech corpora online, via LDC-Online.

http://www.ldc.upenn.edu/

Facebooktwittergoogle_pluslinkedin