Affiliation: Department of Cybernetics, Faculty of Applied Sciences,University of West Bohemia, Pilsen, Czech Republic
The lecture will first give an overview of the background and the course of the MALACH project that was carried out in 2001 through 2007. We will introduce the goals of the projects which were to employ automatic speech recognition and information retrieval techniques to provide improved access to the large video archive containing recorded testimonies of the Holocaust survivors. We will talk about the origin and properties of the archive and briefly present the speech recognition and retrieval techniques that were used. The core of the lecture will be devoted to the detailed description and demonstration of the search system that is able to handle “Google-like” queries in real-time.
The system has been so far developed for the Czech part of the archive only. It takes advantage of the state-of-the-art speech recognition system tailored to the challenging properties of the recordings in the archive (elderly speakers, spontaneous speech and emotionally loaded content) and its close coupling with the actual search engine. The design of the algorithm adopting the spoken term detection approach is focused on the speed of the retrieval. The resulting system is able to search through the 1,000 h of video constituting the Czech portion of the archive and find the query word occurrences in the matter of seconds. The phonetic search implemented alongside the search based on the lexicon words allows to find even the words outside the ASR system lexicon such as names, geographic locations or phrases containing Jewish slang.
The ongoing work on the search technology for the archive, supported by a recently started national project, will be also presented.