After the research and development stage the project we are ready to share our results with the research community. The following functions are supported by the recognizers:
- Basic Video analysis is robust and usable, offering features that facilitate the annotation. Examples of such include:
- Shot/sub-shot detection and key-frame extraction: creating an image storyline helps in navigation through unstructured fieldtrip recordings, or experiment recordings, where all the scenes are very similar and it’s difficult to find cuts between different experiments, etc.
- Camera motion detection: helps to find relevant parts of the recording, e.g. by labeling panning, which is usually not relevant, or zooming, which usually means something important is taking place.
- Audio recognizers focus on delivering utterance segmentation of the audio signal, which is created by three main steps:
- Relative silence detection: helps to discard irrelevant parts of the recording, in which the audio activity is low.
- audio segmentation with:
- standard segmentation recognizer – for long, homogenous segments detection
- fine audio segmentation recognizer – for unorganized acoustic-level utterance detection
- Speech/no speech classification of the segments
- Speaker diarization – to detect the number of speakers in the recording and assign each segment to one of the speakers
- Advanced video recognizers detect the coordinates of hands and head for each frame of the video and use this information (together with time information) to calculate when gesturing action takes place, what are the time boundaries of the gestures, and some features of the gesture, like its speed, size, orientation, location in the personal gesture space. Background is successfully excluded from the analyzed area, therefore a uniform background is not a strong requirement.
The recognizers performance has been evaluated in a series of annotation experiments, in which actual scientific recordings have been annotated by their creators or research assistants. Experiments with and without the recognizers have been performed. The obtained results are the following:
The resulting efficiency for audio segmentation with recognizers increased by 49%
|Annotation blocks||Annotation density (blocks/minute)||Annotation time
|Average annotation speed
|Annotation time/media length ratio|
Resulting efficiency for gesture segmentation with recognizers increased by 46% and 43% in two perfomed experiments.
|Pre-intermediate.mpg (length: 1min 24sec)||30min 00sec||16min 11sec|
|Intermediate.mpg (length: 46 sec)||13min 10sec||7min 30sec|