Presentation is loading. Please wait.

Presentation is loading. Please wait.

The ALERT System: Audiovisual Broadcast Speech Transcription for Selective Dissemination of Multimedia Information Gerhard Rigoll Munich University of.

Similar presentations


Presentation on theme: "The ALERT System: Audiovisual Broadcast Speech Transcription for Selective Dissemination of Multimedia Information Gerhard Rigoll Munich University of."— Presentation transcript:

1 The ALERT System: Audiovisual Broadcast Speech Transcription for Selective Dissemination of Multimedia Information Gerhard Rigoll Munich University of Technology Institute for Human-Machine Communication Munich, Germany

2 ALERT system for selective dissemination of multimedia information
General Project dates ALERT system for selective dissemination of multimedia information • Official start: 01/2000, start of work: 03/2000, duration: 30 months • Man power effort: ~30 MY ---> Budget: ~1.6 Mio Euro EC funding • Web Site:

3 Media information flooding
NEWS Internet supervision by information brokers

4 Media monitoring in the alert project
NEWS information (sound, video, text) topic detection transcription today‘s headlines .... TAXES ALERT MESSAGE Internet

5 General project Objectives
To develop a demo system capable of identifying specific information in multimedia data, consisting of text, audio and video streams using advanced speech recognition video processing techniques automatic topic detection algorithms demonstrator shall alert a user about the existence of requested information send detailed information (on client's further request) extracted text annotated audio/video data and video clips provide functionality in French, German and Portuguese demo system will be evaluated mainly by industrial partners

6 THe alert Consortium integration technologies users Consortium

7 WP structure (WP0-WP4) today milestone deliverable

8 WP structure (WP5-WP7) today milestone deliverable

9 Collection of pilot corpus
First step to setup similar resources Purpose: testbed for assessing methods for data collection, annotation and distribution Collection guidelines: Minimum amount: 5 hours Type of data: video, audio and annotation Video format: MPEG1 Audio format: PCM linear, 16KHz sampling rate, 16 bits/sample, mono, collected from antenna Annotation based on LDC guidelines Thematic orientation: news and interview shows

10 Collection of final databases
Experimental results recommendations for final corpus quality  mp3, 32 kbps, 16kHz, mono Minimum amount: speech recognition: 50 hours (training), 3 hours (development), 3 hours (evaluation) word-labelled topic detection: 300 hours, topic annotated text corpus: 100 million words Full data set: 1300 hours word or topic annotated > 10k topic annotated summaries in German text corpus: > 1 billion words

11 comparison of coding schemes for broadcast speech databases

12 Multimedia datA-labeling and alert-generation
document video/image processing segmentation if video contained video-based speech processing transcription segmentation if audio alert specific users contained best hypo- wordgraph automatic topic detection if text topic contained keywords match topics found against user profiles multimedia document database label database

13 Basic principle of video-segmentation
Stochastic Video-Model (based on HMMs):

14 Result of video-based segmentation

15 Combined video-audio-segmentation

16 topic segmentation Results: video based detection of topic boundaries is feasible precision rate = 1 - insertion rate = % recall rate = 1 - deletion rate = %

17 French BN speech recognizer
continuous density HMM system 33 phones + 3 non-speech (silence, filler words, breath) ~20% WER (on news) 65k dictionary automatic pronunciation with manual verification 58 hours acoustic training data, 350 Mio words text RT decoding: 5700 states, 92k Gaussians 10xRT decoding: states, 350k Gaussians 4-gram language model 15M bi-, 15M tri-, 13M four-grams

18 Portuguese BN speech recognizer
Based on the AUDIMUS LVCSR system Hybrid system based on MLP/HMM techniques Combination of different acoustic models (product of posterior probabilities) 38 phones + silence, 57k dictionary 4 gram LM: 5M bi-, 12M tri-, 13M fourgrams Trained on 13 h of BN data Results: 15xRT: F0: ~20%, All F: ~40 %

19 German Baseline Speech Recognition System

20 German BN speech recognizer
continuous density HMM system 50 phones + 17 non speech (silence, filler words, breath, rustle, ...) ~20 % WER (initial DuDeutsch: >70 % WER) 100 k dictionary initial pronunciation from CELEX, compound word construction 10xRT: 30-90k Gaussians 3-gram (cached) language model, 8M bi-, 16M trigrams

21 Evolution of the german system
system phone models #mixtures WER baseline German triphones ~30% system, 100k, spontaneous speech baseline, not triphones ,7% trained on broad- cast data baseline with triphones ,3% broadcast language model acoustic models monophones ,3% trained on broadcast data acoustic models triphones ,8% optimized on

22 Examples for German transcription results

23 Automatic topic detection
Objectives: to divide automatically audio/video streams into topic-specific homogeneous segments automatic assignment of requested topics to distinct segments Test set: 22 topics in 2956 training and 1284 test texts deletion of 150 stop words no stemming performed

24 New approach to topic detection
This is a text containing important topics. [ ] p(w1) p(w2) p(w3) . MMI Neural Net VQ label

25 Results for Clean text Comparison of new approach and standard system
Comparison of feature quantization with k-means clustering and MMI neural net

26 Partially Corrupted text
Results with partially corrupted texts: some words are fragmented similar to speech recognition output 22 topics in 3037 training and 1319 test texts no stop words no stemming

27 Results for Corrupted text
22 topics 173 topics

28 Demonstrator specification (details)

29 Publications ICASSP 2001 (7/2001) TREC-9 (11/2000)
LIMSI: Automatic transcription of compressed broadcast audio GMUD: New approaches to audio- visual segmentation of TV news for automatic topic retrieval. TREC-9 (11/2000) LIMSI: The LIMSI SDR system for TREC-9 argus press (11/2000) Observer: Observer Argus Media beteiligt sich am EU-Forschungsprojekt ALERT ICSLP 2000 (10/2000) GMUD: Compound splitting and lexical unit recombination for improved performance of a speech recognition system for German parlianmentary speeches INESC: The Use of Syllable Segmentation Information in Continuous Speech Recognition Hybrid Systems Applied to the Portuguese Language INESC: Combination of Acoustic Models in Continuous Speech Recognition Hybrid Systems

30 Publications (II) ICSLP 2000 (10/2000)
LIMSI: Fast decoding for indexation of broadcast data LIMSI: Investigating text normalization and pronunciation variants for German broadcast transcription EDCL th European Conference on Research and Advanced Technology for Digital Libraries (9/2000) INESC: Topic Detection in Read Documents ASR 2000 (9/2000) INESC: A Decoder for Finite-State Structured Search Spaces ICASSP 2000 (6/2000) GMUD: A Novel Error Measure for the Evaluation of Video Indexing Systems

31 Presentations Schaufenster der Wissenschaft (3/2001)
GMUD: Informationen aus Radio, Fernsehen und Internet: Automatische Themenerkennung in Multimedia-Daten Euromap Informationstag (12/2000) GMUD: Das Projekt ALERT - Alert system for selective dissemination of multimedia information IV Jornadas de Arquivo e Documentação (10/2000) INESC: Speech recognition and topic detection applied to alert systems for broadcast news ASR 2000 (9/2000) GMUD: ALERT System for Selective Dissemination of Multimedia Information Homme Technologie et Systèmes Complexes (6/2000) VECSYS: Parlez Naturellement, la Machine Vous Comprend RIAO'2000 Content-based Multimedia Information Access (4/2000) VECSYS, LIMSI: An Audio Transcriber for Broadcast Document Indexation

32 outlook use of additional data cross-talker situations
enlarged number of topics improving rejection mechanisms of unknown topics (confidence for topics) detection of new topics summarization scalable summarization topic-dependent summarization


Download ppt "The ALERT System: Audiovisual Broadcast Speech Transcription for Selective Dissemination of Multimedia Information Gerhard Rigoll Munich University of."

Similar presentations


Ads by Google