Download presentation
Presentation is loading. Please wait.
Published byAnnabelle Goslin Modified over 10 years ago
1
1 Multimodal Technology Integration for News-on-Demand SRI International News-on-Demand Compare & Contrast DARPA September 30, 1998
2
2 Personnel n Speech: Dilek Hakkani, Madelaine Plauche, Zev Rivlin, Ananth Sankar, Elizabeth Shriberg, Kemal Sonmez, Andreas Stolcke, Gokhan Tur n Natural language: David Israel, David Martin, John Bear n Video Analysis: Bob Bolles, Marty Fischler, Marsha Jo Hannah, Bikash Sabata n OCR: Greg Myers, Ken Nitz n Architectures: Luc Julia, Adam Cheyer
3
3 SRI News-on-Demand Highlights n Focus on technologies n New technologies: scene tracking, speaker tracking, flash detection, sentence segmentation n Exploit technology fusion n MAESTRO multimedia browser
4
4 Outline n Goals for News-on-Demand n Component Technologies n The MAESTRO testbed n Information Fusion n Prosody for Information Extraction n Future Work n Summary
5
5 High-level Goal Develop techniques to provide direct and natural access to a large database of information sources through multiple modalities, including video, audio, and text.
6
6 Information We Want n Geographical location n Topic of the story n News-makers n Who or what is in the picture n Who is speaking
7
7 Component Technologies n Speech processing Automatic speech recognition (ASR) Speaker identification Speaker tracking/grouping Sentence boundary/disfluency detection n Video analysis Scene segmentation Scene tracking/grouping Camera flashes n Optical character recognition (OCR) Video caption Scene text (light or dark) Person identification n Information extraction (IE) Names of people, places, organizations Temporal terms Story segmentation/classification
8
8 Component Flowchart
9
9 MAESTRO n Testbed for multimodal News-on-Demand Technologies n Links input data and output from component technologies through common time line n MAESTRO score visually correlates component technologies output n Easy to integrate new technologies through uniform data representation format
10
10 Score ASR Output Video IR Results MAESTRO Interface
11
11 The Technical Challenge n Problem: Knowledge sources are not always available or reliable n Approaches Make existing sources more reliable Combine multiple sources for increased reliability and functionality (fusion) Exploit new knowledge sources
12
12 Two Examples n Technology Fusion: Speech recognition + Named entity finding = better OCR n New knowledge source: Speech prosody for finding names and sentence boundaries
13
13 Fusion Ideas n Use the names of people detected in the audio track to suggest names in captions n Use the names of people detected in yesterdays news to suggest names in audio n Use a video caption to identify a person speaking, and then use their voice to recognize them again
14
Information Fusion Text Recog Moore + NL Moore add to lexicon ASR moore
15
EXTRACTED INFORMATION Video imagery Auxiliary text news sources Audio track Face Det/Rec Caption Recog Scene Text Det/Rec Speaker Seg/Clust/Class Audio event detection Speech Recog Name Extraction Topic detection Story start/end Geographic focus Story topic Who / Whats in view Whos speaking Video object tracking Scene Seg/Clust/Class TECHNOLOGY COMPONENTSINPUT MODALITITES Input processing paths First-pass fusion opportunities
16
16 Augmented Lexicon Improves Recognition Results TONY BLAKJB Without lexicon: With lexicon: TONY BLAIR UNITED STATES WNITEE SIATEE
17
17 Prosody for Enhanced Speech Understanding n Prosody = Rhythm and Melody of Speech n Measured through duration (of phones and pauses), energy, and pitch n Can help extract information crucial to speech understanding n Examples: Sentence boundaries and Named Entities
18
18 Prosody for Sentence Segmentation n Finding sentence boundaries important for information extraction, structuring output for retrieval n Ex.: Any surprises? No. Tanks are in the area. n Experiment: Predict sentence boundaries based on duration and pitch using decision trees classifiers
19
19 Sentence Segmentation: Results n Baseline accuracy = 50% (same number boundaries & non-boundaries) n Accuracy using prosody = 85.7% n Boundaries indicated by: long pauses, low pitch before, high pitch after n Pitch cues work much better in Broadcast News than in Switchboard
20
20 Prosody for Named Entities n Finding names (of people, places, organizations) key to info extraction n Names tend to be important to content, hence prosodic emphasis n Prosodic cues can be detected even if words are misrecognized: could help find new named entities
21
21 Named Entities: Results n Baseline accuracy = 50% n Using prosody only: accuracy = 64.9% n N.E.s indicated by longer duration (more careful pronunciation) more within-word pitch variation n Challenges only first mentions are accented only one word in longer N.E. marked non-names accented
22
22 Using Prosody in NoD: Summary n Prosody can help information extraction independent of word recognition n Preliminary positive results for sentence segmentation and N.E. finding n Other uses: topic boundaries, emotion detection
23
23 Ongoing and Future Work n Combine prosody and words for name finding n Implement additional fusion opportunities: OCR helping speech speaker tracking helping topic tracking n Leverage geographical information for recognition technologies
24
24 Conclusions n News-on-Demand technologies are making great strides n Robustness still a challenge n Improved reliability through data fusion and new knowledge sources
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.