Download presentation
Presentation is loading. Please wait.
1
Automatic Content Extraction for Voicemail Using Ninja Goal: Make voicemail more accessible Enable faster browsing of many voicemails Access from different devices with different capabilities Our solution: Use Ninja infrastructure Augment existing Ninja email services to treat voicemails as MIME encoded email Create Ninja services to process and interpret voicemails Specialized transcoding services Extracts high level information from voicemails Includes: audio, skimmed audio, transcript, text/audio summary, and outline Steven Czerwinski and Barbara Hohlt
2
Voicemail Processing Techniques Speech recognition/synthesis Transcribe voicemail to text IBM ViaVoice SDK and custom audio libs Natural language processing Directed word spotting by “understanding” content ViaVoice SRCL Pitch Detecting important words by emphasized pitch Pause Compression through pause removal Spurts Retrieve sentence structure of voicemail
3
Architecture Transcoder Service Voicemail->Text Transcript Voicemail->Text Summary Voicemail->Text Outline Email ->Plain Audio Voicemail->Audio Summary Voicemail->Skimmed Audio Mail Access Interface NinjaMail Mail Access Interface POP Mail Access Interface IMAP Media Manager Interface Media Manager Service Client Folder Store
4
System Components Media manager service Ninja iSpace service which handles email, voicemail encoded in MIME, and specialized transcoding of voicemail Users can access all their mail across different mail protocols with different types of devices Transcoder service Ninja iSpace service which transforms data Folder store Stores user protocol information Mail access interface A common interface to generalize access to different mail protocols
5
Pitch Detection The idea A speaker’s pitch naturally changes when introducing topics or emphasizing words [hirshberg92] Use pitch increases as hints for “important” words Algorithm [aaron95] Determine pitch for each 20 ms frame (FFT with SHS) Set emphasis threshold to be top 1% of pitch values (by histogram) Mark 1 sec interval as emphasized if contains >=3 emphasized frames
6
Pause Detection Why is pause detection useful? Removing pauses speedups playback Typically, 50-70% of original time [foulke71] Long pauses signify groups (talk spurts) Noise and soft sounds create difficulties Algorithm: smoothed histogram [lamet81] Calculate energy per 10 ms frame Threshold based on smoothed histogram (5 db after first peak) Use heuristics to remove artifacts Average energy (dB) Percent of Frames
7
Examples Original Voicemail: “Hello, This is Barbara. How are you and the cats doing? I was wondering if you would feed them a little more the first time in case they eat too much. My number is (713) 465-5155. You can call me anytime. Have a very good holiday. Bye bye” Processed Voicemail: Phyllis Barbara Area in the cat staring And then if you run but feed them A little more the first time in case they eat too much On my number is (713) 465-5155 You can call me anytime. Have every holiday Of light Translated Talk spurts (Pitch emphasized words in green) (Skimmed)(Just pitch) Translated using NLP Hello this is Barbara My number is (713) 465-5155
8
Results Pause detection Worked well for given applications Playback speedup by 50-70% Pitch detection Problems due to high pitch sounds and transitions Speech recognition Performance decrease in conversational settings Natural language processing Performed well with small grammar
9
Conclusion Overall System useful as navigational hints To achieve total comprehension, need better voice recognition What works well Skimming using pause removal Detecting spurts for structure What needs work Speech detection in conversational settings Pitch emphasis needs refining Future directions Pause detection/word boundaries using speech detection Developing voicemail grammars Using NLP feedback with pitch emphasis detection Improved speech detection in noisy environments
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.