Download presentation
Presentation is loading. Please wait.
Published byFelicia Owen Modified over 8 years ago
1
MOTOROLA and the Stylized M Logo are registered in the US Patent & Trademark Office. All other product or service names are the property of their respective owners. © Motorola, Inc. 2002. Enabling Speech & Multimodal Services on Mobile Devices Distributed Speech Recognition David Pearce bdp003@motorola.com Distributed Speech Recognition “Where is 358 Madison Avenue”
2
2 Contents of this Presentation –Motivation –Technology Overview –ETSI/3GPP/IETF DSR standards Overview Transport protocols 3GPP Speech Enabled Services and Codec evaluations –Multimodal Services Overview –DSR in 3GPP2 –Summary
3
3 Motivation
4
4 Opportunity What if we can give 1.4 billion mobile phone users access to information by voice? How can we make performance reliable? How can we combine voice and data to get the advantages of multimodal interaction?
5
5 The keypad challenge C650C380E398 V220V80 E610C357/C358 V60v C510A890A845 i710i325i285 i215 i830 T725 V810 A768 MS280MS250 V880 MS290
6
6 Evolution of Dialogue Services: Command & Control (e.g., simple Call routing; VRCP; Voice Dialing) Command & Control (e.g., simple Call routing; VRCP; Voice Dialing) Simple ASR; isolated words, connected digits Prompt Constrained Natural Language (e.g., Travel reservations, Finance, Directory asst., E-mail access) Prompt Constrained Natural Language (e.g., Travel reservations, Finance, Directory asst., E-mail access) Larger vocabulary, defined grammars Spoken Language Under- Standing (e.g., CRM, Help Desks, E-Commerce) Spoken Language Under- Standing (e.g., CRM, Help Desks, E-Commerce) Large vocabulary, NL, DM, TTS Problem Solving (e.g., Info Retrieval & Extraction, Multimedia, Human-like Problem Solving) Problem Solving (e.g., Info Retrieval & Extraction, Multimedia, Human-like Problem Solving) Unlimited vocabulary, NL, DM, TTS, Mining 19902004 Require Network- based Resources Available on Devices
7
7 Speech Enabled Services Today’s Telephony Voice Services (1000’s of Applications) –Communication assistance (Name dialling, Service Portal, Directory assistance) AT&T Directory Assistance: 1-800-555-1212, Motorola Directory Assistance –Information retrieval (e.g., obtaining stock-quotes, checking local weather reports, flight schedules, movie/concert show times and locations) Jupiter (Weather): 1-888-573-8255, American Airlines, etc. –M-Commerce and other transactions (e.g., buying movie/concert tickets, stock trades, banking transactions) Lastminute.com (hotel bookings, etc.): +44(0)870 872 6313 Charles Schwab, Fidelity In the near future: Speech Enabled Services - Packet Data based –Improved performance from packet data & DSR –Enabling many more content based speech applications and Personal Information Manager (PIM) functions (e.g., making/checking appointments, managing contacts list, address book, corporate directory etc.) Messaging (IM, unified messaging, etc…) Information capture (e.g. dictation of short memos)
8
8 Voice & Multimodal – Definition User enters commands via: SPEECH KEYPAD System responds: SPEECH SOUNDS Voice-enabled Services Keypad IN Speech IN Audio OUT Screen OUT GRAPHICS TEXT Multimodal-enabled Services
9
9 Multimodal benefits – an enabler for mobile CONTENT Compared to keypad interface on small mobile device: Improved data entry –Spoken entry much easier than numeric keypad Enables access in hands busy/eyes busy situations Compared to voice only interface: Improved dialogue efficiency –Prompting Visual prompting is faster Guides user on domain vocabulary and structure –Confirmation Visual feedback reduces need for spoken confirmation sub-dialogue Error correction is easier –Persistent visual output ( can also be stored ) Rich visual or audio output –Maps, pictures, video, graphics, music/audio clips Flexibility User chooses mode of interaction depending on: –personal preferences –social context (e.g. PIN number entry) –environment (e.g. Car vs train)
10
10 JEM Ecosystem VoiceXML Ecosystem Enterprise Carrier-Controlled Content “Walled Garden” Carrier-Controlled Content “Walled Garden” Java Ecosystem Overall Benefits Richer Content Richer Interaction Best-in-breed Technology Increased ARPU World Wide Web “Open Internet” World Wide Web “Open Internet” CRM Applications The Voice-enabled Content Ecosystem
11
11 Multimodal Business Drivers New valued added services not yet realized, new ARPU difficult data entry and viewing no longer a barrier to data service adoption Voice Fuels Demand for 2.5 and 3G Data Services Enables the Use of Next Generation Messaging Can Speak or Key Message Receives message as text, audio or video Addresses Accessibility Issues thereby increasing user base Further establishes mobile phones as the primary communication and information device Flexibility – Users have a choice Improves Usability for Data Applications Enables rich User Experience User Requests and Receives Content in Appropriate Mode for Situation Public Places On the Go In Automobiles Benefits: Increased Usage Safety Privacy of data input Usability Productivity Accessibility – enable new user base Mode can be mapped to users’ context End Users Carriers
12
12 DSR Technology Overview
13
13 Distributed Speech Recognition IP Network Content Servers [Wireless] Packet Data Network Voice Gateway / Server: VoiceXML / SALT / X+V Browser Speech Resources (ASR, TTS, etc.) Client Devices Conventional Circuit Switched Mobile Voice Channel Speech Coder Speech Decoder ISDNASR Front-end ASR Decoder DSR Packet Data Channel e.g. 1x ASR Front-end ASR Decoder
14
14 Benefits of DSR Improves performance over wireless channels Minimises impact of codec & channel errors Consistent performance over coverage area Improved performance in background noise 53% reduction in error rate Ease of integration of combined speech and data applications Use packet data channel for both DSR and other data
15
15 ETSI/3GPP/IETF DSR Standards Overview
16
16 ETSI STQ-Aurora DSR Working Group Highlights –Feb 2000 Mel-Cepstrum Front-end & Compression (ES 201 108) –Oct 2002 Advanced Front-end & Compression (ES 202 050) –Noise Robust –53% reduction in error rate in background noise –Nov 2003 Extensions (ES 202 211 + ES 202 212) –Speech waveform reconstruction –Improved tonal language recognition Free download: http://pda.etsi.org/pda/queryform.asphttp://pda.etsi.org/pda/queryform.asp Distributed Speech Recognition
17
17 DSR Advanced Front-end (ES 202 050) Noise Robust Front-end Half error rate cf mel-cepstrum in background noise –Double Wiener filtering noise suppression –Waveform processing –Blind equalisation Representation: 12 cepstral coeffs, C0, logE Compression gives bit rate of 4.8kbit/s Feature Extraction Waveform Processing Cepstrum Calculation Blind Equalization VAD input signal to feature compression Noise Reduction Waveform Processing Cepstrum Calculation Blind Equalization 8 & 16 kHz VAD
18
18 DSR Extension (ES 202 212) Enables Speech waveform reconstruction at server for human listening –Adds 800bps containing pitch (total 5.6kbps): –Assists recogniser with tonal language recognition (e.g. Mandarin, Cantonese) Pitch & Class Estimation Pitch Tracking and Smoothing Speech Reconstruction Pitch & Class @ 800 bps CHANNELCHANNEL ETSI Standard DSR Front-End DSR Back-End MFCC & log-E @ 4800 bps Tonal Information Speech In Speech Out
19
19 DSR in 3GPP Speech Enabled Services (SES) work item SA1 (service requirements) –Technical Report (speech & multimodal) “Feasibility Study for Speech Enabled Services” –Technical Specification “Speech Recognition Framework for Automated Voice Services” SA4 (codecs) [Oct 2002 – Feb 2004] –Selection of codec for Speech Enabled Services –Proposals AMR & AMR-WB DSR AFE & extension –Selection Based on Recognition Performance ASR vendor evaluations: Scansoft and IBM 5 Proprietary and 5 3GPP supplied databases DSR selected as the recommended codec for SES (Approved June 04)
20
20 Results of ASR vendor evaluations in 3GPP Extensive testing on 21 different speech databases –Covering different languages, tasks and environments Tests performed with IBM and Scansoft commercial recognisers Results above are for low data-rate comparison for packet data (< 8kbit/s)
21
21 Packet Switched Channel Errors Aurora-3 Italian speech database GPRS network simulation for distribution of errors 20ms speech per RTP payload 3GPP Feb 2004
22
22 Codec vs DSR Experimental set-up Advanced Front- End Processing DSR Coding (Quantiser) DSR Decoding Additional server side processing End-pointingRecognition Bit-stream TerminalServer AMR Encoding AMR Decoding Advanced Front-End Processing Additional server side processing End-pointingRecognition Bit-stream TerminalServer DSR processing chain Codec processing chain
23
23 Coded speech vs DSR (Aurora-3 Italian) DSRAMR 4.75Degradation Well matched96.594.4-57% Med mismatch90.483.9-68% High mismatch88.676.8-104% Average92.486.3-73% DSREVRCDegradation Well matched96.590.6-165% Med mismatch90.475.9-151% High mismatch88.670.5-160% Average92.480.4-159%
24
24 Illustration of importance of performance Word error rate Number of users Area gives total number of dissatisfied users: 10% of users dissatisfied 3% of users dissatisfied Recognition performance: 5% average error rate 4% average error rate Error rate at which users become dissatisfied Improving recognition performance by 20% significantly reduces number of dissatisfied users (10% -> 3%)
25
25 Transport Protocol: IETF RTP Packet Data Format Payload can consist of any number of DSR frame pairs (12 or 14 bytes) For details, see –http://www.ietf.org/rfc/rfc3557.txthttp://www.ietf.org/rfc/rfc3557.txt –http://www.ietf.org/internet-drafts/draft-ietf-avt-rtp-dsr-codecs-03.txthttp://www.ietf.org/internet-drafts/draft-ietf-avt-rtp-dsr-codecs-03.txt 6 Octet 5 1 2 3 4 8 7 6 5 4 3 2 1 idx 2,3 (t) Idx 0,1 (t) idx 4,5 (t) idx 2,3 (t) (cont) idx 6,7 (t) idx 4,5 (t) (cont) idx 10,11 (t) idx 8,9 (t) idx 12,13 (t) idx 10,11 (t) (cont) idx 12,13 (t) (cont) IP headerUDP headerRTP headerRTP Payload
26
26 Multimodal Services Overview
27
27 Distributed Multimodal Architecture Handset device Input modalities (i.e., DSR, keypad input, pen entry) Output media (e.g., Visual rendering, Decoded speech output) Application Environment (Java or Browser) Protocols (SIP / RTP, Multimodal remote control) MM Gateway Handset PP/PP2 Network J2ME Application Multi-Modal Browser Multimodal Browser DSR ASR Decoder RTP & SIP RTP/SIP RTP & SIP RTP/SIP DSR Front End VoiceXML HTTP Content Server Multimodal Applications and content Multimodal Gateway DSR Decoder Multimodal VoiceXML browser Protocols Applications and content Content authoring Content delivery
28
28 DSR for Multimodal (X+V): Thin client configuration HandsetMultimodal/Voice Gateway or Server Carrier Network Internet (HTTP) Application / Content Server Web Server Speech Resources VoiceXML Browser XHTML Browser Codec X+V Synchronization Manager Voice XML XHTML Synchronization Audio I/O Local Synch
29
29 Multimodal in OMA Multimodal work item –addresses the support of service delivery to device with multimodal capabilities BAC (Browsing and Content) WG –Multimodal requirements complete –Recently started the detailed architecture and technical specification work for multimodal –Goal is to complete specifications by mid 2005 http://www.openmobilealliance.org/
30
30 DSR in 3GPP2
31
31 DSR in 3GPP2 Motiviation: high performance speech and multimodal services for CDMA New work item: “DSR codec for Speech Enabled Services” –The purpose of this work item is to enable speech enabled and multimodal services using a Distributed Speech Recognition (DSR) optimized codec in 3GPP2 and harmonize with 3GPP and OMA and make it globally interoperable. –Plan to re-use the existing DSR specs –WI introduced December 2004 Reasons to adopt existing DSR codec –Improved performance for CDMA services –Proven performance improvement. ETSI/3GPP spent 4 years to select DSR! –Time to market –Interoperable services over GSM or CDMA 3 rd party customer premises and hosted services accessible from multiple networks –Single standard for codec helps market growth and reduces service infrastructure costs
32
32 DSR in 3GPP2 Expected impacts are minimal –DSR codec recommended for SES services on the device –No anticipated changes to core network (uses existing supported protocols SIP & RTP) The WI can deliver the following: –DSR Extended Advanced Front-end as the recommended uplink codec –RTP payload format for the transport of the DSR features –Specification of the downlink codec and bit rates to be used for speech output over packet data as part of speech and multimodal services
33
33 Summary Set of DSR standards are established –Advanced Front-end –Extension –Transport protocols SES Work Item: –Reuse Existing Spec –Enable Improved Performance for CDMA services –Global Interoperability Look forward to widespread deployment in next generation of mobile devices “reserve me 2 seats for tonight”
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.