Spoken Language Processing:Summing Up

Spoken Language Processing:Summing Up
Julia Hirschberg CS 4706 12/2/2018

What We’ve Studied Speech phenomena
What can people convey by varying the way they say something? How we identify this kind of variation? What tools do we have for analysis? Speech generation (TTS) Speech recognition (ASR) and understanding (ASRU) Applications for speech technologies 12/2/2018

What phenomena vary in speech?
Intonational contours (ToBI) Phrasing: scope Accent: focus, given/new Overall contour: speech acts Pitch range, timing Topic structure Voice quality, intensity, … Emotion Deception? Charisma? 12/2/2018

Analyzing Speech: At the Acoustic Level
How do we capture speech data for analysis? Digitizing: sampling, quantization, filtering How can we distinguish one speech sound from another? Periodic vs. aperiodic waveforms Characterizing periodic waveforms: cycle, period, phase Displaying and analyzing spectra, pitch tracks Comparing intensity (db) Tools to do all this and more: Praat 12/2/2018

12/2/2018

Analyzing Speech: At the Phonetic Level
Can we distinguish different languages in terms of their phoneme sets? Are their universal constraints on possible speech sounds? Articulatory constraints How do we characterize the sounds of a given language: Acoustic differences associated with place and manner of articulation distinguish consonants Vowels differ in their formant frequencies Do we use such information in speech technologies? 12/2/2018

Articulators in action
French Canadian subjects from the early 1970s (Sample from the Queen’s University / ATR Labs X-ray Film Database) “Why did Ken set the soggy net on top of his deck?” 12/2/2018

Articulatory parameters for English consonants (in ARPAbet)
PLACE OF ARTICULATION bilabial labio-dental inter-dental alveolar palatal velar glottal stop p b t d k g q fric. f v th dh s z sh zh h affric. ch jh nasal m n ng approx w l/r y flap dx MANNER OF ARTICULATION 12/2/2018 VOICING: voiceless voiced

American English vowel space
FRONT BACK HIGH LOW iy ih eh ae aa ao uw uh ah ax ix ux ey ow aw oy ay 12/2/2018

Analyzing Speech: At the Phononological Level
How do people develop models of intonation? ToBI Tones: Pitch accents, phrase accents, boundary tones Break indices Hand labeling vs. automatic analysis Which provides more useful information? 12/2/2018

L*+H L* H* H-H% H-L% L-H% L-L% 12/2/2018

H* !H* H+!H* L+H* H-H% H-L% L-H% L-L% 12/2/2018

Speech Generation Synthesis then and now Open problems in TTS:
Pronunciation modeling: OOV words, homographs, abbreviations Predicting pitch accents and phrase boundaries: corpus-based approaches Information status: focus, given/new Modeling discourse structure Producing emotional speech Evaluation Examples: the Voder, 1939 (Bell Labs); the 1951 Pattern Playback (Haskins); Walter Lawrence’s Parametric Artificial Talker (1953); Gunnar Fant’s Orator Verbis Electris (1953) Now: Nuance; Rhetorical; AT&T 12/2/2018

Speech Recognition/Understanding
ASR then and now: From speaker-dependent digit recognition using analog circuits to HMM-based speaker-independent recognition of spontaneous speech by computer Open problems Segmentation: sentence, speaker, topic OOV recognition Handling disfluencies Evaluation: transcription, semantic, task-based? Recognizing emotion and other types of speaker state 12/2/2018

Spoken Dialogue Systems
Integrating TTS and ASR with dialogue management and task-based components Open questions: Improving ASR accuracy Recognizing dialogue acts Turn-taking behavior Confirmation strategies and initiative Entrainment and ‘personality’ Evaluation 12/2/2018

Recognizing Speaker State and Diagnosis
Emotional speech Voice quality Deceptive speech Charismatic speech Customer care rep evaluation Medical diagnosis Paranoia and other psychiatric disorders Cancer patient prognosis 12/2/2018

Take-Home Final Due: May 14 by 4:10 pm Submission instructions:
This examination is designed to test your ability to synthesize information and to perform critical analysis of published research. Choose 3 of the following 4 questions to answer Each question should be answered with specific reference to the readings specified, all of which are linked to the syllabus for the class on the date given. (I.e., cite articles with page numbers to support claims about authors’ findings or claims, as “McLeod et al. (1998) claims that existing Spoken Dialogue Systems’ major drawback is their lack of delightful personalities (p. 4).”) Do not attempt to answer the questions until you have read and understood the specified articles. Essays that do not show evidence of this understanding will not receive high marks. Each essay will be worth 33 1/3 points. Each essay should be no more than 1200 words in length; only the first 1200 words of each essay will be graded, so please do not exceed this limit. If you can answer the question in a shorter essay, feel free to do so. Please use plain ascii or Word and report word-counts for each essay. 12/2/2018

Sample Question Agree or disagree: “It is more difficult to recognize deception automatically from acoustic/prosodic and lexical cues than from visual cues obtained from face or body gesture.” Use the readings assigned for April 28 to support your answer. Show that you understand the question and are answering it E.g. “I believe that it is more difficult to recognize deception automatically from from visual cues than from acoustic/prosodic and lexical cues.” For agree/disagree questions, decide whether you basically agree or disagree e.g. “While there are difficulties recognizing deception from both types of cues, I believe it is more difficult to recognize deception from visual cues than from language-based cues.” 12/2/2018

Provide evidence on both sides of the question
“While both audio and visual cues require high quality recordings, audio recordings must be obtained in a quiet environment whereas video recordings can be obtained in a wider variety of situations, providing that equipment is available.” “While Mehrabian (1971) found significant effects for both visual and language-based cues, the particular language cues he identified in this study would seem to be easier to recognize automatically than the visual cues: For example, it should be easier to identify amount of speech and speaking rate than features such as ‘rocking gestures’ and ‘leg and foot movements’.” Support your statements with specific reference to your sources e.g. “DePaulo et al (1983) find that…” Or, “Motivation greatly influences subjects’ ability to control their verbal cues (DePaulo et al, 1983).” 12/2/2018

When in doubt, cite 12/2/2018

Spoken Language Processing:Summing Up

Similar presentations

Presentation on theme: "Spoken Language Processing:Summing Up"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Spoken Language Processing:Summing Up

Similar presentations

Presentation on theme: "Spoken Language Processing:Summing Up"— Presentation transcript:

Similar presentations

About project

Feedback