3. SPEECH RECOGNITION, ANALYSIS, AND SYNTHESIS MUSIC 318 MINI-COURSE ON SPEECH AND SINGING Science of Sound, Chapter 16 The Speech Chain, Chapters 7, 8.

Slides:



Advertisements
Similar presentations
Acoustic/Prosodic Features
Advertisements

Tom Lentz (slides Ivana Brasileiro)
KARAOKE FORMATION Pratik Bhanawat (10bec113) Gunjan Gupta Gunjan Gupta (10bec112)
Sounds that “move” Diphthongs, glides and liquids.
Acoustic Characteristics of Consonants
Human Speech Recognition Julia Hirschberg CS4706 (thanks to John-Paul Hosum for some slides)
1 CS 551/651: Structure of Spoken Language Spectrogram Reading: Stops John-Paul Hosom Fall 2010.
Acoustic Characteristics of Vowels
The Perception of Speech. Speech is for rapid communication Speech is composed of units of sound called phonemes –examples of phonemes: /ba/ in bat, /pa/
ACOUSTICS OF SPEECH AND SINGING MUSICAL ACOUSTICS Science of Sound, Chapters 15, 17 P. Denes & E. Pinson, The Speech Chain (1963, 1993) J. Sundberg, The.
The Perception of Speech. Speech is for rapid communication Speech is composed of units of sound called phonemes –examples of phonemes: /ba/ in bat, /pa/
PHONETICS AND PHONOLOGY
Basic Spectrogram Lab 8. Spectrograms §Spectrograph: Produces visible patterns of acoustic energy called spectrograms §Spectrographic Analysis: l Acoustic.
The Human Voice. I. Speech production 1. The vocal organs
Speech Perception Overview of Questions Can computers perceive speech as well as humans? Does each word that we hear have a unique pattern associated.
Speaker Recognition Sharat.S.Chikkerur Center for Unified Biometrics and Sensors
Speech in Multimedia Hao Jiang Computer Science Department Boston College Oct. 9, 2007.
Automatic Lip- Synchronization Using Linear Prediction of Speech Christopher Kohnert SK Semwal University of Colorado, Colorado Springs.
1 Frequency Domain Analysis/Synthesis Concerned with the reproduction of the frequency spectrum within the speech waveform Less concern with amplitude.
SPEECH RECOGNITION Kunal Shalia and Dima Smirnov.
Natural Language Processing - Speech Processing -
6/3/20151 Voice Transformation : Speech Morphing Gidon Porat and Yizhar Lavner SIPL – Technion IIT December
Exam 1 Monday, Tuesday, Wednesday next week WebCT testing centre Covers everything up to and including hearing (i.e. this lecture)
Introduction to Speech Synthesis ● Key terms and definitions ● Key processes in sythetic speech production ● Text-To-Phones ● Phones to Synthesizer parameters.
4/25/2001ECE566 Philip Felber1 Speech Recognition A report of an Isolated Word experiment. By Philip Felber Illinois Institute of Technology April 25,
EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 1 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision.
SPEECH PERCEPTION The Speech Stimulus Perceiving Phonemes Top-Down Processing Is Speech Special?
COMP 4060 Natural Language Processing Speech Processing.
The Perception of Speech
A PRESENTATION BY SHAMALEE DESHPANDE
LE 460 L Acoustics and Experimental Phonetics L-13
Speech synthesis Recording and sampling Speech recognition Apr. 5
Chapter 13: Speech Perception
Speech Perception. Phoneme - a basic unit of a speech sound that distinguishes one word from another Phonemes do not have meaning on their own but they.
Knowledge Base approach for spoken digit recognition Vijetha Periyavaram.
CSD 5400 REHABILITATION PROCEDURES FOR THE HARD OF HEARING Auditory Perception of Speech and the Consequences of Hearing Loss.
Artificial Intelligence 2004 Speech & Natural Language Processing Natural Language Processing written text as input sentences (well-formed) Speech.
Phonetics: the generation of speech Phonemes “The shortest segment of speech that, if changed, would change the meaning of a word.” hog fog log *Phonemes.
Resonance, Revisited March 4, 2013 Leading Off… Project report #3 is due! Course Project #4 guidelines to hand out. Today: Resonance Before we get into.
Multimedia Specification Design and Production 2013 / Semester 2 / week 3 Lecturer: Dr. Nikos Gazepidis
Speech Perception 4/6/00 Acoustic-Perceptual Invariance in Speech Perceptual Constancy or Perceptual Invariance: –Perpetual constancy is necessary, however,
Chapter 7 SPEECH COMMUNICATIONS
1 Speech Perception 3/30/00. 2 Speech Perception How do we perceive speech? –Multifaceted process –Not fully understood –Models & theories attempt to.
Speech Perception1 Fricatives and Affricates We will be looking at acoustic cues in terms of … –Manner –Place –voicing.
ECE 598: The Speech Chain Lecture 7: Fourier Transform; Speech Sources and Filters.
Speech Science Fall 2009 Oct 28, Outline Acoustical characteristics of Nasal Speech Sounds Stop Consonants Fricatives Affricates.
Voice Quality + Stop Acoustics
Say “blink” For each segment (phoneme) write a script using terms of the basic articulators that will say “blink.” Consider breathing, voicing, and controlling.
SPEECH PRODUCTION,RECOGNITION, ANALYSIS, AND SYNTHESIS
Artificial Intelligence 2004 Speech & Natural Language Processing Natural Language Processing written text as input sentences (well-formed) Speech.
Epenthetic vowels in Japanese: a perceptual illusion? Emmanual Dupoux, et al (1999) By Carl O’Toole.
Feature Vector Selection and Use With Hidden Markov Models to Identify Frequency-Modulated Bioacoustic Signals Amidst Noise T. Scott Brandes IEEE Transactions.
Stops Stops include / p, b, t, d, k, g/ (and glottal stop)
Audio processing methods on marine mammal vocalizations Xanadu Halkias Laboratory for the Recognition and Organization of Speech and Audio
Chapter 13: Speech Perception. The Acoustic Signal Produced by air that is pushed up from the lungs through the vocal cords and into the vocal tract Vowels.
Introduction to Digital Speech Processing Presented by Dr. Allam Mousa 1 An Najah National University SP_1_intro.
Performance Comparison of Speaker and Emotion Recognition
Language Perception.
Stop + Approximant Acoustics
WebCT You will find a link to WebCT under the “Current Students” heading on It is your responsibility to know how to work WebCT!
Transitions + Perception March 25, 2010 Tidbits Mystery spectrogram #3 is now up and ready for review! Final project ideas.
Acoustic Phonetics 3/14/00.
SPEECH PRODUCTION,RECOGNITION, ANALYSIS, AND SYNTHESIS
Computer technologies in Linguistics ANALYSIS AND SYNTHESIS of SPEECH
The Human Voice. 1. The vocal organs
Talking with computers
Artificial Intelligence for Speech Recognition
The Human Voice. 1. The vocal organs
Speech Perception (acoustic cues)
Speech Communications
Presentation transcript:

3. SPEECH RECOGNITION, ANALYSIS, AND SYNTHESIS MUSIC 318 MINI-COURSE ON SPEECH AND SINGING Science of Sound, Chapter 16 The Speech Chain, Chapters 7, 8

SPEECH RECOGNITION OUR ABILITY TO RECOGNIZE THE SOUNDS OF LANGUAGE IS TRULY PHENOMENAL. WE CAN RECOGNIZE MORE THAN 30 PHONEMES PER SECOND SPEECH CAN BE UNDERSTOOD AT RATES AS HIGH AS 400 WORDS PER MINUTE.

ARTICULATION TESTS A SET OF SPOKEN WORDS IS PRESENTED AND A LISTENER OR GROUP OF LISTENERS WRITES DOWN WHAT THEY HEAR. THE PERCENTAGE OF WORDS CORRECTLY HEARD IS CALLED THE ARTICULATION SCORE. ARTICULATION SCORES DEPEND UPON THE TEST WORDS USED. ONE TYPE OF WORD LIST CONSISTS OF SINGLE SYLLABLE WORDS SELECTED SO THAT SPEECH SOUNDS IN THE LISTS OCCUR WITH THE SAME RELATIVE FREQUENCY AS THEY DO IN SPOKEN ENGLISH. THESE ARE THE SO-CALLED PHONETICALLY BALANCED OR PB LISTS. ANOTHER TYPE OF WORD LIST IS MADE UP OF TWO-SYLLABLE WORDS LIKE “ARMCHAIR,” “SHOTGUN,” OR “RAILROAD” IN WHICH EACH WORD IS PRONOUNCED WITH EQUAL STRESS ON BOTH SYLLABLES.

ANALYSIS OF SPEECH THREE-DIMENSIONAL DISPLAY OF SOUND LEVEL VERSUS FREQUENCY AND TIME

SPEECH SPECTROGRAPH AS DEVELOPED AT BELL LABORATORIES (1945) DIGITAL VERSION

SPEECH SPECTROGRAM

SPEECH SPECTROGRAM OF A SENTENCE: This is a speech spectrogram

SPEECH SPECTROGRAM WITH COLOR ADDING COLOR ADDS ADDITIONAL INFORMATION

PATTERN PLAYBACK MACHINE STIMULUS PATTERN FOR PRODUCING /t/, /k/, AND /p/ SOUNDS CONSONANT SOUNDS, CHANGE VERY RAPIDLY, ARE DIFFICULT TO ANALYZE. THE SOUND CUES, BY WHICH THEY ARE RECOGNIZED, OFTEN OCCUR IN THE FIRST FEW MILLISECONDS. MUCH EARLY KNOWLEDGE ABOUT THE RECOGNITION OF CONSONANTS RESULTED FROM THE PATTERN PLAYBACK MACHINE, DEVELOPED AT THE HASKINS LABORATORY, WHICH WORKS LIKE A SPEECH SPECTROGRAPH IN REVERSE. PATTERNS MAY BE PRINTED ON PLASTIC BELTS IN ORDER TO STUDY THE EFFECTS OF VARYING THE FEATURES OF SPEECH ONE BY ONE. A DOT PRODUCES A “POP” LIKE A PLOSIVE CONSTANT

TRANSITIONS MAY OCCUR IN EITHER THE FIRST OR SECOND FORMANT A FORMANT TRANSITION WHICH MAY PRODUCE /t/, /p/, OR /k/ DEPENDING ON THE VOWEL WHICH FOLLOWS

TRANSITIONS THAT APPEAR TO ORIGINATE FROM 1800 Hz SECOND-FORMANT TRANSITIONS PERCEIVED AS THE SAME PLOSIVE CONSONANT /t/ (after Delattre, Liberman, and Cooper, 1955)

PATTERNS FOR SYNTHESIS OF /b/, /d/, /g/ PATTERNS FFOR THE SYNTHESIS OF /b/, /d/, AND /g/ BEFORE VOWELS (THE DASHED LINE SHOWS THE LOCUS FOR /d/)

PATTERNS FOR SYNTHESIZING /d/ (a) SECOND FORMANT TRANSITIONS THAT START AT THE /d/-LOCUS (b) COMPARABLE TRANSITIONS THAT MERELY “POINT” AT THE /d/-LOCUS TRANSITIONS IN (a) PRODUCE SYLLABLES BEGINNING WITH /b/, /d/, OR /g/ DEPENDING ON THE FREQUENCY LEVEL OF THE FORMANT; THOSE IN (b) PRODUCE ONLY SYLLABLES BEGINNING WITH /d/

SPEECH INTELLIGIBILITY vs SPL

FILTERED SPEECH FILTERS MAY HAVE HIGH-PASS, LOW-PASS, BAND-PASS, OR BAND-REJECT CHARACTERISTICS. SPEECH INTELLIGIBILITY IS USUALLY MEASURED BY ARTICULATION TESTS IN WHICH A SET OF WORDS IS SPOKEN AND LISTENERS ARE ASKED TO IDENTIFY THEM. ARTICULATION SCORES FOR SPEECH FILTERED WITH HIGH-PASS AND LOW- PASS FILTERS. THE CURVES CROSS OVER AT 1800 Hz WHERE THE ARTICULATION SCORES FOR BOTH ARE 67%. NORMAL SPEECH IS INTELLIGIBLE WITH BOTH TYPES OF FILTERS ALTHOUGH THE QUALITY CHANGES.

WAVEFORM DISTORTION PEAK CLIPPING IS A TYPE OF DISTORTION THAT RESULTS FROM OVERDRIVING AN AUDIO AMPLIFIER. IT IS SOMETIMES USED DELIBERATELY TO REDUCE BANDWIDTH ORIGINAL SPEECH MODERATE CLIPPING SEVERE CLIPPING EVEN AFTER SEVERE CLIPPING IN (c) THE INTELLIGIBILITY REMAINS 50-90% DEPENDING ON THE LISTENER

EFFECT OF NOISE ON SPEECH INTELLIGIBILITY THE THRESHOLDS OF INTELLIGIBILITY AND DETECTABILITY AS FUNCTIONS OF NOISE LEVEL

CATEGORICAL PERCEPTION OUR EXPECTATIONS INFLUENCE OUR ABILITY TO PERCEIVE SPEECH. EXPECTATIONS ARE STRONGER WHEN THE TEST VOCABULARY HAS FEWER WORDS

SYNTHESIS OF SPEECH WHEATSTONE’S RECONSTRUCTION OF KEMPELEN’S TALKING MACHINE AN EARLY ATTEMPT (1791) TO SYNTHESIZE SPEECH WAS VON KEMPELEN’S “TALKING MACHINE.” A BELLOWS SUPPLIES AIR TO A REED WHICH SERVES AS THE VOICE SOURCE. A LEATHER “VOCAL TRACT” IS SHAPED BY THE FINGERS OF ONE HAND. CONSONANTS ARE SIMULATED BY FOUR CONSTRICTED PASSAGES CONTROLLED BY THE FINGERS OF THE OTHEER HAND.

SPEECH SYNTHESIS ACOUSTIC SYNTHESIZERS—MECHANICAL DEVICES BY VON KEMPELEN, WHEATSTONE, KRATZENSTEIN, VON HELMHOLTZ, etc. CHANNEL VOCODERS (voice coders)---CHANGES IN INTENSITY IN NARROW BANDS IS TRANSMITTED AND USED TO REGENERATE SPEECH SPECTRA IN THESE BANDS. FORMANT SYNTHESIZERS---USES A BUZZ GENERATOR (FOR VOICED SOUNDS) AND A HISS GENERATOR (FOR UNVOICED SOUNDS) ALONG WITH A SERIES OF ELECTRICAL RESONATORS (TO SIMULATE FORMANTS). LINEAR PREDICTIVE CODING (LPC)---TEN OR TWELVE COEFFICIENTS ARE CALCULATED FROM SHORT SEGMENTS OF SPEECH AND USED TO PREDICT NEW SPEECH SAMPLES USING A DIGITAL COMPUTER HMM-BASED SYNTHESIS OR STATISTICAL PARAMETRIC SYNTHESIS---BASED ON HIDDEN MARKOV MODELS. USES MAXIMUM LIKELIHOOD TO COMPUTE WAVEFORMS

AUTOMATIC SPEECH RECOGNITION BY COMPUTER AUTOMATIC SPEECH RECOGNITION IS THE “HOLY GRAIL” OF COMPUTER SPEECH RESEARCH HUMAN LISTENERS HAVE LEARNED TO UNDERSTAND DIFFERENT DIALECTS, ACCENTS, VOICE INFLECTIONS, AND EVEN SPEECH OF RATHER LOW QUALITY FROM TALKING COMPUTERS. IT IS STILL DIFFICULT FOR COMPUTERS TO DO THIS. A COMMON STRATEGY FOR RECOGNIZING INDIVIDUAL WORDS IS TEMPLATE MATCHING. TEMPLATES ARE CREATED FOR THE WORDS IN THE DESIRED VOCABULARY AS SPOKEN BY SELECTED SPEAKERS. SPOKEN WORDS ARE THEN MATCHED TO THESE TEMPLATES, AND THE CLOSEST MATCH IS ASSUMED TO BE THE WORD SPOKEN. CONTINUOUS SPEECH RECOGNITION IS MUCH MORE DIFFICULT THAN INDIVIDUAL WORDS BECAUSE IT IS DIFFICULT TO RECOGNIZE THE BEGINNING AND END OF WORDS, SYLLABLES, AND PHONEMES.

RECOGNIZING WORD BOUNDARIES “THE SPACE NEARBY” WORD BOUNDARIES CAN BE LOCATED BY THE INITIAL OR FINAL CONSONANTS “THE AREA AROUND” WORD BOUNDARIES ARE DIFFICULT TO LOCATE

HIDDEN MARKOV MODELS (HMMs) HIDDEN MARKOV MODEL REFPRESENTATION. (a) Example of a word represented by four internal states 1,2,3,4. (b) Abstract representation of (a) snowing states 1-4 sequential transition probabilites a 1....a 4; self-transition probabilities d 1 ….d 4 ; and within-state probability distribution p 1...p 4 (DENES et al.) INVENTED (IN THE EARLY 1900s) BY RUSSIAN MATHEMATICIAN A.A. MARKOV DURING HIS STUDIES OF WORD STATISTICS IN LITERARY TEXTS. DURING THE 1980s HMMs BECAME THE MOST POPULAR SPEECH RECOGNITION METHOD.

SPEAKER IDENTIFICATION: VOICEPRINTS SPEECH SPECTROGRAMS PORTRAY SHORT-TERM VARIATIONS IN INTENSITY AND FREQUENCY IN GRAPHICAL FORM. THUS THEY GIVE MUCH USEFUL INFORMATION ABOUT SPEECH ARTICULATION. WHEN TWO PERSONS SPEAK THE SAME WORD, THEIR ARTICULATION IS SIMILAR BUT NOT IDENTICAL. THUS SPECTROGRAMS OF THEIR SPEECH WILL SHOW SIMILARITIES BUT ALSO DIFFERENCES.

SPECTROGRAMS OF THE SPOKEN WORD “SCIENCE.” WHICH TWO SPECTROGRAMS WERE MADE BY THE SAME SPEAKER?

THE TWO SPECTROGRAMS AT THE TOP WERE MADE BY THE SAME SPEAKER. THE TWO SPECTROGRAMS AT THE BOTTOM WERE MADE BY TWO OTHER SPEAKERS

FROM THE WINTER 2010 ISSUE OF ECHOES SPEECH RECOGNITION CAN BE IMPOROVED BY JOINT ANALYSIS OF THROAT AND ACOUSTIC MICROPHONE RECORDINGS, ACCORDING TO A PAPER IN THE SEPTEMBER ISSUE OF IEEE TRANSACTION ON AUDIO. SPEECH, AND LANGUAGE PROCESSING. A PROPOSED MULTIMODAL SYSTEM IMPROVES PHONEME RECOGNITION RATE. A PAPER IN THE NOVEMBER 2010 ISSUE OF NATURE PROPOSES THAT THE AMINO ACID COMPOSITION IN THE GENE FOXP2 HAS UNDERGONE ACCELERATED EVOLUTION,, AND THIS TWO-AMINO-ACID CHANGE OCCURRED AROUND THE TIME OF LANGUAGE EMERGENCE IN HUMANS AND MAY HAVE PLAYED AN IMPORTANT ROLE. HUMANS USE TACTILE INFORMATION DURING AUDITORY SPEECH PERCEPTION, ACCORDING TO A PAPER IN THE 26 TH NOVEMBER ISSUE OF NATURE. APPLYING TINY BURSTS OF ASPIRATION (SUCH AS WOULD BE PRODUCED BY PLOSIVE CONSONANT TO THE RIGHT HAND OR THE NECK MADE THE SYLLABLES MORE APT TO BE HEARD AS SPIRATED ( RATHER THAN, FOR EXAMPLE).