Download presentation
Presentation is loading. Please wait.
Published byGiles Randell Leonard Modified over 6 years ago
1
Speech Processing August 10, 2005 11/10/2018
2
CS 224S / LINGUIST 236 Speech Recognition and Synthesis
Dan Jurafsky Lecture 1: Overview and Articulatory Phonetics 11/10/2018
3
Phonetics: definitions
Phonetics: the scientific study of linguistic speech sounds how they are produced: articulatory how they are perceived: auditory their physical aspects: acoustic 11/10/2018
4
Decoding the speech stream
No spaces between words Yet we hear separate words in our own languages No individual sounds Yet we hear the individual sounds of our own languages Yet all languages treat linguistic units as discrete (sounds, words, sentences, etc.) 11/10/2018
5
Transcription The most widely used tool in phonetics is transcription
A standardized set of symbols for converting the continuous acoustic stream into discrete, linguistically relevant symbolic units The International Phonetic Alphabet is the most widely used transcription tool. 11/10/2018
6
The International Phonetic Alphabet (IPA)
Necessary because: 1.) Inadequacy of orthography (spelling): a.) one letter/digraph — different sounds laugh ([f]) bright (ø) ghost ([g]) b.) one sound — different letters believe ([i]”ee”) people ([i]) tree ([i]): 2.) Cross-linguistic variation in orthography: Different languages have different ways of representing the same sound-- a.) [k] : ch in Italian (Chianti) b.) initial sound in “church” is written ci (ciao) 3.) A single sound is represented by more than one letter gh = [f] in “laugh”, f = [f] in “fall” etc. 11/10/2018
7
The International Phonetic Alphabet (IPA)
The (IPA): a consistent alphabet that is accepted as the standard international phonetic alphabet Phonetic transcription is in square brackets [aI pi eI]. IPA chart (and fonts) available: 11/10/2018
8
Definitions Phonology: The study of the inventory of sounds in a language; and how human speech sounds may pattern together, or contrast, ie the study of how speech sounds are organized. Phoneme: A minimal unit of sound in a language, necessary for understanding how contrasts are made. Minimal pair (minimal set): Words identical in form except for one sound segment that occurs in the same place in a string. 11/10/2018
9
Articulators Alveolar ridge palate teeth velum uvula lips pharyngeal
larynx vocal folds:glottis trachea 11/10/2018
10
Places of articulation
alveolar post-alveolar/palatal dental velar uvular labial pharyngeal laryngeal/glottal 11/10/2018
11
Articulatory parameters for English consonants (in ARPAbet)
q glottal dx flap y l/r w approx ng n m nasal jh ch affric. zh sh z s dh th v f fric. g k d t b p stop velar palatal alveolar inter-dental labio-dental bilabial PLACE OF ARTICULATION MANNER OF ARTICULATION 11/10/2018 voiced voiceless VOICING:
12
American English vowel space
FRONT BACK HIGH LOW iy ih eh ae aa ao uw uh ah ax ix ux ey ow aw oy ay 11/10/2018
13
Acoustic landmarks [p] [t] [ix] [ih] [ax] [ae] [iy] [sh] [s] [n] [l]
“Patricia and Patsy and Sally” 11/10/2018
14
Syllables Syllabification important for (onset) + nucleus + (coda):
pronunciation: deny/denim speaking rate calculation: syllables per second word recognition in ASR (onset) + nucleus + (coda): c a t a a t t o Lexical stress: primary, secondary, terciary telephone 11/10/2018
15
Phonological Rules Not all instances of a given phone [x] sound/look alike Phoneme /x/ may have many allophones Phonological rules map phonemes in context to allophones, e.g. simple rules: /{t,d}/ --> []/ V’ _ V FSA’s, FST’s declarative constraints: t: V’ _ V 11/10/2018
16
Allophones of /t/ What we would consider a single ‘sound’ can be pronounced differently depending on the phonetic context. For example, the phoneme /t/: Figure 4.8: Jurafsky & Martin (2000), page 104. 11/10/2018
17
Application: Word Pronunciation for TTS
Pronouncing dictionaries (the: [‘dhax],[‘dhiy]) Problems: Homographs (wind/wind, desert/desert) Abbreviation (Dr., St.) Numbers ( ) Acronyms (NAACL, IDIAP) Morphological variation (unrelentingly) Proper names and unknown words rules + dictionaries/dictionaries + rules 11/10/2018
18
Unit Selection TTS Overview
Collect lots of speech (5-50 hours) from one speaker, transcribe very carefully, all the syllables and phones and whatnot To synthesize a sentence, patch together syllables and phones from the training data. Paradigm: search 11/10/2018
19
More phonetic structure
Syllables Composed of vowels and consonants. Not well defined. Something like a “vowel nucleus with some of its surrounding consonants”. Stress Some syllables have more energy than others Stressed syllables versus unstressed syllables (an) ‘INsult vs. (to) in’SULT (an) ‘OBject vs. (to) ob’JECT Unstressed vowels are generally transcribed as schwa: ax 11/10/2018
20
Acoustic Phonetics Sound Waves
11/10/2018
21
Waveforms for speech Frequency: repetitions/second of a wave
Waveform of the vowel [iy] Frequency: repetitions/second of a wave Above vowel has 28 reps in .11 secs So freq is 28/.11 = 255 Hz This is speed that vocal folds move, hence voicing Amplitude: y axis: amount of air pressure at that point in time Zero is normal air pressure, negative is rarefaction 11/10/2018
22
What can we learn from a wavefile?
She just had a baby What can we learn from a wavefile? Vowels are voiced, long, loud Length in time = length in space in waveform picture Voicing: regular peaks in amplitude When stops closed: no peaks: silence. Peaks = voicing: .46 to .58 (vowel [iy], from second .65 to .74 (vowel [ax]) and so on Silence of stop closure (1.06 to 1.08 for first [b], or 1.26 to 1.28 for second [b]) Fricatives like [sh] intense irregular pattern; see .33 to .46 11/10/2018
23
Examples from Ladefoged
pad bad spat 11/10/2018
24
Spectrum The frequency components
Fourier analysis: every wave can be represented as sum of many simple waves of different frequencies. Articulatory facts: The vocal cord vibrations create harmonics The mouth is an amplifier Depending on shape of mouth, some harmonics are amplified more than others 11/10/2018
25
Part of [ae] waveform from “had”
Note complex wave repeating nine times in figure Plus smaller waves which repeats 4 times for every large pattern Large wave has frequency of 250 Hz (9 times in .036 seconds) Small wave roughly 4 times this, or roughly 1000 Hz Two little tiny waves on top of peak of 1000 Hz waves 11/10/2018
26
A spectrum Spectrum represents these freq components
Computed by Fourier transform, algorithm which separates out each frequency component of wave. x-axis shows frequency, y-axis shows magnitude (in decibels, a log measure of amplitude) Peaks at 930 Hz, 1860 Hz, and 3020 Hz. 11/10/2018
27
Spectrogram 11/10/2018
28
Formants Vowels largely distinguished by 2 characteristic pitches.
One of them (the higher of the two) goes downward throughout the series iy ih eh ae aa ao ou u (whisper iy eh uw) The other goes up for the first four vowels and then down for the next four. creaky voice iy ih eh ae (goes up) creaky voice aa ow uh uw (goes down) These are called "formants" of the vowels, lower is 1st formant, higher is 2nd formant. 11/10/2018
29
CS 224S / LINGUIST 236 Speech Recognition and Synthesis
Dan Jurafsky Lecture 3: TTS Overview, History, and Letter-to-Sound IP Notice: lots of info, text, and diagrams on these slides comes (thanks!) from Alan Black’s excellent lecture notes and from Richard Sproat’s great new slides. 11/10/2018
30
Modern TTS systems 1960’s first full TTS: Umeda et al (1968) 1970’s
Joe Olive 1977 concatenation of linear-prediction diphones Speak and Spell 1980’s 1979 MIT MITalk (Allen, Hunnicut, Klatt) 1990’s-present Diphone synthesis Unit selection synthesis 11/10/2018
31
Types of Modern Synthesis
Articulatory Synthesis: Model movements of articulators and acoustics of vocal tract Formant Synthesis: Start with acoustics, create rules/filters to create each formant Concatenative Synthesis: Use databases of stored speech to assemble new utterances. 11/10/2018 Text from Richard Sproat slides
32
Concatenative Synthesis
All current commercial systems. Diphone Synthesis Units are diphones; middle of one phone to middle of next. Why? Middle of phone is steady state. Record 1 speaker saying each diphone Unit Selection Synthesis Larger units Record 10 hours or more, so have multiple copies of each unit Use search to find best sequence of units 11/10/2018
33
TTS Demos (all are Unit-Selection)
ATT: Rhetorical (= Scansoft) Festival Cepstral IBM 11/10/2018
34
TTS Architecture Waveform synthesis Text Analysis Text Normalization
Part-of-Speech tagging Homonym Disambiguation Raw Text in Phonetic Analysis Dictionary Lookup Grapheme-to-Phoneme (LTS) Prosodic Analysis Boundary placement Pitch accent assignment Duration computation Waveform synthesis Speech out 11/10/2018
35
Text Normalization Analysis of raw text into pronounceable words
Sample problems: The robbers stole Rs 100 lakhs from the bank It's 13/4 Modern Ave. The home page is yes, see you the following tues, that's 23/08/05 Steps Identify tokens in text Chunk tokens into reasonably sized sections Map tokens to words Identify types for words 11/10/2018
36
Grapheme to Phoneme How to pronounce a word? Look in dictionary! But:
Unknown words and names will be missing Turkish, German, and other hard languages uygarlaStIramadIklarImIzdanmISsInIzcasIna ``(behaving) as if you are among those whom we could not civilize’ uygar +laS +tIr +ama +dIk +lar +ImIz +dan +mIS +sInIz +casIna civilized +bec +caus +NegAble +ppart +pl +p1pl +abl +past +2pl +AsIf So need Letter to Sound Rules Also homograph disambiguation (wind, live, read) 11/10/2018
37
Grapheme to Phoneme in Indian languages
Hindi: do not need a dictionary. Letter to sound rules can capture the pronunciation of most words. Bengali: Harder than Hindi, but mostly can be handled using rules, and a list of exceptions. 11/10/2018
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.