Download presentation
Presentation is loading. Please wait.
2
EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 1 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision Processing EE2F1 Speech & Audio Technology Lecture 6: Speech Martin Russell Electronic, Electrical & Computer Engineering School of Engineering The University of Birmingham
3
EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 2 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision Processing Overview Loudness revisited Phonetics: speech sounds in the real world –How phonemes are actually realised Beyond the phoneme –Practical issues Speech synthesis –Stages in speech synthesis
4
EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 3 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision Processing Loudness The loudness of a sound, or its intensity is perceived on an approximately logarithmic scale So, we measure it on a log scale, called decibels (after A.G. Bell):
5
EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 4 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision Processing The decibel scale Suppose we have two signals S 1 and S 2 with levels p 1 and p 2 respectively. Suppose p 2 = 2p 1
6
EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 5 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision Processing The decibel scale In other words, an increase in loudness of 6dB corresponds to a doubling of the signal level Similarly, a change in loudness of –6dB corresponds to a halving of the signal level
7
EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 6 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision Processing Lab exercise revisited 4 bit resolution Range –8 to +8 Change level by –6bd Change level by +6dB: equivalent to 3 bit resolution
8
EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 7 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision Processing Back to phonetics Speech sounds in the real world Linguistic units have no boundaries - the articulators do not come to a stop between sounds or even words Features are asynchronous, e.g. nasality, voicing and lip rounding spread over surrounding segments Cues to identity may be in surrounding sounds –pod vs pot - cued by the length of the preceding vowel
9
EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 8 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision Processing Speech is continuous
10
EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 9 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision Processing Phonetics: variability Inter-speaker variation: physical, age, gender, accent Intra-speaker variation: health, mood, external factors Inherent variation: rate of speaking, loudness Style of speaking: formal, casual, read, spontaneous Contextual variation ‘Free’ variation
11
EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 10 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision Processing Variability
12
EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 11 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision Processing Phonetics: contextual effects Co-articulation affects how each sound is realized in context –/ ae / realized differently in cab and cat and can –/ r / sound in train different to that in arrow –/ p / may be different in each of pin, spin, and apt –/ l / in leap different to that in milk –can be cam be (assimilation) Variants of a phoneme which are caused by contextual influences are and are not contrastive are called allophones
13
EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 12 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision Processing Confusability
14
EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 13 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision Processing Phonetics: fluent speech Reduction - ‘target’ positions may not be reached –vowels tend to be neutralized (centralised) –consonants may be not precisely articulated Elision - sounds get missed out –unstressed vowels disappear –consonant clusters may be simplified Epenthesis - sounds may be inserted Depends on speaking style (and rate)
15
EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 14 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision Processing Beyond the phoneme Homophones –to, too, two –glasses (aids to vision or drinking vessels?) Ambiguity of segmentation –grey tape vs. great ape –this new display will recognize speech vs. this nudist play will wreck a nice beach Intonation changes meaning of utterance –He’s gone. vs. He’s gone? –What’s that? vs. What’s that! In some languages intonation changes the meaning of a word Emphasis (a.k.a. “stress”)
16
EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 15 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision Processing External effects Noise –Lombard Effect Vibration –vibrations in the chest, oral and nasal cavity may cause interference in speech signal Fatigue –speaking rate may decrease, loss of control may result in slurring Fear –speaking rate may increase, pitch may rise due to muscle tightening Cognitive loading –interaction with other tasks, stress Alcohol or drugs
17
EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 16 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision Processing Summary The speech signal –varies from person to person and occasion to occasion –is not broken up into convenient units –is altered by stress or external physical factors Speech sounds –change in context –are inherently confusable (i.e. articulated in very similar ways) –may be missing altogether –may be inserted –rely on the the context of the utterance for full unambiguous interpretation
18
EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 17 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision Processing Introduction to Speech Synthesis
19
EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 18 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision Processing What is speech synthesis? Automatic generation of an acoustic speech signal, typically from text, using a computer Some examples: –Formant synthesis from text: –PSOLA concatenative synthesis from text: –DECTalk voices:
20
EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 19 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision Processing Text normalisation Consider: “This morning’s 6am BBC news, read by Sarah Mukherjee, announced that the OPEC countries will restrict oil exports to the UK to 22,000 barrels per day” Problems –6am –BBC –Mukherjee –OPEC –UK –22,000
21
EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 20 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision Processing Text-to-Phoneme Conversion The next stage is to convert the normalised text into a sequence of phonetic elements - a symbolic description of its pronunciation This is text-to-phone concersion e.g: “this slide is too long” /T I s # s l aI d # I z # t u # l Q G # /
22
EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 21 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision Processing Text-to-Phoneme Conversion Sequence of phonetic-segments typically obtained by: –looking up individual words in a pronunciation dictionary (often referred to as the exceptions dictionary for historical reasons) or –applying letter-to-sound rules Finally, need a method to convert the sequence of phonetic segments into an acoustic signal.
23
EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 22 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision Processing The Role of Prosody The meaning of an utterance will affect its acoustic realisation. In fact, the most commonly cited shortcoming of speech synthesis is its prosodic structure:
24
EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 23 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision Processing Prosody Prosody is the term used to describe: –Durational structure of a speech signal (the relative lengths of its different parts, and the presence and lengths of any pauses) –Amplitude structure (the relative amplitudes of its different parts) –Intonational structure. In other words, the fundamental frequency Prosody includes stress and rythmn
25
EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 24 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision Processing The Role of Prosody Consider: –Joe:“Hey-did you hear? Sam took Mary out and bought her a pizza!” –Mike:“You’re wrong - Sam didn’t buy Mary a pizza” “Sam didn’t buy Mary a PIZZA” “Sam didn’t buy MARY a pizza” “Sam didn’t BUY Mary a pizza” “Sam DIDN’T buy Mary a pizza” “SAM didn’t buy Mary a pizza” (from Altmann, “The ascent of Babel”, reference in notes)
26
EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 25 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision Processing Approaches to synthesis Phone-sequence acoustic signal Many different approaches, but for the purposes of this course they will be divided into two classes: –Waveform concatenation –Model-based speech synthesis
27
EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 26 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision Processing Waveform Concatenation Join together, or concatenate, stored sections of real speech Sections may correspond to whole word, or sub- word units Early systems based on whole words –E.G: Speaking clock - UK telephone system, 1936 Storage and access major issues Speech quality requires data-rates of 16,000 to 32,000 bits per second (bps)
28
EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 27 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision Processing 1936 “Speaking Clock” From John Holmes, “Speech synthesis and recognition”, courtesy of British Telecommunications plc
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.