Download presentation
Presentation is loading. Please wait.
1
Speech and Language Processing
2017/10/5 Speech and Language Processing Chapter 8 of SLP Speech Synthesis
2
Outline Arpabet TTS Architectures TTS Components Text Analysis
2017/10/5 Arpabet TTS Architectures TTS Components Text Analysis Text Normalization Homonym Disambiguation Grapheme-to-Phoneme (Letter-to-Sound) Intonation Waveform Generation Unit Selection Diphones 2017/10/5 2 Speech and Language Processing Jurafsky and Martin 2
3
Dave Barry on TTS 2017/10/5 “And computers are getting smarter all the time; scientists tell us that soon they will be able to talk with us. (By "they", I mean computers; I doubt scientists will ever be able to talk to us.) 2017/10/5 3 Speech and Language Processing Jurafsky and Martin 3
4
ARPAbet Vowels b_d ARPA 1 bead iy 9 bode ow 2 bid ih 10 booed uw 3
2017/10/5 b_d ARPA 1 bead iy 9 bode ow 2 bid ih 10 booed uw 3 bayed ey 11 bud ah 4 bed eh 12 bird er 5 bad ae 13 bide ay 6 bod(y) aa 14 bowed aw 7 bawd ao 15 Boyd oy 8 Budd(hist) uh 2017/10/5 4 Speech and Language Processing Jurafsky and Martin 4
5
Brief Historical Interlude
2017/10/5 Pictures and some text from Hartmut Traunmüller’s web site: Von Kempeln 1780 b. Bratislava 1734 d. Vienna 1804 Leather resonator manipulated by the operator to copy vocal tract configuration during sonorants (vowels, glides, nasals) Bellows provided air stream, counterweight provided inhalation Vibrating reed produced periodic pressure wave 2017/10/5 5 Speech and Language Processing Jurafsky and Martin 5
6
Von Kempelen: Small whistles controlled consonants
2017/10/5 Small whistles controlled consonants Rubber mouth and nose; nose had to be covered with two fingers for non-nasals Unvoiced sounds: mouth covered, auxiliary bellows driven by string provides puff of air From Traunmüller’s web site 2017/10/5 6 Speech and Language Processing Jurafsky and Martin 6
7
Modern TTS systems 1960’s first full TTS: Umeda et al (1968) 1970’s
2017/10/5 1960’s first full TTS: Umeda et al (1968) 1970’s Joe Olive 1977 concatenation of linear-prediction diphones Speak and Spell 1980’s 1979 MIT MITalk (Allen, Hunnicut, Klatt) 1990’s-present Diphone synthesis Unit selection synthesis 2017/10/5 7 Speech and Language Processing Jurafsky and Martin 7
8
2. Overview of TTS: Architectures of Modern Synthesis
2017/10/5 Articulatory Synthesis: Model movements of articulators and acoustics of vocal tract Formant Synthesis: Start with acoustics, create rules/filters to create each formant Concatenative Synthesis: Use databases of stored speech to assemble new utterances. 2017/10/5 Text from Richard Sproat slides 8 Speech and Language Processing Jurafsky and Martin 8
9
Formant Synthesis 2017/10/5 Were the most common commercial systems while computers were relatively underpowered. 1979 MIT MITalk (Allen, Hunnicut, Klatt) 1983 DECtalk system The voice of Stephen Hawking 2017/10/5 9 Speech and Language Processing Jurafsky and Martin 9
10
Concatenative Synthesis
2017/10/5 All current commercial systems. Diphone Synthesis Units are diphones; middle of one phone to middle of next. Why? Middle of phone is steady state. Record 1 speaker saying each diphone Unit Selection Synthesis Larger units Record 10 hours or more, so have multiple copies of each unit Use search to find best sequence of units 2017/10/5 10 Speech and Language Processing Jurafsky and Martin 10
11
TTS Demos (all are Unit-Selection)
2017/10/5 Festival Cepstral bin/demos/general IBM 2017/10/5 11 Speech and Language Processing Jurafsky and Martin 11
12
Architecture The three types of TTS Concatenative Formant Articulatory
2017/10/5 The three types of TTS Concatenative Formant Articulatory Only cover the segments+f0+duration to waveform part. A full system needs to go all the way from random text to sound. 2017/10/5 12 Speech and Language Processing Jurafsky and Martin 12
13
Two steps PG&E will file schedules on April 20.
2017/10/5 PG&E will file schedules on April 20. TEXT ANALYSIS: Text into intermediate representation: WAVEFORM SYNTHESIS: From the intermediate representation into waveform 2017/10/5 13 Speech and Language Processing Jurafsky and Martin 13
14
The Hourglass 2017/10/5 2017/10/5 14 Speech and Language Processing Jurafsky and Martin 14
15
1. Text Normalization Analysis of raw text into pronounceable words:
2017/10/5 Analysis of raw text into pronounceable words: Sentence Tokenization Text Normalization Identify tokens in text Chunk tokens into reasonably sized sections Map tokens to words Identify types for words 2017/10/5 15 Speech and Language Processing Jurafsky and Martin 15
16
Rules for end-of-utterance detection
2017/10/5 A dot with one or two letters is an abbrev A dot with 3 cap letters is an abbrev. An abbrev followed by 2 spaces and a capital letter is an end-of-utterance Non-abbrevs followed by capitalized word are breaks This fails for Cog. Sci. Newsletter Lots of cases at end of line. Badly spaced/capitalized sentences 2017/10/5 From Alan Black lecture notes 16 Speech and Language Processing Jurafsky and Martin 16
17
Decision Tree: is a word end-of-utterance?
2017/10/5 2017/10/5 17 Speech and Language Processing Jurafsky and Martin 17
18
Learning Decision Trees
2017/10/5 DTs are rarely built by hand Hand-building only possible for very simple features, domains Lots of algorithms for DT induction 2017/10/5 18 Speech and Language Processing Jurafsky and Martin 18
19
Next Step: Identify Types of Tokens, and Convert Tokens to Words
2017/10/5 Pronunciation of numbers often depends on type: 1776 date: seventeen seventy six. 1776 phone number: one seven seven six 1776 quantifier: one thousand seven hundred (and) seventy six 25 day: twenty-fifth 2017/10/5 19 Speech and Language Processing Jurafsky and Martin 19
20
Classify token into 1 of 20 types
2017/10/5 EXPN: abbrev, contractions (adv, N.Y., mph, gov’t) LSEQ: letter sequence (CIA, D.C., CDs) ASWD: read as word, e.g. CAT, proper names MSPL: misspelling NUM: number (cardinal) (12,45,1/2, 0.6) NORD: number (ordinal) e.g. May 7, 3rd, Bill Gates II NTEL: telephone (or part) e.g NDIG: number as digits e.g. Room 101 NIDE: identifier, e.g. 747, 386, I5, PC110 NADDR: number as stresst address, e.g Pennsylvania NZIP, NTIME, NDATE, NYER, MONEY, BMONY, PRCT,URL,etc SLNT: not spoken (KENT*REALTY) 2017/10/5 20 Speech and Language Processing Jurafsky and Martin 20
21
More about the types 4 categories for alphabetic sequences:
2017/10/5 4 categories for alphabetic sequences: EXPN: expand to full word or word seq (fplc for fireplace, NY for New York) LSEQ: say as letter sequence (IBM) ASWD: say as standard word (either OOV or acronyms) 5 main ways to read numbers: Cardinal (quantities) Ordinal (dates) String of digits (phone numbers) Pair of digits (years) Trailing unit: serial until last non-zero digit: is “eight seven six five thousand” (some phone numbers, long addresses) But still exceptions: ( , ) 2017/10/5 21 Speech and Language Processing Jurafsky and Martin 21
22
Finally: expanding NSW Tokens
2017/10/5 Type-specific heuristics ASWD expands to itself LSEQ expands to list of words, one for each letter NUM expands to string of words representing cardinal NYER expand to 2 pairs of NUM digits… NTEL: string of digits with silence for puncutation Abbreviation: use abbrev lexicon if it’s one we’ve seen Else use training set to know how to expand Cute idea: if “eat in kit” occurs in text, “eat-in kitchen” will also occur somewhere. 2017/10/5 22 Speech and Language Processing Jurafsky and Martin 22
23
2. Homograph disambiguation
2017/10/5 19 most frequent homographs, from Liberman and Church use 319 increase 230 close 215 record 195 house 150 contract 143 lead 131 live 130 lives 105 protest 94 survey 91 project 90 separate 87 present 80 read 72 subject 68 rebel 48 finance 46 estimate 46 Not a huge problem, but still important 2017/10/5 23 Speech and Language Processing Jurafsky and Martin 23
24
POS Tagging for homograph disambiguation
2017/10/5 Many homographs can be distinguished by POS use y uw s y uw z close k l ow s k l ow z house h aw s h aw z live l ay v l ih v REcord reCORD INsult inSULT OBject obJECT OVERflow overFLOW DIScount disCOUNT CONtent conTENT 2017/10/5 24 Speech and Language Processing Jurafsky and Martin 24
25
3. Letter-to-Sound: Getting from words to phones
2017/10/5 Two methods: Dictionary-based Rule-based (Letter-to-sound=LTS) Early systems, all LTS MITalk was radical in having huge 10K word dictionary Now systems use a combination 2017/10/5 25 Speech and Language Processing Jurafsky and Martin 25
26
Pronunciation Dictionaries: CMU
2017/10/5 CMU dictionary: 127K words Some problems: Has errors Only American pronunciations No syllable boundaries Doesn’t tell us which pronunciation to use for which homophones (no POS tags) Doesn’t distinguish case The word US has 2 pronunciations [AH1 S] and [Y UW1 EH1 S] 2017/10/5 26 Speech and Language Processing Jurafsky and Martin 26
27
Pronunciation Dictionaries: UNISYN
2017/10/5 UNISYN dictionary: 110K words (Fitt 2002) Benefits: Has syllabification, stress, some morphological boundaries Pronunciations can be read off in General American RP British Australia Etc (Other dictionaries like CELEX not used because too small, British- only) 2017/10/5 27 Speech and Language Processing Jurafsky and Martin 27
28
Dictionaries aren’t sufficient
2017/10/5 Unknown words (= OOV = “out of vocabulary”) Increase with the (sqrt of) number of words in unseen text Black et al (1998) OALD on 1st section of Penn Treebank: Out of word tokens, 1775 tokens were OOV: 4.6% (943 unique types): So commercial systems have 4-part system: Big dictionary Names handled by special routines Acronyms handled by special routines (previous lecture) Machine learned g2p algorithm for other unknown words names unknown Typos/other 1360 351 64 76.6% 19.8% 3.6% 2017/10/5 28 Speech and Language Processing Jurafsky and Martin 28
29
Names Big problem area is names Names are common
2017/10/5 Big problem area is names Names are common 20% of tokens in typical newswire text will be names 1987 Donnelly list (72 million households) contains about 1.5 million names Personal names: McArthur, D’Angelo, Jiminez, Rajan, Raghavan, Sondhi, Xu, Hsu, Zhang, Chang, Nguyen Company/Brand names: Infinit, Kmart, Cytyc, Medamicus, Inforte, Aaon, Idexx Labs, Bebe 2017/10/5 29 Speech and Language Processing Jurafsky and Martin 29
30
Names Methods: Can do morphology (Walters -> Walter, Lucasville)
2017/10/5 Methods: Can do morphology (Walters -> Walter, Lucasville) Can write stress-shifting rules (Jordan -> Jordanian) Rhyme analogy: Plotsky by analogy with Trostsky (replace tr with pl) Liberman and Church: for 250K most common names, got 212K (85%) from these modified-dictionary methods, used LTS for rest. Can do automatic country detection (from letter trigrams) and then do country-specific rules Can train g2p system specifically on names Or specifically on types of names (brand names, Russian names, etc) 2017/10/5 30 Speech and Language Processing Jurafsky and Martin 30
31
Acronyms We saw above Use machine learning to detect acronyms EXPN
2017/10/5 We saw above Use machine learning to detect acronyms EXPN ASWORD LETTERS Use acronym dictionary, hand-written rules to augment 2017/10/5 31 Speech and Language Processing Jurafsky and Martin 31
32
Letter-to-Sound Rules
2017/10/5 Earliest algorithms: handwritten Chomsky+Halle-style rules: Festival version of such LTS rules: (LEFTCONTEXT [ ITEMS] RIGHTCONTEXT = NEWITEMS ) Example: ( # [ c h ] C = k ) ( # [ c h ] = ch ) # denotes beginning of word C means all consonants Rules apply in order “christmas” pronounced with [k] But word with ch followed by non-consonant pronounced [ch] E.g., “choice” 2017/10/5 32 Speech and Language Processing Jurafsky and Martin 32
33
Stress rules in hand-written LTS
2017/10/5 English famously evil: one from Allen et al 1987 Where X must contain all prefixes: Assign 1-stress to the vowel in a syllable preceding a weak syllable followed by a morpheme-final syllable containing a short vowel and 0 or more consonants (e.g. difficult) Assign 1-stress to the vowel in a syllable preceding a weak syllable followed by a morpheme-final vowel (e.g. oregano) etc 2017/10/5 33 Speech and Language Processing Jurafsky and Martin 33
34
Modern method: Learning LTS rules automatically
2017/10/5 Induce LTS from a dictionary of the language Black et al. 1998 Applied to English, German, French Two steps: alignment (CART-based) rule-induction 2017/10/5 34 Speech and Language Processing Jurafsky and Martin 34
35
Alignment Letters: c h e c k e d Phones: ch _ eh _ k _ t
2017/10/5 Letters: c h e c k e d Phones: ch _ eh _ k _ t Black et al Method 1: First scatter epsilons in all possible ways to cause letters and phones to align Then collect stats for P(phone|letter) and select best to generate new stats This iterated a number of times until settles (5-6) This is EM (expectation maximization) alg 2017/10/5 35 Speech and Language Processing Jurafsky and Martin 35
36
Alignment: Black et al method 2
2017/10/5 Hand specify which letters can be rendered as which phones C goes to k/ch/s/sh W goes to w/v/f, etc An actual list: Once mapping table is created, find all valid alignments, find p(letter|phone), score all alignments, take best 2017/10/5 36 Speech and Language Processing Jurafsky and Martin 36
37
Alignment Some alignments will turn out to be really bad.
2017/10/5 Some alignments will turn out to be really bad. These are just the cases where pronunciation doesn’t match letters: Dept d ih p aa r t m ah n t CMU s iy eh m y uw Lieutenant l eh f t eh n ax n t (British) Also foreign words These can just be removed from alignment training 2017/10/5 37 Speech and Language Processing Jurafsky and Martin 37
38
Building CART trees 2017/10/5 Build a CART tree for each letter in alphabet (26 plus accented) using context of +-3 letters # # # c h e c -> ch c h e c k e d -> _ 2017/10/5 38 Speech and Language Processing Jurafsky and Martin 38
39
Add more features 2017/10/5 Even more: for French liaison, we need to know what the next word is, and whether it starts with a vowel French ‘six’ [s iy s] in j’en veux six [s iy z] in six enfants [s iy] in six filles 2017/10/5 39 Speech and Language Processing Jurafsky and Martin 39
40
Prosody: from words+phones to boundaries, accent, F0, duration
2017/10/5 Prosodic phrasing Need to break utterances into phrases Punctuation is useful, not sufficient Accents: Predictions of accents: which syllables should be accented Realization of F0 contour: given accents/tones, generate F0 contour Duration: Predicting duration of each phone 2017/10/5 40 Speech and Language Processing Jurafsky and Martin 40
41
Defining Intonation Ladd (1996) “Intonational phonology”
2017/10/5 Ladd (1996) “Intonational phonology” “The use of suprasegmental phonetic features Suprasegmental = above and beyond the segment/phone F0 Intensity (energy) Duration to convey sentence-level pragmatic meanings” i.e. meanings that apply to phrases or utterances as a whole, not lexical stress, not lexical tone. 2017/10/5 41 Speech and Language Processing Jurafsky and Martin 41
42
Three aspects of prosody
2017/10/5 Prominence: some syllables/words are more prominent than others Structure/boundaries: sentences have prosodic structure Some words group naturally together Others have a noticeable break or disjuncture between them Tune: the intonational melody of an utterance. 2017/10/5 From Ladd (1996) 42 Speech and Language Processing Jurafsky and Martin 42
43
Prosodic Prominence: Pitch Accents
2017/10/5 Prosodic Prominence: Pitch Accents A: What types of foods are a good source of vitamins? B1: Legumes are a good source of VITAMINS. B2: LEGUMES are a good source of vitamins. Prominent syllables are: Louder Longer Have higher F0 and/or sharper changes in F0 (higher F0 velocity) 2017/10/5 Slide from Jennifer Venditti 43 Speech and Language Processing Jurafsky and Martin 43
44
Stress vs. accent (2) 2017/10/5 The speaker decides to make the word vitamin more prominent by accenting it. Lexical stress tell us that this prominence will appear on the first syllable, hence VItamin. 2017/10/5 44 Speech and Language Processing Jurafsky and Martin 44
45
Which word receives an accent?
2017/10/5 It depends on the context. For example, the ‘new’ information in the answer to a question is often accented, while the ‘old’ information usually is not. Q1: What types of foods are a good source of vitamins? A1: LEGUMES are a good source of vitamins. Q2: Are legumes a source of vitamins? A2: Legumes are a GOOD source of vitamins. Q3: I’ve heard that legumes are healthy, but what are they a good source of ? A3: Legumes are a good source of VITAMINS. Slide from Jennifer Venditti 2017/10/5 45 Speech and Language Processing Jurafsky and Martin 45
46
Factors in accent prediction
2017/10/5 Part of speech: Content words are usually accented Function words are rarely accented Of, for, in on, that, the, a, an, no, to, and but or will may would can her is their its our there is am are was were, etc 2017/10/5 46 Speech and Language Processing Jurafsky and Martin 46
47
Complex Noun Phrase Structure
2017/10/5 Sproat, R English noun-phrase accent prediction for text-to-speech. Computer Speech and Language 8:79-94. Proper Names, stress on right-most word New York CITY; Paris, FRANCE Adjective-Noun combinations, stress on noun Large HOUSE, red PEN, new NOTEBOOK Noun-Noun compounds: stress left noun HOTdog (food) versus HOT DOG (overheated animal) WHITE house (place) versus WHITE HOUSE (made of stucco) examples: MEDICAL Building, APPLE cake, cherry PIE. What about: Madison avenue, Park street ??? Some Rules: Furniture+Room -> RIGHT (e.g., kitchen TABLE) Proper-name + Street -> LEFT (e.g. PARK street) 2017/10/5 47 Speech and Language Processing Jurafsky and Martin 47
48
State of the art Hand-label large training sets
2017/10/5 Hand-label large training sets Use CART, SVM, CRF, etc to predict accent Lots of rich features from context (parts of speech, syntactic structure, information structure, contrast, etc.) Classic lit: Hirschberg, Julia Pitch Accent in context: predicting intonational prominence from text. Artificial Intelligence 63, 2017/10/5 48 Speech and Language Processing Jurafsky and Martin 48
49
Levels of prominence Most phrases have more than one accent
2017/10/5 Most phrases have more than one accent The last accent in a phrase is perceived as more prominent Called the Nuclear Accent Emphatic accents like nuclear accent often used for semantic purposes, such as indicating that a word is contrastive, or the semantic focus. The kind of thing you represent via ***s in IM, or capitalized letters ‘I know SOMETHING interesting is sure to happen,’ she said to herself. Can also have words that are less prominent than usual Reduced words, especially function words. Often use 4 classes of prominence: emphatic accent, pitch accent, unaccented, reduced 2017/10/5 49 Speech and Language Processing Jurafsky and Martin 49
50
Yes-No question are legumes a good source of VITAMINS
2017/10/5 are legumes a good source of VITAMINS Rise from the main accent to the end of the sentence. Slide from Jennifer Venditti 2017/10/5 50 Speech and Language Processing Jurafsky and Martin 50
51
‘Surprise-redundancy’ tune
2017/10/5 [How many times do I have to tell you ...] legumes are a good source of vitamins Low beginning followed by a gradual rise to a high at the end. Slide from Jennifer Venditti 2017/10/5 51 Speech and Language Processing Jurafsky and Martin 51
52
‘Contradiction’ tune linguini isn’t a good source of vitamins
2017/10/5 “I’ve heard that linguini is a good source of vitamins.” linguini isn’t a good source of vitamins [... how could you think that?] Sharp fall at the beginning, flat and low, then rising at the end. Slide from Jennifer Venditti 2017/10/5 52 Speech and Language Processing Jurafsky and Martin 52
53
Duration Simplest: Next simplest: Next Next Simplest: aa 118 b 68
2017/10/5 Simplest: fixed size for all phones (100 ms) Next simplest: average duration for that phone (from training data). Samples from SWBD in ms: aa 118 b 68 ax 59 d 68 ay 138 dh 44 eh 87 f 90 ih 77 g 66 Next Next Simplest: add in phrase-final and initial lengthening plus stress: 2017/10/5 53 Speech and Language Processing Jurafsky and Martin 53
54
Intermediate representation: using Festival
2017/10/5 Do you really want to see all of it? 2017/10/5 54 Speech and Language Processing Jurafsky and Martin 54
55
Waveform Synthesis Given: String of phones Prosody
2017/10/5 Given: String of phones Prosody Desired F0 for entire utterance Duration for each phone Stress value for each phone, possibly accent value Generate: Waveforms 2017/10/5 55 Speech and Language Processing Jurafsky and Martin 55
56
Diphone TTS architecture
2017/10/5 Training: Choose units (kinds of diphones) Record 1 speaker saying 1 example of each diphone Mark the boundaries of each diphones, cut each diphone out and create a diphone database Synthesizing an utterance, grab relevant sequence of diphones from database Concatenate the diphones, doing slight signal processing at boundaries use signal processing to change the prosody (F0, energy, duration) of selected sequence of diphones 2017/10/5 56 Speech and Language Processing Jurafsky and Martin 56
57
Diphones Mid-phone is more stable than edge: 2017/10/5 57 2017/10/5 57
Speech and Language Processing Jurafsky and Martin 57
58
Diphones mid-phone is more stable than edge
2017/10/5 mid-phone is more stable than edge Need O(phone2) number of units Some combinations don’t exist (hopefully) ATT (Olive et al. 1998) system had 43 phones 1849 possible diphones Phonotactics ([h] only occurs before vowels), don’t need to keep diphones across silence Only 1172 actual diphones May include stress, consonant clusters So could have more Lots of phonetic knowledge in design Database relatively small (by today’s standards) Around 8 megabytes for English (16 KHz 16 bit) 2017/10/5 Slide from Richard Sproat 58 Speech and Language Processing Jurafsky and Martin 58
59
Voice Speaker Called a voice talent Diphone database Called a voice
2017/10/5 Speaker Called a voice talent Diphone database Called a voice 2017/10/5 59 Speech and Language Processing Jurafsky and Martin 59
60
Prosodic Modification
2017/10/5 Modifying pitch and duration independently Changing sample rate modifies both: Chipmunk speech Duration: duplicate/remove parts of the signal Pitch: resample to change pitch Text from Alan Black 2017/10/5 60 Speech and Language Processing Jurafsky and Martin 60
61
Speech as Short Term signals
2017/10/5 2017/10/5 Alan Black 61 Speech and Language Processing Jurafsky and Martin 61
62
Duration modification
2017/10/5 Duplicate/remove short term signals Slide from Richard Sproat 2017/10/5 62 Speech and Language Processing Jurafsky and Martin 62
63
Duration modification
2017/10/5 Duplicate/remove short term signals 2017/10/5 63 Speech and Language Processing Jurafsky and Martin 63
64
Pitch Modification 2017/10/5 Move short-term signals closer together/further apart Slide from Richard Sproat 2017/10/5 64 Speech and Language Processing Jurafsky and Martin 64
65
TD-PSOLA ™ Time-Domain Pitch Synchronous Overlap and Add
2017/10/5 Time-Domain Pitch Synchronous Overlap and Add Patented by France Telecom (CNET) Very efficient No FFT (or inverse FFT) required Can modify Hz up to two times or by half Slide from Richard Sproat 2017/10/5 65 Speech and Language Processing Jurafsky and Martin 65
66
TD-PSOLA ™ Time-Domain Pitch Synchronous Overlap and Add
2017/10/5 Time-Domain Pitch Synchronous Overlap and Add Patented by France Telecom (CNET) Windowed Pitch-synchronous Overlap-and-add Very efficient Can modify Hz up to two times or by half 2017/10/5 66 Speech and Language Processing Jurafsky and Martin 66
67
Unit Selection Synthesis
2017/10/5 Generalization of the diphone intuition Larger units From diphones to sentences Many many copies of each unit 10 hours of speech instead of diphones (a few minutes of speech) 2017/10/5 67 Speech and Language Processing Jurafsky and Martin 67
68
Unit Selection Intuition
2017/10/5 Given a big database Find the unit in the database that is the best to synthesize some target segment What does “best” mean? “Target cost”: Closest match to the target description, in terms of Phonetic context F0, stress, phrase position “Join cost”: Best join with neighboring units Matching formants + other spectral characteristics Matching energy Matching F0 2017/10/5 68 Speech and Language Processing Jurafsky and Martin 68
69
Targets and Target Costs
2017/10/5 Target cost T(ut,st): How well the target specification st matches the potential unit in the database ut Features, costs, and weights Examples: /ih-t/ +stress, phrase internal, high F0, content word /n-t/ -stress, phrase final, high F0, function word /dh-ax/ -stress, phrase initial, low F0, word “the” 2017/10/5 Speech and Language Processing Jurafsky and Martin 69 69
70
Target Costs Comprised of k subcosts Stress Phrase position F0
2017/10/5 Comprised of k subcosts Stress Phrase position F0 Phone duration Lexical identity Target cost for a unit: 𝐶 𝑡 ( 𝑡 𝑖 , 𝑢 𝑖 = 𝑘=1 𝑝 𝑤 𝑘 𝑡 𝐶 𝑘 𝑡 𝑡 𝑖 , 𝑢 𝑖 2017/10/5 Speech and Language Processing Jurafsky and Martin Slide from Paul Taylor 70 70
71
Join (Concatenation) Cost
2017/10/5 Measure of smoothness of join Measured between two database units (target is irrelevant) Features, costs, and weights Comprised of k subcosts: Spectral features F0 Energy Join cost: 𝐶 𝑗 ( 𝑢 𝑖−1 , 𝑢 𝑖 = 𝑘=1 𝑝 𝑤 𝑘 𝑗 𝐶 𝑘 𝑗 𝑢 𝑖−1 , 𝑢 𝑖 2017/10/5 Slide from Paul Taylor 71 Speech and Language Processing Jurafsky and Martin 71
72
Total Costs Hunt and Black 1996
2017/10/5 Hunt and Black 1996 We now have weights (per phone type) for features set between target and database units Find best path of units through database that minimize: Standard problem solvable with Viterbi search with beam width constraint for pruning 𝐶( 𝑡 1 𝑛 , 𝑢 1 𝑛 = 𝑖=1 𝑛 𝐶 𝑡𝑎𝑟𝑔𝑒𝑡 𝑡 𝑖 , 𝑢 𝑖 + 𝑖=2 𝑛 𝐶 𝑗𝑜𝑖𝑛 𝑢 𝑖−1 , 𝑢 𝑖 𝑢 1 𝑛 = argmin 𝑢 1 ,..., 𝑢 𝑛 𝐶( 𝑡 1 𝑛 , 𝑢 1 𝑛 2017/10/5 Speech and Language Processing Jurafsky and Martin Slide from Paul Taylor 72 72
73
2017/10/5 2017/10/5 73 Speech and Language Processing Jurafsky and Martin 73
74
Unit Selection Summary
2017/10/5 Advantages Quality is far superior to diphones Natural prosody selection sounds better Disadvantages: Quality can be very bad in places HCI problem: mix of very good and very bad is quite annoying Synthesis is computationally expensive Can’t synthesize everything you want: Diphone technique can move emphasis Unit selection gives good (but possibly incorrect) result 2017/10/5 Slide from Richard Sproat 74 Speech and Language Processing Jurafsky and Martin 74
75
Evaluation of TTS Intelligibility Tests Mean Opinion Score
2017/10/5 Intelligibility Tests Diagnostic Rhyme Test (DRT) and Modified Rhyme Test (MRT) Humans do listening identification choice between two words differing by a single phonetic feature Voicing, nasality, sustenation, sibilation DRT: 96 rhyming pairs Dense/tense, bond/pond, etc Subject hears “dense”, chooses either “dense” or “tense” % of right answers is intelligibility score. MRT: 300 words, 50 sets of 6 words (went, sent, bent, tent, dent, rent) Embedded in carrier phrases: Now we will say “dense” again Mean Opinion Score Have listeners rate space on a scale from 1 (bad) to 5 (excellent) More natural: Reading addresses out loud, reading news text, using two different systems. Do a preference test (prefer A, prefer B) 2017/10/5 Speech and Language Processing Jurafsky and Martin 75 75
76
Recent stuff Problems with Unit Selection Synthesis
2017/10/5 Problems with Unit Selection Synthesis Can’t modify signal (mixing modified and unmodified sounds bad) But database often doesn’t have exactly what you want Solution: HMM (Hidden Markov Model) Synthesis Won recent TTS bakeoff. Sounds less natural to researchers But naïve subjects preferred it Has the potential to improve on both diphone and unit selection. 2017/10/5 76 Speech and Language Processing Jurafsky and Martin 76
77
HMM Synthesis Unit selection (Roger) HMM (Roger) Unit selection (Nina)
2017/10/5 Unit selection (Roger) HMM (Roger) Unit selection (Nina) HMM (Nina) 2017/10/5 77 Speech and Language Processing Jurafsky and Martin 77
78
Summary ARPAbet TTS Architectures TTS Components Text Analysis
2017/10/5 ARPAbet TTS Architectures TTS Components Text Analysis Text Normalization Homonym Disambiguation Grapheme-to-Phoneme (Letter-to-Sound) Intonation Waveform Generation Diphones Unit Selection HMM 2017/10/5 78 Speech and Language Processing Jurafsky and Martin 78
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.