1st and 2nd Generation Synthesis

Slides:



Advertisements
Similar presentations
Analysis and Digital Implementation of the Talk Box Effect Yuan Chen Advisor: Professor Paul Cuff.
Advertisements

Liner Predictive Pitch Synchronization Voiced speech detection, analysis and synthesis Jim Bryan Florida Institute of Technology ECE5525 Final Project.
DFT/FFT and Wavelets ● Additive Synthesis demonstration (wave addition) ● Standard Definitions ● Computing the DFT and FFT ● Sine and cosine wave multiplication.
Page 0 of 34 MBE Vocoder. Page 1 of 34 Outline Introduction to vocoders MBE vocoder –MBE Parameters –Parameter estimation –Analysis and synthesis algorithm.
Filtering Filtering is one of the most widely used complex signal processing operations The system implementing this operation is called a filter A filter.
5-Text To Speech (TTS) Speech Synthesis
Digital Signal Processing
ACOUSTICAL THEORY OF SPEECH PRODUCTION
A 12-WEEK PROJECT IN Speech Coding and Recognition by Fu-Tien Hsiao and Vedrana Andersen.
Itay Ben-Lulu & Uri Goldfeld Instructor : Dr. Yizhar Lavner Spring /9/2004.
Speech in Multimedia Hao Jiang Computer Science Department Boston College Oct. 9, 2007.
Speech Group INRIA Lorraine
Spring Wave Oscillations External force causes oscillations Governing equation: f = ½π(k/m) ½ – The spring stiffness and quantity of mass determines the.
Complete Discrete Time Model Complete model covers periodic, noise and impulsive inputs. For periodic input 1) R(z): Radiation impedance. It has been shown.
1 Frequency Domain Analysis/Synthesis Concerned with the reproduction of the frequency spectrum within the speech waveform Less concern with amplitude.
EE 225D, Section I: Broad background Synthesis/vocoding history (chaps 2&3) Recognition history (chap 4) Machine recognition basics (chap 5) Human recognition.
December 2006 Cairo University Faculty of Computers and Information HMM Based Speech Synthesis Presented by Ossama Abdel-Hamid Mohamed.
Speech Coding Nicola Orio Dipartimento di Ingegneria dell’Informazione IV Scuola estiva AISV, 8-12 settembre 2008.
It was assumed that the pressureat the lips is zero and the volume velocity source is ideal  no energy loss at the input and output. For radiation impedance:
Overview of Adaptive Multi-Rate Narrow Band (AMR-NB) Speech Codec
Introduction to Speech Synthesis ● Key terms and definitions ● Key processes in sythetic speech production ● Text-To-Phones ● Phones to Synthesizer parameters.
Synthetic Audio A Brief Historical Introduction Generating sounds Synthesis can be “additive” or “subtractive” Additive means combining components (e.g.,
Text-To-Speech Synthesis An Overview. What is a TTS System  Goal A system that can read any text Automatic production of new sentences Not just audio.
1 Speech synthesis 2 What is the task? –Generating natural sounding speech on the fly, usually from text What are the main difficulties? –What to say.
Voice Transformations Challenges: Signal processing techniques have advanced faster than our understanding of the physics Examples: – Rate of articulation.
A PRESENTATION BY SHAMALEE DESHPANDE
Warped Linear Prediction Concept: Warp the spectrum to emulate human perception; then perform linear prediction on the result Approaches to warp the spectrum:
A Text-to-Speech Synthesis System
A Full Frequency Masking Vocoder for Legal Eavesdropping Conversation Recording R. F. B. Sotero Filho, H. M. de Oliveira (qPGOM), R. Campello de Souza.
Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.
LE 460 L Acoustics and Experimental Phonetics L-13
Digital Sound and Video Chapter 10, Exploring the Digital Domain.
IIT Bombay ICA 2004, Kyoto, Japan, April 4 - 9, 2004   Introdn HNM Methodology Results Conclusions IntrodnHNM MethodologyResults.
Topics covered in this chapter
04/08/04 Why Speech Synthesis is Hard Chris Brew The Ohio State University.
Synthesis advanced techniques. Other modules Synthesis would be fairly dull if we were limited to mixing together and filtering a few standard waveforms.
1 CS 551/651: Structure of Spoken Language Lecture 8: Mathematical Descriptions of the Speech Signal John-Paul Hosom Fall 2008.
Digital Systems: Hardware Organization and Design
1 ELEN 6820 Speech and Audio Processing Prof. D. Ellis Columbia University Midterm Presentation High Quality Music Metacompression Using Repeated- Segment.
Speech Coding Using LPC. What is Speech Coding  Speech coding is the procedure of transforming speech signal into more compact form for Transmission.
Page 0 of 23 MELP Vocoders Nima Moghadam SN#: Saeed Nari SN#: Supervisor Dr. Saameti April 2005 Sharif University of Technology.
Chapter 16 Speech Synthesis Algorithms 16.1 Synthesis based on LPC 16.2 Synthesis based on formants 16.3 Synthesis based on homomorphic processing 16.4.
Copyright 2004 Ken Greenebaum Introduction to Interactive Sound Synthesis Lecture 11: Modulation Ken Greenebaum.
Comparing Audio Signals Phase misalignment Deeper peaks and valleys Pitch misalignment Energy misalignment Embedded noise Length of vowels Phoneme variance.
1 Linear Prediction. 2 Linear Prediction (Introduction) : The object of linear prediction is to estimate the output sequence from a linear combination.
1 PATTERN COMPARISON TECHNIQUES Test Pattern:Reference Pattern:
1 Linear Prediction. Outline Windowing LPC Introduction to Vocoders Excitation modeling  Pitch Detection.
CS 551/651: Structure of Spoken Language Lecture 13: Text-to-Speech (TTS) Technology and Automatic Speech Recognition (ASR) John-Paul Hosom Fall 2008.
♥♥♥♥ 1. Intro. 2. VTS Var.. 3. Method 4. Results 5. Concl. ♠♠ ◄◄ ►► 1/181. Intro.2. VTS Var..3. Method4. Results5. Concl ♠♠◄◄►► IIT Bombay NCC 2011 : 17.
Speech Signal Processing I By Edmilson Morais And Prof. Greg. Dogil Second Lecture Stuttgart, October 25, 2001.
Speech Signal Representations I Seminar Speech Recognition 2002 F.R. Verhage.
Structure of Spoken Language
ECE 5525 Osama Saraireh Fall 2005 Dr. Veton Kepuska
By Sarita Jondhale 1 The process of removing the formants is called inverse filtering The remaining signal after the subtraction of the filtered modeled.
ECE 4710: Lecture #13 1 Bit Synchronization  Synchronization signals are clock-like signals necessary in Rx (or repeater) for detection (or regeneration)
Vocal Tract & Lip Shape Estimation By MS Shah & Vikash Sethia Supervisor: Prof. PC Pandey EE Dept, IIT Bombay AIM-2003, EE Dept, IIT Bombay, 27 th June,
More On Linear Predictive Analysis
A. R. Jayan, P. C. Pandey, EE Dept., IIT Bombay 1 Abstract Perception of speech under adverse listening conditions may be improved by processing it to.
By Sarita Jondhale 1 Signal preprocessor: “conditions” the speech signal s(n) to new form which is more suitable for the analysis Postprocessor: operate.
IIT Bombay ISTE, IITB, Mumbai, 28 March, SPEECH SYNTHESIS PC Pandey EE Dept IIT Bombay March ‘03.
Linear Prediction.
1 Speech Compression (after first coding) By Allam Mousa Department of Telecommunication Engineering An Najah University SP_3_Compression.
Spectral Analysis Spectral analysis is concerned with the determination of the energy or power spectrum of a continuous-time signal It is assumed that.
Vocoders.
Linear Prediction.
1 Vocoders. 2 The Channel Vocoder (analyzer) : The channel vocoder employs a bank of bandpass filters,  Each having a bandwidth between 100 HZ and 300.
Linear Predictive Coding Methods
Two-Stage Mel-Warped Wiener Filter SNR-Dependent Waveform Processing
Linear Prediction.
Speech Processing Final Project
Presentation transcript:

1st and 2nd Generation Synthesis Speech Synthesis Generation First: Ground Up Synthesis Second: Data Driven Synthesis by Concatenation Input (Sequence of) Phonetic symbols Duration F0 contours Amplification factors Data Rule-based parameters Linear Prediction: Stored diphone parameters

Early Synthesis History Klatt, 1987 “Review of text-to-sppech conversion for English http://americanhistory.si.edu/archives/speechsynthesis/dk_737b.htm Audio: http://www.cs.indiana.edu/rhythmsp/ASA/Contents.html Milestones 1939 Worlds Fair, Voder, Dudley First TTS, Umeda, 1968 Low rate resynthesis, Speak and Spell, Wiggins, 1980 Natural sounding resynthesis, multi-pulse Linear Prediction, Atal, 1982 resynthesis Natural Sounding Synthesis, Klatt, 1986

Formant Synthesizer Design Concept Create individual components for each synthesizer unit Feed the system with a set of parameters Advantage If the parameters are set properly, perfect natural sounding speech is created Disadvantages The combination of parameters becomes obscure Parameter settings do not enable an automated algorithm Demo Program: http://www.asel.udel.edu/speech/tutorials/synthesis/

Formant Synthesizer IIR filter: hn = b0sn – a1yn-1 – a2yn-2 Design for individual formant Components IIR filter: hn = b0sn – a1yn-1 – a2yn-2 Transfer Function: H(z) = b0z0/{(1-a1z-1 – a2z-2)} Transfer Function: H(z) = 1/{(1-p1z-1)(1-p2z-1)} Because they are conjugate pairs H(z) = 1/{(1-reiθz-1)(1-re-iθ z-1)} = 1/(1-re-iθ z-1-reiθ z-1 + reiθz-1re-i θz-1) = 1/(1-r(e-iθ+eiθ)z-1+r2z-2) = 1/(1-2rcosθz-1+r2z-2) The filter: yn = xn – 2rcosθ yn-1+r2yn-2 Parameters (θ controls formant frequency; r controls bandwidth) Θ = 2 πf/F , r = e-πβ/F β = desired bandwidth, F = sampling rate, f = frequency

Parallel or Cascade Cascaded connections Parallel connections Lose control over components because skirts of poles interact Parallel connections Add filtered signals together to maintain component control System Input Parameters A1,2,3 = Amplitudes F1,2,3 = Frequencies BW1,2,3 = Bandwidths Gain = Output multiplier

Periodic Source Flanagan model Lijencrants-Fant model (figure) Glottis approximation formulas Flanagan model Explicit periodic function u[n] = ½(1-cos(πn/L)) if 0≤n ≤L u[n] = cos(π(n-L)/(2M)) if L<n ≤M u[n] = 0 otherwise Lijencrants-Fant model (figure) 0 to amplitude Av at time Tp Te where the derivative reaches E Te is the glottal closing instant The open quotient Oq = Te / T0. The ratio between the opening and closing phase is αm. Abrupt closure after maximum excitation between OqT0 and T0.

Radiation From the Lips Actual modeling of the lips is very complicated Rule based synthesizers want to use specific formulas for simulation Experiments show Lip radiation contains at least one anti-resonance (a zero in the transfer function) The approximation formula often used: R(z) = 1 – αz-1 where 0.95 ≤α ≤0.98 This turns out to be the same formula for preemphasis

Consonants and Nasals Nasals Fricatives One resonator models the oral cavity Another resonator models the nasal cavity Add a zero in series with resonators Outputs added to generate output Fricatives Source either noise or glottis or both One set of resonators model point in front of place of constriction Another set behind point of constriction Outputs added together

The Klatt Synthesizer

Klatt Parameters

Evaluation of Formant Synthesizers Quality Speech produced is understandable Output sounds metallic (not natural) Problems System uses lumped parameters (like components of a spring), it is not distributed (like the vocal tract) Individually valid assumptions are invalid when joined together in a system Speech subtleties are too complex for the formant model Transitions between sounds is not modeled Formants are not present in obstruent sounds

Classical Linear Prediction (LP Synthesis) Concept Use the all-pole tube model of Linear Prediction Y(z) = X(z)/(1-a1z-1 – a2z-2 - … - zpz-P) leads to the linear prediction formula yn = xn + a1yn-1 + a2yn-2 + … + apyn-p Improvements over formant synthesis Obtain parameters directly from speech, not from experimentation or human intervention The glottal filter is subsumed in the LP equation, so synthesizing the glottal source becomes unnecessary Tradeoffs Lose modularity and physical interpretations of coefficients Lack of zeros make modeling nasals and fricatives difficult Modeling transitions between sounds problematic

LP diphone-concatenation Synthesis Definition: The unit that starts from the middle of one phone and ends at the middle of the next phone Concept Capture and store the vocal tract dynamics of each frame Alter the F0 by changing the impulse rate Alter duration as needed Concatenate stored frames together to accomplish synthesis Input: array of {phone symbol, F0 value, duration}

LP difficulties Boundary point transition artifacts Approach: Interpolate the LP parameters between adjacent frames The output has a metallic or buzz quality because the LP filter does not entirely capture the characteristic of the source. The residual contains spikes at each pitch period Experiment to resynthesize a speech waveform Resynthesize with residual: speech sound perfect Resynthesize without residual Same pitch and duration: sounds degraded but okay Alter pitch: speech becomes buzzy Alter duration: degraded but okay

Articulatory Synthesis The oldest approach: mimic the vocal tract components Kempelen Mechanical device with tubes, bellows, and pipes Played as one plays a musical instrument Digital version Controls are the tubes, not the formants Can obtain LP tube parameters from the LP filter Difficulties Difficult to obtain values that shape the tubes The glottis and lip radiation still need to be modeled Existing models produce poor speech Current Applicable Research Articulatory physiology, gestures, audio-visual synthesis, talking heads

2nd Generation Synthesis by Concatenation Extension of 1st Generation LP-Concatenation Comparisons to 1st generation models Input Still explicitly defines the F0 contour and duration and phonetic symbols Output Source waveform generated from a database of diphones (one diphone per phone) Discards impulse pulses and noise generators Concatenation Pitch and duration algorithms glue together diphones

Diphone Inventory Requirements If 40 phonemes 40 left diphones and 40 right diphones can combine in 1600 ways A phonotactic grammar can reduce the database size Pick long units rather than short ones (It is easier to shorten duration than lengthen it) Normalize the phases of the diphones All diphones should have equal pitch Finding diphone sound waves to build the inventory Search a corpus (if one exists) Specifically record words containing the diphones Record nonsense words (logotomes) with desired features

Pitch-synchronous overlap and add (PSOLA) Purpose: Modify pitch or timing of a signal PSOLA is a time domain algorithm Pseudo code Find the exact pitch periods in a speech signal Create overlapping frames centered on epochs extending back and forward one pitch period Apply hamming window Add waves back Closer together for higher pitch, further apart for lower pitch Remove frames to shorten or insert frames to lengthen Undetectable if epochs are accurately found. Why? We are not altering the vocal filter, but changing the amplitude and spacing of the input

PSOLA Illustrations Pitch (window and add) Duration (insert or remove)

PSOLA Epochs PSOLA requires an exact marking of pitch points in a time domain signal Pitch mark Marking any part within a pitch period is okay as long as the algorithm marks the same point for every frame The most common marking point is the instant of glottal closure, which identifies a quick time domain descent Create an array of sample sample numbers comprise an analysis epoch sequence P = {p1, p2, …, pn} Estimate pitch period distance = (pk – pk+1)/2

PSOLA pseudo code Identify the epochs using an array of sample indices, P For each input object Extract the desired F0, phoneme, and duration speech = looked up phoneme sound wave from stored data Identify the epochs in the phoneme with array, P Break up the phoneme into frames If F0 value differs from that of the phoneme Window each frame into an array of frames speech = overlap and add frames using desired F0 IF duration is larger than desired Delete extra frames from speech at regular intervals ELSE if duration is smaller than desired Duplicate frames at regular intervals in speech Note: Multiple F0 points in a phoneme requires multiple input objects

PSOLA Evaluation Advantages Disadvantages As a time domain algorithm, it is unlikely that any other approach is more efficient (O(N)) If pitch and timing differences are within 25%, listeners cannot detect the alterations Disadvantages Epoch marking must be exact Only pitch and timing changes are possible If used with unit selection, several hundred megabytes of storage could be needed

LP - PSOLA Algorithm Analysis If the synthesizer uses linear prediction to compress phoneme sound waves, the residual portion of the signal is already available for additional waveform modifications Mark the epoch points of the LP residual and overlap /combine with the PSOLA approach Analysis Resulting speech is competitive with PSOLA, but not superior

Sinusoidal Models Find contributing sinusoids in a signal using linear regression techniques Definition: Statistically estimate relationships between variables that are related in a linear fashion Advantage: The algorithm is less sensitive to finding exact pitch points General approach Filter the noise component from the signal Successively match signal against a high frequency sinusoidal wave, subtracting the match from the wave The lowest remaining wave is F0 Use PSOLA type algorithm to alter pitch and duration

MBROLA Overview PSOLA synthesis has very poor quality (very hoarse quality) if the pitch points are not correctly marked. MBROLA addresses this issue by preprocessing the database of phonemes Ensure that all phonemes have the same phase Force all phonemes to have the same pitch Overlap and synthesis then works with complete accuracy Home Page: http://tcts.fpms.ac.be/synthesis/mbrola/

Issues and Discussion Concatenation Synthesis Micro-concatenation Problems: Joining phonemes can cause clicks at the boundary Solution: Tapering waveforms at the edges Joining segments with mismatched phases Solution: force all segments to be phase aligned Optimal coupling points Solution: algorithms for matching trajectories Solution: interpolate LP parameters Macro-concatenation: ensure a natural spectral envelope Requires an accurate F0 contour