Spring Wave Oscillations External force causes oscillations Governing equation: f = ½π(k/m) ½ – The spring stiffness and quantity of mass determines the frequency Force determines displacement Damping resistance leads to exponential decay (like a filter) – No resistance would create an unstable system Differences from sound – Components are discrete, not distributed – There is no traveling wave k = spring constant x = displacement F = external force B = damper m = mass f = oscillation frequency Spring Oscillation
Rope Impulse Waves A hand jerk is an impulse – An impulse causes a wave to travel. – Speed of impulses determines frequency Rope stiffness determines wave velocity Reverse waves start at the fixed end – Resonates when waves correlate – Cancels when waves don’t correlate – Eventually a steady state is reached and we perceive no motion (standing wave) – Boundary conditions model the interaction Sound differences – lateral verses transverse motion – Waves escape for non-stop sounds
Acoustic Compression Waves Resistance – sources of energy loss – Ex: Vocal tract walls Impedence (Z) – Resistance to wave motion Capacitance – air resistance to compression Reflection – Portion of waves that reflect back to the source Compression and Rarefaction
Vocal Tract Tube Model Series of short uniform tubes connected in series – Each slice has a fixed area – Add slices for the model to become more continuous Mathematic equations – y(n) = u(t) * v(t) * r(t) – z-domain: Y(z) = U(z) V(z) R(z) – U = Glottal source – V = Vocal model – R = Lip radiation model Reminder: * means convolution
Cylinder model(s) Rough model of throat and mouth cavity With nasal cavity Voice Excitation Voice Excitation open open/closed
Vocal Tract Tube Model Assumptions that have little impact on accuracy 1.Vocal system is linear time invariant (LTI) – Speech is not fully linear, but the model still is a good approximation – LTI: an impulse to the system has the same response regardless of time 2.Vocal tract shape is static – Vocal tract moves with a rate of change slower than the sampling rate. 3.Model employs straight tubes which don’t bend – The bends between tubes do not significantly alter the acoustics 4.Model assumes a discrete number of tubes – Has no impact if we set the number of tubes to a large enough number 5.Vocal tract is lossless (Note: An extended IIR model can monitor these) – Vocal tract walls add resistance and reverberation that tend to cancel – Glottal and lip loss quantities dominates vocal tube loss – The IIR model can be extended to model loss, if necessary
Vocal Tract Tube Model Assumptions that impact accuracy All-pole modeling – The ear is more sensitive to peaks than valleys, which is fortunate – Adding poles can sometimes approximate valleys, but not exactly – Accurate for modeling vowels, but not perfect for obstruents – Poles have difficulty modeling ripples in the speech signal Radiation from the lips – Effects are much more complicated than assuming a simple tube – Example: high frequencies are more narrowly directed than low ones Source-filter separation – Interrelationships between vocal tract muscles is complicated Glottal source – Source signals are complicated, not represented by simple F0 impulses – Example: Louder speaking more than increasing the source strength
Definition of Terms Consider sound propagating along a tube Define: – u(x, t) = particle velocity of wave – U(x, t) = volume velocity (u(x,t) * A where A is area) – p(x,t) = wave pressure variation compared to when x = 0 – ρ = density of air at sea level – c = velocity of sound – d = x F s / c discrete normalized distance by sample rate; x = tube length – n = normalized time of n th sample Notes: – Electrical circuit voltage/current is analogous to acoustic p(x,t) & u(x, t) – Discrete versions of u(x,t) and p(x,t) are respectively u(d,n) and p(d,n) – u[n] is a discrete sample wave measurement at point n – u + [n] is a forward traveling wave; u - [n] is a backward traveling wave
Tube Model Characteristics Determined by the boundaries between tubes – Some of the input travels forward and some reflects back – Junctions between tubes completely defines velocity/pressure – u[d,n] = u + [n-d] – u - [n+(D-d)] {D = length of tube, d = point from start} Interpretation: Wave velocity at some point in a tube equals forward wave (u + ) from the previous tube minus backward reflected wave (u - ) from the next tube – p[d,n] = ρc/A (u + [n-d] + u - [n+(D-d)]) Interpretation: Wave pressure at some point in a tube equals forward wave (u + ) from the previous tube plus backward reflected wave (u - ) from the next tube Author states: These expressions are derived from first-principle solution of wave equations and properties of air.
Flow and Pressure at Junctions Wave flow at junctions between tube k and k+1 u k = u k [D k,n] = u k+1 = u k+1 [0,n] and p k [D k, n] = p k+1 [0,n] Using formulas on previous slide – u + k – u - k = u + k+1 – u - k+1 and 1/A k (u + k + u - k ) = 1/A k+1 (u + k+1 + u - k+1 ) – Forward minus backward wave velocity from the end of tube k = Forward minus backward wave velocity from beginning of k+1. – Pressure at junction is the sum of the forward and backward pressure Multiply pressure formula by A k and add two formulas 2u + k = A k /A k+1 (u + k+1 + u - k+1 ) + u + k+1 – u - k+1 2u + k = (A k +A k+1 )/A k+1 u + k+1 – (A k -A k+1 )/A k+1 u - k+1 ) u + k = (A k +A k+1 )/(2A k+1 )u + k+1 – (A k -A k+1 )/(2A k+1 )u - k+1 ) u + k = (A k +A k+1 )/(2A k+1 )u + k+1 + (A k+1 -A k )/(2A k+1 )u - k+1 ) By similar math (subtracting two formulas instead of adding) we can derive formula for u - k
Reflection Coefficients Previous page formula u + k = (A k +A k+1 )/(2A k+1 )u + k+1 + (A k+1 -A k )/(2A k+1 )u - k+1 ) We want to convert U + k formula to use reflection coefficients 1/(1+r k ) = 1/(1 + (A k+1 –A k )/(A k+1 +A k )) Multiply numerator and denominator by A k+1 +A k 1/(1+r k ) = 1/(1+(A k+1 –A k )/(A k+1 +A k ))(A k+1 +A k )/(A k+1 +A k ) Simplify 1/(1+r k ) = (A k+1 + A k )/(A k+1 + A k + A k+1 – A k ) = (A k+1 +A k )/2A k+1 Similarly, we can show that: r k /(1+r k ) = (A k+1 –A k )/2A k+1 Finally: u + k = (1/(1+r k )u + k+1 + r k /(1+r k ) u - k+1 (Formula 11.16a in book) And by similar derivation we derive Formula 11.16b in book u - k = -r k /(1+r k )u + k+1 + 1/(1+r k )u - k+1 Definition: r k = (A k+1 –A k )/(A k+1 +A k )
General Model Wave at junction u + k = (1/(1+r k )u + k+1 + r k /(1+r k ) u - k+1 u - k = -r k /(1+r k )u + k+1 + 1/(1+r k )u - k+1 Take Z transform of wave at junction U + k (z) = (z D k /(1+r k )U + k+1 + r k z D k /(1+r k ) U - k+1 U - k (z) = -r k z -D k /(1+r k )U + k+1 + z -D k /(1+r k )U - k+1 Note – The Z exponent is positive for forward traveling waves and negative for backward traveling waves. In one case the wave is sped up and in the other the wave is being delayed. It is like moving the imaginary axis of the Z plane left or right – Assumptions: Reflection waves from the lips = 0 – All reflection back to the glottis is absorbed by the lungs
Single Tube Model Flow from tube to Lips (U + L ) without reflection and flow back from tube (U - 1(z)) to the glottis U + 1 (z) = (z D /(1+r L )U + L and U - 1 (z) = -r L z -D /(1+r L )U + L Flow from glottis to tube U + G (z) = (z D G /(1+r G )U + 1 (z) -r G z -D G /(1+r G )U - 1 We model the glottis as a tube of length 0, so Z D G = 1 U + G (z) = (1/(1+r G )U + 1 (z) -r G /(1+r G )U - 1 = (1/(1+r G )(z D /(1+r L )U + L + r G /(1+r G )r L z -D /(1+r L ) U + L = Z D + r G r L Z -D /((1+r G ) (1+r L )) Transfer function for Vocal Tract V(z) = U L / U G V(z) = (1+r G )(1+r L ) / (z D +r G r L z -D k ) = (1+r G )(1 + r L ) z -D /(1+r G r L z -2D ) Conclusion: A one tube models the vocal tract as a filter that resonates when it is applied to a source generated signal.
Analysis: Single Tube Vocal transform function V(z) = (1+r G )(1 + r L ) z -D /(1+r G r L z -2D ) Compute D – D = x F s / c = 0.17 (16000)/34 = 8 –.17 = length of male vocal tract, is sampling rate, 34 meters per second is speed of sound – Exponent of denominator is -16. Conclusions – A filter to model the vocal tract only needs 16 terms – One tube needed for every thousand samples per second
Frequency Response: Single Tube Vocal transform function V(z)=(1+r G )(1+r L )z -D /(1+r G r L z -2D ) Replace (1+r G )(1+r L ) by a constant gain factor G Special case – If Glottis closed: r G = 1 – If Lips fully open: r L = -1 – V(e i ω ) = e -i5ω /(1 – e --i10ω ) = 1/ (e -i5ω + e i5ω ) = 2/cos(5ω) Peaks when cos(5 ω) = 0 Frequency Response
Multi-Tube Model Start from the lips Add tube and plug in the formula Continuing to add tubes working backwards Finish the recursion when we reach the glottis z -P/2 Π k=0,N (1+r k ) 1 – a 1 z -1 - a 2 z -2 - … - a n z -P This turns out to be a standard IIR filter Linear prediction can help us automate how to accurately compute these coefficients V(z) = Note: The text derives this using successive matrix multiplication. His approach is fine and easy to implement in a for loop. For our purposes though, the final formula is what is most important.
The All Pole Model Using a Gain factor, G, an IIR filter can model the vocal tract The model needs a pole for each 1k of the sample rate – A multi-pole model results in more non-zero filter coefficients – Adding extra poles will not increase accuracy – Factoring a filter’s polynomial roughly estimates the formants – Actual pole locations don’t correspond to specific tubes. They result from all of the interactions between the tubes – Adding zeros to the model can account for anti resonances CELP (Code Excited Linear Prediction) – memorizes banks of vocal tract IIR filters for signal compression – The residue is the difference between the actual and modeled signal – Idea: It takes less bits to save the residue than the original signal
Radiation A speech signal travel some distance before being heard There is impedance at the lips that alters the signal to somewhat increases the high frequency amplitudes This can be modeled by an additional filter, but an accurate model is quite complicated An adequate model – Transfer function: R(z) = P(z)/U L (z) – Pressure (P) = Output (U L ) times impedance (R) – Typical implementation: R(z) = 1 – αz -1 where.95<α<.98 – Result: approximate 6dB boost per octave
Glottal Source Designing a highly accurate model is an open research area Simplistic model – Feeding impulses at regular intervals – Unfortunately, it doesn’t accurately mimic the vocal fold vibration Alternate models – u[n] = ½ (1-cos(πn/N 1 )) if 0≤n≤N 1 ; cos(π (n-N 1 )/2N 2 )) if N 1 ≤n ≤N 2 ; N 2 ≤n≤ N 3 – Adding zeroes U(z) = ∏ k=0,M (1-u k z -1 )/(1-z) 2 to model jitter (varying periods) and shimmer (varying amplitudes) – Modeling with a formula that models an initial negative impulse (Lijencrants-Fant)
Nasal Cavity A parallel filter is needed – It’s is a static articulator – A simpler all pole filter will suffice Difficulties – Nasal consonants: the oral wave reflects back causing anti resonances – Nasalized vowels: Some of the nasal wave filters back to the oral cavity Possible complete model – Anti-resonances are modeled with zeros – All pole model for the pharnyx + back of mouth – All pole model for nasal cavity – Poles and zeroes for the oral cavity – A splitting and feedback operation
Speech Models for Nasal Speech Vowel or glide U(z) = P(z)M(z) R L (z) P(z)M(z) R L (z) Nasalized Vowel or glide U(z) = P = Pharnyx, M = Mouth, R L = Radiation from lips, N = Nasal, R N Radiation from nose N(z) R N (z) P(z)M(z) Nasal U(z) = N(z) R N (z)
Speech Models for Obstruent Sounds Placement of tongue – Separates the mouth into two sections V b (z) is the vocal tract to the back of the mouth; M(z) is the mouth The air reflects to the back portion causing anti-resonances – Frequency increases as constriction the point moves towards the front – Poles and zeroes are needed to model the anti resonances. Voiced Obstruent U(z) = V b (z) M(z) R L (z) Unvoiced Noise source V b (z) M(z) R L (z) Unvoiced Noise source