Speech Recognition Raymond Sastraputera
Introduction Frame/Buffer Algorithm Silent Detector Estimate Pitch ◦ Correlation and Candidate ◦ Optimal Candidate ◦ Buffer Delay Added Bias Test and Result Conclusion
Estimates the pitch on a speech Written in C++
Frame segment are shifted with no overlap Frame segment Buffer
Initial detection of silent |max(x)| + |max(y)| + |max(z)| + |min(x)| + |min(y)| + |min(z)| Threshold Value (50dB) XYZ
Correlation of two vectors
Correlation P(x,y) Calculate for different window size (n m ) ◦ Window size will be the pitch value (in sample) ◦ Correlation value above threshold become candidate with score 1 XYZ Vector xVector y nmnm nmnm
Correlation P(y,z) Calculate for different n m ◦ Only for window size in candidate score 1 ◦ Correlation value above threshold become candidate with score 2 XYZ Vector yVector z nmnm nmnm
Correlation Q(n,m) Calculate for different n m ◦ n MAX is maximum n m in the candidate Optimal Candidate ◦ if current candidate Qnm*0.77 is higher than preceeding candidate’s Qnm XYZ Vector xVector z n MAX nmnm
Candidate score 1 Correlation P(x,y) ◦ No candidate silence ◦ Single candidate compute P(y,z) Score stays at 1 hold Score 2 estimated pitch ◦ Multi candidate compute P(y,z) Candidate score 2 Correlation P(y,z) ◦ No candidate compute Q(n,m) candidate score1 ◦ Single candidate estimated pitch ◦ Multi candidate compute Q(n,m) Optimal Pitch Correlation Q(n,m)
Single candidate with score 2 From Q(n,m) of ◦ Candidate score 2 ◦ Candidate score 1 On hold, and next frame estimated pitch is neither silence nor on hold.
Delay the returning value of estimated pitch ◦ Needed to limit the duration of on hold
Conditions: ◦ Two previous frame is not silent ◦ Previous frame is not on hold ◦ Previous frame pitch is between 5/8 and 7/4 of the preceding frame pitch
P(x,y) is doubled
correlation_threshold_silent(0.88) Qnm_optimal_multiplier(0.77) sample_rate( F) max_pitch(400) min_pitch(50) pitch_buffer_size(20) bias_max_frequency(7/4) bias_min_frequency(5/8) silent_threshold(50.0F)
Some improvement can be done to increase the performance of the estimated pitch. ◦ Reduce the search space ◦ Adding 1 st order derivaiton of the pitch ◦ Filtering the outlier / noise Current algorithm might not be fast enough to perform in real time
Bagshaw, Paul Christopher. Automatic Prosodic Analysis for Computer Aider Pronunciation Teaching. The University of Edinburgh (1994).