Computer Science Department A Speech / Music Discriminator using RMS and Zero-crossings Costas Panagiotakis and George Tziritas Department of Computer Science University of Crete Heraklion Greece
Computer Science Department Presentation Organization I.Introduction II.Segmentation III. III.Classification IV. IV.Results V. V.Conclusion EUSIPCO 2002, Toulouse France 1
Computer Science Department Introduction (1/3) Input Figure 1: Original Sound Signal (44100 or sample rate) Output Figure 2: Real time Segmentation and Classification (Speech,Music,Silence) EUSIPCO 2002, Toulouse France 2
Computer Science Department Introduction (2/3) Approaches Basic purpose Features extraction (energy,frequency) Feature based Segmentation and Classification Real time segmentation and classification Algorithmic - computation constraints Low feature number Low change extraction error (20 msec) Low minimum distance between two changes (1 sec) High accuracy (95 %)3 EUSIPCO 2002, Toulouse France
Computer Science Department Introduction (3/3) Root Mean Square (RMS) Basic Features Zero Crossings (ZC) Computed every 20 msec Independent characteristics Signal energy Figure 3: RMS in music Figure 4: RMS in speech Figure 5: ZC in music Figure 6: ZC in speech Mean frequency A = 4 EUSIPCO 2002, Toulouse France
Computer Science Department Segmentation (1/3) Basic characteristics RMS based χ 2 distribution fits well the RMS histograms Two stage algorithm Stage 1 1 sec accuracy (low computation cost) Stage 2 20 msec accuracy (high computation cost) m : mean, s 2 : variance Figure 8: Histogram RMS in speech, approximation by χ 2 distribution Figure 7: Histogram RMS in speech, approximation by χ 2 distribution Γ( a + 1) Γ( a + 1)5 EUSIPCO 2002, Toulouse France
Stage 1 Partitioning in 1 sec frames (50 RMS values) Change in Frame i Frame i-1 and Frame i+1 have to differ Computation of frame distance D (Matusita Distance) using frame similarity (p) Frame i is candidate for Stage 2 (there is a change) If D(i) > threshold and D(i) local maximal Computer Science Department Segmentation (2/3) p( p 1, p 2 ) 6 EUSIPCO 2002, Toulouse France RMS time Frame i-1Frame i+1 HIGH Frame iFrame i+2 1 sec frames Distance Change in frame i LOW
Computer Science Department Segmentation (3/3) Stage 2 20 msec accuracy for each candidate frame (i) from stage 1 1. move 2 successive frames (1 sec) located before and after frame (i) 2. find the time instant where the 2 successive frames have the maximum Matusita distance in RMS distribution Possible oversegmentation Figure 10: The RMS data and the distance D Figure 11: The segmentation result and the RMS data7 EUSIPCO 2002, Toulouse France
Computer Science Department Classification (1/4) Basic purpose Segment classification in one of following classes Music Speech Silence Main Algorithm Hypothesis Segmentation gives homogenous segments Input Basic characteristics RMS, ZC Actual features computation of segment Classification based on actual features values 8 EUSIPCO 2002, Toulouse France
Computer Science Department Classification (2/4) Actual Features specification Normalized RMS variance, σ 2 Α σ 2 Α = Usually (86 %) σ 2 Α (music) < σ 2 Α (speech) The probability of null ZC, ZC0 Always ZC0 (music) = 0 Usually (40%) ZC0 (speech) > 0 Maximal mean frequency, max(ZC) Almost always in speech max(ZC) 2.4 kHz 9 EUSIPCO 2002, Toulouse France
Computer Science Department Joint RMS/ZC measure, Cz Speech : High correlation RMS, ZC many void intervals low RMS and ZC Music : Essentially independent RMS, ZC Void intervals frequency, Fu Void intervals detection ( 20 msec ): (RMS < T1) && (RMS < 0.1max(RMS(i)) && (RMS < T2) || (ZC = 0) Group neighborly silent intervals Fu : frequency of grouped silent intervals Always in speech Fu > 0.6 In at least 65% of music Fu < 0.6 i A Actual Features specification Classification (3/4) 10 EUSIPCO 2002, Toulouse France
Computer Science Department Silence segment recognition Segment is silence E < Threshold A i A Classification (4/4) Decision making algorithm ομιλία Silence segment check Actual features checkSilence speechmusic 11 EUSIPCO 2002, Toulouse France
Computer Science Department Data Data source Segmentation performance Results sec speech sec music 70% audio CDs 15% WWW 15% recordings Actual features performance 97% detection probability Change accuracy ~ 0.2 sec Features12 EUSIPCO 2002, Toulouse France σ 2 Α Cz σ 2 Α Cz Cz σ 2 Α ZC0 σ 2 Α Fu σ 2 Α All Cz Cz Accuracy ZC0 σ 2 Α ZC0 σ 2 Α, ZC0 σ2Ασ2Ασ2Ασ2Α Features
Computer Science Department Complexity Conclusion Summary Minimum complexity O(N) Low computation cost Real time segmentation and classification in three classes Energy distribution (RMS) suffices for segmentation RMS – ZC suffices for classification Purpose : minimum cost and high performance Future extension Content-based indexing and retrieval audio signals Pre-processing stage for speech recognition 13 EUSIPCO 2002, Toulouse France
Computer Science Department Segmentation - Classification Demo
Computer Science Department Sound Player Demo