Download presentation
Presentation is loading. Please wait.
1
AdvAIR Supervised by Prof. Michael R. Lyu Prepared by Alex Fok, Shirley Ng 2002 Fall An Advanced Audio Information Retrieval System
2
Outline Introduction Introduction System Overview System Overview Applications Applications Experiment Experiment Future Work Future Work Q&A Q&A
3
Introduction
4
Motivation Rapid expansion of audio information due to blooming of internet Rapid expansion of audio information due to blooming of internet Little attention paid on audio mining Little attention paid on audio mining Lack of a framework for generic audio information processing Lack of a framework for generic audio information processing
5
Targets Open platform that can provide a basis for various voice oriented applications Open platform that can provide a basis for various voice oriented applications Enhance audio information retrieval by performance with guaranteed accuracy Enhance audio information retrieval by performance with guaranteed accuracy Generic speech analysis tools for data mining Generic speech analysis tools for data mining
6
Approaches Robust low-level sound information preprocess module Robust low-level sound information preprocess module Speed oriented but accuracy algorithms Speed oriented but accuracy algorithms Generalized model concept for various usage Generalized model concept for various usage A visual framework for presentation A visual framework for presentation
7
System Design
8
System Flow Chart Audio Signal Features Extraction Training and Modeling Segmentation and clustering Preprocessing Scene Cutting Speaker Identification Linguistic Identification Video Scene Change And Speaker Tracking Core PlatformExtended tools Database Storage Implements
9
Features Extraction Energy Measurement Energy Measurement Zero Crossing Rate Zero Crossing Rate Pitch Pitch Human resolves frequencies non-linearly across the audio spectrum Human resolves frequencies non-linearly across the audio spectrum MFCC approach MFCC approach Simulate vocal track shape Simulate vocal track shape
10
Features Extraction (con ’ t) The idea of filter-bank, which approximates the non-linear frequency resolution The idea of filter-bank, which approximates the non-linear frequency resolution Bins hold a weighted sum representing the spectral magnitude of channels Bins hold a weighted sum representing the spectral magnitude of channels Lower and upper frequency cut-offs Lower and upper frequency cut-offs magnitude Frequency …
11
Segmentation is to cut audio stream at the acoustic change point Segmentation is to cut audio stream at the acoustic change point BIC (Bayesian Information Criterion) is used BIC (Bayesian Information Criterion) is used It is threshold-free and robust It is threshold-free and robust Input audio stream is modeled Input audio stream is modeled as Gaussians as Gaussians Segmentation Mean Gaussian
12
Segmentation Notations for an audio stream: Notations for an audio stream: –N : Number of frames –X = {xi : i = 1,2, …,N} : a set of feature vectors –μ is the mean –Σ is the full covariance matrix
13
Segmentation for single change pt. Assume change point is at frame i Assume change point is at frame i H 0,H 1 : two different models H 0,H 1 : two different models H 0 models the data as one Gaussian H 0 models the data as one Gaussian –X 1 … X N ~ N( μ, Σ ) H 1 models the data as two Gaussians H 1 models the data as two Gaussians –X 1 … X i ~ N( μ 1, Σ 1 ) –X i+1 … X N ~ N( μ 2, Σ 2 ) Audio Stream Frame 1Frame i Frame NChange point
14
Segmentation for single change pt. (con ’ t) maximum likelihood ratio statistics is maximum likelihood ratio statistics is R(i) = N log | Σ | - N 1 log | Σ 1 | - N 2 R(i) = N log | Σ | - N 1 log | Σ 1 | - N 2 log | Σ 2 | log | Σ 2 | Audio Stream Frame 1Frame i Frame NChange point
15
Segmentation for single change pt. (con ’ t) BIC(i) = R(i) -λ* P BIC(i) = R(i) -λ* P BIC(i) is +ve: i is the change point BIC(i) is +ve: i is the change point BIC(i) is – ve: i is not the change point BIC(i) is – ve: i is not the change point Which model fits the data better, single Gaussian(H 0 ) or 2 Gaussians(H 1 )? Which model fits the data better, single Gaussian(H 0 ) or 2 Gaussians(H 1 )? model H 0 model H 1
16
Segmentation for single change pt. (con ’ t) To detect a single change point, we need to calculate BIC(i) for all i = 1,2, …,N To detect a single change point, we need to calculate BIC(i) for all i = 1,2, …,N The frame i with largest BIC value is the change point The frame i with largest BIC value is the change point O(N) to detect a single change point O(N) to detect a single change point
17
Segmentation for multiple change pt. Step 1: Initialize interval [a,b], set a = 1, b = 2 Step 1: Initialize interval [a,b], set a = 1, b = 2 Step 2: Detect change point in interval [a,b] through BIC single change point detection algorithm Step 2: Detect change point in interval [a,b] through BIC single change point detection algorithm Step 3: If no change point in interval [a,b], Step 3: If no change point in interval [a,b], then set b = b+1 then set b = b+1 else let t be the changing point detected, else let t be the changing point detected, set a = t+1, b = t+2 set a = t+1, b = t+2 Step 4: Go to Step (2) Step 4: Go to Step (2)
18
Enhanced Implementation Algorithm Original multiple change point detection algorithm: Original multiple change point detection algorithm: –Start to detect change point within 2 frames –Increase investigation interval by 1 every time Enhanced Implementation algorithm: Enhanced Implementation algorithm: –minimum processing interval used in our engine is 100 frames –Increase investigation interval by 100 every time
19
Enhanced Implementation Algorithm (con ’ t) Why do we choose to increase the interval by 100 frames? Why do we choose to increase the interval by 100 frames? It increases is too large, then scene change may be missed. It increases is too large, then scene change may be missed. Must be smaller than 170 frames because there are around 170 frames in 1 second Must be smaller than 170 frames because there are around 170 frames in 1 second It increases is too small, then speed of processing is too slow It increases is too small, then speed of processing is too slow
20
Enhanced Implementation Algorithm (con ’ t) Advantage: Speed up Advantage: Speed up Trade-off: the change point we detected is not too accurate Trade-off: the change point we detected is not too accurate To compensate: To compensate: –investigate on the frames around the change point again –investigation interval is incremented by 1 to locate a more accurate change point
21
Training and Modeling Before doing various identification, training and modeling is needed Before doing various identification, training and modeling is needed Probability-based Model Gaussian Mixture Model (GMM) is used Probability-based Model Gaussian Mixture Model (GMM) is used GMM is used for language identification, gender identification and speaker identification GMM is used for language identification, gender identification and speaker identification GMM is modeled by many different Gaussian distributions GMM is modeled by many different Gaussian distributions A Gaussian distribution is represented by its mean and variance A Gaussian distribution is represented by its mean and variance
22
Gaussian Mixture Model (GMM) Model for Speaker i ……………… To train a model is to calculate the mean, variance and weight (λ) for each of the Gaussian distribution To train a model is to calculate the mean, variance and weight (λ) for each of the Gaussian distribution
23
Training of speaker GMMs Collect sound clips that is long enough for each speaker (e.g. 20 minutes sound clips) Collect sound clips that is long enough for each speaker (e.g. 20 minutes sound clips) Steps for training one speaker model: Steps for training one speaker model: –Step 1. Start with an initial model λ –Step 2. Calculate new mean, variance, weighting (new λ) by training –Step 3. Use a newλif it represents the model better than the oldλ –Step 4. Repeat Step 2 to Step 3 Finally, we get λthat can represent the model Finally, we get λthat can represent the model
24
Applications
25
Applications Video scene change and speaker tracking Video scene change and speaker tracking Speaker Identification Speaker Identification Telephony message notification Telephony message notification
26
Video scene change and Speaker tracking Video Clip AdvAIR core Segmentation Speakers Index Information Multimedia Presentation Video Playing Mechanism Timing And Speaker Information
27
Usage Speaker tracking enhance data mining about a particular person (e.g. Political person in a conference) Speaker tracking enhance data mining about a particular person (e.g. Political person in a conference) Audio information indexing and sorting for audio library storage Audio information indexing and sorting for audio library storage It as an auxiliary tool for video cutting and editing applications It as an auxiliary tool for video cutting and editing applications
28
Screenshot Input clip Multimedia player Time information and indexing
29
Speaker Identification Sound source Preprocessed Speaker clip GMM Model Training Speaker Models Database Speaker Comparison Mechanism Speaker Identity Training StageTesting Stage
30
Usage Security authentication Security authentication Speaker identification of telephone base system Speaker identification of telephone base system Criminal investigation (For example, similar to fingerprint) Criminal investigation (For example, similar to fingerprint)
31
Screenshot Input source Flexible length comparison Speaker Identity Media Player for visual verification
32
Telephony Message Notification Caller phone User can’t listen Record the leaving message of caller AdvAIR segmentation GMM model comparison Desired group Model database Desired group Non-desired Group Messaging API Short Message System E-mail system
33
Experiment Results
34
Threshold-free BIC criterion TestWave lengthActual Turing Point False AlarmMissed Point Time used 19 seconds2002 seconds 212 seconds4004 seconds 325 seconds3008 seconds 4120 seconds810134 seconds 5540 seconds12801200 seconds Background Noise affect accuracy
35
Enhanced Implementation TestMethodWave lengthActual Turning Point False AlarmMissed PointTime used 1Old9 seconds20010 seconds New002 seconds 2Old12 seconds40040 seconds New004 seconds 3Old25 seconds3101300 seconds New208 seconds 4Old540 seconds1872Over 1 days New821200 seconds Speed enhance is determined by relative number of changing point by length
36
GMM modal closed-set speaker identification Training Stage 10 speaker 5 males, 5 females 20 minutes for each speaker Testing Stage 50 sound clips with 5 seconds duration 48 sound clips are correct, i.e. 96 %
37
GMM modal open-set speaker identification Accept or Reject as result Accept or Reject as result Same setting as closed-set Same setting as closed-set –i.e. 10 speaker, which each 20 minutes Correct 45/50 = 90% Correct 45/50 = 90% False reject 3/50 = 6 % False reject 3/50 = 6 % False accept 2/50 = 4 % False accept 2/50 = 4 %
38
ProblemsandLimitation
39
Problems and limitations Accuracy is affected by background noise Accuracy is affected by background noise Some speakers have very likely features of sound Some speakers have very likely features of sound Open set speaker identification determination function is not so accurate if duration is short Open set speaker identification determination function is not so accurate if duration is short Segmentation is still a time consuming process Segmentation is still a time consuming process
40
Future Work Speaker gender identification Speaker gender identification Robust open-set speaker identification Robust open-set speaker identification Speech content recognition Speech content recognition Music pattern matching Music pattern matching Distributed system for segmentation Distributed system for segmentation
41
Q & A
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.