Presentation is loading. Please wait.

Presentation is loading. Please wait.

Zero Resource Spoken Term Detection on STD 06 dataset Justin Chiu Carnegie Mellon University 07/24/2012, JHU.

Similar presentations

Presentation on theme: "Zero Resource Spoken Term Detection on STD 06 dataset Justin Chiu Carnegie Mellon University 07/24/2012, JHU."— Presentation transcript:

1 Zero Resource Spoken Term Detection on STD 06 dataset Justin Chiu Carnegie Mellon University 07/24/2012, JHU

2 Motivation Given an unknown language, can you do unsupervised spoken term detection? Using high level representation, with some structural assumption, we can make the spoken term detection more robust – Query by example – Modeling – ASR Approach

3 Proposed Approach Signals MFCC (13 dimension vector) – 10ms per frame, each frame represent 25ms Each utterance = A sequence of MFCC frames Goal: – Cluster the MFCC frames – Represent each MFCC frame with cluster labels – Using SDTW algorithm perform term detection

4 Clustering K-mean clustering – 10 random start K-mean clustering – Store every cluster center as model Gaussian Mixture model – Clustering with Gaussian Mixtures – Store the mean and variance as model Cluster numbers decide by development data

5 Representation Hard representation (Vector -> Label) – Each audio file become sequence of cluster labels 14 14 22 22 22 25 25 26 … – Similar to text retrieval Soft representation (Vector -> Vector) – Represent every MFCC frame as posterior probability for every Gaussian Mixture – Better vector for distance measurement

6 Segmental Dynamic Time Warping Distance Measurement – Hard distance: match(0)/not match(1) – Soft distance: -log (aq) Each jump:500ms x-y distance limitation: 500ms a1a2a3a4a5a6a7a8a9 q1 q2 q3 q4

7 NIST STD 06 Data set One of the dataset used to evaluate Spoken Term Detection performance Advantage – Widely use because of 2006 STD Evaluation Workshop, easy to compare with others Disadvantage – Only text query provided, does not have any spoken queries

8 Choosing the dataset 2006 STD Dataset has 3 different language – Each language (E,M,A) has different subset – We select English CTS (Conversational Telephone Speech) dataset Reason: It has most reported result Spoken query generation – Synthesized speech query: Flite – Extracted speech query: Extracted from dev set

9 Evaluation Measurement ATWV (Average Term Weighted Value) Term-Weighted. Value (TWV) is one minus the average value lost by the system per term. 1 – Avg ( P miss + w * P FA ) Reference ATWV number (Supervised): – English: 0.85 – Mandarin: 0.38 – Arabic: 0.34

10 Query Comparison Primary experiments on development set Synthesized query – 1100 ATWV: <<0 Extracted Query – 411 Extracted / combined queries ATWV: -0.93 – 135 Longer query (Length>1) ATWV:0.185

11 Evaluation Set Result Overrun by tides of false alarm

12 Further struggle Remove the first dimension in MFCC – Represent power of the speech, big value Inverted Frequency – If same frame appears too much time might be less important (background noise) Content-related bonus – Sequential same tag provide bonus


14 What we have learned Representing speech on every MFCC frame is too short Mismatch on the speech signal do affect a lot – Synthesized speech vs extracted speech Lots of false alarm happening for short query – At vs hat vs bat

15 Threshold How similar they are can let us decide they are the same word? (Detected or not) How many abstract representation unit we should use to represent unknown language? – Possibly can handle this with regularization

16 Representation We need to find better representation (Other than MFCC frame) to do the clustering – Phones works, appropriate representation should work, expected to come from data-driven way Advanced Approach for representation – Lee, Glass – Jenson, Church – SSS + clustering

17 Spoken Term Detection Experiments Dataset – NIST Spoken Term Detection 2006 Evaluation set – Advantage: The dataset designed for STD task Evaluation Metrics – ATWV – Advantage: Evaluation tool is available Can compare with lots of supervised baseline

18 Summary Clustering on MFCC frame is an inappropriate representation for speech Need a better representation of speech unit Channel/Speaker mismatch will harm the performance a lot The extracted spoken query and audio for English CTS data is available.

19 Personal Belief in Zero Resource STD Speaker Dependent Speaker Independent

20 Special Thanks Alex Rudnicky Florian Metze Alan Black Rita Singh Jack Mostow

Download ppt "Zero Resource Spoken Term Detection on STD 06 dataset Justin Chiu Carnegie Mellon University 07/24/2012, JHU."

Similar presentations

Ads by Google