ELIS-DSSP Sint-Pietersnieuwstraat 41 B-9000 Gent CAIR Twente (10/10/2003) Audio Indexing as a first step in an Audio Information Retrieval System Jean-Pierre.

ELIS-DSSP Sint-Pietersnieuwstraat 41 B-9000 Gent CAIR Twente (10/10/2003) Audio Indexing as a first step in an Audio Information Retrieval System Jean-Pierre Martens An Vandecatseye Frederik Stouten ELIS-DSSP Sint-Pietersnieuwstraat 41 B-9000 Gent

ELIS-DSSP Sint-Pietersnieuwstraat 41 B-9000 Gent CAIR Twente (10/10/2003) Information retrieval from audio General scheme audio indexing speech transcription information querying audio signal time stamps audio labels time stamps audio labels text (summary) topic labels info This talkTalks of Steve & Roeland

ELIS-DSSP Sint-Pietersnieuwstraat 41 B-9000 Gent CAIR Twente (10/10/2003) Why audio indexing? Extract extra-linguistic information commercial, intro, football report, etc. Save time let speech recognizer only process parts that are expected to contain speech Raise speech transcription accuracy allow speech recognizer to select the right models at the right time

ELIS-DSSP Sint-Pietersnieuwstraat 41 B-9000 Gent CAIR Twente (10/10/2003) Audio indexing in ATRANOS project Project name Main project objectives –Automatic segmentation/labeling of audio files –Automatic transcription of the speech parts –Conversion (normalization) of transcriptions for an application (captioning = test vehicle in this project) Partners ESAT+CCL/KULeuven, ELIS/UGent, CNTS/UIA Status entering its final year

ELIS-DSSP Sint-Pietersnieuwstraat 41 B-9000 Gent CAIR Twente (10/10/2003) Audio indexing in ATRANOS project Mark parts which need no transcription –speech / non-speech segmentation Detect important change points in speech –change of speaker or acoustics (BW, background) –segment between change points = speaker turn Assign speaker label to each turn –all frames of one speaker get same label Assign speech mode to each turn –prepared versus spontaneous speech

ELIS-DSSP Sint-Pietersnieuwstraat 41 B-9000 Gent CAIR Twente (10/10/2003) Audio indexing in ATRANOS Additional design goals –aim for continuous input processing (stream-based) –restrict computational load (real-time on PC) –restrict maximum delay (memory) –aim for language independence Evaluation data –American Broadcast News database (LDC) –Pan-European Broadcast News database (COST278) –Spoken Dutch Corpus (CGN)

ELIS-DSSP Sint-Pietersnieuwstraat 41 B-9000 Gent CAIR Twente (10/10/2003) 1. Speech / non-speech segmentation Approach –construct statistical models (GMMs) for typical situations –let these models score individual audio frames –group the frames on the basis of these scores Which models to build? –one clean speech model –some common background models (e.g. music) –corresponding speech + common background models –a garbage model for all the rest

ELIS-DSSP Sint-Pietersnieuwstraat 41 B-9000 Gent CAIR Twente (10/10/2003) 1. Speech / non-speech segmentation How to group frames? –put models (colored) in a loop model (transition penalty) –compute best state sequence (on-line Viterbi-algorithm with forced decisions) –perform some post-processing on output sequence E3 1 2 B 4 PtPt

ELIS-DSSP Sint-Pietersnieuwstraat 41 B-9000 Gent CAIR Twente (10/10/2003) 1. Speech / non-speech segmentation Evaluation results (7 data sets) –training and parameter setting on Am BN football reports –performance degrades for unseen situations

ELIS-DSSP Sint-Pietersnieuwstraat 41 B-9000 Gent CAIR Twente (10/10/2003) 2. Speaker segmentation Objective detect changes in speaker/acoustics Approach –identify change points by comparing properties of observations in two intervals at both sides of this point –advantage: self-organizing (no speaker models)

ELIS-DSSP Sint-Pietersnieuwstraat 41 B-9000 Gent CAIR Twente (10/10/2003) 2. Speaker segmentation Step 1: potential change position detection –select positions on a grid (CPU-time) –determine fixed length left/right context rightleft block of 10 frames candidate position (n) both –build 3 models for the data: M(both), M(left), M(right) –retain significant maxima in LLR(n; two vs one)

ELIS-DSSP Sint-Pietersnieuwstraat 41 B-9000 Gent CAIR Twente (10/10/2003) 2. Speaker segmentation Step 2: boundary elimination –pool all boundaries in speech part: T max or until EO-S –evaluate variable length context of n using BIC ΔBIC(n) = LL(M2) - LL(M1) - λ [#par(M2) - #par(M1)] log N(n) –select n with minimal ΔBIC(n) if ΔBIC(n) < 0 : eliminate n and reiterate if ΔBIC(n)  0 : move to the next speech part NS Speech NS T max leftright n

ELIS-DSSP Sint-Pietersnieuwstraat 41 B-9000 Gent CAIR Twente (10/10/2003) 2. Speaker segmentation Evaluation (7 data sets) –recall: how many real changes detected? –precision: how many detected changes are real? 5 out of 7

ELIS-DSSP Sint-Pietersnieuwstraat 41 B-9000 Gent CAIR Twente (10/10/2003) 3. Speaker labeling Objective assign same label to all turns of the same speaker Approach –on-line clustering fully integrated in segmentation –BIC as decision criterion Clustering strategy –for all turns in a speech part: compute ΔBIC between turn and ‘closest’ cluster center –select turn with maximal ΔBIC: if ΔBIC > 0 take turn as a new cluster else take turn with smallest ΔBIC and add it to closest cluster

ELIS-DSSP Sint-Pietersnieuwstraat 41 B-9000 Gent CAIR Twente (10/10/2003) 3. Speaker labeling –step 1: assign official speaker label to each cluster –step 2: cluster purity = % frames with correct label –step 3: ideal cluster purity: purity for ideal clustering per speaker: 1 cluster with label of that speaker per frame in turn: select label of dominant speaker in turn official computed ABABA ABAAB error zones Evaluation methodology

ELIS-DSSP Sint-Pietersnieuwstraat 41 B-9000 Gent CAIR Twente (10/10/2003) 3. Speaker labeling Evaluation results (7 data sets) –training and parameter setting on Am BN –still room for improvement (nr of clusters also > ideal)

ELIS-DSSP Sint-Pietersnieuwstraat 41 B-9000 Gent CAIR Twente (10/10/2003) demonstration

ELIS-DSSP Sint-Pietersnieuwstraat 41 B-9000 Gent CAIR Twente (10/10/2003) 4. Speech mode labeling Objective –spontaneous versus prepared speech –how: presence of disfluencies (prior to recognition) Disfluencies –filled pauses (uh’s, abnormally lengthened sounds) –repetitions of words or word groups –abbreviations of words At present –no speech mode labeling results yet –therefore ….

ELIS-DSSP Sint-Pietersnieuwstraat 41 B-9000 Gent CAIR Twente (10/10/2003) 4. Disfluency detection Objectives –spontaneous versus prepared speech –how: presence of disfluencies (prior to recognition) Disfluencies –filled pauses (uh’s, abnormally lengthened sounds) –repetitions of words or word groups –abbreviations of words At present –no speech mode labeling results yet –therefore ….

ELIS-DSSP Sint-Pietersnieuwstraat 41 B-9000 Gent CAIR Twente (10/10/2003) 4. Disfluency detection Feature identification –CGN (Spoken Dutch Corpus): conversational speech –bootstrap data set (11h) 3255 annotated uh’s manual word alignments available (location of uh’s) Approach –perform segmentation into phoneme-sized parts on the basis of cepstral difference measure –identify features revealing FP/NFP nature of these parts –supply these features to a statistical classifier –keep everything stream-based (to fit with the rest)

ELIS-DSSP Sint-Pietersnieuwstraat 41 B-9000 Gent CAIR Twente (10/10/2003) 4. Disfluency detection Feature detection on bootstrap data –segment duration

ELIS-DSSP Sint-Pietersnieuwstraat 41 B-9000 Gent CAIR Twente (10/10/2003) 4. Disfluency detection Feature detection on bootstrap data –segment duration –spectral stability

ELIS-DSSP Sint-Pietersnieuwstraat 41 B-9000 Gent CAIR Twente (10/10/2003) 4. Disfluency detection Feature detection on bootstrap data –segment duration –spectral stability –stable interval durations

ELIS-DSSP Sint-Pietersnieuwstraat 41 B-9000 Gent CAIR Twente (10/10/2003) 4. Disfluency detection Feature detection on bootstrap data –segment duration –spectral stability –stable interval durations –silence present

ELIS-DSSP Sint-Pietersnieuwstraat 41 B-9000 Gent CAIR Twente (10/10/2003) 4. Disfluency detection Feature detection on bootstrap data –segment duration –spectral stability –stable interval durations –silence present –center of gravity

ELIS-DSSP Sint-Pietersnieuwstraat 41 B-9000 Gent CAIR Twente (10/10/2003) 4. Disfluency detection Feature detection on bootstrap data –segment duration –spectral stability –stable interval durations –silence present –center of gravity –output of simple spectral FP-model (GMM) (12 mfccs) result : 12 useful features in total identified

ELIS-DSSP Sint-Pietersnieuwstraat 41 B-9000 Gent CAIR Twente (10/10/2003) 4. Disfluency detection GMM-based filter –two GMM’s  P(x|FP) and P(x|NFP) (x = 12 features) –prior probability P(FP) = 0.01  P(FP|x) –retain segment if P(FP|x) > threshold –results: 90 % of NFP, < 10 % of FP removed and P(FP) raised from 1 to 10 % Statistical classifier –MLP to estimate P(FP|x) (x = 12 features + 12 mfccs) –problem: very low P(FP) (order of 1 %) –therefore: design filter to eliminate most certain NFP

ELIS-DSSP Sint-Pietersnieuwstraat 41 B-9000 Gent CAIR Twente (10/10/2003) 4. Disfluency detection Evaluation on independent test set –size : 47 min containing 415 FP –available information all uh’s (including word internal ones) were annotated all abnormal sound lengthenings were annotated all corresponding time intervals were manually checked

ELIS-DSSP Sint-Pietersnieuwstraat 41 B-9000 Gent CAIR Twente (10/10/2003) 4. Disfluency detection Evaluation on test data –Recall-precision (ROC) curves –Embedded training (15h unlabeled data) does not help Our method R = 75 % P = 85 % Gabrea method R = 60 % P = 65 %

ELIS-DSSP Sint-Pietersnieuwstraat 41 B-9000 Gent CAIR Twente (10/10/2003) 4. Disfluency detection and ASR Baseline system 40K lexicon + uh (FP), trigram LM WER = 51.3 % (spontaneous dialogues CGN, uh excluded) Cheating experiment remove manually labeled FP-segments from the input equivalent with : recognize FP’s, ignore them in LM context equivalent with : remove correct FP’s from input stream WER = 47.6 % (7.5 % relative gain, 1.25 word corrections/FP) First real experiment remove detected FP-segments from the input WER = 49.4 % (3.7 % relative gain, 0.62 word corrections/FP)

ELIS-DSSP Sint-Pietersnieuwstraat 41 B-9000 Gent CAIR Twente (10/10/2003) Conclusions There exist good audio indexing techniques –speech / non-speech segmentation –speaker turn segmentation –speaker identity labeling –filled-pause detection These techniques can be used –to extract extra-linguistic information for AIR –to guide the speech transcription module

ELIS-DSSP Sint-Pietersnieuwstraat 41 B-9000 Gent CAIR Twente (10/10/2003) Audio Indexing as a first step in an Audio Information Retrieval System Jean-Pierre.

Similar presentations

Presentation on theme: "ELIS-DSSP Sint-Pietersnieuwstraat 41 B-9000 Gent CAIR Twente (10/10/2003) Audio Indexing as a first step in an Audio Information Retrieval System Jean-Pierre."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

ELIS-DSSP Sint-Pietersnieuwstraat 41 B-9000 Gent CAIR Twente (10/10/2003) Audio Indexing as a first step in an Audio Information Retrieval System Jean-Pierre.

Similar presentations

Presentation on theme: "ELIS-DSSP Sint-Pietersnieuwstraat 41 B-9000 Gent CAIR Twente (10/10/2003) Audio Indexing as a first step in an Audio Information Retrieval System Jean-Pierre."— Presentation transcript:

Similar presentations

About project

Feedback