Presentation is loading. Please wait.

Presentation is loading. Please wait.

Exploiting lexical information for Meeting Structuring Alfred Dielmann, Steve Renals (University of Edinburgh) {

Similar presentations


Presentation on theme: "Exploiting lexical information for Meeting Structuring Alfred Dielmann, Steve Renals (University of Edinburgh) {"— Presentation transcript:

1 Exploiting lexical information for Meeting Structuring Alfred Dielmann, Steve Renals (University of Edinburgh) { a.dielmann@ed.ac.uk, s.renals@ed.ac.uk }

2 Meeting Structuring (1) Goal: recognise events which involve one or more communicative modalities: Monologue / Dialogue / Note taking / Presentation / Presentation at the whiteboard Working environment: “IDIAP framework”: 30+23 five minutes long meetings of 4 participants 4 audio derived features: Speaker turns (derived from mic. Array localisation) Prosodic Features: RMS energy, F0, Rate Of Speech But we’d like to integrate other features…

3 Meeting Structuring (2) We’re working with Dynamic Bayesian Network based models and in the previous meeting we proposed two models: S01S01 Y01Y01 St1St1 Yt1Yt1 S t+1 1 Y t+1 1 …. A0A0 AtAt A t+1 …. C0C0 E0E0 C0C0 E0E0 C0C0 E0E0 Counter structure (Reduces insertions number) CorrSubDelInsAER W.o. counter93.22.34.5 11.4 With counter89.45.3 0.811.4 The first one is characterised by: Decomposition of actions {A t } into “sub-actions” {S t } Early features integration

4 Meeting Structuring (3) The second model extends the previous one, through multi-stream processing (avoiding early integration): S01S01 Y01Y01 St1St1 Yt1Yt1 S t+1 1 Y t+1 1 …. A0A0 AtAt A t+1 …. S02S02 Y02Y02 St2St2 Yt2Yt2 S t+1 2 Y t+1 2 …. C0C0 E0E0 C0C0 E0E0 C0C0 E0E0 CorrSubDelInsAER W.o. counter90.92.36.82.311.4 With counter94.71.53.80.86.1 Different feature groups are processed independently Parallel independent HMM chains are responsible only for one part of the feature set Cardinalities of {S t n } are part of the model Hidden sub-states {S t n } are a result of the training process

5 Further developments The feature set adopted has to be extended!! Integration of gestures/body movements (VIDEO) Integration of lexical information (ASR) The problem presents some analogies with “Topic Detection and Tracking” Work In Progress ! Correlate the sequence of transcribed words with the sequence of “meeting actions” Discover homogeneous partitions in a transcription, according to the communicative phase of the meeting (Dialogue,Monologue,..)

6 TDT approaches Lexical cohesion based, like “TextTiling”: –Given two adjacent windows a lexical cohesion function is evaluated at the transition between the 2 windows, in order to find topically coherent passages and highlight topic boundaries candidates Feature based, reducing topic segmentation to a statistical classification problem of : –Lexical features: perplexity, mutual information, other information content measures –Cue phrases (word frequent on topic changes) –Short/long range language models: n-gram, -binomial distribution –Prosodic features: pauses, F0, Cross-talk, speaker changes, … Mixed approach (Lexical cohesion + feature classification)

7 Feature based approach Lexical feature based approach could be: –Easily transferred from the TDT problem to the meeting segmentation one –Quickly integrated with proposed DBNs models Mutual Information Investigate a lexical function that discriminates between different communicative phases Look for a list of cue-phrases that highlights “Meeting Actions” boundaries …… Could be interesting starting points for further experiments! “the amount of information that one random variable X contains about another variable Y”

8 We must cope with speech recognition errors!! Problems (1) Challenging conditions: spoken, multiparty dialogues, in unrestricted domain Insufficient training/testing data Especially for Agreement\Disagr. (4% of the corpus) Only 30 meetings are fully transcribed (~25k words) These first experiments are based on hand labelled transcriptions, and attempt to discriminate only between Monologue & Dialogue

9 Lexical classification (1) Each word of the testing corpus is compared with every “Meeting Action” lexical model, and then classified …. ASR transcript ….. …….. …. Monologue Model Dialogue Model xxxxx Model MAX Output Filter …. maximizing the Mutual Information X

10 Lexical classification (2) Each recognised word is classified as a “Monologue word” or a “Dialogue word” This stream of symbols is then filtered (de-noised) : –Considering a moving window –Estimating the the temporal density for each class (Monologue, Dialogue) –The class with higher symbol density (frequency) is the winning one We expect that during a “Dialogue act”, the temporal density of “Monologue words” is lower than the “Dialogue words” one

11 Initial results We evaluated(*) the proposed system using two different classification criteria: Tri-gram probability 63.8 % Shannon M.I.66.8 % Achieved results are very close !! Mutual information seems to be more efficient than the simple 3-gram language model Correct classification percentage: (*) Using 13 meeting to construct Monologue&Dialogue lexical models and remaining 17 to evaluate performances

12 Integration (1) The next step is to integrate these results into the previously adopted framework, therefore we assume that: –Lexical classifier output can be seen as a new independent feature, and combined with Speaker Turns and prosodic features –Developed models could be easily adapted (and eventually re- engineered) to support newly introduced features Analysing a further communicative modality Thanks to the flexibility of Dynamic Bayesian Networks

13 The model must provide at least a minimum degree of a-synchronicity Problems (2) The new lexical feature stays on a new different time-scale Different from both Speaker Turns and Prosodic feature time-scales And different from the time-scale of the events that we’d like to recognise Meeting events usually appear on different modalities (i.e. turn-taking, prosody and word lexical environment), without a precise synchronism Speech and gestures for example Some features are asynchronous because calculated for each participant Other derive from the interaction of different participants (Speaker Turns) A true multi-time scale model probably is more compact and more efficient !?

14 Integration (2) Soon (hopefully!) the new feature will be integrated into the Multistream model The model will be adapted in order to support multiple time-scales, at least at a feature level The lack of synchronism between features will be investigated, and verified in which measure proposed models are able to manage it S01S01 Y01Y01 St1St1 Yt1Yt1 …. A0A0 AtAt S02S02 Y02Y02 S02S02 Y03Y03 Yt3Yt3 St1St1 Yt1Yt1 AtAt St2St2 Yt2Yt2 St2St2 Yt3Yt3 St2St2 St1St1 Yt1Yt1 AtAt St2St2 Yt3Yt3

15 Summary Open problems: –Choose the best way to process lexical data –Integration in the existing framework –More data (transcriptions) are needed –Multi-modal = Multi-time-scale –Synchronism Suggestions ?


Download ppt "Exploiting lexical information for Meeting Structuring Alfred Dielmann, Steve Renals (University of Edinburgh) {"

Similar presentations


Ads by Google