Download presentation
Presentation is loading. Please wait.
Published byLiam Diggs Modified over 9 years ago
1
Detecting Action Items in Multi-Party Meetings: Annotation and Initial Experiments Matthew Purver, Patrick Ehlen, John Niekrasz Computational Semantics Laboratory Center for the Study of Language and Information Stanford University
2
The CALO Project Multi-institution, multi-disciplinary project Working towards an intelligent personal assistant that learns Three major areas – managing personal data clustering email, documents, managing contacts – assisting with task execution learning to carry out computer-based tasks – observing interaction in meetings
3
The CALO Meeting Assistant Observe human-human meetings – Audio recording & speech recognition (ICSI/CMU) – Video recording & processing (MIT/CMU) – Written notes, via digital ink (NIS) or typed (CMU) – Whiteboard sketch recognition (NIS) Produce a useful record of the interaction – answer questions about what happened – can be used by attendees or non-attendees Learn to do this better over time (LITW)
4
The CALO Meeting Assistant Primary focus on end-user Develop something that can really help people when it comes to dealing with all of the meetings we have to deal with
5
What do people want to know from meetings?
6
Banerjee et al. (2005) survey of 12 academics: – Missed meeting - what do you want to know? – Topics: which were discussed, what was said? – Decisions: what decisions were made? – Action items/tasks: was I assigned something?
7
What do people want to know from meetings? Banerjee et al. (2005) survey of 12 academics: – Missed meeting - what do you want to know? – Topics: which were discussed, what was said? – Decisions: what decisions were made? – Action items/tasks: was I assigned something? Lisowska et al. (2004) survey of 28 people: – What would you ask a meeting reporter system? – Similar responses about topics, decisions – who attended, who asked/decided what? – Did they talk about me?
8
Purpose Helpful system: not only records and transcribes a meeting, but extracts (from streams of potentially messy human-human speech): – topics discussed – decisions made – tasks assigned (“action items”) The system should highlight this information over meeting “noise”
9
Example Impromptu meeting you might have after your team has boarded a rebel spacecraft in search of stolen plans, and you’re trying to figure out what to do next
10
Commander, tear this ship apart until you’ve found those plans!
12
A section of discourse in a meeting where someone is made responsible to take care of something
14
Action Items Concrete decisions; public commitments to be responsible for a particular task Want to know: – Can we find them? – Can we produce useful descriptions of them? Not aware of previous discourse-based work
15
Action Item Detection in Email Corston-Oliver et al., 2004 Marked a corpus of email with “dialogue acts” Task act: – “items appropriate to add to an ongoing to-do list” Good inter-annotator agreement (kappa > 0.8) Per-sentence classification using SVMs – lexical features e.g. n-grams; punctuation; message features – f-scores around 0.6
16
A First Try: Flat Annotation Gruenstein et al (2005) analyzed 65 meetings annotated from: – ICSI Meeting Corpus (Janin et al., 2003) – ISL Meeting Corpus (Burger et al., 2002) Two human annotators “Mark utterances relating to action items” – create groups of utterances for each AI – made no distinction between utterance type/role
17
A First Try: Flat Annotation (cont’d) Annotators identified 921 / 1267 (respectively) action item-related utterances Human agreement poor ( < 0.4) Tried binary classification using SVMs (like Corston-Oliver) Precision, recall, f-score: all below.25
18
Try a more restricted dataset? Sequence of 5 (related) CALO meetings – similar amount of ICSI/ISL data for training Same annotation schema SVMs with words & n-grams as features – Also tried other discriminative classifiers, and 2- & 3- grams, w/ no improvements Similar performance – Improved f-scores (0.30 - 0.38), but still poor – Recall up to 0.67, precision still low (< 0.36)
19
Should we be surprised? Our human annotator agreement poor DAMSL schema has dialogue acts Commit, Action-directive – annotator agreement poor ( ~ 0.15) – (Core & Allen, 1997) ICSI MRDA dialogue act commit – Most DA tagging work concentrates on 5 broad DA classes Perhaps “action items” comprise a more heterogeneous set of utterances
20
Rethinking Action Item Acts Maybe action items are not aptly described as singular “dialogue acts” Rather: multiple people making multiple contributions of several types Action item-related utterances represent a form of group action, or social action That social action has several components, giving rise to a heterogeneous set of utterances What are those components?
21
Commander, tear this ship apart until you’ve found those plans! A person commits or is committed to “own” the action item
22
Commander, tear this ship apart until you’ve found those plans! A person commits or is committed to “own” the action item A description of the task itself is given
23
Commander, tear this ship apart until you’ve found those plans! A person commits or is committed to “own” the action item A description of the task itself is given A timeframe is specified
24
Yes, Lord Vader! A person commits or is committed to “own” the action item A description of the task itself is given A timeframe is specified Some form of agreement
25
Exploiting discourse structure Action items have distinctive properties – Task description, owner, timeframe, agreement Action item utterances can simultaneously play different roles – assigning properties – agreeing/committing These classes may be more homogeneous & distinct than looking for just “action item” utts. – Could improve classification performance
26
New annotation schema Annotated and classified again using the new schema Classify utterances by their role in the action item discourse – can play more than one role Define action items by grouping subclass utterances together in an action-item discussion – a subclass can be missing
27
Action Item discourse: an example
28
New Experiment Annotated same set of CALO/ICSI/ISL data using the new schema Ran classifiers to train and identify utterances that contain each of the 4 subclasses
29
Encouraging signs Between-class distinction (cosine distances) – Agreement vs. any other is good: 0.05 to 0.12 – Timeframe vs. description is OK: 0.25 – Owner/timeframe/description: 0.36 to 0.47 Improved inter-annotator agreement? – Timeframe: = 0.86 – Owner 0.77, agreement & description 0.73 – Warning: this is only on one meeting, although it’s the most difficult one we could find
30
Combined classification Still don’t have enough data for proper combined classification – Recall 0.3 to 0.5, precision 0.1 to 0.5 – Agreement subclass is best, with f-score = 0.40 Overall decision based on sub-classifier outputs Ad-hoc heuristic: – prior context window of 5 utterances – agreement plus one other class
31
Questions we can ask Does overall classification look useful? – Whole-AI-based f-score 0.40 to 1.0 (one meeting perfectly correlated with human annotation) Does overall output improve sub-classifiers? – Agreement: f-score 0.40 0.43 – Timescale: f-score 0.26 0.07 – Owner: f-score 0.12 0.24 – Description: f-score 0.33 0.24
32
Example output From a CALO meeting: t = [the, start, of, week, three, just, to] o = [reconfirm, everything, and, at, that, time, jack, i'd, like, you, to, come, back, to, me, with, the] d = [the, details, on, the, printer, and, server] a = [okay] Another (less nice?) example: o = [/h#/, so, jack, /uh/, for, i'd, like, you, to] d = [have, one, more, meeting, on, /um/, /h#/, /uh/] t = [in, in, a, couple, days, about, /uh/] a = [/ls/, okay]
33
Where next for action items? More data annotation – Using NOMOS, our annotation tool Meeting browser to get user feedback Improved individual classifiers Improved combined classifier – maximum entropy model – not enough data yet Moving from words to symbolic output – Gemini (Dowding et al., 1990) bottom-up parser
35
Questions we can ask Does overall classification look useful? – Whole-AI-based f-score 0.40 to 1.0 (one meeting perfectly correlated with human annotation) Does overall output improve sub-classifiers? – Agreement: f-score 0.40 0.43 – Timescale: f-score 0.26 0.07 – Owner: f-score 0.12 0.24 – Description: f-score 0.33 0.24
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.