Download presentation
Presentation is loading. Please wait.
Published byTracy McKinney Modified over 9 years ago
1
RADAR EVALUATION Goals, Targets, Review & Discussion Jaime Carbonell & soon Full SRI/CMU/IET RADAR Team 1-February-2005 School of Computer Science Supported By DARPA IPTO PAL Program: “Personalized Assistant That Learns”
2
Carnegie Mellon University 2 Outline: Radar Evaluation Brief Review of Radar Challenge Task Evaluation Objectives: Obligation and Desiderata Evaluation Components: Radar Tasks Radar Metrics: Tasks Meaningful Measures Putting it all together: Tin-man formula proposal
3
Carnegie Mellon University 3 The resolver needs to replan: gather information, commandeer other rooms, change schedules, post to websites, inform participants. The original plan has been disrupted. Conference wing A is no longer available. Other rooms may be affected. Test: Radar will assist a conference planner in a crisis situation. The test will be evaluated on quality and completeness of the new plan and on the successful completion of related tasks. Crisis Resolver RADAR NLP Planning & Scheduling E-Mail Handler Learning Knowledge Base Conference Participants Website Conference Organizers Wing A Wing B
4
Carnegie Mellon University 4 Conference Re-planning Tasks Situation Assessment –Which resources have become unavailable –What alternative resources exist and at what price Tentative re-planning of conference schedule –Elicit and satisfy as many preferences as possible Validating conference schedule & resource allocation –Securing buy-in from key stakeholders (requires meeting) –Awaiting external confirmations (or default assumptions) –Modifying plan as/when needed Informing all stakeholders –Briefings to VIPs, Update website for participants Cope with background tasks (time permitting)
5
Carnegie Mellon University 5 Scoring Criteria (Adapted from Garvey) Task Realism –Must reflect RADAR challenge performance Sensitive to Learning –Must allow headroom beyond Y2 (no low ceiling) –Must include measurement of learning effects Auditable with Pride –Objective, Simple, Clear, Transparent, Statistically Sound, Replicable, … Comprehensive & Research-Useful –All RADAR modules included, albeit differentially –Responsive to RADAR scientific objectives
6
Carnegie Mellon University 6 Evaluation Components All RADAR Modules (Sched quality) –Time-Space Planning (TSP): Schedule quality –Meeting Scheduling (CMRadar): Meetings, bumps –Webmaster + Briefing Assistant (VIO) –Email + NLP: Other tasks completed: background Additional Learning Targets (?) –Relevant facts & preferences acquired –Strategic knowledge (when/how to apply K) Combination Function (Utility-like) –Linear weighted sum with +/- terms
7
Carnegie Mellon University 7 Example: Schedule Quality Metric W = Weight = importance of the session (e.g. keynote > posters) P = Penalty for distance from ideal (e.g. room smaller than target), linear or step fn f = factors of sessions (e.g. room size, duration, equipment, …) r = resource (e.g. ballroom at Flagstaff)
8
Carnegie Mellon University 8 Putting It All Together Normalizing components: Summing: or
9
Carnegie Mellon University 9 Next Steps for Evaluation Metrics Metrics for Other components Metrics for Learning Boost Discuss/Refine/Redo Combination –True open-ended scale? –Something other than weighted sum? –Quality metric w/o penalties (+ ’s only) Test in a full walk-through scenario –Refine the details –Don’t loose sight of objectives
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.