Download presentation
Presentation is loading. Please wait.
Published byAllan Strickland Modified over 9 years ago
1
AQUAINT Testbed John Aberdeen, John Burger, Conrad Chang, John Henderson, Scott Mardis The MITRE Corporation © 2002, The MITRE Corporation
2
AQUAINT Activities @ MITRE l Testbed - Provide access to Q&A systems on classified data. - Solicit user feedback (user studies). l Testweb - Provide public access to a broad variety of Q&A capabilities. - Evaluate systems and architectures for inclusion in testbed. l User Studies - Determine tasks for which Q&A is a useful technology. - Determine advantages of Q&A over related information technologies. - Obtain feedback from users on utility and usability.
3
Non-AQUAINT studies AFIWC (Air Force Information Warfare Center) installed both classified & unclassified systems QANDA configured to answer information security questions on BUGTRAC data How can CDE be exploited? IRS study questions about completing tax forms What is required to claim a child as a dependent?
4
Lessons learned l Answers are only useful in context (long answers are preferred) l Source text must be available l Chapter and section headings are important context l Issues with a classified system - may not be able to know the top-level objectives of the users - may not be able to be told or record any actual questions - feedback is largely qualitative
5
l Classified network (ICTESTNET) - access to users, data, scenarios will be restricted l Evaluate systems prior to installation - Testweb becomes more important - MITRE installations are more than rehearsal l To facilitate feedback, initial deployment should use open source data possibly on a different network Testbed
6
Testbed Activity l MITRE installations (need to assess portability to the IC environment, maintainability, features, resources, etc.) - QUIRK (CYCorp/IBM) - Javelin (CMU) - in progress - who’s next? l Support scenario development on CNS Data with Search and Q/A interface. Centralize collection of user questions. Available for analysts, reservists, AQUAINT executive committee members, etc.
7
Testweb l Clarity measure l ISI TextMap integrated into web demo Soon: l CNS Data: search + Google API l QANDA on CNS Q/A system CNS TREC 2002 Javelin, LCC, Qanda, TextMap Q/A Portal/ Demo Q/A repository Google API Other collections IR service Clarity service Users Q/A Portal/ Demo
9
System Interoperability ABC 0 When did Columbus discover America? 1492 the previous year Columbus discovered America in 1492 www.columbus.org
10
Answer Combination l 67 systems submitted to TREC-11 main QA task - Including some variants l Average raw submission accuracy was 22% - 28% for loosely correct (judgment {1,2,3}) l How well can we do by combining systems in some way? - Simplest approach: voting - More sophisticated approaches?
11
Basic Approach l Define distance measures between answer pairs - Generalization of simple voting - Can use partial matches, other evidence sources l Select near-“centroid” of all submissions - Minimize sum of pairwise distances (SOP) - Previously used to select DNA sequences (Gusfield 1993) l Endless possibilities for distance measures - Edit distance, ngrams, geo & time distances … l Also used document source prior - NY Times vs. Associated Press vs. Xinhua vs. NIL
12
Sample Results l Simple voting more than twice as good as average l More sophisticated measures even better l SOP scores can also be used for confidence ranking
13
Example: Question 1674 What day did Neil Armstrong land on the moon? l 22 different answer strings submitted - 1969 (plurality of submissions—incorrect) - July 20, 1969; on July 20, 1969 (correct) - July 18, 1969; July 14, 1999 … - 20 - Plus variants differing in punctuation l Best-scoring selector chooses correct answer - Above answers all contribute
14
Future Work l Did not have access to - System identity (even anonymized) - Confidence rankings l Would like to use both - Simple system-specific priors would be easy - More sophisticated models possible l Better confidence estimation - Should do better than using SOP score directly
15
Initial User Study: Comparison to traditional IR l Establish a baseline for relative utility via a task- based comparison of Q/A to traditional IR. l Initial task: collect set of geographic, temporal, and monetary facts regarding hurricane Mitch l Data: TREC11 l Measure: task completeness, accuracy, time l Analyze logs for query reformulations, documents usage, etc.
16
Preliminary Results Initial subjects are MITRE employees We have run N subjects each on Q/A and IR (Lucene)
17
What’s Next l Testbed system appraisals l Testweb stability & facelift l Studies with other Q/A systems & features l Other tasks (based on CNS data) l Other component integrations: - answer combination & summarization
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.