Overview of the TDT 2004 Evaluation and Results Jonathan Fiscus Barbara Wheatley National Institute of Standards and Technology Gaithersburg, Maryland.

Slides:

Advertisements

Similar presentations

DISCOVERING EVENT EVOLUTION GRAPHS FROM NEWSWIRES Christopher C. Yang and Xiaodong Shi Event Evolution and Event Evolution Graph: We define event evolution.

Advertisements

A probabilistic model for retrospective news event detection

Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.

Imbalanced data David Kauchak CS 451 – Fall 2013.

Supervised Learning Techniques over Twitter Data Kleisarchaki Sofia.

Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.

Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.

Systems Engineering and Engineering Management The Chinese University of Hong Kong Parameter Free Bursty Events Detection in Text Streams Gabriel Pui Cheong.

Evaluating Search Engine

Search Engines and Information Retrieval

Distributed Search over the Hidden Web Hierarchical Database Sampling and Selection Panagiotis G. Ipeirotis Luis Gravano Computer Science Department Columbia.

1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement: Relevance Feedback Information Filtering.

Carnegie Mellon 1 Maximum Likelihood Estimation for Information Thresholding Yi Zhang & Jamie Callan Carnegie Mellon University

Report on Intrusion Detection and Data Fusion By Ganesh Godavari.

Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.

SLIDE 1IS 240 – Spring 2010 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.

1 LM Approaches to Filtering Richard Schwartz, BBN LM/IR ARDA 2002 September 11-12, 2002 UMASS.

ISP 433/633 Week 6 IR Evaluation. Why Evaluate? Determine if the system is desirable Make comparative assessments.

Topic Detection and Tracking Introduction and Overview.

Search Engines and Information Retrieval Chapter 1.

1 Information Filtering & Recommender Systems (Lecture for CS410 Text Info Systems) ChengXiang Zhai Department of Computer Science University of Illinois,

Minimal Test Collections for Retrieval Evaluation B. Carterette, J. Allan, R. Sitaraman University of Massachusetts Amherst SIGIR2006.

Modeling Documents by Combining Semantic Concepts with Unsupervised Statistical Learning Author: Chaitanya Chemudugunta America Holloway Padhraic Smyth.

IR Evaluation Evaluate what? –user satisfaction on specific task –speed –presentation (interface) issue –etc. My focus today: –comparative performance.

©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap.

METEOR-Ranking & M-BLEU: Flexible Matching & Parameter Tuning for MT Evaluation Alon Lavie and Abhaya Agarwal Language Technologies Institute Carnegie.

Translingual Topic Tracking with PRISE Gina-Anne Levow and Douglas W. Oard University of Maryland February 28, 2000.

Querying Structured Text in an XML Database By Xuemei Luo.

Report on Intrusion Detection and Data Fusion By Ganesh Godavari.

April 14, 2003Hang Cui, Ji-Rong Wen and Tat- Seng Chua 1 Hierarchical Indexing and Flexible Element Retrieval for Structured Document Hang Cui School of.

UMass at TDT 2000 James Allan and Victor Lavrenko (with David Frey and Vikas Khandelwal) Center for Intelligent Information Retrieval Department of Computer.

Overview of the TDT-2003 Evaluation and Results Jonathan Fiscus NIST Gaithersburg, Maryland November 17-18, 2002.

TDT 2002 Straw Man TDT 2001 Workshop November 12-13, 2001.

LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.

1 01/10/09 1 INFILE CEA LIST ELDA Univ. Lille 3 - Geriico Overview of the INFILE track at CLEF 2009 multilingual INformation FILtering Evaluation.

PSEUDO-RELEVANCE FEEDBACK FOR MULTIMEDIA RETRIEVAL Seo Seok Jun.

Prototype-Driven Learning for Sequence Models Aria Haghighi and Dan Klein University of California Berkeley Slides prepared by Andrew Carlson for the Semi-

Paired Sampling in Density-Sensitive Active Learning Pinar Donmez joint work with Jaime G. Carbonell Language Technologies Institute School of Computer.

Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.

Carnegie Mellon Novelty and Redundancy Detection in Adaptive Filtering Yi Zhang, Jamie Callan, Thomas Minka Carnegie Mellon University {yiz, callan,

Results of the 2000 Topic Detection and Tracking Evaluation in Mandarin and English Jonathan Fiscus and George Doddington.

Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏

Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.

A Critique and Improvement of an Evaluation Metric for Text Segmentation A Paper by Lev Pevzner (Harvard University) Marti A. Hearst (UC, Berkeley) Presented.

 TDT 2004 Evaluation Workshop, NIST, December 2-3, 2004 Creating the TDT5 Corpus and 2004 Evaluation Topics at LDC Stephanie Strassel, Meghan Glenn, Junbo.

Threshold Setting and Performance Monitoring for Novel Text Mining Wenyin Tang and Flora S. Tsai School of Electrical and Electronic Engineering Nanyang.

Relevance-Based Language Models Victor Lavrenko and W.Bruce Croft Department of Computer Science University of Massachusetts, Amherst, MA SIGIR 2001.

Topics Detection and Tracking Presented by CHU Huei-Ming 2004/03/17.

1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.

1 13/05/07 1/20 LIST – DTSI – Interfaces, Cognitics and Virtual Reality Unit The INFILE project: a crosslingual filtering systems evaluation campaign Romaric.

Carnegie Mellon School of Computer Science Language Technologies Institute CMU Team-1 in TDT 2004 Workshop 1 CMU TEAM-A in TDT 2004 Topic Tracking Yiming.

11 A Classification-based Approach to Question Routing in Community Question Answering Tom Chao Zhou 1, Michael R. Lyu 1, Irwin King 1,2 1 The Chinese.

Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart

TDT 2000 Workshop Lessons Learned These slides represent some of the ideas that were tried for TDT 2000, some conclusions that were reached about techniques.

CMU TDT Report November 2001 The CMU TDT Team: Jaime Carbonell, Yiming Yang, Ralf Brown, Chun Jin, Jian Zhang Language Technologies Institute, CMU.

2005/09/13 A Probabilistic Model for Retrospective News Event Detection Zhiwei Li, Bin Wang*, Mingjing Li, Wei-Ying Ma University of Science and Technology.

New Event Detection at UMass Amherst Giridhar Kumaran and James Allan.

1 CS 430 / INFO 430 Information Retrieval Lecture 12 Query Refinement and Relevance Feedback.

Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,

TDT 2004 Unsupervised and Supervised Tracking Hema Raghavan UMASS-Amherst at TDT 2004.

Relevant Document Distribution Estimation Method for Resource Selection Luo Si and Jamie Callan School of Computer Science Carnegie Mellon University

Hierarchical Topic Detection UMass - TDT 2004 Ao Feng James Allan Center for Intelligent Information Retrieval University of Massachusetts Amherst.

Paul van Mulbregt Sheera Knecht Jon Yamron Dragon Systems Detection at Dragon Systems.

1 INFILE - INformation FILtering Evaluation Evaluation of adaptive filtering systems for business intelligence and technology watch Towards real use conditions.

Using Blog Properties to Improve Retrieval Gilad Mishne (ICWSM 2007)

Lecture 16: Filtering & TDT

Exploiting Topic Pragmatics for New Event Detection in TDT-2004

Authors: Wai Lam and Kon Fan Low Announcer: Kyu-Baek Hwang

Panagiotis G. Ipeirotis Luis Gravano

Evaluation of UMD Object Tracking in Video

Presentation transcript:

Overview of the TDT 2004 Evaluation and Results Jonathan Fiscus Barbara Wheatley National Institute of Standards and Technology Gaithersburg, Maryland December 2-3, 2004

TDT 2004 Workshop, Dec 2-3, 2004www.nist.gov/TDT Outline TDT Evaluation Overview Changes in TDT Evaluation Result Summaries –New Event Detection –Link Detection –Topic Tracking –Experimental Tasks: Supervised Adaptive Topic Tracking Hierarchical Topic Detection

TDT 2004 Workshop, Dec 2-3, 2004www.nist.gov/TDT Topic Detection and Tracking 5 TDT Applications –Story Segmentation* –Topic Tracking –Topic Detection –First Story Detection –Link Detection “Applications for organizing text” Terabytes of Unorganized data * Not evaluated in 2004

TDT 2004 Workshop, Dec 2-3, 2004www.nist.gov/TDT TDT’s Research Domain Technology challenge –Develop applications that organize and locate relevant stories from a continuous feed of news stories Research driven by evaluation tasks Composite applications built from –Document Retrieval –Speech-to-Text (STT) – not included this year –Story Segmentation – not included this year

TDT 2004 Workshop, Dec 2-3, 2004www.nist.gov/TDT Definitions An event is … –A specific thing that happens at a specific time and place along with all necessary preconditions and unavoidable consequences. A topic is … –an event or activity, along with all directly related events and activities A broadcast news story is … –a section of transcribed text with substantive information content and a unified topical focus

TDT 2004 Workshop, Dec 2-3, 2004www.nist.gov/TDT Evaluation Corpus TDT4 (Last year’s corpus) TDT5 (This year’s corpus) Collection Dates October 1, 2000 to January 31, 2001 April 1, 2003 to September 31, 2003 Newswire Sources 3 Arabic 2 English 2 Mandarin 6 Arabic 7 English 4 Mandarin Broadcast News Sources 2 Arabic 5 English 5 Mandarin NONE Story Counts90735 news, 7513 non-news stories news, 0 non-news Annotated topics Average topic size 79 stories40 stories Same languages as last year Summary of differences –New time period –No broadcast news No non-news stories –4.5 times more stories –3.1 times more topics –Topics have ½ as many on- topic stories

TDT 2004 Workshop, Dec 2-3, 2004www.nist.gov/TDT Topic Size Distribution 35 Arb+Eng+Man 62 Arb 62 Man 63 Eng 21 Eng+Man 7 Arb+Eng

TDT 2004 Workshop, Dec 2-3, 2004www.nist.gov/TDT Mutlilingual Topic overlap Common Stories Topic ID Unique Stories Topics on Terrorism 107: Casablanca bombs 71: Demonstrations in Casablanca Single Overlap TopicsMultiply Overlap Topics

TDT 2004 Workshop, Dec 2-3, 2004www.nist.gov/TDT Topic labels 72 Court indicts Liberian President 89 Liberian former president arrives in exile 29 Swedish Foreign Minister killed 125 Sweden rejects the Euro 151 Egyptian delegation in Gaza 189 Palestinian public uprising suspended for three months 69 Earthquake in Algeria 145 Visit of Morocco Minister of Foreign Affairs to Algeria 186 Press conference between Lebanon and US foreign ministers 193 Colin Powell Plans to visit Middle East and Europe 105 UN official killed in attack 126 British soldiers attacked in Basra 215 Jerusalem: Bus suicide bombing 227 Bin Laden Videotape 171 Morocco: death sentences for bombing suspects 107 Casablanca bombs 71 Demonstrations in Casablanca 106 Bombing in Riyadh, Saudi Arabia 118 World Economic Forum in Jordan 154 Saudi suicide bomber dies in shootout 60 Saudi King has eye surgery 80 Spanish Elections Single Overlap TopicsMultiply Overlap Topics

TDT 2004 Workshop, Dec 2-3, 2004www.nist.gov/TDT Participation by Task: Showing the Number of Submitted System Runs SitesNew Event Detection Hierarchical Topic Detection Topic TrackingLink Detection TraditionalSupervised Adaptation CMU Carnegie Mellon Univ IBM International Business Machines 4 SHAI Stottler Henke Associates, Inc. 5 UIowa Univ. of Iowa 4 UMd Univ. of Maryland 12 UMass Univ. Massachusetts CUHK Chinese Univ. of Hong Kong 1 ICT Institute of Computing Technology Chinese Academy of Sciences 111 NEU Northeastern University in China 22 TNO The Netherlands Organisation for Applied Scientific Research 8 Foreign Domestic

TDT 2004 Workshop, Dec 2-3, 2004www.nist.gov/TDT New Event Detection Task System Goal: –To detect the first story that discusses each topic First Stories on two topics Not First Stories = Topic 1 = Topic 2

TDT 2004 Workshop, Dec 2-3, 2004www.nist.gov/TDT TDT Evaluation Methodology Tasks are modeled as detection tasks –Systems are presented with many trials and must answer the question: “Is this example a target trial?” –Systems respond: YES this is a target, or NO this is not Each decision includes a likelihood score indicating the system’s confidence in the decision System performance measured by linearly combining the system’s missed detection rate and false alarm rate

TDT 2004 Workshop, Dec 2-3, 2004www.nist.gov/TDT Detection Evaluation Methodology Performance is measured in terms of Detection Cost –C Det = C Miss * P Miss * P target + C FA * P FA * (1- P target ) –Constants: C Miss = 1 and C FA = 0.1 are preset costs P target = 0.02 is the a priori probability of a target –System performance estimates P Miss and P FA –Normalized Detection Cost generally lies between 0 and 1: (C Det ) Norm = C Det /min{C Miss *P target, C FA * (1-P target )} Detection Error Tradeoff (DET) curves graphically depict the performance tradeoff between P Miss and P FA –Makes use of likelihood scores attached to the YES/NO decisions  Two important scores per system –Actual Normalized Detection Cost Based on the YES/NO decision threshold –Minimum Normalized DET point Based on the DET curve: Minimum score with proper threshold

TDT 2004 Workshop, Dec 2-3, 2004www.nist.gov/TDT Performance Measures Example Bottom left is better P(miss) = 5.5% P(fa)=1.1% Min DET Norm Cost = 0.11 > P(miss) = 0.7% P(fa)=1.5% Min DET Norm Cost = 0.08 > DET Curve Bar Chart

TDT 2004 Workshop, Dec 2-3, 2004www.nist.gov/TDT Primary New Event Detection Results Newswire, English Texts Last year’s best score

TDT 2004 Workshop, Dec 2-3, 2004www.nist.gov/TDT New Event Detection Performance History yearconditionsitescore 1999 SR=nwt+bnasr TE=eng,nat boundary DEF=10 UMass SR=nwt+bnasr TE=eng,nat noboundary DEF=10 UMass “ UMass SR=nwt+bnasr TE=eng,nat boundary DEF=10 CMU “” CMU1.5971* 2004 SR=nwt TE=eng,nat DEF=10 UMass * on 2002 Topics

TDT 2004 Workshop, Dec 2-3, 2004www.nist.gov/TDT TDT Link Detection Task System Goal: –To detect whether a pair of stories discuss the same topic. (Can be thought of as a “primitive operator” to build a variety of applications) ?

TDT 2004 Workshop, Dec 2-3, 2004www.nist.gov/TDT Primary Link Detection Results Newswire, Multilingual links, 10-file deferral period Scores are better than last year!

TDT 2004 Workshop, Dec 2-3, 2004www.nist.gov/TDT Link Detection Performance History yearconditionsitescore 1999 SR=nwt+bnasr TE=eng,nat DEF=10 CMU SR=nwt+bnasr TE=eng+man,eng boundary DEF=10 UMass “ CMU SR=nwt+bnasr TE=eng+man+arb, eng boundary DEF=10 PARC SR=nwt+bnasr TE=eng+man+arb, eng boundary DEF=10 UMass * 2004 SR=NWT TE=eng+man+arb DEF=10 CMU * on 2002 Topics

TDT 2004 Workshop, Dec 2-3, 2004www.nist.gov/TDT Topic Tracking Task System Goal: –To detect stories that discuss the target topic, in multiple source streams Supervised Training –Given N t samples stories that discuss a given target topic Testing –Find all subsequent stories that discuss the target topic training data test data on-topic unknown

TDT 2004 Workshop, Dec 2-3, 2004www.nist.gov/TDT Primary Tracking Results Newswire, Multilingual Texts, 1 English Training Story Last year’s best score

TDT 2004 Workshop, Dec 2-3, 2004www.nist.gov/TDT Tracking Performance History yearconditionsitescore 1999 SR=nwt+bnasr TR=eng TE=eng+man,eng boundary NT=4 BBN SR=nwt+bnman TR=eng TE=eng+man,eng boundary NT=1_Nn=0 IBM “ LIMSI SR=nwt+bnman TR=eng TE=eng+man+arb, eng boundary Nt=1 Nn=0 UMass SR=nwt+bnman TR=eng TE=eng+man+arb, eng boundary Nt=1 Nn=0 UMass1.1949* 2004 SR=nwt TR=eng TE=eng+man+arb Nt=1 CMU * on 2002 Topics

TDT 2004 Workshop, Dec 2-3, 2004www.nist.gov/TDT Supervised Adaptive Tracking Task Variation of Topic Tracking system goal: –To detect stories that discuss the target topic when a human provides feedback to the system System receives human judgment (on or off-topic) for every retrieved story –Same task as TREC 2002 Adaptive Filtering training data test data on-topic unknown un-retrieved retrieved on-topic retrieved off-topic

TDT 2004 Workshop, Dec 2-3, 2004www.nist.gov/TDT Supervised Adaptive Tracking Metrics Normalized Detection Cost –Same measure as for basic Tracking task Linear Utility Measure –As defined for TREC 2002 Filtering Track (Robertson & Soboroff) –Measures value of the stories sent to the user: Credit for relevant stories, debit for non-relevant stories Equivalent to thresholding based on estimated probability of relevance –No penalty for missing relevant stories (i.e. all precision, no recall) –Implication: Challenge is to beat the “do-nothing” baseline (i.e. a system that rejects all stories)

TDT 2004 Workshop, Dec 2-3, 2004www.nist.gov/TDT Supervised Adaptive Tracking Metrics Linear Utility Measure Computation: –Basic formula: U = W rel  R - NR R = number of relevant stories retrieved NR = number of non-relevant stories retrieved W rel = relative weight of relevant vs non-relevant (set to 10, by analogy with C Miss vs. C FA weights for C Det ) –Normalization across topics: Divide by maximum possible utility score for each topic –Scaling across topics: Define arbitrary minimum possible score, to avoid having average dominated by a few topics with huge NR counts Corresponds to application scenario in which user stops looking at stories when system exceeds some tolerable false alarm rate –Scaled, normalized value: U scale = [ max(U norm, U min ) ] / [ 1 - U min ]

TDT 2004 Workshop, Dec 2-3, 2004www.nist.gov/TDT Supervised Adaptive Tracking Best Two Submissions per Site Newswire, Multilingual Texts, 1 English Training Story Best 2004 standard tracking result!

TDT 2004 Workshop, Dec 2-3, 2004www.nist.gov/TDT Effect of Supervised Adaptation CMU4 is a simple cosine similarity tracker –Contrastive run submitted without supervised adaptation

TDT 2004 Workshop, Dec 2-3, 2004www.nist.gov/TDT Supervised Adaptive Tracking Utility vs. Detection cost Performance on Utility measure: –2/3 of systems surpassed baseline scaled utility score (0.33) –Most systems optimized for detection cost, not utility Detection Cost and Utility are uncorrelated: R 2 of 0.23 –Even for CMU3 which was tuned for utility

TDT 2004 Workshop, Dec 2-3, 2004www.nist.gov/TDT Hierarchical Topic Detection System goal: –To detect topics in terms of the (clusters of) stories that discuss them Problems with past Topic Detection evaluations: –Topics are at different levels of granularity, yet systems had to choose single operating point for creating a new cluster –Stories may pertain to multiple topics, yet systems had to assign each to only one cluster

TDT 2004 Workshop, Dec 2-3, 2004www.nist.gov/TDT s8 Topic Hierarchy Solves Problems System operation: –Unsupervised topic training - no topic instances as input –Assign each story to one or more clusters –Clusters may overlap or include other clusters –Clusters must be organized as directed acyclic graph (DAG) with single root –Treated as retrospective search Semantics of topic hierarchy: –Root = entire collection –Leaf nodes = the most specific topics –Intermediate nodes represent different levels of granularity Performance assessment: –Given a topic, find matching cluster with lowest cost cbadehfijg s3s2s1s4s9s10s15s16s13s14s11s12s7s5s6 Edge a Vertex Story IDs

TDT 2004 Workshop, Dec 2-3, 2004www.nist.gov/TDT Hierarchical Topic Detection Metric: Minimal Cost Weighted combination of Detection Cost and Travel Cost: WDET  (Cdet(topic, bestVertex)) Norm + (1 - WDET)  Ctravel(topic, bestVertex)) Norm –Detection Cost: same as for other tasks –Travel Cost: function of the hierarchy –Detection Cost weighted 2  Travel Cost ( WDET = 0.66) Minimal Cost metric selected based on study at U Mass (Allan et al.): –Effectively eliminates power set solution –Favors balance of cluster purity vs. number of clusters –Computationally tractable –Good behavior in U Mass experiments Analytic use model: –Find best-matching cluster by traversing DAG, starting from root –Corresponds to analytic task of exploring an unknown collection Drawbacks: –Does not model analytic task of finding other stories on same or neighboring topics –Not obvious how to normalize travel cost

TDT 2004 Workshop, Dec 2-3, 2004www.nist.gov/TDT Hierarchical Topic Detection Metric: Travel Cost Travel Cost computation: Ctravel(topic, vertex) = Ctravel(topic, parentOf(vertex)) + CBRANCH  NumChildren(parentOf(vertex)) + CTITLE – CBRANCH = cost per branch, for each vertex on path to best match – CTITLE = cost of examining each vertex –Relative values of CBRANCH and CTITLE determine preference for shallow, bushy hierarchy vs. deep, less bushy hierarchy –Evaluation values chosen to favor branching factor of 3 Travel Cost normalization: –Absolute travel cost depends on size of corpus, diversity of topics –Must be normalized to combine with Detection Cost –Normalization scheme for trial evaluation chosen to yield Ctravel Norm = 1 for “ignorant” hierarchy (by analogy with use of prior probability for Cdet Norm ): Ctravel Norm = Ctravel / (CBRANCH * MAXVTS * NSTORIES / AVESPT) + CTITLE MAXVTS = 3 (maximum number of vertices per story, controls overlap) AVESPT = 88 (average stories per topic, computed from TDT4 multilingual data)

TDT 2004 Workshop, Dec 2-3, 2004www.nist.gov/TDT Hierarchical Topic Detection

TDT 2004 Workshop, Dec 2-3, 2004www.nist.gov/TDT Hierarchical Topic Detection Observations All systems structured hierarchy as a tree – each vertex has one parent Travel cost has very little effect on finding the best cluster –Setting WDET to 1.0 has little effect on topic mapping Cost parameters favor false alarms –Average mapped cluster sizes are between 1262 and 7757 stories –Average topic size is 40 stories

TDT 2004 Workshop, Dec 2-3, 2004www.nist.gov/TDT Summary Eleven research groups participated in five evaluation tasks Error rates increased for new event detection –Why? Error rates decreased for tracking Error rates decreased for link detection Dry run of hierarchical topic detection completed –Solves previous problems with topic detection task, but raises new issues –Questions to consider: Is the specified hierarchical structure (single-root DAG) appropriate? Is the minimal cost metric appropriate? If so, is the normalization right? Dry run of supervised adaptive tracking completed –Promising results for including relevance feedback –Questions to consider: Should we continue the task? If so, should we continue using both metrics?