A Critique and Improvement of an Evaluation Metric for Text Segmentation A Paper by Lev Pevzner (Harvard University) Marti A. Hearst (UC, Berkeley) Presented.

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.
Imbalanced data David Kauchak CS 451 – Fall 2013.
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.
Baselines for Recognizing Textual Entailment Ling 541 Final Project Terrence Szymanski.
Segmenting G-Protein Coupled Receptors using Language Models Betty Yee Man Cheng Language Technologies Institute, CMU Advisors:Judith Klein-Seetharaman.
Evaluating Search Engine
Active Calibration of Cameras: Theory and Implementation Anup Basu Sung Huh CPSC 643 Individual Presentation II March 4 th,
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.
Gimme’ The Context: Context- driven Automatic Semantic Annotation with CPANKOW Philipp Cimiano et al.
Distributional Cues to Word Boundaries: Context Is Important Sharon Goldwater Stanford University Tom Griffiths UC Berkeley Mark Johnson Microsoft Research/
1 I256: Applied Natural Language Processing Marti Hearst Sept 27, 2006.
Introduction to experimental errors
Efficient Estimation of Emission Probabilities in profile HMM By Virpi Ahola et al Reviewed By Alok Datar.
EVENT IDENTIFICATION IN SOCIAL MEDIA Hila Becker, Luis Gravano Mor Naaman Columbia University Rutgers University.
Query session guided multi- document summarization THESIS PRESENTATION BY TAL BAUMEL ADVISOR: PROF. MICHAEL ELHADAD.
Today Evaluation Measures Accuracy Significance Testing
Evaluating Classifiers
Webpage Understanding: an Integrated Approach
Minimal Test Collections for Retrieval Evaluation B. Carterette, J. Allan, R. Sitaraman University of Massachusetts Amherst SIGIR2006.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
A Comparison of Statistical Significance Tests for Information Retrieval Evaluation CIKM´07, November 2007.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A Comparison of SOM Based Document Categorization Systems.
April 14, 2003Hang Cui, Ji-Rong Wen and Tat- Seng Chua 1 Hierarchical Indexing and Flexible Element Retrieval for Structured Document Hang Cui School of.
Combining Statistical Language Models via the Latent Maximum Entropy Principle Shaojum Wang, Dale Schuurmans, Fuchum Peng, Yunxin Zhao.
Partially Supervised Classification of Text Documents by Bing Liu, Philip Yu, and Xiaoli Li Presented by: Rick Knowles 7 April 2005.
1 Sentence-extractive automatic speech summarization and evaluation techniques Makoto Hirohata, Yosuke Shinnaka, Koji Iwano, Sadaoki Furui Presented by.
Paper: Large-Scale Clustering of cDNA-Fingerprinting Data Presented by: Srilatha Bhuvanapalli INFS 795 – Special Topics in Data Mining.
Distributed Information Retrieval Server Ranking for Distributed Text Retrieval Systems on the Internet B. Yuwono and D. Lee Siemens TREC-4 Report: Further.
Less is More Probabilistic Models for Retrieving Fewer Relevant Documents Harr Chen, David R. Karger MIT CSAIL ACM SIGIR 2006 August 9, 2006.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
PSEUDO-RELEVANCE FEEDBACK FOR MULTIMEDIA RETRIEVAL Seo Seok Jun.
1 Sentence Extraction-based Presentation Summarization Techniques and Evaluation Metrics Makoto Hirohata, Yousuke Shinnaka, Koji Iwano and Sadaoki Furui.
Chapter 23: Probabilistic Language Models April 13, 2004.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
March 23 & 28, Csci 2111: Data and File Structures Week 10, Lectures 1 & 2 Hashing.
March 23 & 28, Hashing. 2 What is Hashing? A Hash function is a function h(K) which transforms a key K into an address. Hashing is like indexing.
Basic Implementation and Evaluations Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
Statistics for Decision Making Basic Inference QM Fall 2003 Instructor: John Seydel, Ph.D.
A Word Clustering Approach for Language Model-based Sentence Retrieval in Question Answering Systems Saeedeh Momtazi, Dietrich Klakow University of Saarland,Germany.
August 17, 2005Question Answering Passage Retrieval Using Dependency Parsing 1/28 Question Answering Passage Retrieval Using Dependency Parsing Hang Cui.
Performance Measures. Why to Conduct Performance Evaluation? 2 n Evaluation is the key to building effective & efficient IR (information retrieval) systems.
Advantages of Query Biased Summaries in Information Retrieval by A. Tombros and M. Sanderson Presenters: Omer Erdil Albayrak Bilge Koroglu.
A Classification-based Approach to Question Answering in Discussion Boards Liangjie Hong, Brian D. Davison Lehigh University (SIGIR ’ 09) Speaker: Cho,
Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
Class Imbalance in Text Classification
Finding document topics for improving topic segmentation Source: ACL2007 Authors: Olivier Ferret (18 route du Panorama, BP6) Reporter:Yong-Xiang Chen.
CHAPTER 7 STATISTICAL PROCESS CONTROL. THE CONCEPT The application of statistical techniques to determine whether the output of a process conforms to.
Event-Based Extractive Summarization E. Filatova and V. Hatzivassiloglou Department of Computer Science Columbia University (ACL 2004)
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
+ 6-7 Graphing and Solving Quadratic Inequalities Objectives: The student will be able to…. 1) graph Quadratic Inequalities in Two Variables. 2) solve.
CS 2750: Machine Learning The Bias-Variance Tradeoff Prof. Adriana Kovashka University of Pittsburgh January 13, 2016.
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
Question Answering Passage Retrieval Using Dependency Relations (SIGIR 2005) (National University of Singapore) Hang Cui, Renxu Sun, Keya Li, Min-Yen Kan,
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
Knowledge and Information Retrieval Dr Nicholas Gibbins 32/4037.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Plan for Today’s Lecture(s)
Evaluation of IR Systems
Erasmus University Rotterdam
Text Categorization Document classification categorizes documents into one or more classes which is useful in Information Retrieval (IR). IR is the task.
iSRD Spam Review Detection with Imbalanced Data Distributions
Presentation transcript:

A Critique and Improvement of an Evaluation Metric for Text Segmentation A Paper by Lev Pevzner (Harvard University) Marti A. Hearst (UC, Berkeley) Presented by Saima Aman SITE, University of Ottawa Nov 10, 2005

Presentation Outline Problem Description Text Segmentation Evaluation Measures: Precision and Recall Evaluation Metric P k Problems with Evaluation Metric P k Solution Modified Metric – WindowDiff Simulation Results Conclusions

What is Text Segmentation? Documents are generally comprised of multiple sub-topics. Text segmentation is the task of determining the positions at which topics change in a document. Applications of Text Segmentation Information Retrieval (IR) for retrieval of relevant passages Automated Summarization Story Segmentation of Video Detection of Topic and Story boundaries in News Feeds

Approaches to Text Segmentation Patterns of Lexical Co-occurrence and Distribution Large shifts in vocabulary indicate subtopic boundaries Clustering based on word co-occurrences Lexical chains A large number of lexical chains are found to originate and end at segment boundaries Cue-words that tend to be used near segment boundaries. Hand-selected cue words Machine learning techniques used to select cue words

Segmentation Evaluation Challenges of Evaluation: Difficult to choose a reference segmentation Human judges disagree over placement of boundaries. Disagreement on how fine-grained segmentation should be. Criticality of errors is often application dependent. Near misses may be okay in information retrieval Near misses critical in news boundary detection

How to Evaluate Segmentation? There are a set of true boundaries according to reference segmentation Segmentation algorithm may identify correct as well as incorrect boundaries The set of segment boundaries identified by the algorithm may not perfectly match the actual set of true boundaries

Precision & Recall Recall: Ratio of the number of true segment boundaries identified to the total number of true segment boundaries in the document. Precision: Ratio of the number of correct segment boundaries identified to the total number of boundaries identified.

Precision and Recall - Challenges Inherent trade-off between Precision and Recall: Trying to improve one quantity may deteriorate the other. F1-measure is sometimes maximized. Placing more boundaries may improve Recall but reduces Precision Not sensitive to “near misses” Both algorithms A-0 and A-1 would receive scores of 0 for both Precision and Recall. Desirable to have a metric that penalizes A-0 less harshly than A-1

A New Metric: P k Proposed by Beeferman, Berger, and Lafferty (1997) Attempts to resolve problems with Precision and Recall P k measures the probability that two sentences k units apart are incorrectly labeled as being in different segments. P k = Total number of disagreements (with reference) Number of measurements taken It compute penalties via a moving window of length k, where k = (average segment size)/2

How is P k Calculated? Segment Size = 8, and window size, k = 4 At each location, the algorithm determines if the two ends of the probe are in the same or different segments. Penalties are assigned whenever two units are incorrectly labelled with respect to reference. Solid lines indicate no penalty is assigned Dashed lines indicate a penalty is assigned.

Scope of the Paper Authors identify several limitations of metric P k Propose a modified metric – WindowDiff Claim that the new metric (WindowDiff) solves most of the problems associated with evaluation metric P k. Present results of their simulations that suggest that the modified metric is an improvement over the original.

Problems with Metric P k ● False Negatives Penalized More Than False Positives – False Negatives always assigned a penalty 'k' – On an average, False Positives assigned a penalty of 'k/2'. ● Number of Boundaries Between Probe Ends Ignored – Causes some errors to go un-penalized ● Sensitivity to Variations in Segment Size – As segment size gets smaller, penalty for both false positives and false negatives decreases – As segment size increases, penalty for false positives increases ● Near-Miss Error Penalized Too Much ● P k is Non-intuitive and its Interpretation is Difficult

Modified Metric - WindowDiff For each position of the probe, compute: ● r i – the number of reference segmentation boundaries that fall between the two ends of a fixed-length probe. ● a i – the number of boundaries that are assigned in this interval by the algorithm The algorithm is penalized if the two numbers do not match, that is if | r i – a i | > 0

Validation via Simulation Simulations performed for following metrics: Evaluation Metric P k Metric P' k (which doubles penalty for false positives) WindowDiff. Simulation Details A single trial included generating a reference segmentation of 1,000 segments Generating different experimental segmentations of a specific type 100 times Computing the metrics and averaging over 100 results. Different segment size distributions were used

Results for WindowDiff Successfully distinguishes 'near misses' as a separate kind of error. Penalizes near misses less than pure false positives and pure false negatives. Gives equal weight to false positive and false negative penalties (eliminates the asymmetry seen in P k metric). Catches false positives and false negatives within segments of length less than k. Only slightly affected by variation in segment size distribution

Interpretation of WindowDiff Test Results show that WindowDiff metric grows in a roughly 'linear' fashion with the difference between the reference and the experimental segmentations. WindowDiff metric value can be interpreted as an indication of the number of discrepancies occurring between the reference and the algorithm’s result. Evaluation Metric P k is a measure of how often two text units are incorrectly labelled as being in different segments. This interpretation is less intuitive.

Conclusions Evaluation Metric P k suffered from several drawbacks A modified version P' k which doubles the false positive penalty can only solve the problem of over-penalizing false negatives but not the other problems. Metric WindowDiff is able to solve all problems associated with P k. Popularity of the new metric WindowDiff A search on internet shows several citations of this paper. Most people in text and media segmentation now use the WindowDiff measure.