A Critique and Improvement of an Evaluation Metric for Text Segmentation A Paper by Lev Pevzner (Harvard University) Marti A. Hearst (UC, Berkeley) Presented.

A Critique and Improvement of an Evaluation Metric for Text Segmentation A Paper by Lev Pevzner (Harvard University) Marti A. Hearst (UC, Berkeley) Presented by Saima Aman SITE, University of Ottawa Nov 10, 2005

Presentation Outline Problem Description Text Segmentation Evaluation Measures: Precision and Recall Evaluation Metric P k Problems with Evaluation Metric P k Solution Modified Metric – WindowDiff Simulation Results Conclusions

What is Text Segmentation? Documents are generally comprised of multiple sub-topics. Text segmentation is the task of determining the positions at which topics change in a document. Applications of Text Segmentation Information Retrieval (IR) for retrieval of relevant passages Automated Summarization Story Segmentation of Video Detection of Topic and Story boundaries in News Feeds

Approaches to Text Segmentation Patterns of Lexical Co-occurrence and Distribution Large shifts in vocabulary indicate subtopic boundaries Clustering based on word co-occurrences Lexical chains A large number of lexical chains are found to originate and end at segment boundaries Cue-words that tend to be used near segment boundaries. Hand-selected cue words Machine learning techniques used to select cue words

Segmentation Evaluation Challenges of Evaluation: Difficult to choose a reference segmentation Human judges disagree over placement of boundaries. Disagreement on how fine-grained segmentation should be. Criticality of errors is often application dependent. Near misses may be okay in information retrieval Near misses critical in news boundary detection

How to Evaluate Segmentation? There are a set of true boundaries according to reference segmentation Segmentation algorithm may identify correct as well as incorrect boundaries The set of segment boundaries identified by the algorithm may not perfectly match the actual set of true boundaries

Precision & Recall Recall: Ratio of the number of true segment boundaries identified to the total number of true segment boundaries in the document. Precision: Ratio of the number of correct segment boundaries identified to the total number of boundaries identified.

Precision and Recall - Challenges Inherent trade-off between Precision and Recall: Trying to improve one quantity may deteriorate the other. F1-measure is sometimes maximized. Placing more boundaries may improve Recall but reduces Precision Not sensitive to “near misses” Both algorithms A-0 and A-1 would receive scores of 0 for both Precision and Recall. Desirable to have a metric that penalizes A-0 less harshly than A-1

A New Metric: P k Proposed by Beeferman, Berger, and Lafferty (1997) Attempts to resolve problems with Precision and Recall P k measures the probability that two sentences k units apart are incorrectly labeled as being in different segments. P k = Total number of disagreements (with reference) Number of measurements taken It compute penalties via a moving window of length k, where k = (average segment size)/2

How is P k Calculated? Segment Size = 8, and window size, k = 4 At each location, the algorithm determines if the two ends of the probe are in the same or different segments. Penalties are assigned whenever two units are incorrectly labelled with respect to reference. Solid lines indicate no penalty is assigned Dashed lines indicate a penalty is assigned.

Scope of the Paper Authors identify several limitations of metric P k Propose a modified metric – WindowDiff Claim that the new metric (WindowDiff) solves most of the problems associated with evaluation metric P k. Present results of their simulations that suggest that the modified metric is an improvement over the original.

Problems with Metric P k ● False Negatives Penalized More Than False Positives – False Negatives always assigned a penalty 'k' – On an average, False Positives assigned a penalty of 'k/2'. ● Number of Boundaries Between Probe Ends Ignored – Causes some errors to go un-penalized ● Sensitivity to Variations in Segment Size – As segment size gets smaller, penalty for both false positives and false negatives decreases – As segment size increases, penalty for false positives increases ● Near-Miss Error Penalized Too Much ● P k is Non-intuitive and its Interpretation is Difficult

Modified Metric - WindowDiff For each position of the probe, compute: ● r i – the number of reference segmentation boundaries that fall between the two ends of a fixed-length probe. ● a i – the number of boundaries that are assigned in this interval by the algorithm The algorithm is penalized if the two numbers do not match, that is if | r i – a i | > 0

Validation via Simulation Simulations performed for following metrics: Evaluation Metric P k Metric P' k (which doubles penalty for false positives) WindowDiff. Simulation Details A single trial included generating a reference segmentation of 1,000 segments Generating different experimental segmentations of a specific type 100 times Computing the metrics and averaging over 100 results. Different segment size distributions were used

Results for WindowDiff Successfully distinguishes 'near misses' as a separate kind of error. Penalizes near misses less than pure false positives and pure false negatives. Gives equal weight to false positive and false negative penalties (eliminates the asymmetry seen in P k metric). Catches false positives and false negatives within segments of length less than k. Only slightly affected by variation in segment size distribution

Interpretation of WindowDiff Test Results show that WindowDiff metric grows in a roughly 'linear' fashion with the difference between the reference and the experimental segmentations. WindowDiff metric value can be interpreted as an indication of the number of discrepancies occurring between the reference and the algorithm’s result. Evaluation Metric P k is a measure of how often two text units are incorrectly labelled as being in different segments. This interpretation is less intuitive.

Conclusions Evaluation Metric P k suffered from several drawbacks A modified version P' k which doubles the false positive penalty can only solve the problem of over-penalizing false negatives but not the other problems. Metric WindowDiff is able to solve all problems associated with P k. Popularity of the new metric WindowDiff A search on internet shows several citations of this paper. Most people in text and media segmentation now use the WindowDiff measure.

A Critique and Improvement of an Evaluation Metric for Text Segmentation A Paper by Lev Pevzner (Harvard University) Marti A. Hearst (UC, Berkeley) Presented.

Similar presentations

Presentation on theme: "A Critique and Improvement of an Evaluation Metric for Text Segmentation A Paper by Lev Pevzner (Harvard University) Marti A. Hearst (UC, Berkeley) Presented."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A Critique and Improvement of an Evaluation Metric for Text Segmentation A Paper by Lev Pevzner (Harvard University) Marti A. Hearst (UC, Berkeley) Presented.

Similar presentations

Presentation on theme: "A Critique and Improvement of an Evaluation Metric for Text Segmentation A Paper by Lev Pevzner (Harvard University) Marti A. Hearst (UC, Berkeley) Presented."— Presentation transcript:

Similar presentations

About project

Feedback