Intelligent Database Systems Lab N.Y.U.S.T. I. M. Evaluation of novelty metrics for sentence-level novelty mining Presenter : Lin, Shu-Han Authors : Flora S. Tsai, Wenyin Tang, Kap Luk Chan Information Sciences, InS (2010)
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 2 Outline Introduction Motivation Objective Methodology Compare study Experiments Conclusion Comments
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Introduction 3 Define Novelty? Novelty is the opposite of “similarity ” or “redundancy” Novelty: Given the set of relevant sentences in all documents, identify all novel sentence. How to identify Novelty sentences? A novelty score: Measured and Scored by a novelty metric
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Motivation 4 Sentence 1: U.S. Stocks set for big sell-off Sentence 2 (incoming sentence) : U.S. Stocks *S2 is covered by S1 Novelty(S1, S2) = 1 – similarity(S1, S2) There is low similarity between S1 and S2 SO S2 is novelty ???
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Objectives 5 How to choose a novelty metric? How to set a suitable threshold automatically?
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Methodology - Novelty Metrics 6 Symmetric (1 – similarity) S1 is novelty to S2 S2 is novelty to S1 Asymmetric S1 is not novelty to S2 S2 is novelty to S1
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Methodology - Symmetric metrics 7 Cosine similarity Jaccard Similarity
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Methodology - ASymmetric metrics 8 Overlap metric New word count metric
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Compare study 9 Performance Requirements (trade-off) : high (recall / precision / F-score) The distribution: (high / medium / low) novelty ratio
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Compare study – Performance Require 10 F-Score/precisionF-Score/recall
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Compare study – Prior probability 11
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Compare study – Prior probability 12
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Methodology – A new Framework Combine symmetic and asymmetric metrics Two problems: The scaling problem: comparable and consistent of metrics The combining strategy 13
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experiments – Mixed metrics vs. individual metrics 14 M3 (jacc+new) tf.isf
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experiments – Mixed metric M3 vs. individual metrics for novelty ratio 15
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experiments – Mixed metric M3 vs. mixture of two symmetric metrics vs. mixture of two asymmetric metrics vs. mixture of all metrics for novelty ratio 16
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experiments – Weight 17
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Conclusions Comparative study Different types of novelty metrics Symmetric: cosine / Jaccard Asymmetric: new word count / overlap Observes Its strengths Introduce Mixture of two types of novelty metrics More stable than using individual metric 18
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 19
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Comments Advantage A Comparative study Mixture Intuitive Drawback … Application Novelty mining 20