Finding document topics for improving topic segmentation Source: ACL2007 Authors: Olivier Ferret (18 route du Panorama, BP6) Reporter:Yong-Xiang Chen.

Slides:



Advertisements
Similar presentations
A Human-Centered Computing Framework to Enable Personalized News Video Recommendation (Oh Jun-hyuk)
Advertisements

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Cluster Analysis: Basic Concepts and Algorithms
Linked data: P redicting missing properties Klemen Simonic, Jan Rupnik, Primoz Skraba {klemen.simonic, jan.rupnik,
Sequence Clustering and Labeling for Unsupervised Query Intent Discovery Speaker: Po-Hsien Shih Advisor: Jia-Ling Koh Source: WSDM’12 Date: 1 November,
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
A Statistical Model for Domain- Independent Text Segmentation Masao Utiyama and Hitoshi Isahura Presentation by Matthew Waymost.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology A novel document similarity measure based on earth mover’s.
Predicting Text Quality for Scientific Articles Annie Louis University of Pennsylvania Advisor: Ani Nenkova.
Predicting Text Quality for Scientific Articles AAAI/SIGART-11 Doctoral Consortium Annie Louis : Louis A. and Nenkova A Automatically.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Multiple Human Objects Tracking in Crowded Scenes Yao-Te Tsai, Huang-Chia Shih, and Chung-Lin Huang Dept. of EE, NTHU International Conference on Pattern.
Latent Semantic Analysis (LSA). Introduction to LSA Learning Model Uses Singular Value Decomposition (SVD) to simulate human learning of word and passage.
A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora Benjamin Arai Computer Science and Engineering Department.
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
Disambiguation of References to Individuals Levon Lloyd (State University of New York) Varun Bhagwan, Daniel Gruhl (IBM Research Center) Varun Bhagwan,
Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification on Reviews Peter D. Turney Institute for Information Technology National.
An Integrated Approach to Extracting Ontological Structures from Folksonomies Huairen Lin, Joseph Davis, Ying Zhou ESWC 2009 Hyewon Lim October 9 th, 2009.
2007. Software Engineering Laboratory, School of Computer Science S E Towards Answering Opinion Questions: Separating Facts from Opinions and Identifying.
1 A Graph-Theoretic Approach to Webpage Segmentation Deepayan Chakrabarti Ravi Kumar
A Comparison of Statistical Significance Tests for Information Retrieval Evaluation CIKM´07, November 2007.
Eric H. Huang, Richard Socher, Christopher D. Manning, Andrew Y. Ng Computer Science Department, Stanford University, Stanford, CA 94305, USA ImprovingWord.
LexRank: Graph-based Centrality as Salience in Text Summarization
A Machine Learning Approach to Sentence Ordering for Multidocument Summarization and Its Evaluation D. Bollegala, N. Okazaki and M. Ishizuka The University.
Partially Supervised Classification of Text Documents by Bing Liu, Philip Yu, and Xiaoli Li Presented by: Rick Knowles 7 April 2005.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
Terminology and documentation*  Object of the study of terminology:  analysis and description of the units representing specialized knowledge in specialized.
Automatic Image Annotation by Using Concept-Sensitive Salient Objects for Image Content Representation Jianping Fan, Yuli Gao, Hangzai Luo, Guangyou Xu.
1 SIGIR 2004 Web-page Classification through Summarization Dou Shen Zheng Chen * Qiang Yang Presentation : Yao-Min Huang Date : 09/15/2004.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
Graph-based Text Classification: Learn from Your Neighbors Ralitsa Angelova , Gerhard Weikum : Max Planck Institute for Informatics Stuhlsatzenhausweg.
1 Sentence Extraction-based Presentation Summarization Techniques and Evaluation Metrics Makoto Hirohata, Yousuke Shinnaka, Koji Iwano and Sadaoki Furui.
LOGO Summarizing Conversations with Clue Words Giuseppe Carenini, Raymond T. Ng, Xiaodong Zhou (WWW ’07) Advisor : Dr. Koh Jia-Ling Speaker : Tu.
A Model for Learning the Semantics of Pictures V. Lavrenko, R. Manmatha, J. Jeon Center for Intelligent Information Retrieval Computer Science Department,
Mutual-reinforcement document summarization using embedded graph based sentence clustering for storytelling Zhengchen Zhang , Shuzhi Sam Ge , Hongsheng.
A Repetition Based Measure for Verification of Text Collections and for Text Categorization Dmitry V.Khmelev Department of Mathematics, University of Toronto.
1 Masters Thesis Presentation By Debotosh Dey AUTOMATIC CONSTRUCTION OF HASHTAGS HIERARCHIES UNIVERSITAT ROVIRA I VIRGILI Tarragona, June 2015 Supervised.
Inference Protocols for Coreference Resolution Kai-Wei Chang, Rajhans Samdani, Alla Rozovskaya, Nick Rizzolo, Mark Sammons, and Dan Roth This research.
A Critique and Improvement of an Evaluation Metric for Text Segmentation A Paper by Lev Pevzner (Harvard University) Marti A. Hearst (UC, Berkeley) Presented.
Outline Problem Background Theory Extending to NLP and Experiment
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
Molecular Classification of Cancer Class Discovery and Class Prediction by Gene Expression Monitoring.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:
Acquisition of Categorized Named Entities for Web Search Marius Pasca Google Inc. from Conference on Information and Knowledge Management (CIKM) ’04.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Discovering Relations among Named Entities from Large Corpora Takaaki Hasegawa *, Satoshi Sekine 1, Ralph Grishman 1 ACL 2004 * Cyberspace Laboratories.
1 Measuring the Semantic Similarity of Texts Author : Courtney Corley and Rada Mihalcea Source : ACL-2005 Reporter : Yong-Xiang Chen.
From Words to Senses: A Case Study of Subjectivity Recognition Author: Fangzhong Su & Katja Markert (University of Leeds, UK) Source: COLING 2008 Reporter:
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
LexPageRank: Prestige in Multi-Document Text Summarization Gunes Erkan, Dragomir R. Radev (EMNLP 2004)
An evolutionary approach for improving the quality of automatic summaries Constantin Orasan Research Group in Computational Linguistics School of Humanities,
Refined Online Citation Matching and Adaptive Canonical Metadata Construction CSE 598B Course Project Report Huajing Li.
A Multilingual Hierarchy Mapping Method Based on GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of.
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
Enhanced hypertext categorization using hyperlinks Soumen Chakrabarti (IBM Almaden) Byron Dom (IBM Almaden) Piotr Indyk (Stanford)
NTNU Speech Lab 1 Topic Themes for Multi-Document Summarization Sanda Harabagiu and Finley Lacatusu Language Computer Corporation Presented by Yi-Ting.
Automatic Question Answering Beyond the Factoid Radu Soricut Information Sciences Institute University of Southern California Eric Brill Microsoft Research.
Question Answering Passage Retrieval Using Dependency Relations (SIGIR 2005) (National University of Singapore) Hang Cui, Renxu Sun, Keya Li, Min-Yen Kan,
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
DATA MINING: CLUSTER ANALYSIS (3) Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.
Short Text Similarity with Word Embedding Date: 2016/03/28 Author: Tom Kenter, Maarten de Rijke Source: CIKM’15 Advisor: Jia-Ling Koh Speaker: Chih-Hsuan.
An Adaptive Learning with an Application to Chinese Homophone Disambiguation from Yue-shi Lee International Journal of Computer Processing of Oriental.
EE368 Final Project Spring 2003
Topic Modeling for Short Texts with Auxiliary Word Embeddings
Neighborhood - based Tag Prediction
A German Corpus for Similarity Detection
Deep Compositional Cross-modal Learning to Rank via Local-Global Alignment Xinyang Jiang, Fei Wu, Xi Li, Zhou Zhao, Weiming Lu, Siliang Tang, Yueting.
What is IR? In the 70’s and 80’s, much of the research focused on document retrieval In 90’s TREC reinforced the view that IR = document retrieval Document.
Presented by Nick Janus
Presentation transcript:

Finding document topics for improving topic segmentation Source: ACL2007 Authors: Olivier Ferret (18 route du Panorama, BP6) Reporter:Yong-Xiang Chen

Motivation Most documents are about more than one subject –Topic segmentation –Use for Automatic Summarization The problem of linear topic segmentation –topically homogeneous segments not overlap each other Topic segmentation and identification are often tackled as separate problems This research study how topic identification can help to improve a topic segmenter based on word reiteration

Principles Documents are represented as sequences of basic discourse units –“Sentences” is the unit in this paper Evaluate similarity between the basic units –Each unit is turned into a vector of words –Computing a similarity measure between the vectors When no external knowledge is used, this similarity is based on –the strict reiteration of words –semantic relations between words

Principles The purpose is improve the detection of topical similarity between text segments but without relying on any external knowledge For each text to segment, –identify its topics by performing an unsupervised clustering of its words according to their co-occurrents in the text each of its topics is represented by a subset of its vocabulary –the similarity between two segments is evaluated during the words they share the presence of words of the same topic

Unsupervised Topic Identification Discover the topics of texts –assume that the most representative words of each of the topics of a text occur in similar contexts –evaluate the pairwise similarity of these selected text words by relying on their co- occurrents

Building the similarity matrix of text words The pairwise similarity between all the selected text words is evaluated for building their similarity matrix Apply the Cosine measure between the vectors

From a similarity matrix to text topics Unsupervised clustering of words of text from their similarity matrix Adapt the Shared Nearest Neighbor (SNN) algorithm SNN algorithm performs clustering by detecting high-density areas in a similarity graph In our case, the similarity graph is directly built from the similarity matrix –each vertex represents a text word –an edge links two words whose similarity is not null

SNN 1.Finds the elements that are the most representative of their neighborhood as seeds 2.Aggregating the remaining elements to those seeds –1. sparsification of the similarity graph –2. building of the SNN graph –3. computation of the distribution of strong links –4. search for topic seeds and filtering of noise –5. building of text topics –6. removal of insignificant topics –7. extension of text topics

The resulting graph for a two- topic document

First stage 1.Keeping only the links towards the k (k=10) most similar neighbors of each text word 2.The similarity graph is transposed into a SNN graph the similarity between two words is given by the number of direct neighbors they share in the similarity graph 3.Strong links in the SNN graph are finally detected by applying a fixed threshold 4.A word with a high number of strong links is taken as the seed of a topic

Second stage 5.Builds text topics by associating to topic seeds the remaining words 6.When the number of words of a topic is too small (size < 3), this topic is judged as insignificant 7.The remaining text topics are extended by associating to them the words that are neither noise nor already part of a topic the integration of words that are not strongly linked to a topic seed can be safely performed by relying on the average strength of their links in the SNN graph with the words of the topic

Using Text Topics for Segmentation Topic segmentation using word reiteration Using text topics to enhance segmentation

Topic segmentation using word reiteration The topic segmenter we propose, called F06 –Evaluates the lexical cohesion of texts and then finds their topic shifts by identifying breaks in this cohesion –Linguistic pre-processing of texts –The evaluation of the lexical cohesion of a text relies as for TextTiling on a fixed-size focus window that is moved over the text

The cohesion delimited by this window is evaluated by measuring the word reiteration between its two sides the cohesion in the window at position x is given by: Wl (Wr ) refers to the vocabulary of the left (right) side of the focus window A cohesion value is computed for each sentence break of the text to segment and the final result is a cohesion graph of the text

The last part of our algorithm is mainly taken from the LCseg system Computation of a score evaluating the probability of each minimum of the cohesion graph to be a topic shift; Removal of segments with a too small size; Selection of topic shifts.

The score of a minimum m begins by finding the pair of maxima l and r around it Values are between 0 and 1, is a measure of how high is the difference between the minimum and the maxima around it

The next step is done by removing as a possible topic shift each minimum that is not farther than 2 sentences from its preceding neighbor Finally, the selection of topic shifts is performed by applying a threshold computed from the distribution of minimum scores. Thus, a minimum m is kept as a topic shift if –µ is the average of minimum scores

Using text topics to enhance segmentation The weak point of the evaluation of lexical cohesion in the focus window is only relies on word reiteration –two different words that respectively belongs to Wl and Wr but also belong to the same text topic cannot contribute to the identification of a possible topical similarity between the two sides of the focus window

F06T Extends the evaluation of lexical cohesion by taking into account the topical proximity of words The evaluation of the cohesion in the focus window –computation of the word reiteration cohesion; –determination of the topic(s) of the window; –computation of the cohesion based on text topics and fusion of the two kinds of cohesion. Hence, a topic is considered as representative of the content of the focus window only if it matches each side of this window

The last step of the cohesion evaluation –first determining for each side of the focus window the number of its words that belong to one of the topics associated to the window The cohesion of the window is Where

Global Cohesion The global cohesion in the focus window is computed as the sum of the two kinds of cohesion, the one computed from word reiteration (see Equation 1) and the one computed from text topics (see Equation 3)

Evaluation framework To verify that taking into account text topics discovered can improve a topic segmentation algorithm that is initially based on word reiteration Based on the building of artificial texts made of segments extracted from different documents – we only use two source documents –Each of them is split into a set of segments whose size is between 3 and 11 sentences –taking documents from the corpus of the CLEF 2003 evaluation for crosslingual information retrieval

Use the error metric Pk proposed in (Beeferman et al., 1999) to measure segmentation accuracy –Pk evaluates the probability that a randomly chosen pair of sentences, separated by k sentences, is wrongly classified WindowDiff (WD), a variant of Pk proposed in (Pevzner and Hearst, 2002) that corrects some of its insufficiencies All other systems were used as F06 and F06T without fixing the number of topic shifts to find For each result, we also give the significance level pval of its difference for Pk with F06 and F06T

Conclusion F06T has significantly better results than F06 F06T have the best results among all the tested algorithms, with a significant difference in most of the cases difference of language –because repetitions are more discouraged in French than in English from a stylistic viewpoint