DOCUMENT UPDATE SUMMARIZATION USING INCREMENTAL HIERARCHICAL CLUSTERING CIKM’10 (DINGDING WANG, TAO LI) Advisor: Koh, Jia-Ling Presenter: Nonhlanhla Shongwe.

Slides:

Advertisements

Similar presentations

Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.

Advertisements

Chapter 5: Tree Constructions

Google News Personalization: Scalable Online Collaborative Filtering

Conceptual Clustering

Multi-Document Person Name Resolution Michael Ben Fleischman (MIT), Eduard Hovy (USC) From Proceedings of ACL-42 Reference Resolution workshop 2004.

Clustering Basic Concepts and Algorithms

Learning Trajectory Patterns by Clustering: Comparative Evaluation Group D.

Clustering Categorical Data The Case of Quran Verses

Entity-Centric Topic-Oriented Opinion Summarization in Twitter Date : 2013/09/03 Author : Xinfan Meng, Furu Wei, Xiaohua, Liu, Ming Zhou, Sujian Li and.

Huffman code and ID3 Prof. Sin-Min Lee Department of Computer Science.

Sequence Clustering and Labeling for Unsupervised Query Intent Discovery Speaker: Po-Hsien Shih Advisor: Jia-Ling Koh Source: WSDM’12 Date: 1 November,

Web Document Clustering: A Feasibility Demonstration Hui Han CSE dept. PSU 10/15/01.

Context-aware Query Suggestion by Mining Click-through and Session Data Authors: H. Cao et.al KDD 08 Presented by Shize Su 1.

Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.

Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.

© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.

Creating Concept Hierarchies in a Customer Self-Help System Bob Wall CS /29/05.

Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.

Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.

Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.

Chapter 3: Cluster Analysis  3.1 Basic Concepts of Clustering  3.2 Partitioning Methods  3.3 Hierarchical Methods The Principle Agglomerative.

Clustering Unsupervised learning Generating “classes”

Query session guided multi- document summarization THESIS PRESENTATION BY TAL BAUMEL ADVISOR: PROF. MICHAEL ELHADAD.

Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.

CHAMELEON : A Hierarchical Clustering Algorithm Using Dynamic Modeling

WebPage Summarization Using Clickthrough Data JianTao Sun & Yuchang Lu, TsingHua University, China Dou Shen & Qiang Yang, HK University of Science & Technology.

Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Microsoft Research Asia Yunhua Hu, Guomao Xin, Ruihua Song, Guoping.

Tag Clouds Revisited Date : 2011/12/12 Source : CIKM’11 Speaker : I- Chih Chiu Advisor : Dr. Koh. Jia-ling 1.

Learning Phonetic Similarity for Matching Named Entity Translation and Mining New Translations Wai Lam, Ruizhang Huang, Pik-Shan Cheung ACM SIGIR 2004.

Eric H. Huang, Richard Socher, Christopher D. Manning, Andrew Y. Ng Computer Science Department, Stanford University, Stanford, CA 94305, USA ImprovingWord.

Incident Threading for News Passages (CIKM 09) Speaker: Yi-lin,Hsu Advisor: Dr. Koh, Jia-ling. Date:2010/06/14.

Querying Structured Text in an XML Database By Xuemei Luo.

MINING FREQUENT ITEMSETS IN A STREAM TOON CALDERS, NELE DEXTERS, BART GOETHALS ICDM2007 Date: 5 June 2008 Speaker: Li, Huei-Jyun Advisor: Dr. Koh, Jia-Ling.

Web Document Clustering: A Feasibility Demonstration Oren Zamir and Oren Etzioni, SIGIR, 1998.

SCATTER/GATHER : A CLUSTER BASED APPROACH FOR BROWSING LARGE DOCUMENT COLLECTIONS GROUPER : A DYNAMIC CLUSTERING INTERFACE TO WEB SEARCH RESULTS MINAL.

1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.

Lossless Compression CIS 465 Multimedia. Compression Compression: the process of coding that will effectively reduce the total number of bits needed to.

LexPageRank: Prestige in Multi- Document Text Summarization Gunes Erkan and Dragomir R. Radev Department of EECS, School of Information University of Michigan.

Expert Systems with Applications 34 (2008) 459–468 Multi-level fuzzy mining with multiple minimum supports Yeong-Chyi Lee, Tzung-Pei Hong, Tien-Chin Wang.

SINGULAR VALUE DECOMPOSITION (SVD)

Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.

LOGO Summarizing Conversations with Clue Words Giuseppe Carenini, Raymond T. Ng, Xiaodong Zhou (WWW ’07) Advisor ： Dr. Koh Jia-Ling Speaker ： Tu.

PCI th Panhellenic Conference in Informatics Clustering Documents using the 3-Gram Graph Representation Model 3 / 10 / 2014.

Finding Experts Using Social Network Analysis 2007 IEEE/WIC/ACM International Conference on Web Intelligence Yupeng Fu, Rongjing Xiang, Yong Wang, Min.

CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.

2015/12/251 Hierarchical Document Clustering Using Frequent Itemsets Benjamin C.M. Fung, Ke Wangy and Martin Ester Proceeding of International Conference.

Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.

Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -

Database Management Systems, R. Ramakrishnan 1 Algorithms for clustering large datasets in arbitrary metric spaces.

Event-Based Extractive Summarization E. Filatova and V. Hatzivassiloglou Department of Computer Science Columbia University (ACL 2004)

LexPageRank: Prestige in Multi-Document Text Summarization Gunes Erkan, Dragomir R. Radev (EMNLP 2004)

1 CS 430: Information Discovery Lecture 5 Ranking.

 Effective Multi-Label Active Learning for Text Classification Bishan yang, Juan-Tao Sun, Tengjiao Wang, Zheng Chen KDD’ 09 Supervisor: Koh Jia-Ling Presenter:

Example Apply hierarchical clustering with d min to below data where c=3. Nearest neighbor clustering d min d max will form elongated clusters!

Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,

Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.

CiteData: A New Multi-Faceted Dataset for Evaluating Personalized Search Performance CIKM’10 Advisor : Jia-Ling, Koh Speaker : Po-Hsien, Shih.

CS791 - Technologies of Google Spring A Webbased Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.

Ontology Engineering and Feature Construction for Predicting Friendship Links in the Live Journal Social Network Author:Vikas Bahirwani 、 Doina Caragea.

Machine Learning: Ensemble Methods

Mining Data Streams with Periodically changing Distributions Yingying Tao, Tamer Ozsu CIKM’09 Supervisor Dr Koh Speaker Nonhlanhla Shongwe April 26,

Julián ALARTE DAVID INSA JOSEP SILVA

Data Mining K-means Algorithm

Birch presented by : Bahare hajihashemi Atefeh Rahimi

Text Categorization Berlin Chen 2003 Reference:

BIRCH: Balanced Iterative Reducing and Clustering Using Hierarchies

Presented by Nick Janus

Presentation transcript:

DOCUMENT UPDATE SUMMARIZATION USING INCREMENTAL HIERARCHICAL CLUSTERING CIKM’10 (DINGDING WANG, TAO LI) Advisor: Koh, Jia-Ling Presenter: Nonhlanhla Shongwe 1

Preview Introduction Incremental Hierarchical Clustering Based Document Update Summarization Incremental Hierarchical Sentence Clustering (IHSC) o The COBWEB algorithm o COBWEB for text Algorithm Evaluation measures Experiments and results 2

Introduction Document summarization has been receiving much attention due to Increasing number of documents on the internet Helping readers to extract their interested information efficiently Most document summarization techniques perform in a batch mode 3

Introduction cont’s Two most widely used summarization methods Firstly: Clustering based Term sentence matrices formed from the document Sentences are grouped into different clusters Score is attached to each sentence using average cosine similarity Sentences with the highest score in each cluster form the summary 4

Introduction cont’s Secondly: Graph-ranking based Constructs a sentence graph, each node is a sentence in a document collection An edge is formed between sentence pairs if The similarity between a pair of sentence is above the threshold They belong to the same document Sentences are selected to form the summary by voting from their neighbors 5

Introduction cont’s With the rapid growth of document, There is a necessity to update the existing summaries when new documents arrives. Traditional methods are not suitable for this task Most of the methods work in batch way: Meaning that all the documents need to be process again once new documents come, which causes inefficiency 6

Introduction cont’s In this paper To integrate document summarization techniques into an incremental hierarchical clustering framework To be able to re-organize sentence clusters immediately after new documents arrive so that their corresponding summaries can be updated efficiently. 7

INCREMENTAL HIERARCHICAL CLUSTERING BASED DOCUMENT UPDATE SUMMARIZATION 1. Framework 2. Preprocessing 3. Incremental Hierarchical Sentence Clustering (IHSC) I.The COBWEB algorithm II.COBWEB for text 4. Representative Sentence Selection for Each Node of the Hierarchy 5. The Algorithm 8

Framework 9

Preprocessing Data preprocessing Given a collection of documents 1.Decompose the documents into sentences 2.Stop words are removed 3.Word stemming is performed 4.Sentence matrix is constructed and each element is the term frequency 10

Incremental Hierarchical Sentence Clustering (IHSC) For update summarization system Used an Incremental Hierarchical Clustering (IHC) Benefits of IHC method The method can efficiently process the dynamic documents, new documents are added A hierarchy is built to facilitate users The number of clusters is not pre-defined 11

The COBWEB algorithm Used COBWEB, most popular incremental hierarchical clustering algorithms Based on the heuristic measures called Category Utility (CU) Clusters Probability of a document belong to a cluster Total number of clusters K 12

The COBWEB algorithm cont’s Ai = The ith attribute of the items being clustered Vij = jth value of the ith attribute For example: A1 Є {male, female}, A2 Є {Red, Green, Blue} V12= female V22= Green Probability matching guessing strategy Expected number of times we can correctly guess the value of multinomial variable Ai to be Vij for an item in a cluster k A good cluster, in which the attributes of the items take similar values will have high values COBWEB maximizes sum score over all possible assignment of a document to a cluster 13

The COBWEB algorithm cont’s The COBWEB algorithm can perform Insert: add the sentence into an existing cluster Create: create a new cluster Merge: combine two clusters into a single cluster Split: divide an existing cluster into several clusters 14

The COBWEB algorithm cont’s Example: 15

COBWEB for text The COBWEB algorithm Using normal attributes distribution is not suitable for text data Documents Are represented in the “bag of words” where terms are attributes Best method Calculating CU using Katz’s distribution 16

COBWEB for text cont’s Katz’s model Assuming word i occurs k times in document then = 1 – (df/N) df = document frequency N = total number of documents p = (cf - df) / cf cf = collection frequency = Pr(the word repeats | the word occurs ) Therefore: (1 - p) = the probability of the word occurring only once 17

COBWEB for text cont’s 18 Substitute with p K=0, using p δk =1 Adding both formulas p(0) = 1- αp α = (1-p(0))/p

COBWEB for text cont’s Where attribute value f=Vij to the contribution of the attribute i towards the category utility of the cluster k 19

Representation sentence selection for Each Node of the Hierarchy Update summarization system Select the most representative sentences to summarize each node and subtrees Once a new sentence arrives, the sentence hierarchy is changed by either of the four operations 20

Representation sentence selection for Each Node of the Hierarchy cont’s Case 1 : Insert a sentence into cluster k Recalculate the representative sentence R k of cluster K Where K : number of sentences in the cluster Sim() : similarity function between sentence pairs Cosine similarity α = parameter α =

Representation sentence selection for Each Node of the Hierarchy cont’s Case 2: Create a new cluster k Newly sentence represents a new cluster R k = s new Case 3: Merge two clusters (cluster a and cluster b ) into a new cluster ( cluster c ) Sentence obtaining the higher similarity with the query is selected as the representative sentence at the new merged node 22

Representation sentence selection for Each Node of the Hierarchy cont’s Case 4: split cluster into a set of clusters (cluster a into cluster 1, cluster 2,…cluster n ) Remove node a Substitute it using the roots of its sub-trees Corresponding representative sentences are the representative sentences for the original sub- tree roots 23

The Algorithm Input: a query/topic the user is interested in a sequence of documents/sentences 1.Read one sentence and check if it is relevant to the given topic i.e., checkrelevance(sentence,topic) 24

The Algorithm cont’s 2. If relevant :initialize the hierarchy tree, sentence as the root Otherwise: remove it and read in the next sentence and repeat Step1 : until root node is formed 3. repeat 25

The Algorithm cont’s 4. Read in the next sentence, start from the root node If the node is a leaf, go to Step 5 otherwise choose one of the following with the highest CU score 1.Insert a node and conduct case 1 summarization 2.Create a node and conduct case 2 summarization 3.Merge a node and conduct case 3 summarization 4.Split a node and conduct case 4 summarization 5.If a leaf node is reached, create a new leaf node and merge the old leaf and the new leaf into a node and case 2 and case 3 are conducted 26

The Algorithm cont’s 6. Until the stopping condition is satisfied 7. Cut the hierarchy tree at one layer to obtain a summary with the corresponding length. Output: A sentence hierarchy The updated summary 27

EXPERIMENTS Data Description Baselines Evaluations Measures Experimental Results 28

Data Description Hurricane Wilman Releases(Hurricane) 1700 documents divided into 3 phases TAC 2008 Update Summarization Track (TAC08) Benchmark dataset from update summarization 48 topics and 20 newswire articles in each topic 29

Baselines BaselineDescription RandomSelects sentences randomly for each document collection CentroidExtracts sentences according to centroid value, positional value and first sentence overlap LexPageRankConstructs a sentence connectivity graph based on cosine similarity then selects important sentences based on the concepts of eigenvector centrality LSAPerforms latent semantic analysis on terms by sentences matrix to select sentences having the greatest combined weights across all important topics 30 Implemented the following used multi-document summarization methods as the baseline systems

Evaluations Measures Rouge toolkit To compare with the human summaries MethodDescription ROUGE-1Uses unigrams ROUGE-2Uses bigrams ROUGE-LUses the longest common subsequence (LCS) ROUGE-SUSkip-bigram plus unigram 31 Count match (gram n ) maximum number of n-grams co-occurring in a candidate summary Count(gram n ) number of n-grams in the reference summaries

Experimental Results 32

Experimental Results cont’s 33

Experimental Results cont’s 34

Conclusion Traditional methods perform in batch way and are not suitable of incrementing summaries Incremental Hierarchical Clustering Based Document Update Summarization Incremental Hierarchical Sentence Clustering (IHSC) Algorithm called COBWEB for text Can perform Insert, Create, Merge, Split IHSC outperforms the traditional methods and its more efficient. 35

THANK YOU! 36