2015/12/251 Hierarchical Document Clustering Using Frequent Itemsets Benjamin C.M. Fung, Ke Wangy and Martin Ester Proceeding of International Conference.

Slides:



Advertisements
Similar presentations
Query Chain Focused Summarization Tal Baumel, Rafi Cohen, Michael Elhadad Jan 2014.
Advertisements

Association Analysis (2). Example TIDList of item ID’s T1I1, I2, I5 T2I2, I4 T3I2, I3 T4I1, I2, I4 T5I1, I3 T6I2, I3 T7I1, I3 T8I1, I2, I3, I5 T9I1, I2,
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Outline. Conceptualization of diversity with unbalanced hierarchies.
Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
Clustering Categorical Data The Case of Quran Verses
gSpan: Graph-based substructure pattern mining
Fast Algorithms For Hierarchical Range Histogram Constructions
Binary Trees CSC 220. Your Observations (so far data structures) Array –Unordered Add, delete, search –Ordered Linked List –??
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Iterative Optimization and Simplification of Hierarchical Clusterings Doug Fisher Department of Computer Science, Vanderbilt University Journal of Artificial.
Image Indexing and Retrieval using Moment Invariants Imran Ahmad School of Computer Science University of Windsor – Canada.
Data Mining Association Analysis: Basic Concepts and Algorithms
A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng City University of Hong Kong WWW 2007 Session: Similarity Search April.
Association Mining Data Mining Spring Transactional Database Transaction – A row in the database i.e.: {Eggs, Cheese, Milk} Transactional Database.
Association Analysis (2). Example TIDList of item ID’s T1I1, I2, I5 T2I2, I4 T3I2, I3 T4I1, I2, I4 T5I1, I3 T6I2, I3 T7I1, I3 T8I1, I2, I3, I5 T9I1, I2,
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
Creating Concept Hierarchies in a Customer Self-Help System Bob Wall CS /29/05.
Data Mining Association Analysis: Basic Concepts and Algorithms
Chapter 9 contd. Binary Search Trees Anshuman Razdan Div of Computing Studies
Query Operations: Automatic Global Analysis. Motivation Methods of local analysis extract information from local set of documents retrieved to expand.
Birch: An efficient data clustering method for very large databases
“A Comparison of Document Clustering Techniques” Michael Steinbach, George Karypis and Vipin Kumar (Technical Report, CSE, UMN, 2000) Mahashweta Das
Chapter 3: Cluster Analysis  3.1 Basic Concepts of Clustering  3.2 Partitioning Methods  3.3 Hierarchical Methods The Principle Agglomerative.
Binary Trees Chapter 6.
Clustering Unsupervised learning Generating “classes”
USpan: An Efficient Algorithm for Mining High Utility Sequential Patterns Authors: Junfu Yin, Zhigang Zheng, Longbing Cao In: Proceedings of the 18th ACM.
Data Structures Arrays both single and multiple dimensions Stacks Queues Trees Linked Lists.
JingTao Yao Growing Hierarchical Self-Organizing Maps for Web Mining Joseph P. Herbert and JingTao Yao Department of Computer Science, University or Regina.
An Integrated Approach to Extracting Ontological Structures from Folksonomies Huairen Lin, Joseph Davis, Ying Zhou ESWC 2009 Hyewon Lim October 9 th, 2009.
NGDM’02 1 Efficient Data-Reduction Methods for On-line Association Rule Mining H. Bronnimann B. ChenM. Dash, Y. Qiao, P. ScheuermannP. Haas Polytechnic.
Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
April 14, 2003Hang Cui, Ji-Rong Wen and Tat- Seng Chua 1 Hierarchical Indexing and Flexible Element Retrieval for Structured Document Hang Cui School of.
Chapter 6 Binary Trees. 6.1 Trees, Binary Trees, and Binary Search Trees Linked lists usually are more flexible than arrays, but it is difficult to use.
Hierarchical Document Clustering using Frequent Itemsets Benjamin C. M. Fung, Ke Wang, Martin Ester SDM 2003 Presentation Serhiy Polyakov DSCI 5240 Fall.
Expert Systems with Applications 34 (2008) 459–468 Multi-level fuzzy mining with multiple minimum supports Yeong-Chyi Lee, Tzung-Pei Hong, Tien-Chin Wang.
Outline Introduction – Frequent patterns and the Rare Item Problem – Multiple Minimum Support Framework – Issues with Multiple Minimum Support Framework.
1 Using Tiling to Scale Parallel Datacube Implementation Ruoming Jin Karthik Vaidyanathan Ge Yang Gagan Agrawal The Ohio State University.
1 Inverted Matrix: Efficient Discovery of Frequent Items in Large Datasets in the Context of Interactive Mining -SIGKDD’03 Mohammad El-Hajj, Osmar R. Zaïane.
DOCUMENT UPDATE SUMMARIZATION USING INCREMENTAL HIERARCHICAL CLUSTERING CIKM’10 (DINGDING WANG, TAO LI) Advisor: Koh, Jia-Ling Presenter: Nonhlanhla Shongwe.
2015/11/271 Efficiently Clustering Transactional data with Weighted Coverage Density M. Hua Yan, Keke Chen, and Ling Liu Proceedings of the 15 th International.
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
R-Trees: A Dynamic Index Structure For Spatial Searching Antonin Guttman.
B-TREE. Motivation for B-Trees So far we have assumed that we can store an entire data structure in main memory What if we have so much data that it won’t.
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
An Interval Classifier for Database Mining Applications Rakes Agrawal, Sakti Ghosh, Tomasz Imielinski, Bala Iyer, Arun Swami Proceedings of the 18 th VLDB.
BINARY TREES Objectives Define trees as data structures Define the terms associated with trees Discuss tree traversal algorithms Discuss a binary.
Hybrid Content and Tag-based Profiles for recommendation in Collaborative Tagging Systems Latin American Web Conference IEEE Computer Society, 2008 Presenter:
Example Apply hierarchical clustering with d min to below data where c=3. Nearest neighbor clustering d min d max will form elongated clusters!
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Hierarchical clustering approaches for high-throughput data Colin Dewey BMI/CS 576 Fall 2015.
1 Data Mining Lecture 6: Association Analysis. 2 Association Rule Mining l Given a set of transactions, find rules that will predict the occurrence of.
University at BuffaloThe State University of New York Pattern-based Clustering How to cluster the five objects? qHard to define a global similarity measure.
Chapter 3 Data Mining: Classification & Association Chapter 4 in the text box Section: 4.3 (4.3.1),
1 Parallel Datacube Construction: Algorithms, Theoretical Analysis, and Experimental Evaluation Ruoming Jin Ge Yang Gagan Agrawal The Ohio State University.
CSE343/543 Machine Learning: Lecture 4.  Chapter 3: Decision Trees  Weekly assignment:  There are lot of applications and systems using machine learning.
Gspan: Graph-based Substructure Pattern Mining
2016/9/301 Exploiting Wikipedia as External Knowledge for Document Clustering Xiaohua Hu, Xiaodan Zhang, Caimei Lu, E. K. Park, and Xiaohua Zhou Proceeding.
An Efficient Algorithm for Incremental Update of Concept space
Ge Yang Ruoming Jin Gagan Agrawal The Ohio State University
Frequent Pattern Mining
Hierarchical clustering approaches for high-throughput data
Information Organization: Clustering
The BIRCH Algorithm Davitkov Miroslav, 2011/3116
Text Categorization Berlin Chen 2003 Reference:
Presentation transcript:

2015/12/251 Hierarchical Document Clustering Using Frequent Itemsets Benjamin C.M. Fung, Ke Wangy and Martin Ester Proceeding of International Conference on Data Mining, SIAM 2003 報告人 : 吳建良

Outline Hierarchical Document Clustering Proposed Approach Frequent Itemset-based Hierarchical Clustering (FIHC) Experimental Evaluation Conclusions 2

Hierarchical Document Clustering Document Clustering Automatically organize documents into clusters Documents within a cluster have high similarity Documents within different clusters are very dissimilar Hierarchical Document Clustering 3 Sports SoccerTennis Tennis ball

Challenges in Hierarchical Document Clustering High dimensionality. High volume of data Consistently high clustering quality. Meaningful cluster description 4

Overview of FIHC 5 Preprocessing Documents (High dimensional doc vectors) Generate frequent itemsets (Reduced dimensions feature vectors) Construct clustersBuild a TreePruning Cluster Tree

Preprocessing Remove stop words and Stemming Construct vector model doc i = ( item frequency 1, if 2, if 3, …, if m ) EX: 6

Generate Frequent Itemsets Use Agrawal et al. proposed algorithm to find global frequent itemsets Minimum global support a percentage of all documents Global frequent itemset a set of items (words) that appear together in more than a minimum global support of the whole document set Global frequent item an item that belongs to some global frequent itemset 7

Reduced Dimensions Vector Model High dimensional vector model Set the minimum global support to 35% Store the frequencies only for global frequent items 8

Construct Initial Clusters Construct a cluster for each global frequent itemset All documents containing this itemset are included in the same cluster 9 C(flow)C(form)C(layer)C(patient) cran.1 cran.2 cran.3 cran.4 cran.5 Its cluster label is {result} C(result)C(treatment)C(flow, layer) C(patient, treatment) cisi.1 cran.1 cran.3 med.2 med.5 cran.1 cran.2 cran.3 cran.4 cran.5 med.1 med.2 med.3 med.4 med.5 med.6 cran.3 med.1 med.2 med.4 med.6 med.1 med.2 med.3 med.4 med.6 cran.1 cran.2 cran.3 cran.4 cran.5 med.1 med.2 med.3 med.4 med.6

Cluster Frequent Items A global frequent item is cluster frequent in a cluster C i if the item is contained in some minimum fraction of documents in C i Suppose the minimum cluster support is set to 70% 10 C(patient) (flow, form, layer, patient, result, treatment) med.1=( ) med.2=( ) med.3=( ) med.4=( ) med.5=( ) med.6=( ) C(patient) ItemCluster Support form33% patient100% result66% treatment83%

11 Initial Cluster (minimum cluster support=70%)

Cluster Label vs. Cluster Frequent Items Cluster label Use global frequent itemset as cluster label A set of mandatory items in the cluster Every document in the cluster must contain all the items in the cluster label Used in hierarchical structure establishment Cluster frequent items Appear in some minimum fraction of documents in the cluster Used in similarity measurement Topic description of the cluster 12

Make Clusters Disjoint Initial clusters are not disjoint Remove the overlapping of clusters Assign a document to the “best” initial cluster Define the score function Score(Ci ← doc j ) Measure the goodness of a cluster C i for a document doc j 13

Score Function Assign each doc j to the initial cluster C i that has the highest score 14 x represents a global frequent item in doc j and the item is also cluster frequent in C i x’ represents a global frequent item in doc j but the item is not cluster frequent in C i n(x) is the frequency of x in the feature vector of doc j n(x’) is the frequency of x’ in the feature vector of doc j

Score Function (cont.) If the highest score are more than one Choose the one that has the most number of items in the cluster label Key idea: A cluster C i is good for a document doc j if there are many global frequent items in doc j that appear in many documents in C i 15

Score Function - Example 16 C(flow) flow= 100% layer= 100% C(form) form= 100% C(layer) flow= 100% layer= 100% C(patient) patient= 100% treatment= 83% C(result) result= 100% patient= 80% treatment= 80% C(treatment) patient= 100% treatment= 100% result= 80% C(flow, layer) flow= 100% layer= 100% C(patient, treatment) patient= 100% treatment= 100% result= 80% (flow, form, layer, patient, result, treatment) med.6=( ) 0+0-[(9 × 0.5)+(1 × 0.42)+(1 × 0.42)] = (9 × 1)+(1 × 1)+(1 × 0.8)= 10.8 global support of patient, result, and treatment

Recompute the Cluster Frequent Items 17 Recompute C i, also include all descendants of C i A descendant of C i if its cluster label is a superset of the cluster label of C i none (flow, form, layer, patient, result, treatment) med.5=( ) Consider C(patient) C(patient). original ItemCluster Support form100% patient100% C(patient). descendant ItemCluster Support form33% patient100% result66% treatment83% Include descendant: C(patient, treatment) Disjoint cluster result

Building the Cluster Tree Put the more specific clusters at the bottom of the tree Put the more general clusters at the top of the tree 18

Building the Cluster Tree (cont.) Tree level Level 0: root, mark “null” and store unclustered documents Level k: cluster label is global frequent k-itemset Bottom-up manner Start from the cluster C i with the largest number k of items in its cluster label Identify all potential parents that are (k-1)-clusters and have the cluster label being a subset of C i ’s cluster label Choose the “best” among potential parents 19

Building the Cluster Tree (contd.) The criterion for selecting the best Similar to choosing the best cluster for a document Method: (1)Merge all the documents in the subtree of C i into a single conceptual document doc(C i ) (2)Compute the score of doc(C i ) against each potential parent C j The potential parent with the highest score would become the parent of C i All leaf clusters that contain no document can be removed 20

Example Start from 2-cluster C(flow, layer) and C(patient, treatment) C(flow, layer) is emptye  remove C(patient, treatment) Potential parents: C(patient) and C(treatment) C(treatment) is empty  remove C(patient) gets a higher score and becomes the parent of C(patient, treatment) 21 null C(flow) cran.1 cran.2 cran.3 cran.4 cran.5 C(form) cisi.1 C(patient) med.5 C(patient, treatment) med.1 med.2 med.3 med.4 med.6

Prune Cluster Tree A small minimum global support A cluster tree can be broad and deep Documents of the same topic are distributed over several small clusters Poor clustering accuracy The aim of tree pruning Produce a natural topic hierarchy for browsing Increase the clustering accuracy 22

Inter-Cluster Similarity Inter_Sim of C a and C b Reuse the score function to calculate Sim(C i ←C j ) 23

Property of Sim(C i ←C j ) Global support and cluster support are between 0 and 1 Maximum value of Score=, minimum is Normalize Score by, Sim is within -1~1 Avoid negative similarity value, add the term +1 The range of Sim function is 0~2, so is Inter_Sim Inter_Sim value is below 1 Weight of dissimilar items has exceeded the weight of similar items A good threshold to distinguish two clusters 24

Child Pruning Objective: shorten the depth of a tree Procedure 1. Scan the tree in the bottom-up order 2. For each non-leaf node, calculate Inter_Sim between the node and each of its children 3. If Inter_Sim is above 1, prune the child cluster 4. If a cluster is pruned, its children become the children of their grandparent Child pruning is only applicable to level 2 25

Example Determine whether cluster C(patient, treatment) should be pruned Compute the inter-cluster similarity between C(patient) and C(patient, treatment) Sim(C(patient)←C(patient, treatment)) Combine all the documents in cluster C(patient, treatment) by adding up their feature vectors 26 med.1=( ) med.2=( ) med.3=( ) med.4=( ) med.6=( ) Sum = ( ) (flow, form, layer, patient, result, treatment)

Example (cont.) Sim(C(patient, treatment)←C(patient))=1.92 Inter_Sim(C(patient)↔C(patient, treatment))= Inter_Sim is above 1, cluster C(patient, treatment) is pruned 27 null C(flow) cran.1 cran.2 cran.3 cran.4 cran.5 C(form) cisi.1 C(patient) med.1 med.2 med.3 med.4 med.5 med.6

Sibling Merging Sibling merging is applicable to level 1 Procedure 1. Calculate the Inter_Sim for each pair of clusters at level 1 2. Merge the cluster pair that has the highest Inter_Sim Repeat above steps until 1. User-specified number of clusters is reached, or 2. All cluster pairs at level 1 have Inter_Sim below or equal to 1 28 null C(flow) cran.1 cran.2 cran.3 cran.4 cran.5 cisi.1 C(patient) med.1 med.2 med.3 med.4 med.5 med.6

Experimental Evaluation Dataset Clustering Quality (F-measure) Recall, Precision Corresponding F-measure: F-measure for whole clustering result: 29 Cluster j Natural Class i n ij

30

Efficiency & Scalability 31

Conclusions & Discussion This research exploits frequent itemsets for Define a cluster Use score function, construct initial clusters, make disjoint clusters Organize the cluster hierarchy Build cluster tree, prune cluster tree Discussion: Use unordered frequent word sets Different order of words may deliver different meaning Multiple topics of documents 32