Frequent Item Based Clustering M.Sc Student:Homayoun Afshar Supervisor:Martin Ester.

Slides:

Advertisements

Similar presentations

Recap: Mining association rules from large datasets

Advertisements

Frequent Itemset Mining Methods. The Apriori algorithm Finding frequent itemsets using candidate generation Seminal algorithm proposed by R. Agrawal and.

Mining Compressed Frequent- Pattern Sets Dong Xin, Jiawei Han, Xifeng Yan, Hong Cheng Department of Computer Science University of Illinois at Urbana-Champaign.

Christoph F. Eick Questions and Topics Review Dec. 10, Compare AGNES /Hierarchical clustering with K-means; what are the main differences? 2. K-means.

Frequent Closed Pattern Search By Row and Feature Enumeration

LOGO Association Rule Lecturer: Dr. Bo Yuan

IT 433 Data Warehousing and Data Mining Association Rules Assist.Prof.Songül Albayrak Yıldız Technical University Computer Engineering Department

10 -1 Lecture 10 Association Rules Mining Topics –Basics –Mining Frequent Patterns –Mining Frequent Sequential Patterns –Applications.

LCM: An Efficient Algorithm for Enumerating Frequent Closed Item Sets L inear time C losed itemset M iner Takeaki Uno Tatsuya Asai Hiroaki Arimura Yuzo.

Sampling Large Databases for Association Rules ( Toivenon’s Approach, 1996) Farzaneh Mirzazadeh Fall 2007.

Data Mining Association Analysis: Basic Concepts and Algorithms

Rakesh Agrawal Ramakrishnan Srikant

Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.

Data Mining Association Analysis: Basic Concepts and Algorithms Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

Association Rule Mining Part 2 (under construction!) Introduction to Data Mining with Case Studies Author: G. K. Gupta Prentice Hall India, 2006.

Data Mining Association Rules Yao Meng Hongli Li Database II Fall 2002.

1 Mining Quantitative Association Rules in Large Relational Database Presented by Jin Jin April 1, 2004.

Near Duplicate Detection

6/23/2015CSE591: Data Mining by H. Liu1 Association Rules Transactional data Algorithm Applications.

Association Rule Mining (Some material adapted from: Mining Sequential Patterns by Karuna Pande Joshi)‏

Fast Algorithms for Association Rule Mining

Lecture14: Association Rules

Web Search – Summer Term 2006 II. Information Retrieval (Basics) (c) Wolfgang Hürst, Albert-Ludwigs-University.

Mining Association Rules

1 Fast Algorithms for Mining Association Rules Rakesh Agrawal Ramakrishnan Srikant Slides from Ofer Pasternak.

ACM SIGKDD Aug – Washington, DC  M. El-Hajj and O. R. Zaïane, 2003 Database Lab. University of Alberta Canada Inverted Matrix: Efficient Discovery.

1 Synthesizing High-Frequency Rules from Different Data Sources Xindong Wu and Shichao Zhang IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL.

Pattern Recognition Lecture 20: Data Mining 3 Dr. Richard Spillman Pacific Lutheran University.

Information Retrieval from Data Bases for Decisions Dr. Gábor SZŰCS, Ph.D. Assistant professor BUTE, Department Information and Knowledge Management.

『 Data Mining 』 By Jung, hae-sun. 1.Introduction 2.Definition 3.Data Mining Applications 4.Data Mining Tasks 5. Overview of the System 6. Data Mining.

Apriori algorithm Seminar of Popular Algorithms in Data Mining and Machine Learning, TKK Presentation Lauri Lahti.

A Short Introduction to Sequential Data Mining

What Is Sequential Pattern Mining?

Ch5 Mining Frequent Patterns, Associations, and Correlations

1 Apriori Algorithm Review for Finals. SE 157B, Spring Semester 2007 Professor Lee By Gaurang Negandhi.

Efficient Data Mining for Calling Path Patterns in GSM Networks Information Systems, accepted 5 December 2002 SPEAKER: YAO-TE WANG ( 王耀德 )

Data Mining Association Rules: Advanced Concepts and Algorithms

Pattern-Growth Methods for Sequential Pattern Mining Iris Zhang

Sequential Pattern Mining

Mining Frequent Patterns without Candidate Generation : A Frequent-Pattern Tree Approach 指導教授：廖述賢博士報告人：朱佩慧班級：管科所博一.

Expert Systems with Applications 34 (2008) 459–468 Multi-level fuzzy mining with multiple minimum supports Yeong-Chyi Lee, Tzung-Pei Hong, Tien-Chin Wang.

Introduction of Data Mining and Association Rules cs157 Spring 2009 Instructor: Dr. Sin-Min Lee Student: Dongyi Jia.

CS 8751 ML & KDDSupport Vector Machines1 Mining Association Rules KDD from a DBMS point of view –The importance of efficiency Market basket analysis Association.

CSE4334/5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.

Data Mining Association Rules: Advanced Concepts and Algorithms Lecture Notes Introduction to Data Mining by Tan, Steinbach, Kumar.

Data Mining Association Rules: Advanced Concepts and Algorithms

Fast Algorithms for Mining Association Rules Rakesh Agrawal and Ramakrishnan Srikant VLDB '94 presented by kurt partridge cse 590db oct 4, 1999.

Association Rule Mining Data Mining and Knowledge Discovery Prof. Carolina Ruiz and Weiyang Lin Department of Computer Science Worcester Polytechnic Institute.

1 Efficient Algorithms for Incremental Update of Frequent Sequences Minghua ZHANG Dec. 7, 2001.

HOW TO HAVE A GOOD PAPER Tran Minh Quang. What and Why Do We Write? Letter Proposal Report for an assignments Research paper Thesis ….

1 Inverted Matrix: Efficient Discovery of Frequent Items in Large Datasets in the Context of Interactive Mining -SIGKDD’03 Mohammad El-Hajj, Osmar R. Zaïane.

1 AC-Close: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery Advisor ： Dr. Koh Jia-Ling Speaker ： Tu Yi-Lang Date ： Hong.

1 Mining Association Rules with Constraints Wei Ning Joon Wong COSC 6412 Presentation.

Using to Save Lives Or, Using Digg to find interesting events. Presented by: Luis Zaman, Amir Khakpour, and John Felix.

 Frequent Word Combinations Mining and Indexing on HBase Hemanth Gokavarapu Santhosh Kumar Saminathan.

Data Mining Spring 2007 Noisy data Data Discretization using Entropy based and ChiMerge.

1 CSIS 7101: CSIS 7101: Spatial Data (Part 1) The R*-tree ： An Efficient and Robust Access Method for Points and Rectangles Rollo Chan Chu Chung Man Mak.

HEMANTH GOKAVARAPU SANTHOSH KUMAR SAMINATHAN Frequent Word Combinations Mining and Indexing on HBase.

Mining Sequential Patterns © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 Slides are adapted from Introduction to Data Mining by Tan, Steinbach,

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

Chapter 8 Association Rules. Data Warehouse and Data Mining Chapter 10 2 Content Association rule mining Mining single-dimensional Boolean association.

Using category-Based Adherence to Cluster Market-Basket Data Author : Ching-Huang Yun, Kun-Ta Chuang, Ming-Syan Chen Graduate : Chien-Ming Hsiao.

1 Top Down FP-Growth for Association Rule Mining By Ke Wang.

Mining Dependent Patterns

Near Duplicate Detection

Association rule mining

Chang-Hung Lee, Jian Chih Ou, and Ming Syan Chen, Proc

CARPENTER Find Closed Patterns in Long Biological Datasets

Association Rule Mining

Farzaneh Mirzazadeh Fall 2007

Presentation transcript:

Frequent Item Based Clustering M.Sc Student:Homayoun Afshar Supervisor:Martin Ester

Homayoun AfsharFrequent Item Based Clustering2 Contents Introduction and motivation Frequent item sets Text data as transactional data Cluster set definition Our approach Test data set, results, challenges Related works Conclusion

Homayoun AfsharFrequent Item Based Clustering3 Introduction and Motivation Huge amount of information online Lots of this information is in text format E.G. s, web pages, news group postings, … Need to group related documents Nontrivial task

Homayoun AfsharFrequent Item Based Clustering4 Frequent Item Sets Given a dataset D={t 1,t 2,…,t n } Each t i is a transaction t i  I where I is the set of all items Given a threshold min_sup i  I such that |{t  i  t and t  D}|>min_sup i is a frequent item set with respect to minimum support min_sup

Homayoun AfsharFrequent Item Based Clustering5 Text Data As Transactional Data Assume each word as an item And each document as a transaction Using a minimum support find frequent item sets (frequent word sets) Frequent Word Sets  Frequent Item Sets

Homayoun AfsharFrequent Item Based Clustering6 Cluster Set Definition f={X 1,X 2,…,X n } is the set of all the frequent item sets with respect to some minimum support c={C 1,C 2,…,C m } is a cluster set, where C i is the documents that are covered with some X k  f And…

Homayoun AfsharFrequent Item Based Clustering7 Cluster Set Definition … Each optimal cluster set has to: Cover the whole data set Mutual overlap between clusters in cluster set must be minimized Clusters should be roughly the same size

Homayoun AfsharFrequent Item Based Clustering8 Our Approach: Frequent-Item Based Clustering … Find all the frequent word sets Form cluster sets with just one cluster Overlap is zero Coverage is the support of the frequent item set presenting the cluster Form cluster sets with two clusters Find the overlap and coverage

Homayoun AfsharFrequent Item Based Clustering9 Our Approach: Frequent-Item Based Clustering … Prune the candidate list for cluster sets If Cov(c i )  Cov(c j ) and overlap(c i )>overlap(c j ) c i and c j are candidates in same level remove if Overlap(c i )>= |Cov(c i )| Generate the next level Find Overlap and Coverage, Prune Stop when there are no more candidates left

Homayoun AfsharFrequent Item Based Clustering10 Our Approach: Coverage And Overlap … Using a bit matrix Each column is a document Each row is a frequent word set Coverage: OR, counting the 1s Overlap: XOR, OR, AND, counting 1s

Homayoun AfsharFrequent Item Based Clustering11 Our Approach: Coverage And Overlap … (1st) (2nd) (3rd) Coverage: OR all = count 1s -> coverage = 6 cost = 2 ORs + counting 1s cost for counting 1s = 8 (shifts, ANDs, Adds)

Homayoun AfsharFrequent Item Based Clustering12 Our Approach: Coverage And Overlap … Overlap: (1st) (2nd) AND first two = (i) XOR first two = (ii) (3rd) AND 3rd with (ii) (iii) OR (i) and (iii) now count 1s for overlap -> Overlap = 4

Homayoun AfsharFrequent Item Based Clustering13 Test Data, Results, Challenges Test data set Reuters documents Reuters news 8655 of them have exactly one topic Remove stop words Stem all the words Number of frequent word sets 5% min_sup = % min_sup= % min_sup=78

Homayoun AfsharFrequent Item Based Clustering14 Test Data, Results, Challenges With 20% min support sample 2-cluster candidate set {(said,reuter)(line,ct,vs)} Overlap = 1 Coverage = 5259 sample 5-cluster candidate set {(reuter)(vs)(net)(line,ct,net)(vs,net,shr)} Overlap = 3303 Coverage = 8609

Homayoun AfsharFrequent Item Based Clustering15 Test Data, Results, Challenges More Results With min_sup=10% {(reuter)(includ)(mln,includ)(mln,profit)(year,ct)(year,mln,net)} 6-clusters cluster set Coverage = 8616 Overlap = 2553 {(reuter)(loss)(profit)(year,1986)(mln,profit)(year,ct)(year,mln,net)} 7-clusters cluster set Coverage = 8611 Overlap = 2705 {(reuter)(loss)(profit)(year,1986)(mln,includ)(mln,profit)(year,ct)(year,mln,net)} 8-clusters cluster set Coverage = 8616 Overlap = 3033

Homayoun AfsharFrequent Item Based Clustering16 Test Data, Results, Challenges Lower support values Pruning is very slow 2-cluster set with minSup=20% Creating= seconds. Updating= seconds. (Overlap and Coverage) Pruning= seconds. Sorting= seconds. Number of candidates Before prune=3003 After prune=73

Homayoun AfsharFrequent Item Based Clustering17 Test Data, Results, Challenges Hierarchical clustering Clustering quality In our test data set, entropy Real data sets, classes are not known Test the pruning more efficiently Defining an upper threshold Using following ratios to prune candidates or Using only max item sets

Homayoun AfsharFrequent Item Based Clustering18 Related Works Similar idea Frequent Term-Based Text Clustering [BEX02] Florian Beil, Martin Ester, Xiaowei Xu Focuses on finding one optimal clustering set (non overlapping)-FTC Hierarchical clustering (overlapping)-HFTC

Homayoun AfsharFrequent Item Based Clustering19 Conclusion To get optimal clustering Reduce minimum support Reduce number of frequent items Introduce maximum support Use only max item sets Better pruning (speed) Hierarchical clustering

Homayoun AfsharFrequent Item Based Clustering20 References [AS94] R. Agrawal, R. Sirkant. Fast Algorithms for Mining Association rules in large databases. In Proc Int. Conf. Very Large Data Bases (VLDB’94), pages , Santiago, Chile, Sept [BEX02] F. Beil, M. Ester,X. Xu. Frequent Term-Based Text clustering. J. Han, M. Kamber. Data Mining Concepts and Techniques. Morgan Kaufmann, 2001.