Yang Xiang, Ruoming Jin, David Fuhry, Feodor F. Dragan

Slides:



Advertisements
Similar presentations
Recap: Mining association rules from large datasets
Advertisements

Frequent Itemset Mining Methods. The Apriori algorithm Finding frequent itemsets using candidate generation Seminal algorithm proposed by R. Agrawal and.
Graph Mining Laks V.S. Lakshmanan
Mining Compressed Frequent- Pattern Sets Dong Xin, Jiawei Han, Xifeng Yan, Hong Cheng Department of Computer Science University of Illinois at Urbana-Champaign.
Query Optimization of Frequent Itemset Mining on Multiple Databases Mining on Multiple Databases David Fuhry Department of Computer Science Kent State.
Introduction to Data Mining Pang-Ning Tan, Michael Steinbach, Vipin Kumar HW 1.
ICDM'06 Panel 1 Apriori Algorithm Rakesh Agrawal Ramakrishnan Srikant (description by C. Faloutsos)
Sampling Large Databases for Association Rules ( Toivenon’s Approach, 1996) Farzaneh Mirzazadeh Fall 2007.
Rakesh Agrawal Ramakrishnan Srikant
Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.
Data Mining Association Analysis: Basic Concepts and Algorithms Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Cluster Description and Related Problems DDM Lab Seminar 2005 Spring Byron Gao.
Association Analysis (2). Example TIDList of item ID’s T1I1, I2, I5 T2I2, I4 T3I2, I3 T4I1, I2, I4 T5I1, I3 T6I2, I3 T7I1, I3 T8I1, I2, I3, I5 T9I1, I2,
Data Mining Association Analysis: Basic Concepts and Algorithms
Frequent Item Based Clustering M.Sc Student:Homayoun Afshar Supervisor:Martin Ester.
Reducing the collection of itemsets: alternative representations and combinatorial problems.
Data Mining Association Analysis: Basic Concepts and Algorithms
August 26, 2008Gupta et al. KDD Quantitative Evaluation of Approximate Frequent Pattern Mining Algorithms Rohit Gupta, Gang Fang, Blayne Field, Michael.
Mining Frequent patterns without candidate generation Jiawei Han, Jian Pei and Yiwen Yin.
Data Mining Association Analysis: Basic Concepts and Algorithms
CS 590M Fall 2001: Security Issues in Data Mining Lecture 5: Association Rules, Sequential Associations.
Finding Compact Structural Motifs Presented By: Xin Gao Authors: Jianbo Qian, Shuai Cheng Li, Dongbo Bu, Ming Li, and Jinbo Xu University of Waterloo,
Efficient and Effective Itemset Pattern Summarization: Regression-based Approaches Ruoming Jin Kent State University Joint work with Muad Abu-Ata, Yang.
Mining Phenotypes and Informative Genes from Gene Expression Data Chun Tang, Aidong Zhang and Jian Pei Department of Computer Science and Engineering State.
Overlapping Matrix Pattern Visualization: a Hypergraph Approach Ruoming Jin Kent State University Joint with Yang Xiang, David Fuhry, and Feodor F. Dragan.
© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.
Association Rule Mining (Some material adapted from: Mining Sequential Patterns by Karuna Pande Joshi)‏
Integer Programming Difference from linear programming –Variables x i must take on integral values, not real values Lots of interesting problems can be.
Cartesian Contour: A Concise Representation for a Collection of Frequent Sets Ruoming Jin Kent State University Joint work with Yang Xiang and Lin Liu.
The Generalized MDL Approach for Summarization Laks V.S. Lakshmanan (UBC) Raymond T. Ng (UBC) Christine X. Wang (UBC) Xiaodong Zhou (UBC) Theodore J. Johnson.
My work: 1. Co-cluster users and content to summarize user  content relationships. 2. Define a new similarity index to efficiently answer complex queries.
What Is Sequential Pattern Mining?
October 2, 2015 Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 8 — 8.3 Mining sequence patterns in transactional.
AR mining Implementation and comparison of three AR mining algorithms Xuehai Wang, Xiaobo Chen, Shen chen CSCI6405 class project.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.
A Model and Algorithms for Pricing Queries Tang Ruiming, Wu Huayu, Bao Zhifeng, Stephane Bressan, Patrick Valduriez.
Turing Clusters into Patterns: Rectangle-based Discriminative Data Description Byron J. Gao and Martin Ester IEEE ICDM 2006 Adviser: Koh Jia-Ling Speaker:
Efficient Data Mining for Calling Path Patterns in GSM Networks Information Systems, accepted 5 December 2002 SPEAKER: YAO-TE WANG ( 王耀德 )
TAR: Temporal Association Rules on Evolving Numerical Attributes Wei Wang, Jiong Yang, and Richard Muntz Speaker: Sarah Chan CSIS DB Seminar May 7, 2003.
Greedy is not Enough: An Efficient Batch Mode Active Learning Algorithm Chen, Yi-wen( 陳憶文 ) Graduate Institute of Computer Science & Information Engineering.
Frequent Item Mining. What is data mining? =Pattern Mining? What patterns? Why are they useful?
Mining Approximate Frequent Itemsets in the Presence of Noise By- J. Liu, S. Paulsen, X. Sun, W. Wang, A. Nobel and J. Prins Presentation by- Apurv Awasthi.
CSE4334/5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.
University at BuffaloThe State University of New York Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo Frequent.
HOW TO HAVE A GOOD PAPER Tran Minh Quang. What and Why Do We Write? Letter Proposal Report for an assignments Research paper Thesis ….
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Association Analysis This lecture node is modified based on Lecture Notes for.
1 AC-Close: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Hong.
1 Mining Association Rules with Constraints Wei Ning Joon Wong COSC 6412 Presentation.
1 Mining Sequential Patterns with Constraints in Large Database Jian Pei, Jiawei Han,Wei Wang Proc. of the 2002 IEEE International Conference on Data Mining.
Mining Sequential Patterns © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 Slides are adapted from Introduction to Data Mining by Tan, Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
1 Discovering Calendar-based Temporal Association Rules SHOU Yu Tao May. 21 st, 2003 TIME 01, 8th International Symposium on Temporal Representation and.
An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets Adam Kirsch, Michael Mitzenmacher, Havard University Andrea.
Indexing and Mining Free Trees Yun Chi, Yirong Yang, Richard R. Muntz Department of Computer Science University of California, Los Angeles, CA {
1 Top Down FP-Growth for Association Rule Mining By Ke Wang.
Frequent Pattern Mining
William Norris Professor and Head, Department of Computer Science
TT-Join: Efficient Set Containment Join
CARPENTER Find Closed Patterns in Long Biological Datasets
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Rule Mining
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining for Finding Connections of Disease and Medical and Genomic Characteristics Vipin Kumar William Norris Professor and Head, Department of Computer.
Association Rule Mining
DENSE ITEMSETS JOUNI K. SEPPANEN, HEIKKI MANNILA SIGKDD2004
Association Analysis: Basic Concepts
Presentation transcript:

Yang Xiang, Ruoming Jin, David Fuhry, Feodor F. Dragan Succinct Summarization of Transactional Databases: An Overlapped Hyperrectangle Scheme Yang Xiang, Ruoming Jin, David Fuhry, Feodor F. Dragan Kent State University Presented by: Yang Xiang

Introduction How to succinctly describe a transactional database? {t1,t2,t7,t8}X{i1,i2,i8,i9} t7 t1 t8 t2 t3 i4 i5 i6 t4 t4 {t4,t5}X{i4,i5,i6} t5 t5 t6 i2 i3 i7 i8 t7 t2 t8 t3 {t2,t3,t6,t7}X{i2,i3,i7,i8} t6 t7 Summarization  a set of hyperrectangles (Cartesian products) with minimal cost to cover ALL the cells (transaction-item pairs) of the transactional database

Problem Formulation Example: cost ({t1,t2,t7,t8}X{i1,i2,i8,i9}) For a hyperrectangle, Ti X Ii , we define its cost to be |Ti|+|Ii| t1 t2 t3 t6 t4 t5 t7 t8 i1 i2 i3 i4 i5 i6 i7 i8 i9 {t1,t2,t7,t8}X{i1,i2,i8,i9} {t2,t3,t6,t7}X{i2,i3,i7,i8} {t4,t5}X{i4,i5,i6} Total Covering Cost=8+5+8=21

Related Work Data Descriptive Mining and Rectangle Covering [Agrawal94] [Lakshmanan02] [Gao06] Summarization for categorical databases [Wang06] [Chandola07] Data Categorization and Comparison [Siebes06] [Leeuwen06] [Vreeken07] Others: Co-clustering [Li05], Approximate Frequent Itemset Mining [Afrati04] [Pei04] [Steinbach04], Data Compression [Johnson04]…

Hardness Results Unfortunately, this problem and several variations are proved to be NP-Hard! (Proof hint: Reduce minimum set cover problem to this problem.)

Weighted Set Cover Problem The summarization problem is closely related to the weighted set covering problem Ground set Candidate sets (each set has a weight) Weighted set cover problem: Use a subset of candidate sets to cover the ground set such that the total weight is minimum  All cells of the database  All possible hyperrectangles (each hyperrectangle has a cost)  Use a subset of all possible hyperrectangles to cover the database such that the total cost is minimum

Naïve Greedy Algorithm Each time choose a hyperrectangle (Hi=Ti×Ii) with lowest price . |Ti|+|Ii| is hyperrectangle cost. R is the set of covered cells Approximation ratio is ln(k)+1 [V.Chvátal 1979]. k is the number of selected hyperrectangles. The problem? The number of candidate hyperrectangles are 2|T|+|I| !!!

Basic Idea-1 Restricted number of candidates A candidate is a hyperrectangle whose itemset is either frequent, or a single item. Cα = {Ti X Ii | Ii ∈ Fα υIs} is the set of candidates. Given an itemset either frequent or singleton, it corresponds to an exponential number of hyperrectangles. For example: {1,2,3}X{a} It corresponds to the following hyperrectangles: {1} X {a}, {2} X {a}, {3} X {a}, {1,2} X {a}, {1,3} X{a}, {2,3} X{a}, {1,2,3} X{a} The number of hyperrectangle is still exponential

Basic Idea-2 Given an itemset, we do NOT try to enumerate the exponential number of hyperrectangles sharing this itemset. A linear algorithm to find the hyperrectangle with the lowest price among all the hyperrectanges sharing the same itemset.

Idea-2 Illustration i2 i4 i5 i7 t1 t3 t4 t6 t7 t9 Price=5/4=1.25 Hyperrectangle of Lowest Price {t1,t4,t6,t7}×{i2,i4,i5,i7} Price=7/10=0.70 Price=8/12=0.67 Price=9/13=0.69 Lowest Price=8/12=0.67

HYPER Algorithm While there are some cells not covered [STEP1] Calculate lowest price hyperrectangle for each frequent or single itemset. (basic idea-2) [STEP2] Find the frequent or single itemset whose corresponding lowest price hyperrectangle is the lowest among all. [STEP3] Output this hyperrectangle. [STEP4] Update Coverage of the database.

HYPER We assume Apriori algorithm provides Fα . HYPER is able to find the best cover which utilizes the exponential number of hyperrectangles, described by candidate sets Cα = {Ti X Ii | Ii ∈ Fα υIs} ( ). Properties: Approximation ratio is still ln(k)+1 w.r.t. Cα , Running time is polynomial to .

Pruning Technique for HYPER One important observation: For each frequent or single itemset, the price of lowest price hyperrectangle will only increase! i2 i4 i5 i7 t1 Hyperrectangle of Lowest Price 0.67 {t1,t4,t6,t7}×{i2,i4,i5,i7} Hyperrectangle of Lowest Price 0.80 {t1,t4,t6,t7}×{i2,i4,i5,i7} t3 t4 t6 t7 Ii ∈ Fα υIs t9 0.74 0.37 0.94 0.48 0.53 0.66 0.80 1.33 … … I1 I2 I3 Ii In Significantly reduce the time of step 1

Further Summarization: HYPER+ The number of hyperrectangles returned by HYPER may be too large or the cost is too high. We can do further summarization by allowing false positive budget β, i.e. (false cells)/(true cells)≤ β t1 t2 t7 t8 i1 i2 i8 i9 i1 i2 i3 i4 i5 i6 i7 i8 i9 t1 t2 t1 t2 t3 t6 t7 t8 i1 i2 i3 i7 i8 i9 t3 t4 t5 i4 i5 i6 β =0 β =2/7 t4 Cost=12+5=17 Cost=8+8+5=21 t5 t6 t2 t3 t6 t7 i2 i3 i7 i8 t7 t8

Experimental Results Two important observations: Convergence behavior Threshold behavior Two important conclusions: min-sup doesn’t need to be too low. We can reduce k to a relatively small number without increasing false positive ratio too much.

Conclusion and Future Work HYPER can utilize exponential number of candidates to achieve a ln(k)+1 approximate bound but works in polynomial time. We can speed up HYPER by pruning technique. HYPER and HYPER+ works effectively and we find threshold behavior and convergence behavior in the experiments. Future work Apply this method to real world applications. What is the best α for summarization?

Thank you! Questions?

References [Agrawal94] Rakesh Agrawal, Johannes Gehrke, Dimitrios Gunopulos, and Prabhakar Raghavan. Automatic subspace clustering of high dimensional data for data mining applications. In SIGMOD Conference, pp 94-105, 1998. [Lakshmanan02] Laks V. S. Lakshmanan, Raymond T. Ng, Christine Xing Wang, Xiaodong Zhou, and Theodore J. Johnson. The generalized mdl approach for summarization. In VLDB ’02, pp 766–777, 2002. [Gao06] Byron J. Gao and Martin Ester. Turning clusters into patterns: Rectangle-based discriminative data description. In ICDM, pages 200–211, 2006. [Wang06] Jianyong Wang and George Karypis. On efficiently summarizing categorical databases. Knowl. Inf. Syst., 9(1):19–37, 2006. [Chandola07] Varun Chandola and Vipin Kumar. Summarization -compressing data into an informative representation. Knowl. Inf. Syst., 12(3):355–378, 2007. [Siebes06] Arno Siebes, Jilles Vreeken, and Matthijs van Leeuwen. Itemsets that compress. In SDM, 2006. [Leeuwen06] Matthijs van Leeuwen, Jilles Vreeken, and Arno Siebes. Compression picks item sets that matter. In PKDD, pp 585–592, 2006. [Vreeken07] Jilles Vreeken, Matthijs van Leeuwen, and Arno Siebes. Characterising the difference. In KDD ’07, pages 765–774, 2007. [Li05] Tao Li. A general model for clustering binary data. In KDD, pp 188–197, 2005. [Afrati04] Foto N. Afrati, Aristides Gionis, and Heikki Mannila. Approximating a collection of frequent sets. In KDD, pp 12–19, 2004. [Pei04] Jian Pei, Guozhu Dong, Wei Zou, and Jiawei Han. Mining condensed frequent-pattern bases. Knowl. Inf. Syst., 6(5):570–594, 2004. [Steinbach04] Michael Steinbach, Pang-Ning Tan, and Vipin Kumar. Support envelopes: a technique for exploring the structure of association patterns. In KDD ’04, pages 296–305, New York, NY, USA, 2004. ACM. [Johnson04] David Johnson, Shankar Krishnan, Jatin Chhugani, Subodh Kumar, and Suresh Venkatasubramanian. Compressing large boolean matrices using reordering techniques. In VLDB’2004, pages 13–23. VLDB Endowment, 2004. [V.Chvátal 1979] V. Chvátal. A greedy heuristic for the set-covering problem. Math. Oper. Res, 4:233–235, 1979.