CARPENTER Find Closed Patterns in Long Biological Datasets

Slides:

Advertisements

Similar presentations

Recap: Mining association rules from large datasets

Advertisements

Ranking Outliers Using Symmetric Neighborhood Relationship Wen Jin, Anthony K.H. Tung, Jiawei Han, and Wei Wang Advances in Knowledge Discovery and Data.

Frequent Itemset Mining Methods. The Apriori algorithm Finding frequent itemsets using candidate generation Seminal algorithm proposed by R. Agrawal and.

Association rules and frequent itemsets mining

Data Mining (Apriori Algorithm)DCS 802, Spring DCS 802 Data Mining Apriori Algorithm Spring of 2002 Prof. Sung-Hyuk Cha School of Computer Science.

Mining Compressed Frequent- Pattern Sets Dong Xin, Jiawei Han, Xifeng Yan, Hong Cheng Department of Computer Science University of Illinois at Urbana-Champaign.

Query Optimization of Frequent Itemset Mining on Multiple Databases Mining on Multiple Databases David Fuhry Department of Computer Science Kent State.

Frequent Closed Pattern Search By Row and Feature Enumeration

1 Department of Information & Computer Education, NTNU SmartMiner: A Depth First Algorithm Guided by Tail Information for Mining Maximal Frequent Itemsets.

Avrilia Floratou, Sandeep Tata, and Jignesh M. Patel ICDE 2010 Efficient and Accurate Discovery of Patterns in Sequence Datasets.

Sampling Large Databases for Association Rules ( Toivenon’s Approach, 1996) Farzaneh Mirzazadeh Fall 2007.

Data Mining Association Analysis: Basic Concepts and Algorithms

1 IncSpan :Incremental Mining of Sequential Patterns in Large Database Hong Cheng, Xifeng Yan, Jiawei Han Proc Int. Conf. on Knowledge Discovery.

Efficiently Mining Long Patterns from Databases Roberto J. Bayardo Jr. IBM Almaden Research Center.

Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

Association Rule Mining Part 2 (under construction!) Introduction to Data Mining with Case Studies Author: G. K. Gupta Prentice Hall India, 2006.

1 Exploratory Tools for Follow-up Studies to Microarray Experiments Kaushik Sinha Ruoming Jin Gagan Agrawal Helen Piontkivska Ohio State and Kent State.

Pattern Lattice Traversal by Selective Jumps Osmar R. Zaïane and Mohammad El-Hajj Department of Computing Science, University of Alberta Edmonton, AB,

Intelligent Information Directory System for Clinical Documents Qinghua Zou 6/3/2005 Dr. Wesley W. Chu (Advisor)

Fast Algorithms for Association Rule Mining

ACM SIGKDD Aug – Washington, DC  M. El-Hajj and O. R. Zaïane, 2003 Database Lab. University of Alberta Canada Inverted Matrix: Efficient Discovery.

USpan: An Efficient Algorithm for Mining High Utility Sequential Patterns Authors: Junfu Yin, Zhigang Zheng, Longbing Cao In: Proceedings of the 18th ACM.

Sequential PAttern Mining using A Bitmap Representation

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.

Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.

Efficient Data Mining for Calling Path Patterns in GSM Networks Information Systems, accepted 5 December 2002 SPEAKER: YAO-TE WANG ( 王耀德 )

ICDE 2012 Discovering Threshold-based Frequent Closed Itemsets over Probabilistic Data Yongxin Tong 1, Lei Chen 1, Bolin Ding 2 1 Department of Computer.

Mining High Utility Itemset in Big Data

Mining Serial Episode Rules with Time Lags over Multiple Data Streams Tung-Ying Lee, En Tzu Wang Dept. of CS, National Tsing Hua Univ. (Taiwan) Arbee L.P.

Implementation of “A New Two-Phase Sampling Based Algorithm for Discovering Association Rules” Tokunbo Makanju Adan Cosgaya Faculty of Computer Science.

Xiangnan Kong,Philip S. Yu Multi-Label Feature Selection for Graph Classification Department of Computer Science University of Illinois at Chicago.

Clustering by Pattern Similarity in Large Data Sets Haixun Wang, Wei Wang, Jiong Yang, Philip S. Yu IBM T. J. Watson Research Center Presented by Edmond.

Outline Introduction – Frequent patterns and the Rare Item Problem – Multiple Minimum Support Framework – Issues with Multiple Minimum Support Framework.

1 Inverted Matrix: Efficient Discovery of Frequent Items in Large Datasets in the Context of Interactive Mining -SIGKDD’03 Mohammad El-Hajj, Osmar R. Zaïane.

MINING COLOSSAL FREQUENT PATTERNS BY CORE PATTERN FUSION FEIDA ZHU, XIFENG YAN, JIAWEI HAN, PHILIP S. YU, HONG CHENG ICDE07 Advisor: Koh JiaLing Speaker:

EECS 730 Introduction to Bioinformatics Microarray Luke Huan Electrical Engineering and Computer Science

1 AC-Close: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery Advisor ： Dr. Koh Jia-Ling Speaker ： Tu Yi-Lang Date ： Hong.

Fast and Memory Efficient Mining of Frequent Closed Itemsets Claudio Lucchese Salvatore Orlando Raffaele Perego DB group seminar Presenter: Leonidas.

Spatial Congeries Pattern Mining Presented by: Iris Zhang Supervisor: Dr. David Cheung 24 October 2003.

M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining ARM: Improvements March 10, 2009 Slide.

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

Δ-Tolerance Closed Frequent Itemsets James Cheng,Yiping Ke,and Wilfred Ng ICDM ’ 06 報告者：林靜怡 2007/03/15.

Indexing and Mining Free Trees Yun Chi, Yirong Yang, Richard R. Muntz Department of Computer Science University of California, Los Angeles, CA {

1 Data Mining Lecture 6: Association Analysis. 2 Association Rule Mining l Given a set of transactions, find rules that will predict the occurrence of.

1 Top Down FP-Growth for Association Rule Mining By Ke Wang.

University at BuffaloThe State University of New York Pattern-based Clustering How to cluster the five objects? qHard to define a global similarity measure.

Computer Science and Engineering Jianye Yang 1, Ying Zhang 2, Wenjie Zhang 1, Xuemin Lin 1 Influence based Cost Optimization on User Preference 1 The University.

Rapid Association Rule Mining Amitabha Das, Wee-Keong Ng, Yew-Kwong Woon, Proc. of the 10th ACM International Conference on Information and Knowledge Management(CIKM’01),2001.

QED : An Efficient Framework for Temporal Region Query Processing Yi-Hong Chu 朱怡虹 Network Database Laboratory Dept. of Electrical Engineering National.

Abolfazl Asudeh Azade Nazi Nan Zhang Gautam DaS

Data Mining Association Analysis: Basic Concepts and Algorithms

Sequential Pattern Mining Using A Bitmap Representation

Frequent Pattern Mining

Byung Joon Park, Sung Hee Kim

The Concept of Maximal Frequent Itemsets

TT-Join: Efficient Set Containment Join

Data Mining Association Analysis: Basic Concepts and Algorithms

Mining Frequent Itemsets over Uncertain Databases

Association Rule Mining

Mining Association Rules from Stars

Farzaneh Mirzazadeh Fall 2007

Association Analysis: Basic Concepts and Algorithms

Mining Unexpected Rules by Pushing User Dynamics

Approximate Frequency Counts over Data Streams

Geometrically Inspired Itemset Mining*

Closed Itemset Mining CSCI-7173: Computational Complexity & Algorithms, Final Project - Spring 16 Supervised By Dr. Tom Altman Presented By Shahab Helmi.

Donghui Zhang, Tian Xia Northeastern University

Discovering Frequent Poly-Regions in DNA Sequences

Presentation transcript:

CARPENTER Find Closed Patterns in Long Biological Datasets Zhiyu Wang Knowledge Discovery and Data Mining Dr. Osmar Zaiane Department of Computing Science University of Alberta

Biological Datasets Gene expression Consists of large number of genes

How can we find frequent patterns in such dataset? Biological Datasets Lung Cancer dataset (gene expression) 181 samples Each sample is described by 12533 genes How can we find frequent patterns in such dataset? CARPENTER

Overview…… Motivation Problem statement Preliminaries CARPENTER algorithm Transpose table Row enumeration tree Prune methods Performance Comments and Conclusion

Motivation Challenge to find the closed patterns from biological datasets that contains large number of columns with small number of rows For example, 10,000 – 100,000 columns with 100 – 1,000 rows

Motivation Running time of most existing algorithms increases exponentially with increasing average row length For example, in a dataset potential frequent itemsets, where is the maximum row size. What if i=12533? = (Hugh Search Space)

Problem Statement Discover all the frequent closed patterns with respect to user specified support threshold in such biological datasets efficiently.

Preliminaries Features Feature support set Items in the dataset Maximal set of rows contain a set of features i r_i 1 a, b, c 2 b, c, d 3 4 d Features: {a, b, c, d} Feature support set F’={b,c}, then ={1,2,3}

Preliminaries Row support set Frequent closed pattern Maximal set of features common to a set of rows Frequent closed pattern There is no superset with the same support value i r_i 1 a, b, c 2 b, c, d 3 4 d Row support set R’={1,2}, then ={b,c} Frequent Closed patterns: {b,c}, {d}, {b,c,d}……..

CARPENTER algorithm Proposed by A. K. H. Tung et.al, in ACM SIGKDD 2003. Main idea is to find frequent closed pattern in depth-first row-wise enumeration. Assumption: Assume dataset satisfies the condition:

CARPENTER There are two phases: Transpose the dataset Row enumeration tree Recursively search in conditional transposed table

Transpose table original table 23-Conditional transposed table Projection {2, 3} 23-Conditional transposed table transposed table

Row enumeration tree Bottom-up row enumeration tree is based on conditional table. Each node is a conditional table. 23-conditional table represents node 23.

Not a real tree structure 4 5 3 7 2 6 9 10 8 1 Not a real tree structure

CARPENTER Recursively generation of conditional transposed table, performing a depth-first traversal of row-enumeration tree in order to find the frequent closed patterns.

Example Without pruning strategies, minsup=3

Example Frequent closed patterns Minsup=3 a 1,2,3,4 l 1,2,5 aeh 2,3,4

Prune methods It is obvious that complete traversal of row enumerations tree is not efficient. CARPENTER proposes 3 prune methods.

Prune method 1 Prune out the branch which can never generate closed pattern over minsup threshold

If minsup=4, then these branches will prune out

Prune method 2 If rows appear in all tuples of the conditional transposed table, then such branch needs to prune and reconstruct

Prune method 3 In each node, if corresponding support features is found, prune out the branch.

Performance CARPENTER is comparing with CHARM and CLOSET Both CHARM and CLOSET use column enumeration approach Use lung cancer dataset 181 samples with 12533 features Two parameters: minsup and length ratio Length ratio is the percentage of column from original dataset

Performance Length ratio =60%, varying minsup

Performance Minsup=4% varying length ratio

Comments Bottom-up approach of CARPENTER is not efficient. minsup=3

Comments TD-Close uses top-down approach. minsup=3

Conclusion CARPENTER is used to find the frequent closed pattern in biological dataset. CARPENTER uses row enumeration instead of column enumeration to overcome the high dimensionality of biological datasets. Not very efficient somehow

References A. K. H. Tung J. Yang F. Pan, G. Cong and M. J. Zaki. CARPENTER: Finding closed patterns in long biological datasets. In In Proc. 2003 ACM SIGKDD Int. Conf. On Knowledge Discovery and Data Mining, 2003.

Thank you! Questions?