The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Sequence Clustering COMP 790-90 Research Seminar Spring 2011.

Slides:



Advertisements
Similar presentations
Extending Q-Grams to Estimate Selectivity of String Matching with Low Edit Distance [1] Pirooz Chubak May 22, 2008.
Advertisements

Indexing DNA Sequences Using q-Grams
Data Mining Techniques Association Rule
Data Mining (Apriori Algorithm)DCS 802, Spring DCS 802 Data Mining Apriori Algorithm Spring of 2002 Prof. Sung-Hyuk Cha School of Computer Science.
Mining Biological Data
Mining Compressed Frequent- Pattern Sets Dong Xin, Jiawei Han, Xifeng Yan, Hong Cheng Department of Computer Science University of Illinois at Urbana-Champaign.
LOGO Association Rule Lecturer: Dr. Bo Yuan
Sequence Assembly for Single Molecule Methods Steven Skiena, Alexey Smirnov Department of Computer Science SUNY at Stony Brook {skiena,
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
Data Mining Techniques: Clustering
Data Quality Class 10. Agenda Review of Last week Cleansing Applications Guest Speaker.
Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms
Mutual Information Mathematical Biology Seminar
Association Analysis: Basic Concepts and Algorithms.
Data Mining Association Analysis: Basic Concepts and Algorithms
Reza Sherkat ICDE061 Reza Sherkat and Davood Rafiei Department of Computing Science University of Alberta Canada Efficiently Evaluating Order Preserving.
2/8/00CSE 711 data mining: Apriori Algorithm by S. Cha 1 CSE 711 Seminar on Data Mining: Apriori Algorithm By Sung-Hyuk Cha.
Fast Algorithms for Association Rule Mining
Mining Long Sequential Patterns in a Noisy Environment Jiong Yang, Wei Wang, Philip S. Yu, Jiawei Han SIGMOD 2002.
Mining Association Rules
Performance and Scalability: Apriori Implementation.
Pattern Recognition Lecture 20: Data Mining 3 Dr. Richard Spillman Pacific Lutheran University.
CSC 177 Research Paper Review Chad Crowe. A Microeconomic Data Mining Problem: Customer-Oriented Catalog Segmentation Authors: Martin Ester, Rong Ge,
『 Data Mining 』 By Jung, hae-sun. 1.Introduction 2.Definition 3.Data Mining Applications 4.Data Mining Tasks 5. Overview of the System 6. Data Mining.
Mining Sequential Patterns: Generalizations and Performance Improvements R. Srikant R. Agrawal IBM Almaden Research Center Advisor: Dr. Hsu Presented by:
Data Mining Chun-Hung Chou
Ch5 Mining Frequent Patterns, Associations, and Correlations
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
SISAP’08 – Approximate Similarity Search in Genomic Sequence Databases using Landmark-Guided Embedding Ahmet Sacan and I. Hakki Toroslu
TAR: Temporal Association Rules on Evolving Numerical Attributes Wei Wang, Jiong Yang, and Richard Muntz Speaker: Sarah Chan CSIS DB Seminar May 7, 2003.
Pattern-Growth Methods for Sequential Pattern Mining Iris Zhang
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Sequential Pattern Mining COMP Seminar BCB 713 Module Spring 2011.
80 million tiny images: a large dataset for non-parametric object and scene recognition CS 4763 Multimedia Systems Spring 2008.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Clustering COMP Research Seminar BCB 713 Module Spring 2011 Wei Wang.
Frequent Item Mining. What is data mining? =Pattern Mining? What patterns? Why are they useful?
Mining Approximate Frequent Itemsets in the Presence of Noise By- J. Liu, S. Paulsen, X. Sun, W. Wang, A. Nobel and J. Prins Presentation by- Apurv Awasthi.
CS 8751 ML & KDDSupport Vector Machines1 Mining Association Rules KDD from a DBMS point of view –The importance of efficiency Market basket analysis Association.
Outline Knowledge discovery in databases. Data warehousing. Data mining. Different types of data mining. The Apriori algorithm for generating association.
Randomization in Privacy Preserving Data Mining Agrawal, R., and Srikant, R. Privacy-Preserving Data Mining, ACM SIGMOD’00 the following slides include.
Fast Algorithms for Mining Association Rules Rakesh Agrawal and Ramakrishnan Srikant VLDB '94 presented by kurt partridge cse 590db oct 4, 1999.
A data-mining approach for multiple structural alignment of proteins WY Siu, N Mamoulis, SM Yiu, HL Chan The University of Hong Kong Sep 9, 2009.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Association Analysis This lecture node is modified based on Lecture Notes for.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Bi-Clustering COMP Seminar Spring 2011.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Bi-Clustering COMP Seminar Spring 2008.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Tomáš Skopal 1, Benjamin Bustos 2 1 Charles University in Prague, Czech Republic 2 University of Chile, Santiago, Chile On Index-free Similarity Search.
Mining Frequent Patterns. What Is Frequent Pattern Analysis? Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs.
Computing Approximate Weighted Matchings in Parallel Fredrik Manne, University of Bergen with Rob Bisseling, Utrecht University Alicia Permell, Michigan.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
1 Discovering Calendar-based Temporal Association Rules SHOU Yu Tao May. 21 st, 2003 TIME 01, 8th International Symposium on Temporal Representation and.
Network Partition –Finding modules of the network. Graph Clustering –Partition graphs according to the connectivity. –Nodes within a cluster is highly.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Association Rule Mining COMP Seminar BCB 713 Module Spring 2011.
1 Data Mining Lecture 6: Association Analysis. 2 Association Rule Mining l Given a set of transactions, find rules that will predict the occurrence of.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Mining Complex Data COMP Seminar Spring 2011.
Multiple String Comparison – The Holy Grail. Why multiple string comparison? It is the most critical cutting-edge toοl for extracting and representing.
DATA MINING: ASSOCIATION ANALYSIS (2) Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Bi-Clustering COMP Seminar Spring 2008.
Data Mining Association Analysis: Basic Concepts and Algorithms
Frequent Pattern Mining
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
I don’t need a title slide for a lecture
Association Analysis: Basic Concepts and Algorithms
Association Analysis: Basic Concepts
Presentation transcript:

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Sequence Clustering COMP Research Seminar Spring 2011

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL 2 Sequential Pattern Mining Support Framework Multiple Alignment Framework Evaluation Conclusion ApproxMAP

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL 3 Inherent Problems Exact match  A pattern gets support from a sequence in the database if and only if the pattern is exactly contained in the sequence  Often may not find general long patterns in the database  For example, many customers may share similar buying habits, but few of them follow an exactly same pattern Mines complete set: Too many trivial patterns  Given long sequences with noise  too expensive and too many patterns  Finding max / closed sequential patterns is non-trivial  In noisy environment, still too many max/close patterns  Not Summarizing Trend

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL 4 PA()TTERN Multiple Alignment line up the sequences to detect the trend  Find common patterns among strings  DNA / bio sequences PATTTERN PA() TERM P TT RN OA TTERB P SYYRTN

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL 5 INDEL REPL Multiple Alignment Score  ∑PS(seq i, seq j ) (  1 ≤ i ≤ N and 1≤ j ≤ N)  Optimal alignment : minimum score Pairwise Score = edit distance=dist(S 1,S 2 ) –Minimum # of ops required to change S 1 to S 2 –Ops = INDEL(a) and/or REPLACE(a,b) Edit Distance PATTTERN PA() TERM

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL 6 Weighted Sequence Weighted Sequence : profile  Compress a set of aligned sequences into one sequence seq 1 (A)(B)(DE) seq 2 (AE)(H)(BC)(E) seq 3 (A)(BCG)(D) Weighted Sequence(A:3,E:1): 3 (H:1): 1 (B:3,C:2, G:1):3 (D:2, E:2):33

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL 7 Consensus Sequence strength(i, j) = # of occurrences of item i in position j total # of sequences Consensus itemset (j)  { i a |  i a  (I  ()) & strength(i a, j) ≥ min_strength } Consensus sequence : min_strength=2  concatenation of the consensus itemsets for all positions excluding any null consensus itemsets seq 1 (A)(B)(DE) seq 2 (AE)(H)(BC)(E) seq 3 (A)(BCG)(D) Weighted Sequence(A:3,E:1): 3 (H:1): 1 (B:3,C:2, G:1):3 (D:2, E:2):33 Consensus Sequence (A)(BC)(DE)

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL 8 Multiple Alignment Pattern Mining Given  N sequences of sets,  Op costs (INDEL & REPLACE) for itemsets, and  Strength threshold for consensus sequences  can specify different levels for each partition To  (1) partition the N sequences into K sets of sequences such that the sum of the K multiple alignment scores is minimum, and  (2) find the optimal multiple alignment for each partition, and  (3) find the pattern consensus sequence and the variation consensus sequence for each partition

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL 9 ApproxMAP (Approximate Multiple Alignment Pattern mining) Exact solution : Too expensive! Approximation Method  Group : O(kN) + O(N 2 L 2 I )  partition by Clustering (k-NN)  distance metric  Compress : O(nL 2 )  multiple alignment (greedy)  Summarize : O(1)  Pattern and Variation Consensus Sequence  Time Complexity : O(N 2 L 2 I )

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL 10 Multiple Alignment : Weighted Sequence seq 3 (A)(B)(DE) seq 2 (AE)(H)(B)(D) WS 1 (A:2,E:1):2(H:1):1(B:2):2(D:2,E:1):22 seq 3 (A)(B)(DE) seq 2 (AE)(H)(B)(D) WS 1 (A:2,E:1):2(H:1):1(B:2):2(D:2,E:1):22 seq 4 (A)(BCG)(D) WS 2 (A:3,E:1):3(H:1):1(B:3,C:1,G:1):3(D:3,E:1):33 seq 3 (A)(B)(DE) seq 2 (AE)(H)(B)(D) seq 4 (A)(BCG)(D) seq 3 (A)(B)(DE) seq 2 (AE)(H)(B)(D)

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL 11 Evaluation Method: Criteria & Datasets Criteria  Recoverability : max patterns  degree of the underlying patterns in DB detected  ∑ E(F B ) * [ max res pat B (|B  P|) / E(L B )]  Cutoff so that 0 ≤ R ≤ 1  # of spurious patterns  # of redundant patterns  Degree of extraneous items in the patterns  total # of extraneous items in P / total # of items in P Datasets  Random data : Independence between and across itemsets  Patterned data : IBM synthetic data (Agrawal and Srikant)  Robustness w.r.t. noise : alpha (Yang – SIGMOD 2002)  Robustness w.r.t. random sequences (outliers)

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL 12 Evaluation : Comparison ApproxMAPSupport Framework Random Data No patterns with more than 1 item returned Lots of spurious patterns Patterned Data 10 patterns embedded into 1000 seqs k=6 & MinStrgh=30% Recoverability : 92.5% 10 patterns returned 2 redundant patterns 0 spurious patterns 0 extraneous items MinSup=5% Recoverability : 91.6% 253,924 patterns returned 247,266 redundant patterns 6,648 spurious patterns 93,043=5.2% extraneous items NoiseRobustNot Robust Recoverability degrades fast OutliersRobust

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL 13 Robustness w.r.t. noise

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL 14 Results : Scalability

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL 15 Evaluation : Real data Successfully applied ApproxMAP to sequence of monthly social welfare services given to clients in North Carolina Found interpretable and useful patterns that revealed information from the data

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL 16 Conclusion : why does it work well? Robust on random & weak patterned noise  Noises can almost never be aligned to generate patterns, so they are ignored  If some alignment is possible, the pattern is detected Very good at organizing sequences  when there are “ enough ” sequences with a certain pattern, they are clustered & aligned  When aligning, we start with the sequences with the least noise and add on those with progressively more noise  This builds a center of mass to which those sequences with lots of noise can attach to Long sequence data that are not random have unique signatures

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL 17 Conclusion Works very well with market basket data  High dimensional  Sparse  Massive outliers Scales reasonably well  Scales very well w.r.t # of patterns  k : scales very well = O(1)  DB : scales reasonably well=O(N 2 L 2 I )  Less than 1 minute for N=1000 on Intel Pentium