0 0 0 0 1 P 11 4. Left half of rt half ? false  0 0 2. Left half pure1? false  0 0 0 1. Whole is pure1? false  0 5. Rt half of right half? true  1.

Slides:

Advertisements

Similar presentations

Huffman Codes and Asssociation Rules (II) Prof. Sin-Min Lee Department of Computer Science.

Advertisements

Frequent Itemset Mining Methods. The Apriori algorithm Finding frequent itemsets using candidate generation Seminal algorithm proposed by R. Agrawal and.

Association rules and frequent itemsets mining

Frequent Closed Pattern Search By Row and Feature Enumeration

1 of 25 1 of 45 Association Rule Mining CIT366: Data Mining & Data Warehousing Instructor: Bajuna Salehe The Institute of Finance Management: Computing.

Data Mining Association Analysis: Basic Concepts and Algorithms

Association Rules l Mining Association Rules between Sets of Items in Large Databases (R. Agrawal, T. Imielinski & A. Swami) l Fast Algorithms for.

Rakesh Agrawal Ramakrishnan Srikant

Association Analysis. Association Rule Mining: Definition Given a set of records each of which contain some number of items from a given collection; –Produce.

Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.

Data Mining Association Analysis: Basic Concepts and Algorithms Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

732A02 Data Mining - Clustering and Association Analysis ………………… Jose M. Peña Association rules Apriori algorithm FP grow algorithm.

Data Mining Association Analysis: Basic Concepts and Algorithms

Data Mining Association Analysis: Basic Concepts and Algorithms

Association Analysis: Basic Concepts and Algorithms.

Association Rule Mining. Generating assoc. rules from frequent itemsets  Assume that we have discovered the frequent itemsets and their support  How.

Data Mining Association Analysis: Basic Concepts and Algorithms

© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.

Association Analysis (3). FP-Tree/FP-Growth Algorithm Use a compressed representation of the database using an FP-tree Once an FP-tree has been constructed,

Mining Association Rules

Entity Tables, Relationship Tables We Classify using any Table (as the Training Table) on any of its columns, the class label column. Medical Expert System:

SEG Tutorial 2 – Frequent Pattern Mining.

Association Discovery from Databases Association rules are a simple formalism for expressing positive connections between columns in a 0/1 matrix. A classical.

Data Mining on Streams  We should use runlists for stream data mining (unless there is some spatial structure to the data, of course, then we need to.

MULTI-LAYERED SOFTWARE SYSTEM FRAMEWORK FOR DISTRIBUTED DATA MINING

AR mining Implementation and comparison of three AR mining algorithms Xuehai Wang, Xiaobo Chen, Shen chen CSCI6405 class project.

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

Bit Sequential (bSQ) Data Model and Peano Count Trees (P-trees) Department of Computer Science North Dakota State University, USA (the bSQ and P-tree technology.

Data Mining 1 Data Mining is one aspect of Database Query Processing (on the "what if" or pattern and trend end of Query Processing, rather than the "please.

Association Rule Mining on Remotely Sensed Imagery Using Peano-trees (P-trees) Qin Ding, Qiang Ding, and William Perrizo Computer Science Department North.

1 Inverted Matrix: Efficient Discovery of Frequent Items in Large Datasets in the Context of Interactive Mining -SIGKDD’03 Mohammad El-Hajj, Osmar R. Zaïane.

The Universality of Nearest Neighbor Sets in Classification and Prediction Dr. William Perrizo, Dr. Gregory Wettstein, Dr. Amal Shehan Perera and Tingda.

Vertical Data In Data Processing, you run up against two curses immediately. Curse of cardinality: solutions don’t scale well with respect to record volume.

Our Approach  Vertical, horizontally horizontal data vertically)  Vertical, compressed data structures, variously called either Predicate-trees or Peano-trees.

Association Analysis (3)

Fast and Scalable Nearest Neighbor Based Classification Taufik Abidin and William Perrizo Department of Computer Science North Dakota State University.

Knowledge Discovery in Protected Vertical Information Dr. William Perrizo University Distinguished Professor of Computer Science North Dakota State University,

Bootstrapped Optimistic Algorithm for Tree Construction

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

Reducing Number of Candidates Apriori principle: – If an itemset is frequent, then all of its subsets must also be frequent Apriori principle holds due.

Data Mining Association Rules Mining Frequent Itemset Mining Support and Confidence Apriori Approach.

Efficient Quantitative Frequent Pattern Mining Using Predicate Trees Baoying Wang, Fei Pan, Yue Cui William Perrizo North Dakota State University.

Vertical Set Square Distance Based Clustering without Prior Knowledge of K Amal Perera,Taufik Abidin, Masum Serazi, Dept. of CS, North Dakota State University.

Data Mining – Association Rules

Item-Based P-Tree Collaborative Filtering applied to the Netflix Data

Data Mining Association Analysis: Basic Concepts and Algorithms

Decision Tree Induction for High-Dimensional Data Using P-Trees

Efficient Ranking of Keyword Queries Using P-trees

Frequent Pattern Mining

North Dakota State University Fargo, ND USA

Yue (Jenny) Cui and William Perrizo North Dakota State University

Chapter 6 Tutorial.

Market Basket Analysis and Association Rules

Market Basket Many-to-many relationship between different objects

All Shortest Path pTrees for a unipartite undirected graph, G7 (SP1, SP2, SP3, SP4, SP5)

PTrees (predicate Trees) fast, accurate , DM-ready horizontal processing of compressed, vertical data structures Project onto each attribute (4 files)

DIRECT HASHING AND PRUNING (DHP) ALGORITHM

Vertical K Median Clustering

Incremental Interactive Mining of Constrained Association Rules from Biological Annotation Data Imad Rahal, Dongmei Ren, Amal Perera, Hassan Najadat and.

Entity Tables, Relationship Tables is in Course Student Enrollments

Vertical K Median Clustering

COMP5331 FP-Tree Prepared by Raymond Wong Presented by Raymond Wong

3. Vertical Data LECTURE 2 Section 3.

North Dakota State University Fargo, ND USA

Functional Analytic Unsupervised and Supervised data mining Technology

The Multi-hop closure theorem for the Rolodex Model using pTrees

Vertical K Median Clustering

North Dakota State University Fargo, ND USA

Presentation transcript:

P Left half of rt half ? false  Left half pure1? false  Whole is pure1? false  0 5. Rt half of right half? true  R To find the number of occurences of , AND these basic Ptrees (next slide) Predicate trees (Ptrees): vertically project each attribute, Given a table structured into horizontal records. Traditional way: Vertical Processing of Horizontal Data - VPHD ) Top-down construction of the 1-dimensional Ptree of R 11, denoted, P 11 : Record the truth of the universal predicate pure 1 in a tree recursively on halves (1/2 1 subsets), until purity is achieved. 3. Right half pure1? false  R 11 R 12 R 13 R 21 R 22 R 23 R 31 R 32 R 33 R 41 R 42 R 43 R[A 1 ] R[A 2 ] R[A 3 ] R[A 4 ] But it is pure (pure0) so this branch ends then vertically project each bit position of each attribute, then compress each bit slice into a basic 1D Ptree. e.g., compression of R 11 into P 11 goes as follows: P 11 pure1? false=0 pure1? true=1 pure1? false=0 R (A 1 A 2 A 3 A 4 ) for Horizontally structured records Scan vertically = Base 10Base 2 P 11 P 12 P 13 P 21 P 22 P 23 P 31 P 32 P 33 P 41 P 42 P ^^ ^ ^ ^ ^^ Review of Vertical Data and 1-D Ptrees VPHD to find the number of occurences of =2 Now Horizonal Processing of Vertical Data HPVD!

R 11 R 12 R 13 R 21 R 22 R 23 R 31 R 32 R 33 R 41 R 42 R 43 R[A 1 ] R[A 2 ] R[A 3 ] R[A 4 ] To count occurrences of 7,0,1,4 use : 0 P 11 ^P 12 ^P 13 ^P’ 21 ^P’ 22 ^P’ 23 ^P’ 31 ^P’ 32 ^P 33 ^P 41 ^P’ 42 ^P’ 43 = ^ P 11 P 12 P 13 P 21 P 22 P 23 P 31 P 32 P 33 P 41 P 42 P ^ ^^^ ^ ^ ^ ^^ R (A 1 A 2 A 3 A 4 ) = This 0 makes entire left branch 0 These 0 s make this node 0 These 1 s and these 0 s (which when complemented are 1's) make this node 1 The 2 1 -level (2 nd level has the only 1-bit so the 1-count of the Ptree is 1*2 1 = 2 # change Vertical Data: 1Shortcuts in the processing of 1-Dimensional Ptrees

HTT Scan D C1C1 TID C 2 Scan D C2C2 F 3 = L 3 Scan D P 1 2 //\\ 1010 P 2 3 //\\ 0111 P 3 3 //\\ 1110 P 4 1 //\\ 1000 P 5 3 //\\ 0111 Build Ptrees: Scan D L 1 ={1}{2}{3}{5} P 1 ^P 2 1 //\\ 0010 P 1 ^P 3 2 //\\ 1010 P 1 ^P 5 1 //\\ 0010 P 2 ^P 3 2 //\\ 0110 P 2 ^P 5 3 //\\ 0111 P 3 ^P 5 2 //\\ 0110 L 2 ={13}{23}{25}{35} P 1 ^P 2 ^P 3 1 //\\ 0010 P 1 ^P 3 ^P 5 1 //\\ 0010 P 2 ^P 3 ^P 5 2 //\\ 0110 L 3 ={235} F 1 = L 1 F 2 = L 2 {123} pruned since {12} not frequent {135} pruned since {15} not frequent Example ARM using uncompressed Ptrees (note: I have placed the 1-count at the root of each Ptree) C3C3 itemset {2 3 5} {1 2 3} {1,3,5} Data_Lecture_4.1_ARM

L3L3 L1L1 L2L2 1-ItemSets don’t support Association Rules (They will have no antecedent or no consequent). Are there any Strong Rules supported by Frequent=Large 2-ItemSets (at minconf=.75)? {1,3}conf({1}  {3}) = supp{1,3}/supp{1} = 2/2 = 1 ≥.75 STRONG conf({3}  {1}) = supp{1,3}/supp{3} = 2/3 =.67 <.75 {2,3}conf({2}  {3}) = supp{2,3}/supp{2} = 2/3 =.67 <.75 conf({3}  {2}) = supp{2,3}/supp{3} = 2/3 =.67 <.75 {2,5}conf({2}  {5}) = supp{2,5}/supp{2} = 3/3 = 1 ≥.75 STRONG! conf({5}  {2}) = supp{2,5}/supp{5} = 3/3 = 1 ≥.75 STRONG! {3,5}conf({3}  {5}) = supp{3,5}/supp{3} = 2/3 =.67 <.75 conf({5}  {3}) = supp{3,5}/supp{5} = 2/3 =.67 <.75 Are there any Strong Rules supported by Frequent or Large 3-ItemSets? {2,3,5}conf({2,3}  {5}) = supp{2,3,5}/supp{2,3} = 2/2 = 1 ≥.75 STRONG! conf({2,5}  {3}) = supp{2,3,5}/supp{2,5} = 2/3 =.67 <.75 conf({3,5}  {2}) = supp{2,3,5}/supp{3,5} = 2/3 =.67 <.75 No subset antecedent can yield a strong rule either (i.e., no need to check conf({2}  {3,5}) or conf({5}  {2,3}) since both denominators will be at least as large and therefore, both confidences will be at least as low. No need to check conf({3}  {2,5}) or conf({5}  {2,3}) DONE! 2-Itemsets do support ARs. Data_Lecture_4.1_ARM

Ptree-ARM versus Apriori on aerial photo (RGB) data together with yeild data Scalability with support threshold 1320  1320 pixel TIFF- Yield dataset (total number of transactions is ~1,700,000). P-ARM compared to Horizontal Apriori (classical) and FP-growth (an improvement of it).  In P-ARM, we find all frequent itemsets, not just those containing Yield (for fairness)  Aerial TIFF images (R,G,B) with synchronized yield (Y). Scalability with number of transactions  Identical results  P-ARM is more scalable for lower support thresholds.  P-ARM algorithm is more scalable to large spatial datasets.

P-ARM versus FP-growth (see literature for definition) Scalability with support threshold 17,424,000 pixels (transactions) Scalability with number of trans  FP-growth = efficient, tree-based frequent pattern mining method (details later)  For a dataset of 100K bytes, FP-growth runs very fast. But for images of large size, P-ARM achieves better performance.  P-ARM achieves better performance in the case of low support threshold.