Query Optimization: Relational Queries to Data Mining Most people have Data from which they want information. So, most people need DBMSs whether they know.

Slides:



Advertisements
Similar presentations
Clustering.
Advertisements

Huffman code and ID3 Prof. Sin-Min Lee Department of Computer Science.
Data Mining Classification: Alternative Techniques
Data Mining Classification: Alternative Techniques
Data Mining Techniques: Clustering
Clustering II.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
Datamining_3 Clustering Methods Clustering a set is partitioning that set. Partitioning is subdividing into subsets which mutually exclusive (don't overlap)
Basic Data Mining Techniques
Classification Continued
Unsupervised Learning and Data Mining
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
1 Partitioning Algorithms: Basic Concepts  Partition n objects into k clusters Optimize the chosen partitioning criterion Example: minimize the Squared.
KNN, LVQ, SOM. Instance Based Learning K-Nearest Neighbor Algorithm (LVQ) Learning Vector Quantization (SOM) Self Organizing Maps.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Mining Association Rules
Classification.
Revision (Part II) Ke Chen COMP24111 Machine Learning Revision slides are going to summarise all you have learnt from Part II, which should be helpful.
Mining Association Rules
CS Instance Based Learning1 Instance Based Learning.
Mining Association Rules in Large Databases. What Is Association Rule Mining?  Association rule mining: Finding frequent patterns, associations, correlations,
Chapter 3: Cluster Analysis  3.1 Basic Concepts of Clustering  3.2 Partitioning Methods  3.3 Hierarchical Methods The Principle Agglomerative.
Clustering Unsupervised learning Generating “classes”
Evaluating Performance for Data Mining Techniques
DATA MINING : CLASSIFICATION. Classification : Definition  Classification is a supervised learning.  Uses training sets which has correct answers (class.
Cluster Analysis Part II. Learning Objectives Hierarchical Methods Density-Based Methods Grid-Based Methods Model-Based Clustering Methods Outlier Analysis.
Chapter 9 Neural Network.
Clustering Analysis of Spatial Data Using Peano Count Trees Qiang Ding William Perrizo Department of Computer Science North Dakota State University, USA.
1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods.
Chapter 9 – Classification and Regression Trees
Data Mining 1 Data Mining is one aspect of Database Query Processing (on the "what if" or pattern and trend end of Query Processing, rather than the "please.
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
Ch10 Machine Learning: Symbol-Based
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
K-Nearest Neighbor Classification on Spatial Data Streams Using P-trees Maleq Khan, Qin Ding, William Perrizo; NDSU.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Clustering COMP Research Seminar BCB 713 Module Spring 2011 Wei Wang.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Clustering.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
1Ellen L. Walker Category Recognition Associating information extracted from images with categories (classes) of objects Requires prior knowledge about.
DATA MINING WITH CLUSTERING AND CLASSIFICATION Spring 2007, SJSU Benjamin Lam.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Clustering Instructor: Max Welling ICS 178 Machine Learning & Data Mining.
Fast Similarity Metric Based Data Mining Techniques Using P-trees: k-Nearest Neighbor Classification  Distance metric based computation using P-trees.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
Data Mining and Decision Support
Overview Data Mining - classification and clustering
1 Classification: predicts categorical class labels (discrete or nominal) classifies data (constructs a model) based on the training set and the values.
Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.
Cluster Analysis Dr. Bernard Chen Assistant Professor Department of Computer Science University of Central Arkansas.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Cluster Analysis Dr. Bernard Chen Ph.D. Assistant Professor Department of Computer Science University of Central Arkansas Fall 2010.
Clustering Wei Wang. Outline What is clustering Partitioning methods Hierarchical methods Density-based methods Grid-based methods Model-based clustering.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Introduction to Data Mining Clustering & Classification Reference: Tan et al: Introduction to data mining. Some slides are adopted from Tan et al.
Chapter 3 Data Mining: Classification & Association Chapter 4 in the text box Section: 4.3 (4.3.1),
Linear Models & Clustering Presented by Kwak, Nam-ju 1.
APPENDIX: Data Mining DUALITIES : 1. PARTITION FUNCTION EQUIVALENCE RELATION UNDIRECTED GRAPH Given any set, S: A Partition is a decomposition of a set.
P Left half of rt half ? false  Left half pure1? false  Whole is pure1? false  0 5. Rt half of right half? true  1.
Data Transformation: Normalization
Semi-Supervised Clustering
Clustering CSC 600: Data Mining Class 21.
Classification and Prediction
Clustering Methods Clustering a set is partitioning that set.
DATA MINING Introductory and Advanced Topics Part II - Clustering
Clustering Wei Wang.
Text Categorization Berlin Chen 2003 Reference:
©Jiawei Han and Micheline Kamber
Presentation transcript:

Query Optimization: Relational Queries to Data Mining Most people have Data from which they want information. So, most people need DBMSs whether they know it or not. A major component of any DBMS is the query processor. Queries can range from structure to unstructured: SELECT FROM WHERE Complex queries (nested, EXISTS.. ) FUZZY queries (e.g., BLAST searches,.. OLAP (rollup, drilldown, slice/dice.. Machine LearningData Mining Relational querying Simple Searching and aggregating Supervised - Classification Regression Unsupervised- Clustering Association Rule Mining Although we just looked closely at the structured end of this spectrum, much research is yet to be done on that end to solve the problem of delivering standard workload answers with low response times and high throughput (D. DeWitt, ACM SIGMOD’02 plenary symposium). On the Data Mining end, we have barely scratched the surface. (But those scratches have made the difference between becoming the world’s biggest corporation and filing for bankruptcy – Walmart vs. KMart)

Recall the ER Model notion of a Relationship  Relationship: Association among 2 or more entities (# of entities is the degree).  The Graph of a Relationship: A degree=2 relationship between entity T and I generates a bipartite undirected graph (bipartite means that the node set is a disjoint union of two subsets and that all edges run from one subset to the other). lot name Employee ssn since Works_In dname budget did Department Degree=2 relationship between entities, Employees and Departments. subor- dinate super- visor Reports_To lot name Employee ssn To distinguish roles in a unipartite graph, can specify “role” of each entity. A degree=2 relationship between an entity and itself, e.g., Employee Reports_To Employee, generates a uni-partite undirected graph. Relationships can have attributes too!

Association Rule Mining (ARM)  Given a relationship between entities T and I – E.g., in a retail market, Transactions (at checkout) and Items  An I-Association Rule relates 2 disjoint subsets of I (itemsets), A and C, and is written, A  C and has 2 measures, support and confidence A is the antecedent and is disjoint from C, called consequent T I A t1t1 t2t2 t3t3 t4t4 t5t5 i1i1 i2i2 i3i3 i4i4 C  There are also T-association rules, of course. –Examples: Relationship between customer cash-register transactions, T, and purchasable items, I (t related to i iff i is being bought during that cash-register transaction) Relationship between Aspects, T, and Code Modules, I (t is related to i iff module, i, is part of the aspect, t) Relationship between experiments, T, and genes, I (t is related to i iff gene, i, expresses at a threshold level during experiment, t) Any “part of” relationship, i  I is part of t  T (t is related to i iff i is part of t) Any “IS A” relationship, i  I IS A t  T (t is related to i iff i IS A t) …  The support of an I-set, A, is the fraction of T-instances related to each I-instance in A, e.g., A={i 1,i 2 } supp(A)= |{t 2,t 4 }| / 5 =.4 and the support of a rule A  C is defined as supp{A  C}  The confidence of a rule, A  C, is supp(A  C) / supp(A) (conditional probability of t being related to C given that it is related to A), e.g., conf(A  C) is.2/.4 =.5, since supp(A  C) = |{t 2 }|/5 =.2

Association Rule Mining (ARM)  Given a many to many relationship between entities T and I – E.g., in a retail market, Transactions (at checkout) and Items The support of A  C is the support of A  C The Confidence of A  C is the support of A  C divided by the support of A –Users usually define a minimum threshold for support (minsupp) and for confidence (minconf) to indicate which rules are important to them. –Users usually want STRONG RULES with supp ≥minsupp and conf ≥minconf

Finding strong Assoc Rules The m-m relationship between Transactions and Items can be expressed in a Transaction Table where each transaction is a row with its ID as one column and the list of the items related to it as the other. T IDABCDEF minsupp=.5, minconf=.75 To find frequent or Large itemsets (support ≥ minsupp) Suppose the items in L k-1 are listed in an order Step 1: self-joining L k-1 insert into C k select p.item 1, p.item 2, …, p.item k-1, q.item k-1 p from L k-1, q from L k-1 where p.item 1 =q.item 1,..,p.item k-2 =q.item k-2, p.item k-1 < q.item k-1 Step 2: pruning forall itemsets c in C k do forall (k-1)-subsets s of c do if (s is not in L k-1 ) delete c from C k Or the Item lists can be expressed using “Item bit vectors” instead Use: Subset of a frequent itemset must also be frequent. (if {A, B} is frequent itemset, {A} and {B} must be frequent) APRIORI METHOD: Iteratively find frequent itemsets, size from 1 to k. Generate association rules supported by frequent (or large) Itemsets.. C k will denote the candidate k-itemsets generated L k will denote the Large k-itemsets itemset supp Large 1-itemset Start by finding large 1-ItemSets.

Database D Scan D C1C1 TID C2C2 Scan D C2C2 L3L3 P 1 2 //\\ 1010 P 2 3 //\\ 0111 P 3 3 //\\ 1110 P 4 1 //\\ 1000 P 5 3 //\\ 0111 Build Ptrees: Scan D L 1 ={1,2,3,5} P 1 ^P 2 1 //\\ 0010 P 1 ^P 3 2 //\\ 1010 P 1 ^P 5 1 //\\ 0010 P 2 ^P 3 2 //\\ 0110 P 2 ^P 5 3 //\\ 0111 P 3 ^P 5 2 //\\ 0110 L 2 ={13,23,25,35} P 1 ^P 2 ^P 3 1 //\\ 0010 P 1 ^P 3 ^P 5 1 //\\ 0010 P 2 ^P 3 ^P 5 2 //\\ 0110 L 3 ={235} L1L1 C3C3 L2L2 {123} pruned since {12} not large {135} pruned since {15} not Large Using P-trees:

L3L3 L1L1 L2L2 1-ItemSets don’t support meaningful rules (they either have no antecedent or no consequent). Are there any Strong Rules supported by Large 2-ItemSets at minconf=.75? {1,3}conf{1}  {3} = supp{1,3}/supp{1} = 2/2 = 1 ≥.75 STRONG conf{3}  {1} = supp{1,3}/supp{3} = 2/3 =.67 <.75 {2,3}conf{2}  {3} = supp{2,3}/supp{2} = 2/3 =.67 <.75 conf{3}  {2} = supp{2,3}/supp{3} = 2/3 =.67 <.75 {2,5}conf{2}  {5} = supp{2,5}/supp{2} = 3/3 = 1 ≥.75 STRONG! conf{5}  {2} = supp{2,5}/supp{5} = 3/3 = 1 ≥.75 STRONG! {3,5}conf{3}  {5} = supp{3,5}/supp{3} = 2/3 =.67 <.75 conf{5}  {3} = supp{3,5}/supp{5} = 2/3 =.67 <.75 Any Confident Rules supported by Large 3-ItemSets? {2,3,5}conf{2,3}  {5} = supp{2,3,5}/supp{2,3} = 2/2 = 1 ≥.75 STRONG! conf{2,5}  {3} = supp{2,3,5}/supp{2,5} = 2/3 =.67 <.75 conf{3,5}  {2} = supp{2,3,5}/supp{3,5} = 2/3 =.67 <.75 No subset antecedent can yield a strong rule either (i.e., no need to check conf{2}  {3,5} or conf{5}  {2,3} since both denominators will be at least as large and therefore, both confidences will be at least as low. No need to check conf{3}  {2,5} or conf{5}  {2,3} DONE!

P-ARM versus Apriori for (R,G,B,Y) Scalability with support threshold 1320  1320 pixel TIFF- Yield dataset (total number of transactions is ~1,700,000). Comparison with Horizontal Apriori (classical method) and FP-growth (an improvement).  In P-ARM, we find all frequent itemsets, not just those containing Yield, for fairness.  Aerial TIFF images (R,G,B) with synchronized yield (Y). Scalability with number of transactions  Identical results  P-ARM is more scalable for lower support thresholds.  P-ARM algorithm is more scalable to large spatial datasets.

P-ARM versus FP-growth Scalability with support threshold 17,424,000 pixels (transactions) Scalability with number of trans  FP-growth = efficient, tree-based frequent pattern mining method (details later)  Identical results.  For a dataset of 100K bytes, FP-growth runs very fast. But for images of large size, P-ARM achieves better performance.  P-ARM achieves better performance in the case of low support threshold.

Methods to Improve Apriori’s Efficiency  Hash-based itemset counting: A k-itemset whose corresponding hashing bucket count is below the threshold cannot be frequent  Transaction reduction: A transaction that does not contain any frequent k-itemset is useless in subsequent scans  Partitioning: Any itemset that is potentially frequent in DB must be frequent in at least one of the partitions of DB  Sampling: mining on a subset of given data, lower support threshold + a method to determine the completeness  Dynamic itemset counting: add new candidate itemsets only when all of their subsets are estimated to be frequent  The core of the Apriori algorithm: –Use frequent (k – 1)-itemsets to generate candidate frequent k-itemsets –Use database scan and pattern matching to collect counts for the candidate itemsets  The bottleneck of Apriori: candidate generation 1. Huge candidate sets: 10 4 frequent 1-itemset may generate 10 7 candidate 2-itemsets To discover frequent pattern of size 100, eg, {a 1 …a 100 }, need to generate  candidates. 2. Multiple scans of database: (Needs (n +1 ) scans, n = length of the longest pattern)

OUTPUT conveyor INPUT hopper 3 steps: Build Model, Test Model, Use Model (to predict class of samples) Typical Applications –credit approval –target marketing –medical diagnosis –treatment effectiveness analysis Classification MODEL (classifier) Assuming Training-Set-relationship between non-class-attribute-values and class-labels is typical. Build a model to approximate that relationship. Unclassified sample Classified tuple Classify data (construct a model) from a training whose tuples have class column values (class labels), then uses model to classify unclassified samples (data that does not yet have class column values)

Training Data Eager Classifiers TRAINING PHASE Classification Algorithm (creates the Classifier or Model during training phase) IF rank = ‘professor’ OR years > 6 THEN tenured = ‘yes’ Classifier or Model (as a rule set) INPUT hopper OUTPUT conveyor Class CLASSIFICATION or USE PHASE Unclassified sample Class Unclassified sample Class Unclassified sample e.g., Joe, Assistant Prof, 5

Classifier Testing Data NAMERANKYEARSTENURED TomAssistant Prof2no MerlisaAssociate Prof7no GeorgeAssociate Prof5yes JosephAssistant Prof7no % correct classifications? Test Process (2): Usually some of the Training Tuples are set aside as a Test Set and after a model is constructed, the Test Tuples are run through the Model. The Model is acceptable if, e.g., the % correct > 60%. If not, the Model is rejected (never used). Correct=3 Incorrect=1 75% Since 75% is above the acceptability threshold, accept the model!

Classification by Decision Tree Induction  Decision tree –Each Internal node denotes a test on an attribute (test attribute for that node) –Each Branch represents an outcome of the test (value of the test attribute) –Leaf nodes represent class label decisions (plurality leaf class is predicted class)  Decision tree model development consists of two phases –Tree construction At start, all the training examples are at the root Partition examples recursively based on selected attributes –Tree pruning Identify and remove branches that reflect noise or outliers  Decision tree use: Classifying unclassified samples by filtering them down the decision tree to their proper leaf, than predict the plurality class of that leaf (often only one, depending upon the stopping condition of the construction phase)

Algorithm for Decision Tree Induction  Basic ID3 algorithm (a simple greedy top-down algorithm) –At start, the current node is the root and all the training tuples are at the root –Repeat, down each branch, until the stopping condition is true At current node, choose a decision attribute (e.g., one with largest information gain). Each value for that decision attribute is associated with a link to the next level down and that value is used as the selection criterion of that link. Each new level produces a partition of the parent training subset based on the selection value assigned to its link. –stopping conditions: When all samples for a given node belong to the same class When there are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf When there are no samples left

Bayesian Classification A Bayesian classifier is a statistical classifier, which is based on following theorem known as Bayes theorem: Bayes theorem: Let X be a data sample whose class label is unknown. Let H be the hypothesis that X belongs to class, H. P(H|X) is the conditional probability of H given X. P(H) is prob of H, then P(H|X) = P(X|H)P(H)/P(X)

Naïve Bayesian Classification  Given training set, R(A 1..A n, C) where C={C 1..C m } is the class label attribute.  The naive Bayesian Classifier will predict the class of unknown data sample, X, to be the class, C j having the highest conditional probability, conditioned on X P(C j |X) ≥ P(C i |X), i  j.  From the Bayes theorem: P(C j |X) = P(X|C j )P(C j )/P(X) –P(X) is constant for all classes so we maximize P(X|C j )P(C j ). –Maximize P(X|C j )P(C j ). –To reduce the computational complexity of calculating all P(X|C j )'s the naive assumption: class conditional independence

Neural Networks  Advantages –prediction accuracy is generally high –robust, works when training examples contain errors –output may be discrete, real-valued, or a vector of several discrete or real-valued attributes –fast evaluation of the learned target function  Criticism –difficult to understand the learned function (weights) –not easy to incorporate domain knowledge –long training time

 The n-dimensional input vector x is mapped into variable y by means of the scalar product and a nonlinear function mapping kk - f weighted sum Input vector x output y Activation function weight vector w  w0w0 w1w1 wnwn x0x0 x1x1 xnxn A Neuron

 The ultimate objective of training –obtain a set of weights that makes almost all the tuples in the training data classify correctly  Steps –Initialize weights with random values –Feed the input tuples into the network –For each unit Compute the net input to the unit as a linear combination of all the inputs to the unit Compute the output value using the activation function Compute the error Update the weights and the bias Network Training

Output nodes Input nodes Hidden nodes Output vector Input vector: x i w ij Multi-Layer Perceptron

For Nearest Neighbor Classification, need a distance (to make sense of "nearest") A distance is a function, d, of two n-dimensional points X and Y, such that d(X, Y) is positive definite: if (X  Y), d(X, Y) > 0; if (X = Y), d(X, Y) = 0 d(X, Y) is symmetric: d(X, Y) = d(Y, X) d(X, Y) holds triangle inequality: d(X, Y) + d(Y, Z)  d(X, Z) n xxxxX,,,, 321 …  n yyyyY,,,, 321 …  Minkowski distance or L p distance, Manhattan distance,(P = 1) Euclidian distance,(P = 2) Max distance, (P =  ) Canberra distance Squared cord distance Squared chi-squared distance

An Example Manhattan, d 1 (X,Y) = XZ+ ZY = 4+3 = 7 X (2,1) Y (6,4) Z d 1  d 2  d  always In fact, for any positive integer p, A two-dimensional space: Max, d  (X,Y) = Max(XZ, ZY) = XZ = 4 Euclidian, d 2 (X,Y) = XY = 5

Neighborhoods of a Point A Neighborhood (disk neighborhood) of a point, T, is a set of points, S,  : X  S iff d(T, X)  r 2r2r T X Manhattan If X is a point on the boundary, d(T, X) = r T 2r2r X Euclidian 2r2r T X Max

Classical k-Nearest Neighbor Classification  Select a suitable value for k  Determine a suitable distance metric  Find the k nearest training set points to the unclassified sample.  Let them vote.  Assign the highest class vote (plurality class) from among the k-nearest neighbor set.

Closed-KNN T T is the unclassified sample. using k = 3, find the three nearest neighbor, KNN arbitrarily select one point from the boundary line shown Closed-KNN includes all points on the boundary Closed-KNN yields higher classification accuracy than traditional KNN (thesis of MD Maleq Khan 2001). The P-tree method always produce closed neighborhoods (and is faster!)

Performance – Accuracy Training Set Size (no. of pixels) Accuracy (%) KNN-ManhattanKNN-Euclidian KNN-MaxKNN-Hobbit P-tree: (closed-KNN)P-tree: Hobbit (closed-KNN) 1997 Dataset: Hobbit methods use a simplified distance takes only the highest order bit difference rather than the square root of the sum of the differences (as in Euclidean distance)

Performance - Accuracy (cont.) 1998 Dataset: Training Set Size (no of pixels) Accuracy (%) KNN-ManhattanKNN-Euclidian KNN-MaxKNN-Hobbit P-tree: Perfect Centering (closed-KNN) P-tree: Hobbit (closed-KNN)

Performance - Time 1997 Dataset: both axis in logarithmic scale Training Set Size (no. of pixels) Per Sample Classification time (sec) KNN-Manhattan KNN-Euclidian KNN-Max KNN-Hobbit P-tree: Perfect Centering (cosed-KNN) P-tree: Hobbit (closed-KNN) This needs re-running using the EINring forumulas?

Performance - Time (cont.) Training Set Size (no. of pixels) Per Sample Classification Time (sec) KNN-Manhattan KNN-Euclidian KNN-Max KNN-Hobbit P-tree: Perfect Centering (closed-KNN) P-tree: Hobbit (closed-KNN) 1998 Dataset : both axis in logarithmic scale This needs re-running using the EINring forumulas?

3 NEAREST NBR CLASSIFICATION use Hamming distance = # of mismatches d1 d2 a1 a2 a3 a4 a5 a6 a7 a8 a9 C a1'a2'a3'a4'a5'a6'a7'a8'a9‘ C' t1 t t1 t t1 t t1 t t2 t t2 t t3 t t3 t t3 t t3 t t5 t t5 t t5 t t5 t t6 t t7 t t7 t Unclassified sample ( a 5 a 6 a 1 ’ a 2 ’ a 3 ’ a 4 ’ ) = t1 t t1 t t1 t a5 a6 C a1’a2’a3’a4’ dis 3NNs d=2, don’t replace d=4, don’t replace d=4, don’t replace d=3, don’t replace d=3, don’t replace d=2, don’t replace d=3, don’t replace d=2, don’t replace d=1, replace t5 t d=2, don’t replace d=2, don’t replace d=3, don’t replace d=2, don’t replace d=2, don’t replace 0 1 C=1 wins! Relevant attributes Class label=C

To find all training pts within dis=2 (fairer prediction?) requires another scan. d1 d2 a1 a2 a3 a4 a5 a6 a7 a8 a9 C a1'a2'a3'a4'a5'a6'a7'a8'a9‘ C' t1 t t1 t t1 t t1 t t2 t t2 t t3 t t3 t t3 t t3 t t5 t t5 t t5 t t5 t t6 t t7 t t7 t t1 t t1 t a5 a6 C a1’a2’a3’a4’ d d=2, include it also d=4, don’t include d=4, don’t include d=3, don’t include d=3, don’t include d=2, include it also d=3, don’t include d=2, include it also d=1, already have t5 t d=2, include it also d=2, include it also d=3, don’t replace d=2, include it also d=2, include it also d=2, include it also d=2, already have d=1, already have

C C C C Using Ptree: Find all training pts  dis 1 from a=a 5 a 6 a 1 ’a 2 ’a 3 ’a 4 ’ = (000000) C' d 1 d 2 t 1 t 2 t 1 t 3 t 1 t 5 t 1 t 6 t 2 t 1 t 2 t 7 t 3 t 1 t 3 t 2 t 3 t 3 t 5 t 5 t 1 t 5 t 3 t 5 t 5 t 7 t 6 t 1 t 7 t 2 t 7 t 5 a a a a a a a a a a a a a a a a a a C C a1‘ a1‘ a2‘ a2‘ a3‘ a3‘ a4‘ a4‘ a5‘ a5‘ a6‘ a6‘ a7‘ a7‘ a8‘ a8‘ a9‘ a9‘ The 1-ring: given by 1-bits in the Ptree, P, constructed: 0 1 P P C=1 vote count = root count of P^C. C=0 vote count = root count of P^C. (never need to know which tuples voted) a4‘ a4‘ a3‘ a3‘ a2‘ a2‘ a1‘ a1‘ a a a a a4‘ a4‘ a3‘ a3‘ a2‘ a2‘ a1‘ a1‘ a a a a a4‘ a4‘ a3‘ a3‘ a2‘ a2‘ a1‘ a1‘ a a a a a4‘ a4‘ a3‘ a3‘ a2‘ a2‘ a1‘ a1‘ a a a a a4‘ a4‘ a3‘ a3‘ a2‘ a2‘ a1‘ a1‘ a a a a a4‘ a4‘ a3‘ a3‘ a2‘ a2‘ a1‘ a1‘ a a a a OR a5^a6^a1’^a2’^a3’^a4’a5^a6^a1’^a2’^a3’^a4’ a5^a6^a1’^a2’^a3’^a4’a5^a6^a1’^a2’^a3’^a4’ a5^a6^a1’^a2’^a3’^a4’a5^a6^a1’^a2’^a3’^a4’ a5^a6^a1’^a2’^a3’^a4’a5^a6^a1’^a2’^a3’^a4’ a5^a6^a1’^a2’^a3’^a4’a5^a6^a1’^a2’^a3’^a4’ a5^a6^a1’^a2’^a3’^a4’a5^a6^a1’^a2’^a3’^a4’ ( a 5 a 6 a 1 ’a 2 ’a 3 ’a 4 ’ )

a’ s 2-ring? a=a 5 a 6 a 1 ’a 2 ’a 3 ’a 4 ’ = (000000) d 1 d 2 t 1 t 2 t 1 t 3 t 1 t 5 t 1 t 6 t 2 t 1 t 2 t 7 t 3 t 1 t 3 t 2 t 3 t 3 t 5 t 5 t 1 t 5 t 3 t 5 t 5 t 7 t 6 t 1 t 7 t 2 t 7 t 5 a a a a C C a1‘ a1‘ a2‘ a2‘ a3‘ a3‘ a4‘ a4‘ For each of the following Ptrees, a 1-bit corresponds to a training point in a’s 2-ring: Pa 5 a 6 a 1 ’a 2 ‘a 3 ‘a 4 ‘ Pa 5 a 6 a 1 ‘a 2 ’a 3 ‘a 4 ‘ Pa 5 a 6 a 1 ‘a 2 ’a 3 ‘a 4 ‘ Pa 5 a 6 a 1 ‘a 2 ‘a 3 ’a 4 ‘ Pa 5 a 6 a 1 ‘a 2 ‘a 3 ‘a 4 ’ Pa 5 a 6 a 1 ‘a 2 ’a 3 ‘a 4 ‘ Pa 5 a 6 a 1 ‘a 2 ’a 3 ‘a 4 ‘ Pa 5 a 6 a 1 ‘a 2 ‘a 3 ’a 4 ‘ Pa 5 a 6 a 1 ‘a 2 ‘a 3 ‘a 4 ’ Pa 5 a 6 a 1 ‘a 2 ’a 3 ‘a 4 ‘ Pa 5 a 6 a 1 ‘a 2 ‘a 3 ’a 4 ‘ Pa 5 a 6 a 1 ‘a 2 ‘a 3 ‘a 4 ’ Pa 5 a 6 a 1 ‘a 2 ‘a 3 ’a 4 ‘ Pa 5 a 6 a 1 ‘a 2 ‘a 3 ‘a 4 ’ Pa 5 a 6 a 1 ‘a 2 ‘a 3 ‘a 4 ’ 1 st line first: a4‘ a4‘ a3‘ a3‘ a2‘ a2‘ a1‘ a1‘ a a a a a4‘ a4‘ a3‘ a3‘ a2‘ a2‘ a1‘ a1‘ a a a a a4‘ a4‘ a3‘ a3‘ a2‘ a2‘ a1‘ a1‘ a a a a a4‘ a4‘ a3‘ a3‘ a2‘ a2‘ a1‘ a1‘ a a a a a4‘ a4‘ a3‘ a3‘ a2‘ a2‘ a1‘ a1‘ a a a a

d 1 d 2 t 1 t 2 t 1 t 3 t 1 t 5 t 1 t 6 t 2 t 1 t 2 t 7 t 3 t 1 t 3 t 2 t 3 t 3 t 5 t 5 t 1 t 5 t 3 t 5 t 5 t 7 t 6 t 1 t 7 t 2 t 7 t 5 a a a a C C a1‘ a1‘ a2‘ a2‘ a3‘ a3‘ a4‘ a4‘ a4‘ a4‘ a3‘ a3‘ a2‘ a2‘ a1‘ a1‘ a a a a a4‘ a4‘ a3‘ a3‘ a2‘ a2‘ a1‘ a1‘ a a a a a4‘ a4‘ a3‘ a3‘ a2‘ a2‘ a1‘ a1‘ a a a a a4‘ a4‘ a3‘ a3‘ a2‘ a2‘ a1‘ a1‘ a a a a For each of the following Ptrees, a 1-bit corresponds to a training point in a’s 2-ring: Pa 5 a 6 a 1 ’a 2 ‘a 3 ‘a 4 ‘ Pa 5 a 6 a 1 ‘a 2 ’a 3 ‘a 4 ‘ Pa 5 a 6 a 1 ‘a 2 ’a 3 ‘a 4 ‘ Pa 5 a 6 a 1 ‘a 2 ‘a 3 ’a 4 ‘ Pa 5 a 6 a 1 ‘a 2 ‘a 3 ‘a 4 ’ Pa 5 a 6 a 1 ‘a 2 ’a 3 ‘a 4 ‘ Pa 5 a 6 a 1 ‘a 2 ’a 3 ‘a 4 ‘ Pa 5 a 6 a 1 ‘a 2 ‘a 3 ’a 4 ‘ Pa 5 a 6 a 1 ‘a 2 ‘a 3 ‘a 4 ’ Pa 5 a 6 a 1 ‘a 2 ’a 3 ‘a 4 ‘ Pa 5 a 6 a 1 ‘a 2 ‘a 3 ’a 4 ‘ Pa 5 a 6 a 1 ‘a 2 ‘a 3 ‘a 4 ’ Pa 5 a 6 a 1 ‘a 2 ‘a 3 ’a 4 ‘ Pa 5 a 6 a 1 ‘a 2 ‘a 3 ‘a 4 ’ Pa 5 a 6 a 1 ‘a 2 ‘a 3 ‘a 4 ’ 2 nd line:

d 1 d 2 t 1 t 2 t 1 t 3 t 1 t 5 t 1 t 6 t 2 t 1 t 2 t 7 t 3 t 1 t 3 t 2 t 3 t 3 t 5 t 5 t 1 t 5 t 3 t 5 t 5 t 7 t 6 t 1 t 7 t 2 t 7 t 5 a a a a C C a1‘ a1‘ a2‘ a2‘ a3‘ a3‘ a4‘ a4‘ a4‘ a4‘ a3‘ a3‘ a2‘ a2‘ a1‘ a1‘ a a a a a4‘ a4‘ a3‘ a3‘ a2‘ a2‘ a1‘ a1‘ a a a a a4‘ a4‘ a3‘ a3‘ a2‘ a2‘ a1‘ a1‘ a a a a For each of the following Ptrees, a 1-bit corresponds to a training point in a’s 2-ring: Pa 5 a 6 a 1 ’a 2 ‘a 3 ‘a 4 ‘ Pa 5 a 6 a 1 ‘a 2 ’a 3 ‘a 4 ‘ Pa 5 a 6 a 1 ‘a 2 ’a 3 ‘a 4 ‘ Pa 5 a 6 a 1 ‘a 2 ‘a 3 ’a 4 ‘ Pa 5 a 6 a 1 ‘a 2 ‘a 3 ‘a 4 ’ Pa 5 a 6 a 1 ‘a 2 ’a 3 ‘a 4 ‘ Pa 5 a 6 a 1 ‘a 2 ’a 3 ‘a 4 ‘ Pa 5 a 6 a 1 ‘a 2 ‘a 3 ’a 4 ‘ Pa 5 a 6 a 1 ‘a 2 ‘a 3 ‘a 4 ’ Pa 5 a 6 a 1 ‘a 2 ’a 3 ‘a 4 ‘ Pa 5 a 6 a 1 ‘a 2 ‘a 3 ’a 4 ‘ Pa 5 a 6 a 1 ‘a 2 ‘a 3 ‘a 4 ’ Pa 5 a 6 a 1 ‘a 2 ‘a 3 ’a 4 ‘ Pa 5 a 6 a 1 ‘a 2 ‘a 3 ‘a 4 ’ Pa 5 a 6 a 1 ‘a 2 ‘a 3 ‘a 4 ’ 3 rd line:

d 1 d 2 t 1 t 2 t 1 t 3 t 1 t 5 t 1 t 6 t 2 t 1 t 2 t 7 t 3 t 1 t 3 t 2 t 3 t 3 t 5 t 5 t 1 t 5 t 3 t 5 t 5 t 7 t 6 t 1 t 7 t 2 t 7 t 5 a a a a C C a1‘ a1‘ a2‘ a2‘ a3‘ a3‘ a4‘ a4‘ For each of the following Ptrees, a 1-bit corresponds to a training point in a’s 2-ring: Pa 5 a 6 a 1 ’a 2 ‘a 3 ‘a 4 ‘ Pa 5 a 6 a 1 ‘a 2 ’a 3 ‘a 4 ‘ Pa 5 a 6 a 1 ‘a 2 ’a 3 ‘a 4 ‘ Pa 5 a 6 a 1 ‘a 2 ‘a 3 ’a 4 ‘ Pa 5 a 6 a 1 ‘a 2 ‘a 3 ‘a 4 ’ Pa 5 a 6 a 1 ‘a 2 ’a 3 ‘a 4 ‘ Pa 5 a 6 a 1 ‘a 2 ’a 3 ‘a 4 ‘ Pa 5 a 6 a 1 ‘a 2 ‘a 3 ’a 4 ‘ Pa 5 a 6 a 1 ‘a 2 ‘a 3 ‘a 4 ’ Pa 5 a 6 a 1 ‘a 2 ’a 3 ‘a 4 ‘ Pa 5 a 6 a 1 ‘a 2 ‘a 3 ’a 4 ‘ Pa 5 a 6 a 1 ‘a 2 ‘a 3 ‘a 4 ’ Pa 5 a 6 a 1 ‘a 2 ‘a 3 ’a 4 ‘ Pa 5 a 6 a 1 ‘a 2 ‘a 3 ‘a 4 ’ Pa 5 a 6 a 1 ‘a 2 ‘a 3 ‘a 4 ’ 4 th line: a4‘ a4‘ a3‘ a3‘ a2‘ a2‘ a1‘ a1‘ a a a a a4‘ a4‘ a3‘ a3‘ a2‘ a2‘ a1‘ a1‘ a a a a

d 1 d 2 t 1 t 2 t 1 t 3 t 1 t 5 t 1 t 6 t 2 t 1 t 2 t 7 t 3 t 1 t 3 t 2 t 3 t 3 t 5 t 5 t 1 t 5 t 3 t 5 t 5 t 7 t 6 t 1 t 7 t 2 t 7 t 5 a a a a C C a1‘ a1‘ a2‘ a2‘ a3‘ a3‘ a4‘ a4‘ For each of the following Ptrees, a 1-bit corresponds to a training point in a’s 2-ring: Pa 5 a 6 a 1 ’a 2 ‘a 3 ‘a 4 ‘ Pa 5 a 6 a 1 ‘a 2 ’a 3 ‘a 4 ‘ Pa 5 a 6 a 1 ‘a 2 ’a 3 ‘a 4 ‘ Pa 5 a 6 a 1 ‘a 2 ‘a 3 ’a 4 ‘ Pa 5 a 6 a 1 ‘a 2 ‘a 3 ‘a 4 ’ Pa 5 a 6 a 1 ‘a 2 ’a 3 ‘a 4 ‘ Pa 5 a 6 a 1 ‘a 2 ’a 3 ‘a 4 ‘ Pa 5 a 6 a 1 ‘a 2 ‘a 3 ’a 4 ‘ Pa 5 a 6 a 1 ‘a 2 ‘a 3 ‘a 4 ’ Pa 5 a 6 a 1 ‘a 2 ’a 3 ‘a 4 ‘ Pa 5 a 6 a 1 ‘a 2 ‘a 3 ’a 4 ‘ Pa 5 a 6 a 1 ‘a 2 ‘a 3 ‘a 4 ’ Pa 5 a 6 a 1 ‘a 2 ‘a 3 ’a 4 ‘ Pa 5 a 6 a 1 ‘a 2 ‘a 3 ‘a 4 ’ Pa 5 a 6 a 1 ‘a 2 ‘a 3 ‘a 4 ’ 5 th line: a4‘ a4‘ a3‘ a3‘ a2‘ a2‘ a1‘ a1‘ a a a a

Clustering Methods Clustering is partitioning into mutually exclusive and collectively exhaustive subsets, such that each point is very similar to (close to) points in its component and very dissimilar to (far from) points in the other components. A Categorization of Major Clustering Methods  Partitioning methods (K-means | K-medoids...)  Hierarchical methods (Agnes, Diana...)  Density-based methods  Grid-based methods  Model-based methods

The K-Means Clustering Method Given k, the k-means algorithm is implemented in 4 steps (assumes partitioning criteria is: maximize intra-cluster similarity and minimize inter-cluster similarity. Of course, a heuristic is used. Method isn’t really an optimization) 1.Partition objects into k nonempty subsets (or pick k initial means). 2.Compute the mean (center) or centroid of each cluster of the current partition (if one started with k means, then this step is done). centroid ~ point that minimizes the sum of dissimilarities from the mean or the sum of the square errors from the mean. Assign each object to the cluster with the most similar (closest) center. 3.Go back to Step 2 4.Stop when the new set of means doesn’t change (or some other stopping condition?)

k-Means Step 1 Step 2 Step 3 Step 4 Strength relatively efficient: O(tkn), n is # objects, k is # clusters t is # iterations. Normally, k, t << n. Weakness Applicable only when mean is defined (e.g., a vector space). Need to specify k, the number of clusters, in advance. It is sensitive to noisy data and outliers.

The K-Medoids Clustering Method  Find representative objects, called medoids, (must be an actual object in the cluster, where as the mean seldom is).  PAM (Partitioning Around Medoids, 1987) –starts from an initial set of medoids –iteratively replaces one of the medoids by a non-medoid –if it improves the aggregate similarity measure, retain the swap. Do this over all medoid-nonmedoid pairs –PAM works for small data sets. Does not scale for large data sets  CLARA (Clustering LARge Applications) (Kaufmann,Rousseeuw, 1990) Sub-samples then apply PAM  CLARANS (Clustering Large Applications based on RANdom Search) (Ng & Han, 1994): Randomized the sampling

Hierarchical Clustering Methods: AGNES (Agglomerative Nesting)  Introduced in Kaufmann and Rousseeuw (1990)  Use the Single-Link (distance between two sets is the minimum pairwise distance) method Other options are complete link (distance is max pairwise); average,...  Merge nodes that are most similarity  Eventually all nodes belong to the same cluster

DIANA (Divisive Analysis)  Introduced in Kaufmann and Rousseeuw (1990)  Inverse order of AGNES (initially all objects are in one cluster; then it is split according to some criteria (e.g., maximize some aggregate measure of pairwise dissimilarity again)  Eventually each node forms a cluster on its own

Contrasting hierarchical Clustering Techniques  Partitioning algorithms: Partition a dataset to k clusters, e.g., k=3   Hierarchical alg: Create hierarchical decomposition of ever-finer partitions. e.g., top down (divisively). bottom up (agglomerative)

Hierarchical Clustering a b Step 1 d e Step 2 c d e Step 3 a b c d e Step 4 b d c e a Step 0 Agglomerative

Hierarchical Clustering (top down) In either case, one gets a nice dendogram in which any maximal anti- chain (no 2 nodes linked) is a clustering (partition). b a Step 4 d e Step 3 c d e Step 2 a b c d e Step 1 a b c d e Step 0 Divisive

Hierarchical Clustering (Cont.) Recall that any maximal anti-chain (maximal set of nodes in which no 2 are chained) is a clustering (a dendogram offers many).          

Hierarchical Clustering (Cont.) But the “horizontal” anti-chains are the clusterings resulting from the top down (or bottom up) method(s).

Data Mining Summary Data Mining on a given table of data includes Association Rule Mining (ARM) on Bipartite Relationships Clustering Partitioning methods (K-means | K-medoids...), Hierarchical methods (Agnes, Diana...), Model-based methods (K-Means, K-Medoids..),.... Classification Decision Tree Induction, Bayesian, Neural Network, k-Nearest-Neighbor,...) But most data mining is on a database, not just one table, that is, often times, first one must apply the appropriate SQL query to a database to get the table to be data mined. The next slides discuss vertical data methods for doing that.

Vertical Select-Project-Join (SPJ) Queries A Select-Project-Join query has joins, selections and projections. Typically there is a central fact relation to which several dimension relations are to be joined (standard STAR DW) E.g., Student(S), Course(C), Enrol(E) STAR DB below (bit encoding is shown in reduced font italics for certain attributes) S|s____|name_|gen| C|c____|name|st|term| E|s____|c____|grade | |0 000|CLAY |M 0| |0 000|BI |ND|F 0| |0 000|1 001|B 10| |1 001|THAIS|M 0| |1 001|DB |ND|S 1| |0 000|0 000|A 11| |2 010|GOOD |F 1| |2 010|DM |NJ|S 1| |3 011|1 001|A 11| |3 011|BAID |F 1| |3 011|DS |ND|F 0| |3 011|3 011|D 00| |4 100|PERRY|M 0| |4 100|SE |NJ|S 1| |1 001|3 011|D 00| |5 101|JOAN |F 1| |5 101|AI |ND|F 0| |1 001|0 000|B 10| |2 010|2 010|B 10| |2 010|3 011|A 11| |4 100|4 100|B 10| |5 101|5 101|B 10| Vertical bit sliced (uncompressed) attrs stored as: S.s 2 S.s 1 S.s 0 S.gC.c 2 C.c 1 C.c 0 C.tE.s 2 E.s 1 E.s 0 E.c 2 E.c 1 E.c 0 E.g 1 E.g Vertical (un-bit-sliced) attributes are stored: S.name C.name C.st |CLAY | |BI | |ND| |THAIS| |DB | |ND| |GOOD | |DM | |NJ| |BAID | |DS | |ND| |PERRY| |SE | |NJ| |JOAN | |AI | |ND|

O :o c r |0 000|0 00|0 01| |1 001|0 00|1 01| |2 010|1 01|0 00| |3 011|1 01|1 01| |4 100|2 10|0 00| |5 101|2 10|2 10| |6 110|2 10|3 11| |7 111|3 11|2 10| C:c n cred |0 00|B|1 01| |1 01|D|3 11| |2 10|M|3 11| |3 11|S|2 10| Vertical preliminary Select-Project-Join Query Processing (SPJ) R:r cap |0 00|30 11| |1 01|20 10| |2 10|30 11| |3 11|10 01| SELECT S.n, C.n FROM S, C, O, R, E WHERE S.s=E.s & C.c=O.c & O.o=E.o & O.r=R.r & S.g=M & C.r=2 & E.g=A & R.c=20; S:s n gen |0 000|A|M| |1 001|T|M| |2 100|S|F| |3 111|B|F| |4 010|C|M| |5 011|J|F| E:s o grade |0 000|1 001|2 10| |0 000|0 000|3 11| |3 011|1 001|3 11| |3 011|3 011|0 00| |1 001|3 011|0 00| |1 001|0 000|2 10| |2 010|2 010|2 10| |2 010|7 111|3 11| |4 100|4 100|2 10| |5 101|5 101|2 10| S.s E.s C.c R.r S.s S.s S.n A T S B C J S.g M F M F C.c C.n B D M S C.r C.r R.r R.c R.c E.s E.s E.o E.o E.o E.g E.g In the SCORE database (Students, Courses, Offerings, Rooms, Enrollments), numeric attributes are represente vertically as P-trees (not compressed). Categorical are projected to a 1 column vertical file O.o O.o O.o O.c O.c O.r O.r decimal binary.

SM For selections, S.g=M=1 b C.r=2=10 b E.g=A=11 b R.c=20=10 b create the selection masks using ANDs and COMPLEMENTS. S.s S.s S.s S.n A T S B C J S.g E.s E.s E.s E.o E.o E.o E.g E.g C.c C.c C.n B D M S C.r C.r O.o O.o O.o O.c O.c O.r O.r R.r R.r R.c R.c SELECT S.n, C.n FROM S, C, O, R, E WHERE S.s=E.s & C.c=O.c & O.o=E.o & O.r=R.r & S.g=M & C.r=2 & E.g=A & R.c=20; C.r C.r’ Cr2 0 1 E.g E.g EgA R.c R.c’ Rc Apply these selection masks (Zero out numeric values, blanked out others). S.s 2 0 S.s S.s S.n A T C E.s 2 0 E.s E.s E.o E.o E.o C.c C.c C.n S O.o O.o O.c O.c O.r O.r R.r 1 0 R.r

SELECT S.n, C.n FROM S, C, O, R, E WHERE S.s=E.s & C.c=O.c & O.o=E.o & O.r=R.r & S.g=M & C.r=2 & E.g=A & R.c=20; S.s 2 0 S.s S.s S.n A T C E.s 2 0 E.s E.s E.o E.o E.o C.c C.c C.n S O.o O.o O.o O.c O.c O.r O.r R.r 1 0 R.r For the joins, S.s=E.s C.c=O.c O.o=E.o O.r=R.r, one approach is to follow an indexed nested loop like method. (Noting that attribute P-trees ARE an index for that attribute). The join O.r=R.r is simply part of a selection on O (R doesn’t contribute output nor participate in any further operations) Use the Rc20-masked R as the outer relation Use O as the indexed inner relation to produce that O-selection mask. Rc Get 1 st R.r value, 01 b (there's only 1) Mask the O tuples: P O.r 1 ^P’ O.r 0 O.r O’.r OM This is the only R.r value (if there were more, one would do the same for each, then OR those masks to get the final O-mask). Next, we apply the O-mask, OM to O O.o O.o O.o O.c O.c 0 0 1

SELECT S.n, C.n FROM S, C, O, R, E WHERE S.s=E.s & C.c=O.c & O.o=E.o & O.r=R.r & S.g=M & C.r=2 & E.g=A & R.c=20; S.s 2 0 S.s S.s S.n A T C E.s 2 0 E.s E.s E.o E.o E.o C.c C.c C.n S For the final 3 joins C.c=O.c O.o=E.o E.s=S.s the same indexed nested loop like method can be used. O.o O.o O.o O.c O.c Get 1 st masked C.c value, 11 b Mask corresponding O tuples: P O.c 1 ^P O.c 0 O.c O.c OM 0 1 Get 1 st masked O.o value, 111 b Mask corresponding E tuples: P E.o 2 ^P E.o 1 ^P E.o 0 E.o E.o Get 1 st masked E.s value, 010 b Mask corresponding S tuples: P’ S.s 2 ^P S.s 1 ^P’ S.s 0 S’.s S.s S’.s SM Get S.n-value(s), C, pair it with C.n-value(s), S, output concatenation, C.n S.n There was just one masked tuple at each stage in this example. In general, one would loop through the masked portion of the extant domain at each level (thus, Indexed Horizontal Nested Loop or IHNL) E.o EM S C

Vertical Select-Project-Join-Classification Query Given previous SCORE Training Database (not presented as just one training table), predict what course a male student will register for, who got an A in a previous course in Room with a capacity of 20. This is a matter of applying the previous complex SPJ query first to get the pertinent Training table and then classifying the above unclassified sample (e.g., using, 1-nearest neighbour classification). The result of the SPJ is the single row Training Set, (S,C) and so the prediction is course=C.