Presentation is loading. Please wait.

Presentation is loading. Please wait.

Query Optimization: Relational Queries to Data Mining Most people have Data from which they want information. So, most people need DBMSs whether they know.

Similar presentations


Presentation on theme: "Query Optimization: Relational Queries to Data Mining Most people have Data from which they want information. So, most people need DBMSs whether they know."— Presentation transcript:

1 Query Optimization: Relational Queries to Data Mining Most people have Data from which they want information. So, most people need DBMSs whether they know it or not. A major component of any DBMS is the query processor. Queries can range from structure to unstructured: SELECT FROM WHERE Complex queries (nested, EXISTS.. ) FUZZY queries (e.g., BLAST searches,.. OLAP (rollup, drilldown, slice/dice.. Machine LearningData Mining Relational querying Simple Searching and aggregating Supervised - Classification Regression Unsupervised- Clustering Association Rule Mining Although we just looked closely at the structured end of this spectrum, much research is yet to be done on that end to solve the problem of delivering standard workload answers with low response times and high throughput (D. DeWitt, ACM SIGMOD’02 plenary symposium). On the Data Mining end, we have barely scratched the surface. (But those scratches have made the difference between becoming the world’s biggest corporation and filing for bankruptcy – Walmart vs. KMart)

2 Recall the ER Model notion of a Relationship  Relationship: Association among 2 or more entities (# of entities is the degree).  The Graph of a Relationship: A degree=2 relationship between entity T and I generates a bipartite undirected graph (bipartite means that the node set is a disjoint union of two subsets and that all edges run from one subset to the other). lot name Employee ssn since Works_In dname budget did Department Degree=2 relationship between entities, Employees and Departments. subor- dinate super- visor Reports_To lot name Employee ssn To distinguish roles in a unipartite graph, can specify “role” of each entity. A degree=2 relationship between an entity and itself, e.g., Employee Reports_To Employee, generates a uni-partite undirected graph. Relationships can have attributes too!

3 Association Rule Mining (ARM)  Given a relationship between entities T and I – E.g., in a retail market, Transactions (at checkout) and Items  An I-Association Rule relates 2 disjoint subsets of I (itemsets), A and C, and is written, A  C and has 2 measures, support and confidence A is the antecedent and is disjoint from C, called consequent T I A t1t1 t2t2 t3t3 t4t4 t5t5 i1i1 i2i2 i3i3 i4i4 C  There are also T-association rules, of course. –Examples: Relationship between customer cash-register transactions, T, and purchasable items, I (t related to i iff i is being bought during that cash-register transaction) Relationship between Aspects, T, and Code Modules, I (t is related to i iff module, i, is part of the aspect, t) Relationship between experiments, T, and genes, I (t is related to i iff gene, i, expresses at a threshold level during experiment, t) Any “part of” relationship, i  I is part of t  T (t is related to i iff i is part of t) Any “IS A” relationship, i  I IS A t  T (t is related to i iff i IS A t) …  The support of an I-set, A, is the fraction of T-instances related to each I-instance in A, e.g., A={i 1,i 2 } supp(A)= |{t 2,t 4 }| / 5 =.4 and the support of a rule A  C is defined as supp{A  C}  The confidence of a rule, A  C, is supp(A  C) / supp(A) (conditional probability of t being related to C given that it is related to A), e.g., conf(A  C) is.2/.4 =.5, since supp(A  C) = |{t 2 }|/5 =.2

4 Association Rule Mining (ARM)  Given a many to many relationship between entities T and I – E.g., in a retail market, Transactions (at checkout) and Items The support of A  C is the support of A  C The Confidence of A  C is the support of A  C divided by the support of A –Users usually define a minimum threshold for support (minsupp) and for confidence (minconf) to indicate which rules are important to them. –Users usually want STRONG RULES with supp ≥minsupp and conf ≥minconf

5 Finding strong Assoc Rules The m-m relationship between Transactions and Items can be expressed in a Transaction Table where each transaction is a row with its ID as one column and the list of the items related to it as the other. T IDABCDEF 2000111000 1000101000 4000100100 5000010011 minsupp=.5, minconf=.75 To find frequent or Large itemsets (support ≥ minsupp) Suppose the items in L k-1 are listed in an order Step 1: self-joining L k-1 insert into C k select p.item 1, p.item 2, …, p.item k-1, q.item k-1 p from L k-1, q from L k-1 where p.item 1 =q.item 1,..,p.item k-2 =q.item k-2, p.item k-1 < q.item k-1 Step 2: pruning forall itemsets c in C k do forall (k-1)-subsets s of c do if (s is not in L k-1 ) delete c from C k Or the Item lists can be expressed using “Item bit vectors” instead Use: Subset of a frequent itemset must also be frequent. (if {A, B} is frequent itemset, {A} and {B} must be frequent) APRIORI METHOD: Iteratively find frequent itemsets, size from 1 to k. Generate association rules supported by frequent (or large) Itemsets.. C k will denote the candidate k-itemsets generated L k will denote the Large k-itemsets. 3 2 2 1 1 1 1-itemset supp 3 2 2 Large 1-itemset Start by finding large 1-ItemSets.

6 Database D Scan D C1C1 TID 12345 10010110 20001101 30011101 40001001 C2C2 Scan D C2C2 L3L3 P 1 2 //\\ 1010 P 2 3 //\\ 0111 P 3 3 //\\ 1110 P 4 1 //\\ 1000 P 5 3 //\\ 0111 Build Ptrees: Scan D L 1 ={1,2,3,5} P 1 ^P 2 1 //\\ 0010 P 1 ^P 3 2 //\\ 1010 P 1 ^P 5 1 //\\ 0010 P 2 ^P 3 2 //\\ 0110 P 2 ^P 5 3 //\\ 0111 P 3 ^P 5 2 //\\ 0110 L 2 ={13,23,25,35} P 1 ^P 2 ^P 3 1 //\\ 0010 P 1 ^P 3 ^P 5 1 //\\ 0010 P 2 ^P 3 ^P 5 2 //\\ 0110 L 3 ={235} L1L1 C3C3 L2L2 {123} pruned since {12} not large {135} pruned since {15} not Large Using P-trees:

7 L3L3 L1L1 L2L2 1-ItemSets don’t support meaningful rules (they either have no antecedent or no consequent). Are there any Strong Rules supported by Large 2-ItemSets at minconf=.75? {1,3}conf{1}  {3} = supp{1,3}/supp{1} = 2/2 = 1 ≥.75 STRONG conf{3}  {1} = supp{1,3}/supp{3} = 2/3 =.67 <.75 {2,3}conf{2}  {3} = supp{2,3}/supp{2} = 2/3 =.67 <.75 conf{3}  {2} = supp{2,3}/supp{3} = 2/3 =.67 <.75 {2,5}conf{2}  {5} = supp{2,5}/supp{2} = 3/3 = 1 ≥.75 STRONG! conf{5}  {2} = supp{2,5}/supp{5} = 3/3 = 1 ≥.75 STRONG! {3,5}conf{3}  {5} = supp{3,5}/supp{3} = 2/3 =.67 <.75 conf{5}  {3} = supp{3,5}/supp{5} = 2/3 =.67 <.75 Any Confident Rules supported by Large 3-ItemSets? {2,3,5}conf{2,3}  {5} = supp{2,3,5}/supp{2,3} = 2/2 = 1 ≥.75 STRONG! conf{2,5}  {3} = supp{2,3,5}/supp{2,5} = 2/3 =.67 <.75 conf{3,5}  {2} = supp{2,3,5}/supp{3,5} = 2/3 =.67 <.75 No subset antecedent can yield a strong rule either (i.e., no need to check conf{2}  {3,5} or conf{5}  {2,3} since both denominators will be at least as large and therefore, both confidences will be at least as low. No need to check conf{3}  {2,5} or conf{5}  {2,3} DONE!

8 P-ARM versus Apriori for (R,G,B,Y) Scalability with support threshold 1320  1320 pixel TIFF- Yield dataset (total number of transactions is ~1,700,000). Comparison with Horizontal Apriori (classical method) and FP-growth (an improvement).  In P-ARM, we find all frequent itemsets, not just those containing Yield, for fairness.  Aerial TIFF images (R,G,B) with synchronized yield (Y). Scalability with number of transactions  Identical results  P-ARM is more scalable for lower support thresholds.  P-ARM algorithm is more scalable to large spatial datasets.

9 P-ARM versus FP-growth Scalability with support threshold 17,424,000 pixels (transactions) Scalability with number of trans  FP-growth = efficient, tree-based frequent pattern mining method (details later)  Identical results.  For a dataset of 100K bytes, FP-growth runs very fast. But for images of large size, P-ARM achieves better performance.  P-ARM achieves better performance in the case of low support threshold.

10 Methods to Improve Apriori’s Efficiency  Hash-based itemset counting: A k-itemset whose corresponding hashing bucket count is below the threshold cannot be frequent  Transaction reduction: A transaction that does not contain any frequent k-itemset is useless in subsequent scans  Partitioning: Any itemset that is potentially frequent in DB must be frequent in at least one of the partitions of DB  Sampling: mining on a subset of given data, lower support threshold + a method to determine the completeness  Dynamic itemset counting: add new candidate itemsets only when all of their subsets are estimated to be frequent  The core of the Apriori algorithm: –Use frequent (k – 1)-itemsets to generate candidate frequent k-itemsets –Use database scan and pattern matching to collect counts for the candidate itemsets  The bottleneck of Apriori: candidate generation 1. Huge candidate sets: 10 4 frequent 1-itemset may generate 10 7 candidate 2-itemsets To discover frequent pattern of size 100, eg, {a 1 …a 100 }, need to generate 2 100  10 30 candidates. 2. Multiple scans of database: (Needs (n +1 ) scans, n = length of the longest pattern)

11 OUTPUT conveyor INPUT hopper 3 steps: Build Model, Test Model, Use Model (to predict class of samples) Typical Applications –credit approval –target marketing –medical diagnosis –treatment effectiveness analysis Classification MODEL (classifier) Assuming Training-Set-relationship between non-class-attribute-values and class-labels is typical. Build a model to approximate that relationship. Unclassified sample Classified tuple Classify data (construct a model) from a training whose tuples have class column values (class labels), then uses model to classify unclassified samples (data that does not yet have class column values)

12 Training Data Eager Classifiers TRAINING PHASE Classification Algorithm (creates the Classifier or Model during training phase) IF rank = ‘professor’ OR years > 6 THEN tenured = ‘yes’ Classifier or Model (as a rule set) INPUT hopper OUTPUT conveyor Class CLASSIFICATION or USE PHASE Unclassified sample Class Unclassified sample Class Unclassified sample e.g., Joe, Assistant Prof, 5

13 Classifier Testing Data NAMERANKYEARSTENURED TomAssistant Prof2no MerlisaAssociate Prof7no GeorgeAssociate Prof5yes JosephAssistant Prof7no % correct classifications? Test Process (2): Usually some of the Training Tuples are set aside as a Test Set and after a model is constructed, the Test Tuples are run through the Model. The Model is acceptable if, e.g., the % correct > 60%. If not, the Model is rejected (never used). Correct=3 Incorrect=1 75% Since 75% is above the acceptability threshold, accept the model!

14 Classification by Decision Tree Induction  Decision tree –Each Internal node denotes a test on an attribute (test attribute for that node) –Each Branch represents an outcome of the test (value of the test attribute) –Leaf nodes represent class label decisions (plurality leaf class is predicted class)  Decision tree model development consists of two phases –Tree construction At start, all the training examples are at the root Partition examples recursively based on selected attributes –Tree pruning Identify and remove branches that reflect noise or outliers  Decision tree use: Classifying unclassified samples by filtering them down the decision tree to their proper leaf, than predict the plurality class of that leaf (often only one, depending upon the stopping condition of the construction phase)

15 Algorithm for Decision Tree Induction  Basic ID3 algorithm (a simple greedy top-down algorithm) –At start, the current node is the root and all the training tuples are at the root –Repeat, down each branch, until the stopping condition is true At current node, choose a decision attribute (e.g., one with largest information gain). Each value for that decision attribute is associated with a link to the next level down and that value is used as the selection criterion of that link. Each new level produces a partition of the parent training subset based on the selection value assigned to its link. –stopping conditions: When all samples for a given node belong to the same class When there are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf When there are no samples left

16 Bayesian Classification A Bayesian classifier is a statistical classifier, which is based on following theorem known as Bayes theorem: Bayes theorem: Let X be a data sample whose class label is unknown. Let H be the hypothesis that X belongs to class, H. P(H|X) is the conditional probability of H given X. P(H) is prob of H, then P(H|X) = P(X|H)P(H)/P(X)

17 Naïve Bayesian Classification  Given training set, R(A 1..A n, C) where C={C 1..C m } is the class label attribute.  The naive Bayesian Classifier will predict the class of unknown data sample, X, to be the class, C j having the highest conditional probability, conditioned on X P(C j |X) ≥ P(C i |X), i  j.  From the Bayes theorem: P(C j |X) = P(X|C j )P(C j )/P(X) –P(X) is constant for all classes so we maximize P(X|C j )P(C j ). –Maximize P(X|C j )P(C j ). –To reduce the computational complexity of calculating all P(X|C j )'s the naive assumption: class conditional independence

18 Neural Networks  Advantages –prediction accuracy is generally high –robust, works when training examples contain errors –output may be discrete, real-valued, or a vector of several discrete or real-valued attributes –fast evaluation of the learned target function  Criticism –difficult to understand the learned function (weights) –not easy to incorporate domain knowledge –long training time

19  The n-dimensional input vector x is mapped into variable y by means of the scalar product and a nonlinear function mapping kk - f weighted sum Input vector x output y Activation function weight vector w  w0w0 w1w1 wnwn x0x0 x1x1 xnxn A Neuron

20  The ultimate objective of training –obtain a set of weights that makes almost all the tuples in the training data classify correctly  Steps –Initialize weights with random values –Feed the input tuples into the network –For each unit Compute the net input to the unit as a linear combination of all the inputs to the unit Compute the output value using the activation function Compute the error Update the weights and the bias Network Training

21 Output nodes Input nodes Hidden nodes Output vector Input vector: x i w ij Multi-Layer Perceptron

22 For Nearest Neighbor Classification, need a distance (to make sense of "nearest") A distance is a function, d, of two n-dimensional points X and Y, such that d(X, Y) is positive definite: if (X  Y), d(X, Y) > 0; if (X = Y), d(X, Y) = 0 d(X, Y) is symmetric: d(X, Y) = d(Y, X) d(X, Y) holds triangle inequality: d(X, Y) + d(Y, Z)  d(X, Z) n xxxxX,,,, 321 …  n yyyyY,,,, 321 …  Minkowski distance or L p distance, Manhattan distance,(P = 1) Euclidian distance,(P = 2) Max distance, (P =  ) Canberra distance Squared cord distance Squared chi-squared distance

23 An Example Manhattan, d 1 (X,Y) = XZ+ ZY = 4+3 = 7 X (2,1) Y (6,4) Z d 1  d 2  d  always In fact, for any positive integer p, A two-dimensional space: Max, d  (X,Y) = Max(XZ, ZY) = XZ = 4 Euclidian, d 2 (X,Y) = XY = 5

24 Neighborhoods of a Point A Neighborhood (disk neighborhood) of a point, T, is a set of points, S,  : X  S iff d(T, X)  r 2r2r T X Manhattan If X is a point on the boundary, d(T, X) = r T 2r2r X Euclidian 2r2r T X Max

25 Classical k-Nearest Neighbor Classification  Select a suitable value for k  Determine a suitable distance metric  Find the k nearest training set points to the unclassified sample.  Let them vote.  Assign the highest class vote (plurality class) from among the k-nearest neighbor set.

26 Closed-KNN T T is the unclassified sample. using k = 3, find the three nearest neighbor, KNN arbitrarily select one point from the boundary line shown Closed-KNN includes all points on the boundary Closed-KNN yields higher classification accuracy than traditional KNN (thesis of MD Maleq Khan 2001). The P-tree method always produce closed neighborhoods (and is faster!)

27 Performance – Accuracy 40 45 50 55 60 65 70 75 80 256102440961638465536262144 Training Set Size (no. of pixels) Accuracy (%) KNN-ManhattanKNN-Euclidian KNN-MaxKNN-Hobbit P-tree: (closed-KNN)P-tree: Hobbit (closed-KNN) 1997 Dataset: Hobbit methods use a simplified distance takes only the highest order bit difference rather than the square root of the sum of the differences (as in Euclidean distance)

28 Performance - Accuracy (cont.) 1998 Dataset: 20 25 30 35 40 45 50 55 60 65 256102440961638465536262144 Training Set Size (no of pixels) Accuracy (%) KNN-ManhattanKNN-Euclidian KNN-MaxKNN-Hobbit P-tree: Perfect Centering (closed-KNN) P-tree: Hobbit (closed-KNN)

29 Performance - Time 1997 Dataset: both axis in logarithmic scale 0.00001 0.0001 0.001 0.01 0.1 1 256102440961638465536262144 Training Set Size (no. of pixels) Per Sample Classification time (sec) KNN-Manhattan KNN-Euclidian KNN-Max KNN-Hobbit P-tree: Perfect Centering (cosed-KNN) P-tree: Hobbit (closed-KNN) This needs re-running using the EINring forumulas?

30 Performance - Time (cont.) 0.00001 0.0001 0.001 0.01 0.1 1 256102440961638465536262144 Training Set Size (no. of pixels) Per Sample Classification Time (sec) KNN-Manhattan KNN-Euclidian KNN-Max KNN-Hobbit P-tree: Perfect Centering (closed-KNN) P-tree: Hobbit (closed-KNN) 1998 Dataset : both axis in logarithmic scale This needs re-running using the EINring forumulas?

31 3 NEAREST NBR CLASSIFICATION use Hamming distance = # of mismatches d1 d2 a1 a2 a3 a4 a5 a6 a7 a8 a9 C a1'a2'a3'a4'a5'a6'a7'a8'a9‘ C' t1 t2 1 0 1 0 0 0 1 1 0 1 0 1 1 0 1 1 0 0 0 1 t1 t3 1 0 1 0 0 0 1 1 0 1 0 1 0 0 1 0 0 0 1 1 t1 t5 1 0 1 0 0 0 1 1 0 1 0 1 0 1 0 0 1 1 0 0 t1 t6 1 0 1 0 0 0 1 1 0 1 1 0 1 0 1 0 0 0 1 0 t2 t1 0 1 1 0 1 1 0 0 0 1 1 0 1 0 0 0 1 1 0 1 t2 t7 0 1 1 0 1 1 0 0 0 1 0 0 1 1 0 0 1 1 0 0 t3 t1 0 1 0 0 1 0 0 0 1 1 1 0 1 0 0 0 1 1 0 1 t3 t2 0 1 0 0 1 0 0 0 1 1 0 1 1 0 1 1 0 0 0 1 t3 t3 0 1 0 0 1 0 0 0 1 1 0 1 0 0 1 0 0 0 1 1 t3 t5 0 1 0 0 1 0 0 0 1 1 0 1 0 1 0 0 1 1 0 0 t5 t1 0 1 0 1 0 0 1 1 0 0 1 0 1 0 0 0 1 1 0 1 t5 t3 0 1 0 1 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 t5 t5 0 1 0 1 0 0 1 1 0 0 0 1 0 1 0 0 1 1 0 0 t5 t7 0 1 0 1 0 0 1 1 0 0 0 0 1 1 0 0 1 1 0 0 t6 t1 1 0 1 0 1 0 0 0 1 0 1 0 1 0 0 0 1 1 0 1 t7 t2 0 0 1 1 0 0 1 1 0 0 0 1 1 0 1 1 0 0 0 1 t7 t5 0 0 1 1 0 0 1 1 0 0 0 1 0 1 0 0 1 1 0 0 Unclassified sample ( a 5 a 6 a 1 ’ a 2 ’ a 3 ’ a 4 ’ ) = 0 0 0 t1 t2 0 0 1 0 1 1 0 2 t1 t3 0 0 1 0 1 0 0 1 t1 t5 0 0 1 0 1 0 1 2 a5 a6 C a1’a2’a3’a4’ dis 3NNs 0 0 0 d=2, don’t replace 0 0 0 d=4, don’t replace 0 0 0 d=4, don’t replace 0 0 0 d=3, don’t replace 0 0 0 d=3, don’t replace 0 0 0 d=2, don’t replace 0 0 0 d=3, don’t replace 0 0 0 d=2, don’t replace 0 0 0 d=1, replace t5 t3 0 0 0 0 1 0 0 1 0 0 0 d=2, don’t replace 0 0 0 d=2, don’t replace 0 0 0 d=3, don’t replace 0 0 0 d=2, don’t replace 0 0 0 d=2, don’t replace 0 1 C=1 wins! Relevant attributes Class label=C

32 To find all training pts within dis=2 (fairer prediction?) requires another scan. d1 d2 a1 a2 a3 a4 a5 a6 a7 a8 a9 C a1'a2'a3'a4'a5'a6'a7'a8'a9‘ C' t1 t2 1 0 1 0 0 0 1 1 0 1 0 1 1 0 1 1 0 0 0 1 t1 t3 1 0 1 0 0 0 1 1 0 1 0 1 0 0 1 0 0 0 1 1 t1 t5 1 0 1 0 0 0 1 1 0 1 0 1 0 1 0 0 1 1 0 0 t1 t6 1 0 1 0 0 0 1 1 0 1 1 0 1 0 1 0 0 0 1 0 t2 t1 0 1 1 0 1 1 0 0 0 1 1 0 1 0 0 0 1 1 0 1 t2 t7 0 1 1 0 1 1 0 0 0 1 0 0 1 1 0 0 1 1 0 0 t3 t1 0 1 0 0 1 0 0 0 1 1 1 0 1 0 0 0 1 1 0 1 t3 t2 0 1 0 0 1 0 0 0 1 1 0 1 1 0 1 1 0 0 0 1 t3 t3 0 1 0 0 1 0 0 0 1 1 0 1 0 0 1 0 0 0 1 1 t3 t5 0 1 0 0 1 0 0 0 1 1 0 1 0 1 0 0 1 1 0 0 t5 t1 0 1 0 1 0 0 1 1 0 0 1 0 1 0 0 0 1 1 0 1 t5 t3 0 1 0 1 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 t5 t5 0 1 0 1 0 0 1 1 0 0 0 1 0 1 0 0 1 1 0 0 t5 t7 0 1 0 1 0 0 1 1 0 0 0 0 1 1 0 0 1 1 0 0 t6 t1 1 0 1 0 1 0 0 0 1 0 1 0 1 0 0 0 1 1 0 1 t7 t2 0 0 1 1 0 0 1 1 0 0 0 1 1 0 1 1 0 0 0 1 t7 t5 0 0 1 1 0 0 1 1 0 0 0 1 0 1 0 0 1 1 0 0 t1 t2 0 0 1 0 1 1 0 2 t1 t3 0 0 1 0 1 0 0 1 a5 a6 C a1’a2’a3’a4’ d 0 0 0 d=2, include it also 0 0 0 d=4, don’t include 0 0 0 d=4, don’t include 0 0 0 d=3, don’t include 0 0 0 d=3, don’t include 0 0 0 d=2, include it also 0 0 0 d=3, don’t include 0 0 0 d=2, include it also 0 0 0 d=1, already have t5 t3 0 0 0 0 1 0 0 1 0 0 0 d=2, include it also 0 0 0 d=2, include it also 0 0 0 d=3, don’t replace 0 0 0 d=2, include it also 0 0 0 d=2, include it also 0 0 0 d=2, include it also 0 0 0 d=2, already have 0 0 0 d=1, already have 0 1 0 0 0

33 C11111111110000000C11111111110000000 C00000000001111111C00000000001111111 Using Ptree: Find all training pts  dis 1 from a=a 5 a 6 a 1 ’a 2 ’a 3 ’a 4 ’ = (000000) C' 1 0 1 0 1 0 1 0 1 0 d 1 d 2 t 1 t 2 t 1 t 3 t 1 t 5 t 1 t 6 t 2 t 1 t 2 t 7 t 3 t 1 t 3 t 2 t 3 t 3 t 5 t 5 t 1 t 5 t 3 t 5 t 5 t 7 t 6 t 1 t 7 t 2 t 7 t 5 a111110000000000100a111110000000000100 a200001111111111000a200001111111111000 a311111100000000111a311111100000000111 a400000000001111011a400000000001111011 a500001111110000100a500001111110000100 a600001100000000000a600001100000000000 a711110000001111011a711110000001111011 a811110000001111011a811110000001111011 a900000011110000100a900000011110000100 C11111111110000000C11111111110000000 a1‘00011010001000100a1‘00011010001000100 a2‘11100001110110011a2‘11100001110110011 a3‘10011111001001110a3‘10011111001001110 a4‘00100100010011001a4‘00100100010011001 a5‘11010001100100010a5‘11010001100100010 a6‘10000001000000010a6‘10000001000000010 a7‘00101110011011101a7‘00101110011011101 a8‘00101110011011101a8‘00101110011011101 a9‘01010000100100000a9‘01010000100100000 The 1-ring: given by 1-bits in the Ptree, P, constructed: 0 1 P01000000000100000P01000000000100000 C=1 vote count = root count of P^C. C=0 vote count = root count of P^C. (never need to know which tuples voted) a4‘11011011101100110a4‘11011011101100110 a3‘01100000110110001a3‘01100000110110001 a2‘00011110001001100a2‘00011110001001100 a1‘11100101110111011a1‘11100101110111011 a611110011111111111a611110011111111111 a500001111110000100a500001111110000100 a4‘11011011101100110a4‘11011011101100110 a3‘01100000110110001a3‘01100000110110001 a2‘00011110001001100a2‘00011110001001100 a1‘11100101110111011a1‘11100101110111011 a600001100000000000a6000011000000000000 a511110000001111011a511110000001111011 a4‘11011011101100110a4‘11011011101100110 a3‘01100000110110001a3‘01100000110110001 a2‘00011110001001100a2‘00011110001001100 a1‘00011010001000100a1‘00011010001000100 a611110011111111111a611110011111111111 a511110000001111011a511110000001111011 a4‘11011011101100110a4‘11011011101100110 a3‘01100000110110001a3‘01100000110110001 a2‘11100001110110011a2‘111000011101100111 a1‘11100101110111011a1‘11100101110111011 a611110011111111111a611110011111111111 a511110000001111011a511110000001111011 a4‘11011011101100110a4‘11011011101100110 a3‘10011111001001110a3‘10011111001001110 a2‘00011110001001100a2‘00011110001001100 a1‘11100101110111011a1‘11100101110111011 a611110011111111111a611110011111111111 a511110000001111011a511110000001111011 a4‘00100100010011001a4‘001001000100110010 a3‘01100000110110001a3‘01100000110110001 a2‘00011110001001100a2‘00011110001001100 a1‘11100101110111011a1‘11100101110111011 a611110011111111111a611110011111111111 a511110000001111011a511110000001111011 OR a5^a6^a1’^a2’^a3’^a4’a5^a6^a1’^a2’^a3’^a4’ a5^a6^a1’^a2’^a3’^a4’a5^a6^a1’^a2’^a3’^a4’ a5^a6^a1’^a2’^a3’^a4’a5^a6^a1’^a2’^a3’^a4’ a5^a6^a1’^a2’^a3’^a4’a5^a6^a1’^a2’^a3’^a4’ a5^a6^a1’^a2’^a3’^a4’a5^a6^a1’^a2’^a3’^a4’ a5^a6^a1’^a2’^a3’^a4’a5^a6^a1’^a2’^a3’^a4’ ( a 5 a 6 a 1 ’a 2 ’a 3 ’a 4 ’ )

34 a’ s 2-ring? a=a 5 a 6 a 1 ’a 2 ’a 3 ’a 4 ’ = (000000) d 1 d 2 t 1 t 2 t 1 t 3 t 1 t 5 t 1 t 6 t 2 t 1 t 2 t 7 t 3 t 1 t 3 t 2 t 3 t 3 t 5 t 5 t 1 t 5 t 3 t 5 t 5 t 7 t 6 t 1 t 7 t 2 t 7 t 5 a500001111110000100a500001111110000100 a600001100000000000a600001100000000000 C11111111110000000C11111111110000000 a1‘00011010001000100a1‘00011010001000100 a2‘11100001110110011a2‘11100001110110011 a3‘10011111001001110a3‘10011111001001110 a4‘00100100010011001a4‘00100100010011001 For each of the following Ptrees, a 1-bit corresponds to a training point in a’s 2-ring: Pa 5 a 6 a 1 ’a 2 ‘a 3 ‘a 4 ‘ Pa 5 a 6 a 1 ‘a 2 ’a 3 ‘a 4 ‘ Pa 5 a 6 a 1 ‘a 2 ’a 3 ‘a 4 ‘ Pa 5 a 6 a 1 ‘a 2 ‘a 3 ’a 4 ‘ Pa 5 a 6 a 1 ‘a 2 ‘a 3 ‘a 4 ’ Pa 5 a 6 a 1 ‘a 2 ’a 3 ‘a 4 ‘ Pa 5 a 6 a 1 ‘a 2 ’a 3 ‘a 4 ‘ Pa 5 a 6 a 1 ‘a 2 ‘a 3 ’a 4 ‘ Pa 5 a 6 a 1 ‘a 2 ‘a 3 ‘a 4 ’ Pa 5 a 6 a 1 ‘a 2 ’a 3 ‘a 4 ‘ Pa 5 a 6 a 1 ‘a 2 ‘a 3 ’a 4 ‘ Pa 5 a 6 a 1 ‘a 2 ‘a 3 ‘a 4 ’ Pa 5 a 6 a 1 ‘a 2 ‘a 3 ’a 4 ‘ Pa 5 a 6 a 1 ‘a 2 ‘a 3 ‘a 4 ’ Pa 5 a 6 a 1 ‘a 2 ‘a 3 ‘a 4 ’ 1 st line first: a4‘11011011101100110a4‘11011011101100110 a3‘01100000110110001a3‘01100000110110001 a2‘00011110001001100a2‘00011110001001100 a1‘11100101110111011a1‘11100101110111011 a600001100000000000a6000011000000000000 a500001111110000100a500001111110000100 a4‘11011011101100110a4‘11011011101100110 a3‘01100000110110001a3‘01100000110110001 a2‘00011110001001100a2‘00011110001001100 a1‘00011010001000100a1‘00011010001000100 a611110011111111111a611110011111111111 a500001111110000100a500001111110000100 a4‘11011011101100110a4‘11011011101100110 a3‘01100000110110001a3‘01100000110110001 a2‘11100001110110011a2‘111000011101100111 a1‘11100101110111011a1‘11100101110111011 a611110011111111111a611110011111111111 a500001111110000100a500001111110000100 a4‘11011011101100110a4‘11011011101100110 a3‘10011111001001110a3‘10011111001001110 a2‘00011110001001100a2‘00011110001001100 a1‘11100101110111011a1‘11100101110111011 a611110011111111111a611110011111111111 a500001111110000100a500001111110000100 a4‘00100100010011001a4‘001001000100110010 a3‘01100000110110001a3‘01100000110110001 a2‘00011110001001100a2‘00011110001001100 a1‘11100101110111011a1‘11100101110111011 a611110011111111111a611110011111111111 a500001111110000100a500001111110000100 0 1

35 d 1 d 2 t 1 t 2 t 1 t 3 t 1 t 5 t 1 t 6 t 2 t 1 t 2 t 7 t 3 t 1 t 3 t 2 t 3 t 3 t 5 t 5 t 1 t 5 t 3 t 5 t 5 t 7 t 6 t 1 t 7 t 2 t 7 t 5 a500001111110000100a500001111110000100 a600001100000000000a600001100000000000 C11111111110000000C11111111110000000 a1‘00011010001000100a1‘00011010001000100 a2‘11100001110110011a2‘11100001110110011 a3‘10011111001001110a3‘10011111001001110 a4‘00100100010011001a4‘00100100010011001 0 1 a4‘11011011101100110a4‘11011011101100110 a3‘01100000110110001a3‘01100000110110001 a2‘00011110001001100a2‘00011110001001100 a1‘00011010001000100a1‘00011010001000100 a600001100000000000a6000011000000000000 a511110000001111011a511110000001111011 a4‘11011011101100110a4‘11011011101100110 a3‘01100000110110001a3‘01100000110110001 a2‘11100001110110011a2‘11100001110110011 a1‘11100101110111011a1‘11100101110111011 a600001100000000000a6000011000000000000 a511110000001111011a511110000001111011 a4‘11011011101100110a4‘11011011101100110 a3‘11100001110110011a3‘11100001110110011 a2‘00011110001001100a2‘00011110001001100 a1‘11100101110111011a1‘11100101110111011 a600001100000000000a6000011000000000000 a511110000001111011a511110000001111011 a4‘00100100010011001a4‘00100100010011001 a3‘01100000110110001a3‘01100000110110001 a2‘00011110001001100a2‘00011110001001100 a1‘11100101110111011a1‘11100101110111011 a600001100000000000a6000011000000000000 a511110000001111011a511110000001111011 For each of the following Ptrees, a 1-bit corresponds to a training point in a’s 2-ring: Pa 5 a 6 a 1 ’a 2 ‘a 3 ‘a 4 ‘ Pa 5 a 6 a 1 ‘a 2 ’a 3 ‘a 4 ‘ Pa 5 a 6 a 1 ‘a 2 ’a 3 ‘a 4 ‘ Pa 5 a 6 a 1 ‘a 2 ‘a 3 ’a 4 ‘ Pa 5 a 6 a 1 ‘a 2 ‘a 3 ‘a 4 ’ Pa 5 a 6 a 1 ‘a 2 ’a 3 ‘a 4 ‘ Pa 5 a 6 a 1 ‘a 2 ’a 3 ‘a 4 ‘ Pa 5 a 6 a 1 ‘a 2 ‘a 3 ’a 4 ‘ Pa 5 a 6 a 1 ‘a 2 ‘a 3 ‘a 4 ’ Pa 5 a 6 a 1 ‘a 2 ’a 3 ‘a 4 ‘ Pa 5 a 6 a 1 ‘a 2 ‘a 3 ’a 4 ‘ Pa 5 a 6 a 1 ‘a 2 ‘a 3 ‘a 4 ’ Pa 5 a 6 a 1 ‘a 2 ‘a 3 ’a 4 ‘ Pa 5 a 6 a 1 ‘a 2 ‘a 3 ‘a 4 ’ Pa 5 a 6 a 1 ‘a 2 ‘a 3 ‘a 4 ’ 2 nd line:

36 d 1 d 2 t 1 t 2 t 1 t 3 t 1 t 5 t 1 t 6 t 2 t 1 t 2 t 7 t 3 t 1 t 3 t 2 t 3 t 3 t 5 t 5 t 1 t 5 t 3 t 5 t 5 t 7 t 6 t 1 t 7 t 2 t 7 t 5 a500001111110000100a500001111110000100 a600001100000000000a600001100000000000 C11111111110000000C11111111110000000 a1‘00011010001000100a1‘00011010001000100 a2‘11100001110110011a2‘11100001110110011 a3‘10011111001001110a3‘10011111001001110 a4‘00100100010011001a4‘00100100010011001 0 1 a4‘11011011101100110a4‘11011011101100110 a3‘01100000110110001a3‘01100000110110001 a2‘11100001110110011a2‘11100001110110011 a1‘00011010001000100a1‘00011010001000100 a611110011111111111a611110011111111111 a511110000001111011a511110000001111011 a4‘11011011101100110a4‘11011011101100110 a3‘10011111001001110a3‘10011111001001110 a2‘00011110001001100a2‘00011110001001100 a1‘00011010001000100a1‘00011010001000100 a611110011111111111a611110011111111111 a511110000001111011a511110000001111011 a4‘11011011101100110a4‘11011011101100110 a3‘01100000110110001a3‘01100000110110001 a2‘00011110001001100a2‘00011110001001100 a1‘00011010001000100a1‘00011010001000100 a611110011111111111a611110011111111111 a511110000001111011a511110000001111011 For each of the following Ptrees, a 1-bit corresponds to a training point in a’s 2-ring: Pa 5 a 6 a 1 ’a 2 ‘a 3 ‘a 4 ‘ Pa 5 a 6 a 1 ‘a 2 ’a 3 ‘a 4 ‘ Pa 5 a 6 a 1 ‘a 2 ’a 3 ‘a 4 ‘ Pa 5 a 6 a 1 ‘a 2 ‘a 3 ’a 4 ‘ Pa 5 a 6 a 1 ‘a 2 ‘a 3 ‘a 4 ’ Pa 5 a 6 a 1 ‘a 2 ’a 3 ‘a 4 ‘ Pa 5 a 6 a 1 ‘a 2 ’a 3 ‘a 4 ‘ Pa 5 a 6 a 1 ‘a 2 ‘a 3 ’a 4 ‘ Pa 5 a 6 a 1 ‘a 2 ‘a 3 ‘a 4 ’ Pa 5 a 6 a 1 ‘a 2 ’a 3 ‘a 4 ‘ Pa 5 a 6 a 1 ‘a 2 ‘a 3 ’a 4 ‘ Pa 5 a 6 a 1 ‘a 2 ‘a 3 ‘a 4 ’ Pa 5 a 6 a 1 ‘a 2 ‘a 3 ’a 4 ‘ Pa 5 a 6 a 1 ‘a 2 ‘a 3 ‘a 4 ’ Pa 5 a 6 a 1 ‘a 2 ‘a 3 ‘a 4 ’ 3 rd line:

37 d 1 d 2 t 1 t 2 t 1 t 3 t 1 t 5 t 1 t 6 t 2 t 1 t 2 t 7 t 3 t 1 t 3 t 2 t 3 t 3 t 5 t 5 t 1 t 5 t 3 t 5 t 5 t 7 t 6 t 1 t 7 t 2 t 7 t 5 a500001111110000100a500001111110000100 a600001100000000000a600001100000000000 C11111111110000000C11111111110000000 a1‘00011010001000100a1‘00011010001000100 a2‘11100001110110011a2‘11100001110110011 a3‘10011111001001110a3‘10011111001001110 a4‘00100100010011001a4‘00100100010011001 0 1 For each of the following Ptrees, a 1-bit corresponds to a training point in a’s 2-ring: Pa 5 a 6 a 1 ’a 2 ‘a 3 ‘a 4 ‘ Pa 5 a 6 a 1 ‘a 2 ’a 3 ‘a 4 ‘ Pa 5 a 6 a 1 ‘a 2 ’a 3 ‘a 4 ‘ Pa 5 a 6 a 1 ‘a 2 ‘a 3 ’a 4 ‘ Pa 5 a 6 a 1 ‘a 2 ‘a 3 ‘a 4 ’ Pa 5 a 6 a 1 ‘a 2 ’a 3 ‘a 4 ‘ Pa 5 a 6 a 1 ‘a 2 ’a 3 ‘a 4 ‘ Pa 5 a 6 a 1 ‘a 2 ‘a 3 ’a 4 ‘ Pa 5 a 6 a 1 ‘a 2 ‘a 3 ‘a 4 ’ Pa 5 a 6 a 1 ‘a 2 ’a 3 ‘a 4 ‘ Pa 5 a 6 a 1 ‘a 2 ‘a 3 ’a 4 ‘ Pa 5 a 6 a 1 ‘a 2 ‘a 3 ‘a 4 ’ Pa 5 a 6 a 1 ‘a 2 ‘a 3 ’a 4 ‘ Pa 5 a 6 a 1 ‘a 2 ‘a 3 ‘a 4 ’ Pa 5 a 6 a 1 ‘a 2 ‘a 3 ‘a 4 ’ 4 th line: a4‘11011011101100110a4‘11011011101100110 a3‘10011111001001110a3‘10011111001001110 a2‘11100001110110011a2‘11100001110110011 a1‘11100101110111011a1‘11100101110111011 a611110011111111111a611110011111111111 a511110000001111011a511110000001111011 a4‘00100100010011001a4‘001001000100110010 a3‘01100000110110001a3‘01100000110110001 a2‘11100001110110011a2‘11100001110110011 a1‘11100101110111011a1‘11100101110111011 a611110011111111111a611110011111111111 a511110000001111011a511110000001111011

38 d 1 d 2 t 1 t 2 t 1 t 3 t 1 t 5 t 1 t 6 t 2 t 1 t 2 t 7 t 3 t 1 t 3 t 2 t 3 t 3 t 5 t 5 t 1 t 5 t 3 t 5 t 5 t 7 t 6 t 1 t 7 t 2 t 7 t 5 a500001111110000100a500001111110000100 a600001100000000000a600001100000000000 C11111111110000000C11111111110000000 a1‘00011010001000100a1‘00011010001000100 a2‘11100001110110011a2‘11100001110110011 a3‘10011111001001110a3‘10011111001001110 a4‘00100100010011001a4‘00100100010011001 0 1 For each of the following Ptrees, a 1-bit corresponds to a training point in a’s 2-ring: Pa 5 a 6 a 1 ’a 2 ‘a 3 ‘a 4 ‘ Pa 5 a 6 a 1 ‘a 2 ’a 3 ‘a 4 ‘ Pa 5 a 6 a 1 ‘a 2 ’a 3 ‘a 4 ‘ Pa 5 a 6 a 1 ‘a 2 ‘a 3 ’a 4 ‘ Pa 5 a 6 a 1 ‘a 2 ‘a 3 ‘a 4 ’ Pa 5 a 6 a 1 ‘a 2 ’a 3 ‘a 4 ‘ Pa 5 a 6 a 1 ‘a 2 ’a 3 ‘a 4 ‘ Pa 5 a 6 a 1 ‘a 2 ‘a 3 ’a 4 ‘ Pa 5 a 6 a 1 ‘a 2 ‘a 3 ‘a 4 ’ Pa 5 a 6 a 1 ‘a 2 ’a 3 ‘a 4 ‘ Pa 5 a 6 a 1 ‘a 2 ‘a 3 ’a 4 ‘ Pa 5 a 6 a 1 ‘a 2 ‘a 3 ‘a 4 ’ Pa 5 a 6 a 1 ‘a 2 ‘a 3 ’a 4 ‘ Pa 5 a 6 a 1 ‘a 2 ‘a 3 ‘a 4 ’ Pa 5 a 6 a 1 ‘a 2 ‘a 3 ‘a 4 ’ 5 th line: a4‘00100100010011001a4‘001001000100110010 a3‘10011111001001110a3‘10011111001001110 a2‘00011110001001100a2‘00011110001001100 a1‘11100101110111011a1‘11100101110111011 a611110011111111111a611110011111111111 a511110000001111011a511110000001111011

39 Clustering Methods Clustering is partitioning into mutually exclusive and collectively exhaustive subsets, such that each point is very similar to (close to) points in its component and very dissimilar to (far from) points in the other components. A Categorization of Major Clustering Methods  Partitioning methods (K-means | K-medoids...)  Hierarchical methods (Agnes, Diana...)  Density-based methods  Grid-based methods  Model-based methods

40 The K-Means Clustering Method Given k, the k-means algorithm is implemented in 4 steps (assumes partitioning criteria is: maximize intra-cluster similarity and minimize inter-cluster similarity. Of course, a heuristic is used. Method isn’t really an optimization) 1.Partition objects into k nonempty subsets (or pick k initial means). 2.Compute the mean (center) or centroid of each cluster of the current partition (if one started with k means, then this step is done). centroid ~ point that minimizes the sum of dissimilarities from the mean or the sum of the square errors from the mean. Assign each object to the cluster with the most similar (closest) center. 3.Go back to Step 2 4.Stop when the new set of means doesn’t change (or some other stopping condition?)

41 k-Means 0 1 2 3 4 5 6 7 8 9 10 0123456789 0 1 2 3 4 5 6 7 8 9 0123456789 Step 1 Step 2 Step 3 Step 4 Strength relatively efficient: O(tkn), n is # objects, k is # clusters t is # iterations. Normally, k, t << n. Weakness Applicable only when mean is defined (e.g., a vector space). Need to specify k, the number of clusters, in advance. It is sensitive to noisy data and outliers.

42 The K-Medoids Clustering Method  Find representative objects, called medoids, (must be an actual object in the cluster, where as the mean seldom is).  PAM (Partitioning Around Medoids, 1987) –starts from an initial set of medoids –iteratively replaces one of the medoids by a non-medoid –if it improves the aggregate similarity measure, retain the swap. Do this over all medoid-nonmedoid pairs –PAM works for small data sets. Does not scale for large data sets  CLARA (Clustering LARge Applications) (Kaufmann,Rousseeuw, 1990) Sub-samples then apply PAM  CLARANS (Clustering Large Applications based on RANdom Search) (Ng & Han, 1994): Randomized the sampling

43 Hierarchical Clustering Methods: AGNES (Agglomerative Nesting)  Introduced in Kaufmann and Rousseeuw (1990)  Use the Single-Link (distance between two sets is the minimum pairwise distance) method Other options are complete link (distance is max pairwise); average,...  Merge nodes that are most similarity  Eventually all nodes belong to the same cluster

44 DIANA (Divisive Analysis)  Introduced in Kaufmann and Rousseeuw (1990)  Inverse order of AGNES (initially all objects are in one cluster; then it is split according to some criteria (e.g., maximize some aggregate measure of pairwise dissimilarity again)  Eventually each node forms a cluster on its own

45 Contrasting hierarchical Clustering Techniques  Partitioning algorithms: Partition a dataset to k clusters, e.g., k=3   Hierarchical alg: Create hierarchical decomposition of ever-finer partitions. e.g., top down (divisively). bottom up (agglomerative)

46 Hierarchical Clustering a b Step 1 d e Step 2 c d e Step 3 a b c d e Step 4 b d c e a Step 0 Agglomerative

47 Hierarchical Clustering (top down) In either case, one gets a nice dendogram in which any maximal anti- chain (no 2 nodes linked) is a clustering (partition). b a Step 4 d e Step 3 c d e Step 2 a b c d e Step 1 a b c d e Step 0 Divisive

48 Hierarchical Clustering (Cont.) Recall that any maximal anti-chain (maximal set of nodes in which no 2 are chained) is a clustering (a dendogram offers many).          

49 Hierarchical Clustering (Cont.) But the “horizontal” anti-chains are the clusterings resulting from the top down (or bottom up) method(s).

50 Data Mining Summary Data Mining on a given table of data includes Association Rule Mining (ARM) on Bipartite Relationships Clustering Partitioning methods (K-means | K-medoids...), Hierarchical methods (Agnes, Diana...), Model-based methods (K-Means, K-Medoids..),.... Classification Decision Tree Induction, Bayesian, Neural Network, k-Nearest-Neighbor,...) But most data mining is on a database, not just one table, that is, often times, first one must apply the appropriate SQL query to a database to get the table to be data mined. The next slides discuss vertical data methods for doing that.

51 Vertical Select-Project-Join (SPJ) Queries A Select-Project-Join query has joins, selections and projections. Typically there is a central fact relation to which several dimension relations are to be joined (standard STAR DW) E.g., Student(S), Course(C), Enrol(E) STAR DB below (bit encoding is shown in reduced font italics for certain attributes) S|s____|name_|gen| C|c____|name|st|term| E|s____|c____|grade | |0 000|CLAY |M 0| |0 000|BI |ND|F 0| |0 000|1 001|B 10| |1 001|THAIS|M 0| |1 001|DB |ND|S 1| |0 000|0 000|A 11| |2 010|GOOD |F 1| |2 010|DM |NJ|S 1| |3 011|1 001|A 11| |3 011|BAID |F 1| |3 011|DS |ND|F 0| |3 011|3 011|D 00| |4 100|PERRY|M 0| |4 100|SE |NJ|S 1| |1 001|3 011|D 00| |5 101|JOAN |F 1| |5 101|AI |ND|F 0| |1 001|0 000|B 10| |2 010|2 010|B 10| |2 010|3 011|A 11| |4 100|4 100|B 10| |5 101|5 101|B 10| Vertical bit sliced (uncompressed) attrs stored as: S.s 2 S.s 1 S.s 0 S.gC.c 2 C.c 1 C.c 0 C.tE.s 2 E.s 1 E.s 0 E.c 2 E.c 1 E.c 0 E.g 1 E.g 0 0000000000000110 0010001100000011 1000100100101100 1011101000100010 0101010101101111 0111011001101100 01001010 01000111 10010010 10110110 Vertical (un-bit-sliced) attributes are stored: S.name C.name C.st |CLAY | |BI | |ND| |THAIS| |DB | |ND| |GOOD | |DM | |NJ| |BAID | |DS | |ND| |PERRY| |SE | |NJ| |JOAN | |AI | |ND|

52 O :o c r |0 000|0 00|0 01| |1 001|0 00|1 01| |2 010|1 01|0 00| |3 011|1 01|1 01| |4 100|2 10|0 00| |5 101|2 10|2 10| |6 110|2 10|3 11| |7 111|3 11|2 10| C:c n cred |0 00|B|1 01| |1 01|D|3 11| |2 10|M|3 11| |3 11|S|2 10| Vertical preliminary Select-Project-Join Query Processing (SPJ) R:r cap |0 00|30 11| |1 01|20 10| |2 10|30 11| |3 11|10 01| SELECT S.n, C.n FROM S, C, O, R, E WHERE S.s=E.s & C.c=O.c & O.o=E.o & O.r=R.r & S.g=M & C.r=2 & E.g=A & R.c=20; S:s n gen |0 000|A|M| |1 001|T|M| |2 100|S|F| |3 111|B|F| |4 010|C|M| |5 011|J|F| E:s o grade |0 000|1 001|2 10| |0 000|0 000|3 11| |3 011|1 001|3 11| |3 011|3 011|0 00| |1 001|3 011|0 00| |1 001|0 000|2 10| |2 010|2 010|2 10| |2 010|7 111|3 11| |4 100|4 100|2 10| |5 101|5 101|2 10| S.s 2 0 1 0 E.s 2 0 1 C.c 1 0 1 R.r 1 0 1 S.s 1 0 1 S.s 0 0 1 0 1 0 1 S.n A T S B C J S.g M F M F C.c 0 0 1 0 1 C.n B D M S C.r 1 0 1 C.r 0 1 0 R.r 0 0 1 0 1 R.c 1 1 0 R.c 0 1 0 1 E.s 1 0 1 0 1 0 E.s 0 0 1 0 1 E.o 2 0 1 E.o 1 0 1 0 1 0 E.o 0 1 0 1 0 1 0 1 E.g 1 1 0 1 E.g 0 0 1 0 1 0 In the SCORE database (Students, Courses, Offerings, Rooms, Enrollments), numeric attributes are represente vertically as P-trees (not compressed). Categorical are projected to a 1 column vertical file O.o 2 0 1 O.o 1 0 1 0 1 O.o 0 0 1 0 1 0 1 0 1 O.c 1 0 1 O.c 0 0 1 0 1 O.r 1 0 1 O.r 0 1 0 1 0 1 0 decimal binary.

53 SM 1 0 1 0 For selections, S.g=M=1 b C.r=2=10 b E.g=A=11 b R.c=20=10 b create the selection masks using ANDs and COMPLEMENTS. S.s 2 0 1 0 S.s 1 0 1 S.s 0 0 1 0 1 0 1 S.n A T S B C J S.g 1 0 1 0 E.s 2 0 1 E.s 1 0 1 0 1 0 E.s 0 0 1 0 1 E.o 2 0 1 0 1 E.o 1 0 1 0 1 0 E.o 0 1 0 1 0 1 0 1 E.g 1 1 0 1 E.g 0 0 1 0 1 0 C.c 1 0 1 C.c 1 0 1 0 1 C.n B D M S C.r 1 0 1 C.r 2 1 0 O.o 2 0 1 O.o 1 0 1 0 1 O.o 0 0 1 0 1 0 1 0 1 O.c 1 0 1 O.c 0 0 1 0 1 O.r 1 0 1 O.r 0 1 0 1 0 1 0 R.r 1 0 1 R.r 0 0 1 0 1 R.c 1 1 0 R.c 0 1 0 1 SELECT S.n, C.n FROM S, C, O, R, E WHERE S.s=E.s & C.c=O.c & O.o=E.o & O.r=R.r & S.g=M & C.r=2 & E.g=A & R.c=20; C.r 1 0 1 C.r’ 2 0 1 Cr2 0 1 E.g 1 1 0 1 E.g 0 0 1 0 1 0 EgA 0 1 0 1 0 R.c 1 1 0 R.c’ 0 0 1 0 Rc20 0 1 0 Apply these selection masks (Zero out numeric values, blanked out others). S.s 2 0 S.s 1 0 1 0 S.s 0 0 1 0 S.n A T C E.s 2 0 E.s 1 0 1 0 1 0 E.s 0 0 1 0 E.o 2 0 1 0 E.o 1 0 1 0 E.o 0 0 1 0 1 0 C.c 1 0 1 C.c 0 0 1 C.n S O.o 2 0 1 O.o 1 0 1 0 1 001010101001010101 O.c 1 0 1 O.c 0 0 1 0 1 O.r 1 0 1 O.r 0 1 0 1 0 1 0 R.r 1 0 R.r 0 0 1 0

54 SELECT S.n, C.n FROM S, C, O, R, E WHERE S.s=E.s & C.c=O.c & O.o=E.o & O.r=R.r & S.g=M & C.r=2 & E.g=A & R.c=20; S.s 2 0 S.s 1 0 1 0 S.s 0 0 1 0 S.n A T C E.s 2 0 E.s 1 0 1 0 1 0 E.s 0 0 1 0 E.o 2 0 1 0 E.o 1 0 1 0 E.o 0 0 1 0 1 0 C.c 1 0 1 C.c 0 0 1 C.n S O.o 2 0 1 O.o 1 0 1 0 1 O.o 0 0 1 0 1 0 1 0 1 O.c 1 0 1 O.c 0 0 1 0 1 O.r 1 0 1 O.r 0 1 0 1 0 1 0 R.r 1 0 R.r 0 0 1 0 For the joins, S.s=E.s C.c=O.c O.o=E.o O.r=R.r, one approach is to follow an indexed nested loop like method. (Noting that attribute P-trees ARE an index for that attribute). The join O.r=R.r is simply part of a selection on O (R doesn’t contribute output nor participate in any further operations) Use the Rc20-masked R as the outer relation Use O as the indexed inner relation to produce that O-selection mask. Rc20 0 1 0 Get 1 st R.r value, 01 b (there's only 1) Mask the O tuples: P O.r 1 ^P’ O.r 0 O.r 1 0 1 O’.r 0 0 1 0 1 0 1 OM 0 1 0 1 This is the only R.r value (if there were more, one would do the same for each, then OR those masks to get the final O-mask). Next, we apply the O-mask, OM to O O.o 2 0 1 0 1 O.o 1 0 1 O.o 0 0 1 0 1 O.c 1 0 1 0 1 O.c 0 0 1

55 SELECT S.n, C.n FROM S, C, O, R, E WHERE S.s=E.s & C.c=O.c & O.o=E.o & O.r=R.r & S.g=M & C.r=2 & E.g=A & R.c=20; S.s 2 0 S.s 1 0 1 0 S.s 0 0 1 0 S.n A T C E.s 2 0 E.s 1 0 1 0 1 0 E.s 0 0 1 0 E.o 2 0 1 0 E.o 1 0 1 0 E.o 0 0 1 0 1 0 C.c 1 0 1 C.c 0 0 1 C.n S For the final 3 joins C.c=O.c O.o=E.o E.s=S.s the same indexed nested loop like method can be used. O.o 2 0 1 0 1 O.o 1 0 1 O.o 0 0 1 0 1 O.c 1 0 1 0 1 O.c 0 0 1 Get 1 st masked C.c value, 11 b Mask corresponding O tuples: P O.c 1 ^P O.c 0 O.c 1 0 1 0 1 O.c 0 0 1 OM 0 1 Get 1 st masked O.o value, 111 b Mask corresponding E tuples: P E.o 2 ^P E.o 1 ^P E.o 0 E.o 1 0 1 0 E.o 0 0 1 0 1 0 Get 1 st masked E.s value, 010 b Mask corresponding S tuples: P’ S.s 2 ^P S.s 1 ^P’ S.s 0 S’.s 2 1 0 1 0 S.s 1 0 1 0 S’.s 0 1 0 1 0 SM 0 1 0 Get S.n-value(s), C, pair it with C.n-value(s), S, output concatenation, C.n S.n There was just one masked tuple at each stage in this example. In general, one would loop through the masked portion of the extant domain at each level (thus, Indexed Horizontal Nested Loop or IHNL) E.o 2 0 1 0 EM 0 1 0 S C

56 Vertical Select-Project-Join-Classification Query Given previous SCORE Training Database (not presented as just one training table), predict what course a male student will register for, who got an A in a previous course in Room with a capacity of 20. This is a matter of applying the previous complex SPJ query first to get the pertinent Training table and then classifying the above unclassified sample (e.g., using, 1-nearest neighbour classification). The result of the SPJ is the single row Training Set, (S,C) and so the prediction is course=C.


Download ppt "Query Optimization: Relational Queries to Data Mining Most people have Data from which they want information. So, most people need DBMSs whether they know."

Similar presentations


Ads by Google