BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of.

Slides:



Advertisements
Similar presentations
Mining Association Rules
Advertisements

CSE 634 Data Mining Techniques
Data Mining Techniques Association Rule
Classification Techniques: Decision Tree Learning
732A02 Data Mining - Clustering and Association Analysis ………………… Jose M. Peña Association rules Apriori algorithm FP grow algorithm.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Association Rules Mining Part III. Multiple-Level Association Rules Items often form hierarchy. Items at the lower level are expected to have lower support.
FP-growth. Challenges of Frequent Pattern Mining Improving Apriori Fp-growth Fp-tree Mining frequent patterns with FP-tree Visualization of Association.
Classification and Prediction
1 Mining Frequent Patterns Without Candidate Generation Apriori-like algorithm suffers from long patterns or quite low minimum support thresholds. Two.
Association Rule Mining. Generating assoc. rules from frequent itemsets  Assume that we have discovered the frequent itemsets and their support  How.
Classification Continued
Mining Association Rules in Large Databases
ITIS 5160 Indexing. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of.
Classification II.
Mining Association Rules
Classification.
Mining Association Rules
Association Rule Mining. Mining Association Rules in Large Databases  Association rule mining  Algorithms Apriori and FP-Growth  Max and closed patterns.
Mining Association Rules in Large Databases. What Is Association Rule Mining?  Association rule mining: Finding frequent patterns, associations, correlations,
Pattern Recognition Lecture 20: Data Mining 3 Dr. Richard Spillman Pacific Lutheran University.
Association Discovery from Databases Association rules are a simple formalism for expressing positive connections between columns in a 0/1 matrix. A classical.
Chapter 7 Decision Tree.
Data Mining : Introduction Chapter 1. 2 Index 1. What is Data Mining? 2. Data Mining Functionalities 1. Characterization and Discrimination 2. MIning.
Fall 2004 TDIDT Learning CS478 - Machine Learning.
Ch5 Mining Frequent Patterns, Associations, and Correlations
Data Mining Frequent-Pattern Tree Approach Towards ARM Lecture
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
EFFICIENT ITEMSET EXTRACTION USING IMINE INDEX By By U.P.Pushpavalli U.P.Pushpavalli II Year ME(CSE) II Year ME(CSE)
Basics of Decision Trees  A flow-chart-like hierarchical tree structure –Often restricted to a binary structure  Root: represents the entire dataset.
Chapter 9 – Classification and Regression Trees
ITCS 6163 Lecture 5. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of.
Information Systems Data Analysis – Association Mining Prof. Les Sztandera.
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Classification CS 685: Special Topics in Data Mining Fall 2010 Jinze Liu.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Classification COMP Seminar BCB 713 Module Spring 2011.
Data Mining Find information from data data ? information.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Association Analysis This lecture node is modified based on Lecture Notes for.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
Mining Frequent Patterns, Associations, and Correlations Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
Classification and Prediction
1 Classification: predicts categorical class labels (discrete or nominal) classifies data (constructs a model) based on the training set and the values.
Data Mining  Association Rule  Classification  Clustering.
Decision Trees.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Chapter 8 Association Rules. Data Warehouse and Data Mining Chapter 10 2 Content Association rule mining Mining single-dimensional Boolean association.
Data Mining Association Rules Mining Frequent Itemset Mining Support and Confidence Apriori Approach.
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Association Rule Mining CS 685: Special Topics in Data Mining Jinze Liu.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Association Rule Mining COMP Seminar BCB 713 Module Spring 2011.
Chapter 3 Data Mining: Classification & Association Chapter 4 in the text box Section: 4.3 (4.3.1),
Review of Decision Tree Learning Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Chapter 6 Decision Tree.
Data Mining Find information from data data ? information.
ITIS 5160 Indexing.
C4.5 algorithm Let the classes be denoted {C1, C2,…, Ck}. There are three possibilities for the content of the set of training samples T in the given node.
C4.5 algorithm Let the classes be denoted {C1, C2,…, Ck}. There are three possibilities for the content of the set of training samples T in the given node.
Mining Association Rules
©Jiawei Han and Micheline Kamber
Classification and Prediction
Mining Association Rules in Large Databases
Association Rule Mining
I don’t need a title slide for a lecture
CS 685: Special Topics in Data Mining Jinze Liu
CS 685: Special Topics in Data Mining Jinze Liu
CSCI N317 Computation for Scientific Applications Unit Weka
©Jiawei Han and Micheline Kamber
©Jiawei Han and Micheline Kamber
Avoid Overfitting in Classification
CS 685: Special Topics in Data Mining Jinze Liu
Presentation transcript:

BITMAPS & Starjoins

Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of indexed keys. Dynamic, stable and exhibit good performance under updates. (But OLAP is not about updates….) Bitmaps: Space efficient Difficult to update (but we don’t care in DW). Can effectively prune searches before looking at data.

Bitmaps R = (…., A,….., M)  R (A) B 8 B 7 B 6 B 5 B 4 B 3 B 2 B 1 B

Query optimization Consider a high-selectivity-factor query with predicates on two attributes. Query optimizer: builds plans (P1) Full relation scan (filter as you go). (P2) Index scan on the predicate with lower selectivity factor, followed by temporary relation scan, to filter out non- qualifying tuples, using the other predicate. (Works well if data is clustered on the first index key). (P3) Index scan for each predicate (separately), followed by merge of RID.

Query optimization (continued) (P2) Blocks of data Pred. 2 answer t1 tn Index Pred1 (P3) t1 tn Index Pred2 Tuple list1 Tuple list2 Merged list

Query optimization (continued) When using bitmap indexes (P3) can be an easy winner! CPU operations in bitmaps (AND, OR, XOR, etc.) are more efficient than regular RID merges: just apply the binary operations to the bitmaps (In B-trees, you would have to scan the two lists and select tuples in both -- merge operation--) Of course, you can build B-trees on the compound key, but we would need one for every compound predicate (exponential number of trees…).

Bitmaps and predicates A = a1 AND B = b2 Bitmap for a1Bitmap for b2 AND = Bitmap for a1 and b2

Tradeoffs Dimension cardinality small dense bitmaps Dimension cardinality large sparse bitmaps Compression (decompression)

Bitmap for prod  Bitmap for prod  ….. Query strategy for Star joins Maintain join indexes between fact table and dimension tables Prod. Fact tableDimension table a... k …… …… Bitmap for type a Bitmap for type k ….. Bitmap for loc.  Bitmap for loc.  …..

Strategy example Aggregate all sales for products of location ,  or Bitmap for  Bitmap for  Bitmap for OR = Bitmap for predicate

Star-Joins Select F.S, D1.A1, D2.A2, …. Dn.An from F,D1,D2,Dn where F.A1 = D1.A1 F.A2 = D2.A2 … F.An = Dn.An and D1.B1 = ‘c1’ D2.B2 = ‘p2’ …. Likely strategy: For each Di find suitable values of Ai such that Di.Bi = ‘xi’ (unless you have a bitmap index for Bi). Use bitmap index on Ai’ values to form a bitmap for related rows of F (OR-ing the bitmaps). At this stage, you have n such bitmaps, the result can be found AND-ing them.

Example Selectivity/predicate = 0.01 (predicates on the dimension tables) n predicates (statistically independent) Total selectivity = 10 -2n Facts table = 10 8 rows, n = 3, tuples in answer = 10 8 / 10 6 = 100 rows. In the worst case = 100 blocks… Still better than all the blocks in the relation (e.g., assuming 100 tuples/block, this would be 10 6 blocks!)

Design Space of Bitmap Indexes The basic bitmap design is called Value-list index. The focus there is on the columns. If we change the focus to the rows, the index becomes a set of attribute values (integers) in each tuple (row), that can be represented in a particular way We can encode this row in many ways...

Attribute value decomposition C = attribute cardinality Consider a value of the attribute, v, and a sequence of numbers. Also, define b n =  C /  b i , then v can be decomposed into a sequence of n digits as follows: v = V 1 = V 2 b 1 + v 1 = V 3 (b 2 b 1 ) + v 2 b 1 + v 1 … n-1 i-1 = v n (  b j ) + …+ v i (  b j ) + …+ v 2 b 1 + v 1 where v i = V i mod b i and V i =  V i-1 /b i-1 

(decimal system!) 576 = 5 x 10 x x /100 = 5 | 76 76/10 = 7 | 6 6 Number systems How do you write 576 in: 576 = 1 x x x x x x x x x x / 2 9 = 1 | 64, 64/ 2 8 = 0|64, 64/ 2 7 = 0|64, 64/ 2 6 = 1|0, 0/ 2 5 = 0|0, 0/ 2 4 = 0|0, 0/ 2 3 = 0|0, 0/ 2 2 = 0|0, 0/ 2 1 = 0|0, 0/ 2 0 = 0|0 576/(7x7x5x3) = 576/735 = 0 | 576, 576/(7x5x3)=576/105=5| = 5 x (7x5x3)+51 51/(5x3) = 51/15 = 3 | = 5 x (7x5x3) + 3 (5 x 3) /3 =2 | = 5 x (7x 5 x 3) + 3 x (5 x 3 ) + 2 x (3)

Bitmaps R = (…., A,….., M) value-list index  R (A) B 8 B 7 B 6 B 5 B 4 B 3 B 2 B 1 B

Example sequence value-list index (equality)  R (A) B 2 2 B 1 2 B 0 2 B 2 1 B 1 1 B (1x3+0)

Encoding scheme Equality encoding: all bits to 0 except the one that corresponds to the value Range Encoding: the vi righmost bits to 0, the remaining to 1

Range encoding single component, base-9  R (A) B 8 B 7 B 6 B 5 B 4 B 3 B 2 B 1 B

Example (revisited) sequence value-list index(Equality)  R (A) B 2 2 B 1 2 B 0 2 B 2 1 B 1 1 B (1x3+0)

Example sequence range-encoded index  R (A) B 1 2 B 0 2 B 1 1 B

Design Space …. equality range

RangeEval Evaluates each range predicate by computing two bitmaps: BEQ bitmap and either BGT or BLT RangeEval-Opt uses only <= A < v is the same as A <= v-1 A > v is the same as Not( A <= v) A >= v is the same as Not (A <= v-1)

RangeEval-OPT

Classification: –predicts categorical class labels –classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data Prediction: –models continuous-valued functions, i.e., predicts unknown or missing values Typical Applications –credit approval –target marketing –medical diagnosis –treatment effectiveness analysis Classification vs. Prediction

Pros: – Fast. – Rules easy to interpret. – High dimensional data Cons: –No correlations – Axis-parallel cuts. Supervised learning (classification) –Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations –New data is classified based on the training set Unsupervised learning (clustering) –The class labels of training data is unknown –Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data Decision tree –A flow-chart-like tree structure –Internal node denotes a test on an attribute –Branch represents an outcome of the test –Leaf nodes represent class labels or class distribution Decision tree generation consists of two phases –Tree construction At start, all the training examples are at the root Partition examples recursively based on selected attributes –Tree pruning Identify and remove branches that reflect noise or outliers Use of decision tree: Classifying an unknown sample –Test the attribute values of the sample against the decision tree

Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm) –Tree is constructed in a top-down recursive divide-and-conquer manner –At start, all the training examples are at the root –Attributes are categorical (if continuous-valued, they are discretized in advance) –Examples are partitioned recursively based on selected attributes –Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain) Conditions for stopping partitioning –All samples for a given node belong to the same class –There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf –There are no samples left

Decision tree algorithms Building phase: –Recursively split nodes using best splitting attribute and value for node Pruning phase: –Smaller (yet imperfect) tree achieves better prediction accuracy. – Prune leaf nodes recursively to avoid over-fitting. DATA TYPES Numerically ordered: values are ordered and they can be represented in real line. ( E.g., salary.) Categorical: takes values from a finite set not having any natural ordering. (E.g., color.) Ordinal: takes values from a finite set whose values posses a clear ordering, but the distances between them are unknown. (E.g., preference scale: good, fair, bad.)

Some probability... S = cases freq(Ci,S) = # cases in S that belong to Ci Gain entropic meassure: Prob(“this case belongs to Ci”) = freq(Ci,S)/|S| Information conveyed: -log (freq(Ci,S)/|S|) Entropy = expected information = -  (freq(Ci,S)/|S|) log (freq(Ci,S)/|S|) = info(S) GAIN Test X: infoX (T) =  |Ti|/T info(Ti) gain(X) = info (T) - infoX(T)

PROBLEM: What is best predictor to segment on? - windy or the outlook?

Problem with Gain Strong bias towards test with many outcomes. Example: Z = Name |Ti| = 1 (each name unique) info Z (T) =  1/|T| (- 1/N log (1/N))  0 Maximal gain!! (but useless division--- overfitting--)

Split Split-info (X) = -  |Ti|/|T| log (|Ti|/|T|) gain-ratio(X) = gain(X)/split-info(X) Gain <= log(k) Split <= log(n) ratio small

The generated tree may overfit the training data –Too many branches, some may reflect anomalies due to noise or outliers –Result is in poor accuracy for unseen samples Two approaches to avoid overfitting –Prepruning: Halt tree construction early—do not split a node if this would result in the goodness measure falling below a threshold Difficult to choose an appropriate threshold –Postpruning: Remove branches from a “fully grown” tree—get a sequence of progressively pruned trees Use a set of data different from the training data to decide which is the “best pruned tree” Approaches to Determine the Final Tree Size Separate training (2/3) and testing (1/3) sets Use cross validation, e.g., 10-fold cross validation Use all the data for training but apply a statistical test (e.g., chi-square) to estimate whether expanding or pruning a node may improve the entire distribution Use minimum description length (MDL) principle: halting growth of the tree when the encoding is minimized

Gini Index (IBM IntelligentMiner) If a data set T contains examples from n classes, gini index, gini(T) is defined as where p j is the relative frequency of class j in T. If a data set T is split into two subsets T 1 and T 2 with sizes N 1 and N 2 respectively, the gini index of the split data contains examples from n classes, the gini index gini(T) is defined as The attribute provides the smallest gini split (T) is chosen to split the node (need to enumerate all possible splitting points for each attribute).

Training set AgeCar Attribute lists Problem: What is the best way to determine risk? Is it Age or Car?

Splits Age < 27.5 Group1 Group2

Histograms For continuous attributes Associated with node (Cabove, Cbelow) to processalready processed

ANSWER The winner is Age <= 18.5 H YN

Summary Classification is an extensively studied problem (mainly in statistics, machine learning & neural networks) Classification is probably one of the most widely used data mining techniques with a lot of extensions Scalability is still an important issue for database applications: thus combining classification with database techniques should be a promising topic Research directions: classification of non-relational data, e.g., text, spatial, multimedia, etc..

Association rules a* priori paper – student plays basketball example

Chapter 6: Mining Association Rules in Large Databases Association rule mining Mining single-dimensional Boolean association rules from transactional databases Mining multilevel association rules from transactional databases Mining multidimensional association rules from transactional databases and data warehouse From association mining to correlation analysis Summary

Association Rules Market basket data: your ``supermarket’’ basket contains {bread, milk, beer, diapers…} Find rules that correlate the presence of one set of items X with another Y. – Ex: X = diapers, Y= beer, X  Y with confidence 98% – Maybe constrained: e.g., consider only female customers.

Applications Market basket analysis: tell me how I can improve my sales by attaching promotions to “best seller” itemsets. Marketing: “people who bought this book also bought…” Fraud detection: a claim for immunizations always come with a claim for a doctor’s visit on the same day. Shelf planning: given the “best sellers,” how do I organize my shelves?

Association Rule: Basic Concepts Given: (1) database of transactions, (2) each transaction is a list of items (purchased by a customer in a visit) Find: all rules that correlate the presence of one set of items with that of another set of items –E.g., 98% of people who purchase tires and auto accessories also get automotive services done

Association Rule Mining: A Road Map Boolean vs. quantitative associations (Based on the types of values handled) – buys(x, “SQLServer”) ^ buys(x, “DMBook”)  buys(x, “DBMiner”) [0.2%, 60%] – age(x, “30..39”) ^ income(x, “42..48K”)  buys(x, “PC”) [1%, 75%] Single dimension vs. multiple dimensional associations (see ex. Above)

Road-map (continuation) Single level vs. multiple-level analysis – What brands of beers are associated with what brands of diapers? Various extensions – Correlation, causality analysis Association does not necessarily imply correlation or causality Causality: Does Beer  Diapers or Diapers  Beer (I.e., did the customer buy the diapers because he bought the beer or was it the other way around) Correlation: 90% buy coffee, 25 % buy tea, 20% buy both--- support is less than expected support = 0.9*0.25 = – Maxpatterns and closed itemsets – Constraints enforced E.g., small sales (sum 1,000)?

Chapter 6: Mining Association Rules in Large Databases Association rule mining Mining single-dimensional Boolean association rules from transactional databases Mining multilevel association rules from transactional databases Mining multidimensional association rules from transactional databases and data warehouse From association mining to correlation analysis Summary

Mining Association Rules—An Example For rule A  C: support = support({A  C}) = 50% confidence = support({A  C})/support({A}) = 66.6% The Apriori principle: Any subset of a frequent itemset must be frequent Min. support 50% Min. confidence 50%

Mining Frequent Itemsets: the Key Step Find the frequent itemsets: the sets of items that have minimum support –A subset of a frequent itemset must also be a frequent itemset i.e., if {AB} is a frequent itemset, both {A} and {B} should be a frequent itemset –Iteratively find frequent itemsets with cardinality from 1 to k (k-itemset) Use the frequent itemsets to generate association rules.

Problem decomposition Two phases: Generate all itemsets whose support is above a threshold. Call them large (or hot) itemsets. (Any other itemset is small.) How? Generate all combinations? (exponential!) (HARD.) For a given large itemset Y = I 1 I 2 …I k k >= 2 Generate (at most k rules) X  I j X = Y - {I j } confidence = c  support(Y)/ support (X) So, have a threshold c and decide which ones you keep. (EASY.)

Examples Minimum support: 50 %  itemsets {a,b} and {a,c} Rules: a  b with support 50 % and confidence 66.6 % a  c with support 50 % and confidence 66.6 % c  a with support 50% and confidence 100 % b  a with support 50% and confidence 100% Assume s = 50 % and c = 80 %

The Apriori Algorithm Join Step: C k is generated by joining L k-1 with itself Prune Step: Any (k-1)-itemset that is not frequent cannot be a subset of a frequent k-itemset Pseudo-code: C k : Candidate itemset of size k L k : frequent itemset of size k L 1 = {frequent items}; for (k = 1; L k !=  ; k++) do begin C k+1 = candidates generated from L k ; for each transaction t in database do increment the count of all candidates in C k+1 that are contained in t L k+1 = candidates in C k+1 with min_support end return  k L k ;

The Apriori Algorithm — Example Database D Scan D C1C1 L1L1 L2L2 C2C2 C2C2 C3C3 L3L3

How to Generate Candidates? Suppose the items in L k-1 are listed in an order Step 1: self-joining L k-1 insert into C k select p.item 1, p.item 2, …, p.item k-1, q.item k-1 from L k-1 p, L k-1 q where p.item 1 =q.item 1, …, p.item k-2 =q.item k-2, p.item k-1 < q.item k- 1 Step 2: pruning forall itemsets c in C k do forall (k-1)-subsets s of c do if (s is not in L k-1 ) then delete c from C k

Candidate generation (example) C2L2 L2  L2 C3C3 Since {1,5} and {1,2} do not have enough support

Is Apriori Fast Enough? — Performance Bottlenecks The core of the Apriori algorithm: –Use frequent (k – 1)-itemsets to generate candidate frequent k- itemsets –Use database scan and pattern matching to collect counts for the candidate itemsets The bottleneck of Apriori: candidate generation –Huge candidate sets: 10 4 frequent 1-itemset will generate 10 7 candidate 2-itemsets To discover a frequent pattern of size 100, e.g., {a 1, a 2, …, a 100 }, one needs to generate  candidates. –Multiple scans of database: Needs (n +1 ) scans, n is the length of the longest pattern

Mining Frequent Patterns Without Candidate Generation Compress a large database into a compact, Frequent-Pattern tree (FP-tree) structure –highly condensed, but complete for frequent pattern mining –avoid costly database scans Develop an efficient, FP-tree-based frequent pattern mining method –A divide-and-conquer methodology: decompose mining tasks into smaller ones –Avoid candidate generation: sub-database test only!

Construct FP-tree from a Transaction DB {} f:4c:1 b:1 p:1 b:1c:3 a:3 b:1m:2 p:2m:1 Header Table Item frequency head f4 c4 a3 b3 m3 p3 min_support = 0.5 TIDItems bought (ordered) frequent items 100{f, a, c, d, g, i, m, p}{f, c, a, m, p} 200{a, b, c, f, l, m, o}{f, c, a, b, m} 300 {b, f, h, j, o}{f, b} 400 {b, c, k, s, p}{c, b, p} 500 {a, f, c, e, l, p, m, n}{f, c, a, m, p} Steps: 1.Scan DB once, find frequent 1-itemset (single item pattern) 2.Order frequent items in frequency descending order 3.Scan DB again, construct FP-tree

Chapter 6: Mining Association Rules in Large Databases Association rule mining Mining single-dimensional Boolean association rules from transactional databases Mining multilevel association rules from transactional databases Mining multidimensional association rules from transactional databases and data warehouse From association mining to correlation analysis Summary

Chapter 6: Mining Association Rules in Large Databases Association rule mining Mining single-dimensional Boolean association rules from transactional databases Mining multilevel association rules from transactional databases Mining multidimensional association rules from transactional databases and data warehouse From association mining to correlation analysis Constraint-based association mining Summary

Chapter 6: Mining Association Rules in Large Databases Association rule mining Mining single-dimensional Boolean association rules from transactional databases Mining multilevel association rules from transactional databases Mining multidimensional association rules from transactional databases and data warehouse From association mining to correlation analysis Summary

Interestingness Measurements Objective measures Two popular measurements: ¶support; and ·confidence Subjective measures (Silberschatz & Tuzhilin, KDD95) A rule (pattern) is interesting if ¶it is unexpected (surprising to the user); and/or ·actionable (the user can do something with it)

Criticism to Support and Confidence Example 1: (Aggarwal & Yu, PODS98) –Among 5000 students 3000 play basketball 3750 eat cereal 2000 both play basket ball and eat cereal –play basketball  eat cereal [40%, 66.7%] is misleading because the overall percentage of students eating cereal is 75% which is higher than 66.7%. –play basketball  not eat cereal [20%, 33.3%] is far more accurate, although with lower support and confidence

Criticism to Support and Confidence (Cont.) We need a measure of dependent or correlated events If Corr < 1 A is negatively correlated with B (discourages B) If Corr > 1 A and B are positively correlated P(A  B)=P(A)P(B) if the itemsets are independent. (Corr = 1) P(B|A)/P(B) is also called the lift of rule A => B (we want positive lift!)

Chapter 6: Mining Association Rules in Large Databases Association rule mining Mining single-dimensional Boolean association rules from transactional databases Mining multilevel association rules from transactional databases Mining multidimensional association rules from transactional databases and data warehouse From association mining to correlation analysis Summary

Why Is the Big Pie Still There? More on constraint-based mining of associations – Boolean vs. quantitative associations Association on discrete vs. continuous data – From association to correlation and causal structure analysis. Association does not necessarily imply correlation or causal relationships – From intra-trasanction association to inter-transaction associations E.g., break the barriers of transactions (Lu, et al. TOIS’99). – From association analysis to classification and clustering analysis E.g, clustering association rules

Summary Association rule mining –probably the most significant contribution from the database community in KDD –A large number of papers have been published Many interesting issues have been explored An interesting research direction –Association analysis in other types of data: spatial data, multimedia data, time series data, etc.

Business Miner Clementine Darwin Data Surveyor www. ddi. nl/ DBMiner Delta Miner Decision Series IDIS Intelligent Miner MineSet MLC++ MSBN SuperQuery Weka Apriori: Some Products and Free Soft available for association rule mining

K-menas clustering

Birch uses summary information – bonus question

STUDY QUESTIONS Some sample questions on data mining part. You may practice by yourself. No need to hand in. 1. Given transaction table: TID List of items T11, 2, 5 T22, 4 T32,3 T41, 2, 4 T51, 3 T62, 3 T71, 3 T81, 2, 3, 5 T91, 2, 3 1)if min_sup = 2/9, apply apriori algorithm to get all the frequent itemsets, show the step. 2)If min_con = 50%, show all the association rules generated from L3 (the large itemsets contains 3 items).

STUDY QUESTIONS 2. Assume we have the following association rules with min_sup = s and min_con = c: A=>B (s1, c1) B=>C (s2,c2) C=>A (s3,c3) Show the probability of P(A), P(B), P(C), P(AB), P(BC), P(AC), P(B|A), P(C|B), P(C|A) Show the conditions we can get A=>C

STUDY QUESTIONS. Given the following table Apply sprint algorithm to build decision tree. (The measure is gini)

STUDY QUESTIONS 4. Apply k-means to cluster the following 8 points to 3 clusters. The distance function is Euclidean distance. Assume initially we assign A1, B1, and C1 as the center of each cluster respectively. The 8 points are : A1(2,10), A2(2,5), A3(8,4) B1(5,8) B2(7,5), B3(6,4), C1(1,2), C2(4,9) Show - the three cluster centers after the first round execution. - the final three clusters.