Presentation is loading. Please wait.

Presentation is loading. Please wait.

BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of.

Similar presentations


Presentation on theme: "BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of."— Presentation transcript:

1 BITMAPS & Starjoins

2 Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of indexed keys. Dynamic, stable and exhibit good performance under updates. (But OLAP is not about updates….) Bitmaps: Space efficient Difficult to update (but we don’t care in DW). Can effectively prune searches before looking at data.

3 Bitmaps R = (…., A,….., M)  R (A) B 8 B 7 B 6 B 5 B 4 B 3 B 2 B 1 B 0 3 0 0 0 0 0 1 0 0 0 2 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 1 0 2 0 0 0 0 0 0 1 0 0 8 1 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 1 0 0 2 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 7 0 1 0 0 0 0 0 0 0 5 0 0 0 1 0 0 0 0 0 6 0 0 1 0 0 0 0 0 0 4 0 0 0 0 1 0 0 0 0

4 Query optimization Consider a high-selectivity-factor query with predicates on two attributes. Query optimizer: builds plans (P1) Full relation scan (filter as you go). (P2) Index scan on the predicate with lower selectivity factor, followed by temporary relation scan, to filter out non- qualifying tuples, using the other predicate. (Works well if data is clustered on the first index key). (P3) Index scan for each predicate (separately), followed by merge of RID.

5 Query optimization (continued) (P2) Blocks of data Pred. 2 answer t1 tn Index Pred1 (P3) t1 tn Index Pred2 Tuple list1 Tuple list2 Merged list

6 Query optimization (continued) When using bitmap indexes (P3) can be an easy winner! CPU operations in bitmaps (AND, OR, XOR, etc.) are more efficient than regular RID merges: just apply the binary operations to the bitmaps (In B-trees, you would have to scan the two lists and select tuples in both -- merge operation--) Of course, you can build B-trees on the compound key, but we would need one for every compound predicate (exponential number of trees…).

7 Bitmaps and predicates A = a1 AND B = b2 Bitmap for a1Bitmap for b2 AND = Bitmap for a1 and b2

8 Tradeoffs Dimension cardinality small dense bitmaps Dimension cardinality large sparse bitmaps Compression (decompression)

9 Bitmap for prod  Bitmap for prod  ….. Query strategy for Star joins Maintain join indexes between fact table and dimension tables Prod. Fact tableDimension table a... k …… …… Bitmap for type a Bitmap for type k ….. Bitmap for loc.  Bitmap for loc.  …..

10 Strategy example Aggregate all sales for products of location ,  or Bitmap for  Bitmap for  Bitmap for OR = Bitmap for predicate

11 Star-Joins Select F.S, D1.A1, D2.A2, …. Dn.An from F,D1,D2,Dn where F.A1 = D1.A1 F.A2 = D2.A2 … F.An = Dn.An and D1.B1 = ‘c1’ D2.B2 = ‘p2’ …. Likely strategy: For each Di find suitable values of Ai such that Di.Bi = ‘xi’ (unless you have a bitmap index for Bi). Use bitmap index on Ai’ values to form a bitmap for related rows of F (OR-ing the bitmaps). At this stage, you have n such bitmaps, the result can be found AND-ing them.

12 Example Selectivity/predicate = 0.01 (predicates on the dimension tables) n predicates (statistically independent) Total selectivity = 10 -2n Facts table = 10 8 rows, n = 3, tuples in answer = 10 8 / 10 6 = 100 rows. In the worst case = 100 blocks… Still better than all the blocks in the relation (e.g., assuming 100 tuples/block, this would be 10 6 blocks!)

13 Design Space of Bitmap Indexes The basic bitmap design is called Value-list index. The focus there is on the columns. If we change the focus to the rows, the index becomes a set of attribute values (integers) in each tuple (row), that can be represented in a particular way. 50 0 0 1 0 0 0 0 0 We can encode this row in many ways...

14 Attribute value decomposition C = attribute cardinality Consider a value of the attribute, v, and a sequence of numbers. Also, define b n =  C /  b i , then v can be decomposed into a sequence of n digits as follows: v = V 1 = V 2 b 1 + v 1 = V 3 (b 2 b 1 ) + v 2 b 1 + v 1 … n-1 i-1 = v n (  b j ) + …+ v i (  b j ) + …+ v 2 b 1 + v 1 where v i = V i mod b i and V i =  V i-1 /b i-1 

15 (decimal system!) 576 = 5 x 10 x 10 + 7 x 10 + 6 576/100 = 5 | 76 76/10 = 7 | 6 6 Number systems How do you write 576 in: 576 = 1 x 2 9 + 0 x 2 8 + 0 x 2 7 + 1 x 2 6 + 0 x 2 5 + 0 x 2 4 + 0 x 2 3 + 0 x 2 2 + 0 x 2 1 + 0 x 2 0 576/ 2 9 = 1 | 64, 64/ 2 8 = 0|64, 64/ 2 7 = 0|64, 64/ 2 6 = 1|0, 0/ 2 5 = 0|0, 0/ 2 4 = 0|0, 0/ 2 3 = 0|0, 0/ 2 2 = 0|0, 0/ 2 1 = 0|0, 0/ 2 0 = 0|0 576/(7x7x5x3) = 576/735 = 0 | 576, 576/(7x5x3)=576/105=5|51 576 = 5 x (7x5x3)+51 51/(5x3) = 51/15 = 3 | 6 576 = 5 x (7x5x3) + 3 (5 x 3) + 16 6/3 =2 | 0 576 = 5 x (7x 5 x 3) + 3 x (5 x 3 ) + 2 x (3)

16 Bitmaps R = (…., A,….., M) value-list index  R (A) B 8 B 7 B 6 B 5 B 4 B 3 B 2 B 1 B 0 3 0 0 0 0 0 1 0 0 0 2 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 1 0 2 0 0 0 0 0 0 1 0 0 8 1 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 1 0 0 2 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 7 0 1 0 0 0 0 0 0 0 5 0 0 0 1 0 0 0 0 0 6 0 0 1 0 0 0 0 0 0 4 0 0 0 0 1 0 0 0 0

17 Example sequence value-list index (equality)  R (A) B 2 2 B 1 2 B 0 2 B 2 1 B 1 1 B 0 1 3 (1x3+0) 0 1 0 0 0 1 2 0 0 1 1 0 0 1 0 0 1 0 1 0 2 0 0 1 1 0 0 8 1 0 0 1 0 0 2 0 0 1 1 0 0 2 0 0 1 1 0 0 0 0 0 1 0 0 1 7 1 0 0 0 1 0 5 0 1 0 1 0 0 6 1 0 0 0 0 1 4 0 1 0 0 1 0

18 Encoding scheme Equality encoding: all bits to 0 except the one that corresponds to the value Range Encoding: the vi righmost bits to 0, the remaining to 1

19 Range encoding single component, base-9  R (A) B 8 B 7 B 6 B 5 B 4 B 3 B 2 B 1 B 0 3 11 1 1 1 1 0 0 0 2 11 1 1 1 1 1 0 0 1 11 1 1 1 1 1 1 0 8 1 0 0 0 0 0 0 0 0 0 11 1 1 1 1 1 1 1 7 1 1 0 0 0 0 0 0 0 5 11 1 1 0 0 0 0 0 6 11 1 0 0 0 0 0 0 4 11 1 1 1 0 0 0 0

20 Example (revisited) sequence value-list index(Equality)  R (A) B 2 2 B 1 2 B 0 2 B 2 1 B 1 1 B 0 1 3 (1x3+0) 0 1 0 0 0 1 2 0 0 1 1 0 0 1 0 0 1 0 1 0 2 0 0 1 1 0 0 8 1 0 0 1 0 0 2 0 0 1 1 0 0 2 0 0 1 1 0 0 0 0 0 1 0 0 1 7 1 0 0 0 1 0 5 0 1 0 1 0 0 6 1 0 0 0 0 1 4 0 1 0 0 1 0

21 Example sequence range-encoded index  R (A) B 1 2 B 0 2 B 1 1 B 0 1 3 1 0 1 1 2 1 1 0 0 1 1 1 1 0 2 1 1 0 0 8 0 0 0 0 2 1 1 0 0 2 1 1 0 0 0 1 1 1 1 7 0 0 1 0 5 1 0 0 0 6 0 0 1 1 4 1 0 1 0

22 Design Space …. equality range

23 RangeEval Evaluates each range predicate by computing two bitmaps: BEQ bitmap and either BGT or BLT RangeEval-Opt uses only <= A < v is the same as A <= v-1 A > v is the same as Not( A <= v) A >= v is the same as Not (A <= v-1)

24 RangeEval-OPT

25

26 Classification: –predicts categorical class labels –classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data Prediction: –models continuous-valued functions, i.e., predicts unknown or missing values Typical Applications –credit approval –target marketing –medical diagnosis –treatment effectiveness analysis Classification vs. Prediction

27 Pros: – Fast. – Rules easy to interpret. – High dimensional data Cons: –No correlations – Axis-parallel cuts. Supervised learning (classification) –Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations –New data is classified based on the training set Unsupervised learning (clustering) –The class labels of training data is unknown –Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data Decision tree –A flow-chart-like tree structure –Internal node denotes a test on an attribute –Branch represents an outcome of the test –Leaf nodes represent class labels or class distribution Decision tree generation consists of two phases –Tree construction At start, all the training examples are at the root Partition examples recursively based on selected attributes –Tree pruning Identify and remove branches that reflect noise or outliers Use of decision tree: Classifying an unknown sample –Test the attribute values of the sample against the decision tree

28 Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm) –Tree is constructed in a top-down recursive divide-and-conquer manner –At start, all the training examples are at the root –Attributes are categorical (if continuous-valued, they are discretized in advance) –Examples are partitioned recursively based on selected attributes –Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain) Conditions for stopping partitioning –All samples for a given node belong to the same class –There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf –There are no samples left

29 Decision tree algorithms Building phase: –Recursively split nodes using best splitting attribute and value for node Pruning phase: –Smaller (yet imperfect) tree achieves better prediction accuracy. – Prune leaf nodes recursively to avoid over-fitting. DATA TYPES Numerically ordered: values are ordered and they can be represented in real line. ( E.g., salary.) Categorical: takes values from a finite set not having any natural ordering. (E.g., color.) Ordinal: takes values from a finite set whose values posses a clear ordering, but the distances between them are unknown. (E.g., preference scale: good, fair, bad.)

30 Some probability... S = cases freq(Ci,S) = # cases in S that belong to Ci Gain entropic meassure: Prob(“this case belongs to Ci”) = freq(Ci,S)/|S| Information conveyed: -log (freq(Ci,S)/|S|) Entropy = expected information = -  (freq(Ci,S)/|S|) log (freq(Ci,S)/|S|) = info(S) GAIN Test X: infoX (T) =  |Ti|/T info(Ti) gain(X) = info (T) - infoX(T)

31 PROBLEM: What is best predictor to segment on? - windy or the outlook?

32

33

34 Problem with Gain Strong bias towards test with many outcomes. Example: Z = Name |Ti| = 1 (each name unique) info Z (T) =  1/|T| (- 1/N log (1/N))  0 Maximal gain!! (but useless division--- overfitting--)

35 Split Split-info (X) = -  |Ti|/|T| log (|Ti|/|T|) gain-ratio(X) = gain(X)/split-info(X) Gain <= log(k) Split <= log(n) ratio small

36 The generated tree may overfit the training data –Too many branches, some may reflect anomalies due to noise or outliers –Result is in poor accuracy for unseen samples Two approaches to avoid overfitting –Prepruning: Halt tree construction early—do not split a node if this would result in the goodness measure falling below a threshold Difficult to choose an appropriate threshold –Postpruning: Remove branches from a “fully grown” tree—get a sequence of progressively pruned trees Use a set of data different from the training data to decide which is the “best pruned tree” Approaches to Determine the Final Tree Size Separate training (2/3) and testing (1/3) sets Use cross validation, e.g., 10-fold cross validation Use all the data for training but apply a statistical test (e.g., chi-square) to estimate whether expanding or pruning a node may improve the entire distribution Use minimum description length (MDL) principle: halting growth of the tree when the encoding is minimized

37 Gini Index (IBM IntelligentMiner) If a data set T contains examples from n classes, gini index, gini(T) is defined as where p j is the relative frequency of class j in T. If a data set T is split into two subsets T 1 and T 2 with sizes N 1 and N 2 respectively, the gini index of the split data contains examples from n classes, the gini index gini(T) is defined as The attribute provides the smallest gini split (T) is chosen to split the node (need to enumerate all possible splitting points for each attribute).

38 Training set AgeCar Attribute lists Problem: What is the best way to determine risk? Is it Age or Car?

39 Splits Age < 27.5 Group1 Group2

40 Histograms For continuous attributes Associated with node (Cabove, Cbelow) to processalready processed

41

42

43

44

45

46

47

48

49

50

51 ANSWER The winner is Age <= 18.5 H YN

52 Summary Classification is an extensively studied problem (mainly in statistics, machine learning & neural networks) Classification is probably one of the most widely used data mining techniques with a lot of extensions Scalability is still an important issue for database applications: thus combining classification with database techniques should be a promising topic Research directions: classification of non-relational data, e.g., text, spatial, multimedia, etc..

53 Association rules a* priori paper – student plays basketball example

54 Chapter 6: Mining Association Rules in Large Databases Association rule mining Mining single-dimensional Boolean association rules from transactional databases Mining multilevel association rules from transactional databases Mining multidimensional association rules from transactional databases and data warehouse From association mining to correlation analysis Summary

55 Association Rules Market basket data: your ``supermarket’’ basket contains {bread, milk, beer, diapers…} Find rules that correlate the presence of one set of items X with another Y. – Ex: X = diapers, Y= beer, X  Y with confidence 98% – Maybe constrained: e.g., consider only female customers.

56 Applications Market basket analysis: tell me how I can improve my sales by attaching promotions to “best seller” itemsets. Marketing: “people who bought this book also bought…” Fraud detection: a claim for immunizations always come with a claim for a doctor’s visit on the same day. Shelf planning: given the “best sellers,” how do I organize my shelves?

57 Association Rule: Basic Concepts Given: (1) database of transactions, (2) each transaction is a list of items (purchased by a customer in a visit) Find: all rules that correlate the presence of one set of items with that of another set of items –E.g., 98% of people who purchase tires and auto accessories also get automotive services done

58 Association Rule Mining: A Road Map Boolean vs. quantitative associations (Based on the types of values handled) – buys(x, “SQLServer”) ^ buys(x, “DMBook”)  buys(x, “DBMiner”) [0.2%, 60%] – age(x, “30..39”) ^ income(x, “42..48K”)  buys(x, “PC”) [1%, 75%] Single dimension vs. multiple dimensional associations (see ex. Above)

59 Road-map (continuation) Single level vs. multiple-level analysis – What brands of beers are associated with what brands of diapers? Various extensions – Correlation, causality analysis Association does not necessarily imply correlation or causality Causality: Does Beer  Diapers or Diapers  Beer (I.e., did the customer buy the diapers because he bought the beer or was it the other way around) Correlation: 90% buy coffee, 25 % buy tea, 20% buy both--- support is less than expected support = 0.9*0.25 = 0.225-- – Maxpatterns and closed itemsets – Constraints enforced E.g., small sales (sum 1,000)?

60 Chapter 6: Mining Association Rules in Large Databases Association rule mining Mining single-dimensional Boolean association rules from transactional databases Mining multilevel association rules from transactional databases Mining multidimensional association rules from transactional databases and data warehouse From association mining to correlation analysis Summary

61 Mining Association Rules—An Example For rule A  C: support = support({A  C}) = 50% confidence = support({A  C})/support({A}) = 66.6% The Apriori principle: Any subset of a frequent itemset must be frequent Min. support 50% Min. confidence 50%

62 Mining Frequent Itemsets: the Key Step Find the frequent itemsets: the sets of items that have minimum support –A subset of a frequent itemset must also be a frequent itemset i.e., if {AB} is a frequent itemset, both {A} and {B} should be a frequent itemset –Iteratively find frequent itemsets with cardinality from 1 to k (k-itemset) Use the frequent itemsets to generate association rules.

63 Problem decomposition Two phases: Generate all itemsets whose support is above a threshold. Call them large (or hot) itemsets. (Any other itemset is small.) How? Generate all combinations? (exponential!) (HARD.) For a given large itemset Y = I 1 I 2 …I k k >= 2 Generate (at most k rules) X  I j X = Y - {I j } confidence = c  support(Y)/ support (X) So, have a threshold c and decide which ones you keep. (EASY.)

64 Examples Minimum support: 50 %  itemsets {a,b} and {a,c} Rules: a  b with support 50 % and confidence 66.6 % a  c with support 50 % and confidence 66.6 % c  a with support 50% and confidence 100 % b  a with support 50% and confidence 100% Assume s = 50 % and c = 80 %

65 The Apriori Algorithm Join Step: C k is generated by joining L k-1 with itself Prune Step: Any (k-1)-itemset that is not frequent cannot be a subset of a frequent k-itemset Pseudo-code: C k : Candidate itemset of size k L k : frequent itemset of size k L 1 = {frequent items}; for (k = 1; L k !=  ; k++) do begin C k+1 = candidates generated from L k ; for each transaction t in database do increment the count of all candidates in C k+1 that are contained in t L k+1 = candidates in C k+1 with min_support end return  k L k ;

66 The Apriori Algorithm — Example Database D Scan D C1C1 L1L1 L2L2 C2C2 C2C2 C3C3 L3L3

67 How to Generate Candidates? Suppose the items in L k-1 are listed in an order Step 1: self-joining L k-1 insert into C k select p.item 1, p.item 2, …, p.item k-1, q.item k-1 from L k-1 p, L k-1 q where p.item 1 =q.item 1, …, p.item k-2 =q.item k-2, p.item k-1 < q.item k- 1 Step 2: pruning forall itemsets c in C k do forall (k-1)-subsets s of c do if (s is not in L k-1 ) then delete c from C k

68 Candidate generation (example) C2L2 L2  L2 C3C3 Since {1,5} and {1,2} do not have enough support

69 Is Apriori Fast Enough? — Performance Bottlenecks The core of the Apriori algorithm: –Use frequent (k – 1)-itemsets to generate candidate frequent k- itemsets –Use database scan and pattern matching to collect counts for the candidate itemsets The bottleneck of Apriori: candidate generation –Huge candidate sets: 10 4 frequent 1-itemset will generate 10 7 candidate 2-itemsets To discover a frequent pattern of size 100, e.g., {a 1, a 2, …, a 100 }, one needs to generate 2 100  10 30 candidates. –Multiple scans of database: Needs (n +1 ) scans, n is the length of the longest pattern

70 Mining Frequent Patterns Without Candidate Generation Compress a large database into a compact, Frequent-Pattern tree (FP-tree) structure –highly condensed, but complete for frequent pattern mining –avoid costly database scans Develop an efficient, FP-tree-based frequent pattern mining method –A divide-and-conquer methodology: decompose mining tasks into smaller ones –Avoid candidate generation: sub-database test only!

71 Construct FP-tree from a Transaction DB {} f:4c:1 b:1 p:1 b:1c:3 a:3 b:1m:2 p:2m:1 Header Table Item frequency head f4 c4 a3 b3 m3 p3 min_support = 0.5 TIDItems bought (ordered) frequent items 100{f, a, c, d, g, i, m, p}{f, c, a, m, p} 200{a, b, c, f, l, m, o}{f, c, a, b, m} 300 {b, f, h, j, o}{f, b} 400 {b, c, k, s, p}{c, b, p} 500 {a, f, c, e, l, p, m, n}{f, c, a, m, p} Steps: 1.Scan DB once, find frequent 1-itemset (single item pattern) 2.Order frequent items in frequency descending order 3.Scan DB again, construct FP-tree

72 Chapter 6: Mining Association Rules in Large Databases Association rule mining Mining single-dimensional Boolean association rules from transactional databases Mining multilevel association rules from transactional databases Mining multidimensional association rules from transactional databases and data warehouse From association mining to correlation analysis Summary

73 Chapter 6: Mining Association Rules in Large Databases Association rule mining Mining single-dimensional Boolean association rules from transactional databases Mining multilevel association rules from transactional databases Mining multidimensional association rules from transactional databases and data warehouse From association mining to correlation analysis Constraint-based association mining Summary

74 Chapter 6: Mining Association Rules in Large Databases Association rule mining Mining single-dimensional Boolean association rules from transactional databases Mining multilevel association rules from transactional databases Mining multidimensional association rules from transactional databases and data warehouse From association mining to correlation analysis Summary

75 Interestingness Measurements Objective measures Two popular measurements: ¶support; and ·confidence Subjective measures (Silberschatz & Tuzhilin, KDD95) A rule (pattern) is interesting if ¶it is unexpected (surprising to the user); and/or ·actionable (the user can do something with it)

76 Criticism to Support and Confidence Example 1: (Aggarwal & Yu, PODS98) –Among 5000 students 3000 play basketball 3750 eat cereal 2000 both play basket ball and eat cereal –play basketball  eat cereal [40%, 66.7%] is misleading because the overall percentage of students eating cereal is 75% which is higher than 66.7%. –play basketball  not eat cereal [20%, 33.3%] is far more accurate, although with lower support and confidence

77 Criticism to Support and Confidence (Cont.) We need a measure of dependent or correlated events If Corr < 1 A is negatively correlated with B (discourages B) If Corr > 1 A and B are positively correlated P(A  B)=P(A)P(B) if the itemsets are independent. (Corr = 1) P(B|A)/P(B) is also called the lift of rule A => B (we want positive lift!)

78 Chapter 6: Mining Association Rules in Large Databases Association rule mining Mining single-dimensional Boolean association rules from transactional databases Mining multilevel association rules from transactional databases Mining multidimensional association rules from transactional databases and data warehouse From association mining to correlation analysis Summary

79 Why Is the Big Pie Still There? More on constraint-based mining of associations – Boolean vs. quantitative associations Association on discrete vs. continuous data – From association to correlation and causal structure analysis. Association does not necessarily imply correlation or causal relationships – From intra-trasanction association to inter-transaction associations E.g., break the barriers of transactions (Lu, et al. TOIS’99). – From association analysis to classification and clustering analysis E.g, clustering association rules

80 Summary Association rule mining –probably the most significant contribution from the database community in KDD –A large number of papers have been published Many interesting issues have been explored An interesting research direction –Association analysis in other types of data: spatial data, multimedia data, time series data, etc.

81 Business Miner http://www.businessobjects.comhttp://www.businessobjects.com Clementine http://www.isl.co.uk/clem.htmlhttp://www.isl.co.uk/clem.html Darwin http://www.oracle.com/ip/analyze/warehouse/datamining/ Data Surveyor http:// www. ddi. nl/ DBMiner http://db.cs.sfu.ca/DBMinerhttp://db.cs.sfu.ca/DBMiner Delta Miner http://www.bissantz.de Decision Series http://www.neovista.comhttp://www.neovista.com IDIS http://wwwdatamining.comhttp://wwwdatamining.com Intelligent Miner http://www.software.ibm.com/data/intelli-minehttp://www.software.ibm.com/data/intelli-mine MineSet http://www.sgi.com/software/mineset/ MLC++ http://www.sgi.com/Technology/mlc/ MSBN http://www.research.microsoft.com/research./dtg/msbnhttp://www.research.microsoft.com/research./dtg/msbn SuperQuery http://www.azmy.comhttp://www.azmy.com Weka http://www.cs.waikato.ac.nz/ml/wekahttp://www.cs.waikato.ac.nz/ml/weka Apriori: http://fuzzy.cs.uni-magdeburg.de/~borgelt/apriori/apriori.html Some Products and Free Soft available for association rule mining

82 K-menas clustering

83 Birch uses summary information – bonus question

84 STUDY QUESTIONS Some sample questions on data mining part. You may practice by yourself. No need to hand in. 1. Given transaction table: TID List of items T11, 2, 5 T22, 4 T32,3 T41, 2, 4 T51, 3 T62, 3 T71, 3 T81, 2, 3, 5 T91, 2, 3 1)if min_sup = 2/9, apply apriori algorithm to get all the frequent itemsets, show the step. 2)If min_con = 50%, show all the association rules generated from L3 (the large itemsets contains 3 items).

85 STUDY QUESTIONS 2. Assume we have the following association rules with min_sup = s and min_con = c: A=>B (s1, c1) B=>C (s2,c2) C=>A (s3,c3) Show the probability of P(A), P(B), P(C), P(AB), P(BC), P(AC), P(B|A), P(C|B), P(C|A) Show the conditions we can get A=>C

86 STUDY QUESTIONS. Given the following table Apply sprint algorithm to build decision tree. (The measure is gini)

87 STUDY QUESTIONS 4. Apply k-means to cluster the following 8 points to 3 clusters. The distance function is Euclidean distance. Assume initially we assign A1, B1, and C1 as the center of each cluster respectively. The 8 points are : A1(2,10), A2(2,5), A3(8,4) B1(5,8) B2(7,5), B3(6,4), C1(1,2), C2(4,9) Show - the three cluster centers after the first round execution. - the final three clusters.


Download ppt "BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of."

Similar presentations


Ads by Google