Download presentation
Presentation is loading. Please wait.
Published byClinton Fitzgerald Modified over 8 years ago
1
Lecture 10 (big data) Knowledge Induction using association rule and decision tree (Understanding customer behavior Using data mining skills)
2
Part I. Association Rule Analysis Market Basket Analysis
3
What Is Association Mining? Association rule mining: – Finding frequent patterns, associations, correlations, or causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories. Applications: – Market basket analysis, cross-marketing, catalog design, loss- leader analysis, clustering, classification, etc. Examples: – Rule form: “ Body Head [support, confidence] ” buys(x, “ diapers ” ) buys(x, “ beers ” ) [0.5%, 60%] major(x, “ CS ” ) ^ takes(x, “ DB ” ) grade(x, “ A ” ) [1%, 75%]
4
Support and Confidence Support –Percent of samples contain both A and B –support(A B) = P(A ∩ B) Confidence –Percent of A samples also containing B –confidence(A B) = P(B|A) Example –computer financial_management_software [support = 2%, confidence = 60%]
5
Association Rules: Basic Concepts Given: (1) database of transactions, (2) each transaction is a list of items (purchased by a customer in a visit) Find: all rules that correlate the presence of one set of items with that of another set of items –e.g., 98% of people who purchase tires and auto accessories also get automotive services done Applications –Home Electronics - What other products should the store stocks up? –Retailing – Shelf design, promotion structuring, direct marketing
6
Find all the rules A C with minimum confidence and support –Support (s) probability that a transaction contains {A & C} –Confidence (c) conditional probability that a transaction having {A} also contains {C} Let minimum support 50%, and minimum confidence 50%, we have A C (50%, 66.6%) C A (50%, 100%) Customer buys diaper Customer buys both Customer buys beer Rule Measures: Support and Confidence
7
For rule A C: support = support({A, C}) = 50% confidence = support({A, C})/support({A}) = 66.6% Target: Min. support 50% Min. confidence 50% Mining Association Rules: An Example
8
An Example of Market Basket(1) There are 8 transactions on three items on A, B, C. Check associations for below two cases. (1) A B (2) (A, B) C #Basket 1A 2B 3C 4A, B 5A, C 6B, C 7A, B, C 8
9
An Example of Market Basket(1(2) Basic probabilities are below: (1) A B(2) (A, B) C LHSP(A) = 5/8 = 0.625P(A,B) = 3/8 = 0.375 RHSP(B) = 5/8 = 0.625P(C) = 5/8 = 0.625 CoverageLHS = 0.625LHS = 0.375 SupportP(A∩B) = 3/8 = 0.375P((A,B)∩C)) = 2/8 =0.25 ConfidenceP(B|A)=0.375/0.625=0.6P(C|(A,B))=0.25/0.375=0.7 Lift0.375/(0.625*0.625)=0.960.25/(0.375*0.625)=1.07 Leverage0.375 - 0.390 = -0.0150.25 - 0.234 = 0.016
10
What are good association rules? (How to interpret them?) –If lift is close to 1, it means there is no association between two items (sets). –If lift is greater than 1, it means there is a positive association between two items (sets). –If lift is less than 1, it means there is a negative association between two items (sets). Lift
11
Leverage –Leverage = P(A∩B) - P(A)*P(B), it has three types ① Leverage > 0 ② Leverage = 0 ③ Leverage < 0 – ① Two items (sets) are positively associated – ② Two items (sets) are independent – ③ Two items (sets) are negarively associated
12
Lab on Association Rules(1) SPSS Clementine, SAS Enterprise Miner have association rules softwares. This exercise uses Magnum Opus. Go to http://www.rulequest.com and download Magnum Opus evaluation version ( click)
13
After you install the problem, you can see below initial screen. From menu, choose File – Import Data (Ctrl – O).
14
Demo Data sets are already there. Magnum Opus has two types of data sets available: (transaction data: *.idi, *.itl) and (attribute-value data: *.data, *.nam) Data format has below two types:(*.idi, *.itl). idi (identifier-item file) itl (item list file) 001, apples 001, oranges 001, bananas 002, apples 002, carrots 002, lettuce 002, tomatoes apples, oranges, bananas apples, carrots, lettuce, tomatoes
15
If you open tutorial.idi using note pad, you can see the file inside as left. The example left has 5 transactions (baskets)
16
File – Import Data, or click. click Tutorial.idi Check Identifier – item file and click Next >.
17
Click Yes and click Next > … click Next > …
18
Click Next > … What percentage of whole file you want to use? Type 50% and click Next > …
19
click Import Data 를 클릭 Then, you can see a screen like below left.
20
Set things as they are. –Search by: LIFT –Minimum lift: 1 –Maximum no. of rules: 10 Click GO
21
Results are saved in tutorial.out file. Below are rules derived: lettuce & carrots are associated with tomatoes with strength = 0.857 coverage = 0.042: 21 cases satisfy the LHS support = 0.036: 18 cases satisfy both the LHS and the RHS lift 3.51: the strength is 3.51 times greater than the strength if there were no association leverage = 0.0258: the support is 0.0258 (12.9 cases) greater than if there were no association
22
lettuce & carrots tomatoes –When Lettuce and carrots are purchase then they buy tomatoes –coverage = 0.042: 21 cases satisfy the LHS –LHS(lettuce & carrots) = 21/500 = 0.042 support = 0.036: 18 cases satisfy both the LHS and the RHS –P((lettuce & carrots) ∩ tomatoes)) = 18/500 = 0.036 strength(confidence) = 0.857 –P(support|LHS)= 18/21 = 0.036/0.042 = 0.857
23
lift 3.51: the strength is 3.51 times greater than the strength if there were no association – 즉, (18/21)/(122/500) = 3.51 leverage = 0.0258: the support is 0.0258 (12.9 cases) greater than if there were no association –P(LHS ∩ RHS) – P(A)*P(B) = 0.036 – 0.042*0.244 = 0.0258
24
Part II: Knowledge Induction from Data
25
Poll: Which data mining technique..?
26
Training Data Classification Algorithms IF rank = ‘professor’ OR years > 6 THEN tenured = ‘yes’ Classifier (Model) Classification Process Step 1: Model Construction
27
Classifier Testing Data Unseen Data (Jeff, Professor, 4) Tenured? Classification Process Step 2: Use the Model in Prediction
28
Decision tree … –A flow-chart-like tree structure –Internal node denotes a test on an attribute –Branch represents an outcome of the test –Leaf nodes represent class labels or class distribution Decision tree generation consists –Tree construction At start, all the training examples are at the root Examples are recursively partitioned based on selected attributes Use of decision tree: Classifying an unknown sample Classification by Decision Tree Induction
29
This follows an example from Quinlan’s ID3 Training Dataset
30
age? overcast student?credit rating? noyes fair excellent <=30 >40 no yes 30..40 Tree Output: A Decision Tree for Credit Approval
31
Extracting Classification Rules from Trees Represent the knowledge in the form of IF-THEN rules One rule is created for each path from the root to a leaf Each attribute-value pair along a path forms a conjunction The leaf node holds the class prediction Rules are easier for humans to understand Example IF age = “ <=30 ” AND student = “ no ” THEN buys_computer = “ no ” IF age = “ <=30 ” AND student = “ yes ” THEN buys_computer = “ yes ” IF age = “ 31 … 40 ” THEN buys_computer = “ yes ” IF age = “ >40 ” AND credit_rating = “ excellent ” THEN buys_computer = “ yes ” IF age = “ >40 ” AND credit_rating = “ fair ” THEN buys_computer = “ no ”
32
Decision Trees: Attribute Selection Methods Information gain (ID3/C4.5) –ID3 - All attributes are assumed to be categorical –C4.5 - Modified ID3 for continuous attributes
33
Decision Tree) Concept of Decision Tree –Tree-like graph for classification purpose 분류 –Through recursive partitioning it consists of root node, internal nodes, link, leaf
34
An Example of ‘Car Buyers’ noJobM/FAreaAgeY/N 1NJMN35N 2NJFN51N 3OWFN31Y 4EMMN38Y 5EMFS33Y 6EMMS54Y 7OWFS49Y 8NJFN32N 9NJMN32Y 10EMMS35Y 11NJFS54Y 12OWMN50Y 13OWFS36Y 14EMMN49N Job (14,5,9) Emplyee (5,2,3) Age Below 43 (3,0,3) Y Above 43 (2,2,0) N Owner (4,0,4) Y No Job (5,3,2) Res. Area South (2,0,2) Y North (3,3,0) N * (a,b,c) means a: total # of records, b: ‘N’ counts, c: ‘Y’ counts
35
Lab on Decision Tree(1) SPSS Clementine, SAS Enterprise Miner See5/C5.0Download See5/C5.0 2.02See5/C5.0 2.02 Evaluation from http://www.rulequest.com E
36
Lab on Decision Tree(2) From below initial screen, choose File – Locate Data
37
Lab on Decision Tree(3) Select housing.data from Samples folder and click open.
38
Lab on Decision Tree(3(4) This data set is on deciding house price in Boston area. It has 350 cases and 13 variables.
39
Lab on Decision Tree (5) Input variables –crime rate –proportion large lots: residential space –proportion industrial: ratio of commercial area –CHAS: dummy variable –nitric oxides ppm:polution rate in ppm –av rooms per dwelling: # of room for dwelling –proportion pre-1940 –distance to employment centers: distance to the center of city –accessibility to radial highways: accessibility to high way –property tax rate per $10\,000 –pupil-teacher ratio: teachers’ rate –B: racial statistics –percentage low income earners: ratio of low income people Decision variable –Top 20%, Bottom 80%
40
Lab on Decision Tree(6) For the analysis, click Construct Classifier or click Construct Classifier from File menu
41
Lab on Decision Tree(7) Click on Global pruning to (V ). Then, click OK
42
Lab on Decision Tree(8) Decision Tree Evaluation with Training data Evaluation with Test data
43
Lab on Decision Tree(9) Understanding picture –We can see that (av rooms per dwelling) is the most important variable in deciding house price.
44
Lab on Decision Tree(11) 의사결정나무 그림으로는 규칙을 알아보기 어렵다. To view the rules, close current screen and click Construct Classifier again or click Construct Classifier from File menu.
45
Lab on Decision Tree(12) Choose/click Rulesets. Then click OK.
46
Lab on Decision Tree(13)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.