Download presentation
Presentation is loading. Please wait.
Published byJessie Strickland Modified over 8 years ago
1
Fundamentals, Design, and Implementation, 9/e KDD and Data Mining Instructor: Dragomir R. Radev Winter 2005
2
Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/2 Copyright © 2004 The big problem Billions of records A small number of interesting patterns “Data rich but information poor”
3
Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/3 Copyright © 2004 Data mining Knowledge discovery Knowledge extraction Data/pattern analysis
4
Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/4 Copyright © 2004 Types of source data Relational databases Transactional databases Web logs Textual databases
5
Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/5 Copyright © 2004 Association rules 65% of all customers who buy beer and tomato sauce also buy pasta and chicken wings Association rules: X Y
6
Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/6 Copyright © 2004 Association analysis IF 20 < age < 30 AND 20K < INCOME < 30K THEN –Buys (“CD player”) SUPPORT = 2%, CONFIDENCE = 60%
7
Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/7 Copyright © 2004 Basic concepts Minimum support threshold Minimum confidence threshold Itemsets Occurrence frequency of an itemset
8
Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/8 Copyright © 2004 Association rule mining Find all frequent itemsets Generate strong association rules from the frequent itemsets
9
Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/9 Copyright © 2004 Support and confidence Support (X) Confidence (X Y) = Support(X+Y) / Support (X)
10
Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/10 Copyright © 2004 Example TIDList of item IDs T100I1, I2, I5 T200I2, I4 T300I2, I3 T400I1, I2, I4 T500I1, I3 T600I2, I3 T700I1, I3 T800I1, I2, I3, I5 T900I1, I2, I3
11
Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/11 Copyright © 2004 Example (cont’d) Frequent itemset l = {I1, I2, I5} I1 AND I2 I5 C = 2/4 = 50% I1 AND I5 I2 I2 AND I5 I1 I1 I2 AND I5 I2 I1 AND I5 I3 I1 AND I2
12
Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/12 Copyright © 2004 Example 2 TIDdateitems T10010/15/99{K, A, D, B} T20010/15/99{D, A, C, E, B} T30010/19/99{C, A, B, E} T40010/22/99{B, A, D} min_sup = 60%, min_conf = 80%
13
Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/13 Copyright © 2004 Correlations Corr (A,B) = P (A OR B) / P(A) P (B) If Corr < 1: A discourages B (negative correlation) (lift of the association rule A B)
14
Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/14 Copyright © 2004 Contingency table Game^GameSum Video4,0003,5007,500 ^Video2,0005002,500 Sum6,0004,00010,000
15
Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/15 Copyright © 2004 Example P({game}) = 0.60 P({video}) = 0.75 P({game,video}) = 0.40 P({game,video})/(P({game})x(P({video })) = 0.40/(0.60 x 0.75) = 0.89
16
Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/16 Copyright © 2004 Example 2 hotdogs^hotdogsSum hamburgers20005002500 ^hamburgers100015002500 Sum300020005000
17
Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/17 Copyright © 2004 Classification using decision trees Expected information need I (s 1, s 2, …, s m ) = - p i log (p i ) s = data samples m = number of classes
18
Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/18 Copyright © 2004 RIDAgeIncomestudentcreditbuys? 1<= 30HighNoFairNo 2<= 30HighNoExcellentNo 331.. 40HighNoFairYes 4> 40MediumNoFairYes 5> 40LowYesFairYes 6> 40LowYesExcellentNo 731.. 40LowYesExcellentYes 8<= 30MediumNoFairNo 9<= 30LowYesFairYes 10> 40MediumYesFairYes 11<= 30MediumYesExcellentYes 1231.. 40MediumNoExcellentYes 1331.. 40HighYesFairYes 14> 40Mediumnoexcellentno
19
Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/19 Copyright © 2004 Decision tree induction I(s 1,s 2 ) = I(9,5) = = - 9/14 log 9/14 – 5/14 log 5/14 = = 0.940
20
Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/20 Copyright © 2004 Entropy and information gain E(A) = I (s 1j,…,s mj ) S 1j + … + s mj s Entropy = expected information based on the partitioning into subsets by A Gain (A) = I (s 1,s 2,…,s m ) – E(A)
21
Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/21 Copyright © 2004 Entropy Age <= 30 s 11 = 2, s 21 = 3, I(s 11, s 21 ) = 0.971 Age in 31.. 40 s 12 = 4, s 22 = 0, I (s 12,s 22 ) = 0 Age > 40 s 13 = 3, s 23 = 2, I (s 13,s 23 ) = 0.971
22
Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/22 Copyright © 2004 Entropy (cont’d) E (age) = 5/14 I (s11,s21) + 4/14 I (s12,s22) + 5/14 I (S13,s23) = 0.694 Gain (age) = I (s1,s2) – E(age) = 0.246 Gain (income) = 0.029, Gain (student) = 0.151, Gain (credit) = 0.048
23
Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/23 Copyright © 2004 Final decision tree excellent age studentcredit noyesnoyes no 31.. 40 > 40 yes fair
24
Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/24 Copyright © 2004 Other techniques Bayesian classifiers X: age <=30, income = medium, student = yes, credit = fair P(yes) = 9/14 = 0.643 P(no) = 5/14 = 0.357
25
Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/25 Copyright © 2004 Example P (age < 30 | yes) = 2/9 = 0.222 P (age < 30 | no) = 3/5 = 0.600 P (income = medium | yes) = 4/9 = 0.444 P (income = medium | no) = 2/5 = 0.400 P (student = yes | yes) = 6/9 = 0.667 P (student = yes | no) = 1/5 = 0.200 P (credit = fair | yes) = 6/9 = 0.667 P (credit = fair | no) = 2/5 = 0.400
26
Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/26 Copyright © 2004 Example (cont’d) P (X | yes) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044 P (X | no) = 0.600 x 0.400 x 0.200 x 0.400 = 0.019 P (X | yes) P (yes) = 0.044 x 0.643 = 0.028 P (X | no) P (no) = 0.019 x 0.357 = 0.007 Answer: yes/no?
27
Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/27 Copyright © 2004 Predictive models Inputs (e.g., medical history, age) Output (e.g., will patient experience any side effects) Some models are better than others
28
Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/28 Copyright © 2004 Principles of data mining Training/test sets Error analysis and overfitting Cross-validation Supervised vs. unsupervised methods error input size training test
29
Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/29 Copyright © 2004 Representing data Vector space salary credit pay off default
30
Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/30 Copyright © 2004 Decision surfaces salary credit pay off default
31
Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/31 Copyright © 2004 Decision trees salary credit pay off default
32
Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/32 Copyright © 2004 Linear boundary salary credit pay off default
33
Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/33 Copyright © 2004 kNN models Assign each element to the closest cluster Demos: –http://www- 2.cs.cmu.edu/~zhuxj/courseproject/knnd emo/KNN.html
34
Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/34 Copyright © 2004 Other methods Decision trees Neural networks Support vector machines Demos –http://www.cs.technion.ac.il/~rani/LocBo ost/
35
Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/35 Copyright © 2004 arff files @relation weather @attribute outlook {sunny, overcast, rainy} @attribute temperature real @attribute humidity real @attribute windy {TRUE, FALSE} @attribute play {yes, no} @data sunny,85,85,FALSE,no sunny,80,90,TRUE,no overcast,83,86,FALSE,yes rainy,70,96,FALSE,yes rainy,68,80,FALSE,yes rainy,65,70,TRUE,no overcast,64,65,TRUE,yes sunny,72,95,FALSE,no sunny,69,70,FALSE,yes rainy,75,80,FALSE,yes sunny,75,70,TRUE,yes overcast,72,90,TRUE,yes overcast,81,75,FALSE,yes rainy,71,91,TRUE,no
36
Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/36 Copyright © 2004 Weka http://www.cs.waikato.ac.nz/ml/weka Methods: rules.ZeroR bayes.NaiveBayes trees.j48.J48 lazy.IBk trees.DecisionStump
37
Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/37 Copyright © 2004 kMeans clustering http://www.cc.gatech.edu/~dellaert/html/sof tware.html java weka.clusterers.SimpleKMeans -t data/weather.arff
38
Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/38 Copyright © 2004 More useful pointers http://www.kdnuggets.com/ http://www.twocrows.com/booklet.htm
39
Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/39 Copyright © 2004 More types of data mining Classification and prediction Cluster analysis Outlier analysis Evolution analysis
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.