Download presentation
Presentation is loading. Please wait.
Published byPosy Willis Modified over 9 years ago
1
Classification supplemental
2
Scalable Decision Tree Induction Methods in Data Mining Studies SLIQ (EDBT’96 — Mehta et al.) – builds an index for each attribute and only class list and the current attribute list reside in memory SPRINT (VLDB’96 — J. Shafer et al.) – constructs an attribute list data structure PUBLIC (VLDB’98 — Rastogi & Shim) – integrates tree splitting and tree pruning: stop growing the tree earlier RainForest (VLDB’98 — Gehrke, Ramakrishnan & Ganti) – separates the scalability aspects from the criteria that determine the quality of the tree – builds an AVC-list (attribute, value, class label)
3
SPRINT For large data sets.Age < 25 H Car = Sports HL
4
Gini Index (IBM IntelligentMiner) If a data set T contains examples from n classes, gini index, gini(T) is defined as where p j is the relative frequency of class j in T. If a data set T is split into two subsets T 1 and T 2 with sizes N 1 and N 2 respectively, the gini index of the split data contains examples from n classes, the gini index gini(T) is defined as The attribute provides the smallest gini split (T) is chosen to split the node (need to enumerate all possible splitting points for each attribute).
5
SPRINT Partition (S) if all points of S are in the same class return; else for each attribute A do evaluate_splits on A; use best split to partition into S1,S2; Partition(S1); Partition(S2);
6
SPRINT Data Structures Training set AgeCar Attribute lists
7
Splits Age < 27.5 Group1 Group2
8
Histograms For continuous attributes Associated with node (Cabove, Cbelow) to processalready processed
9
Example gini split0 = 0/6 gini(S1) + 6/6 gini(S2) gini(S2) = 1 - [(4/6) 2 +(2/6) 2 ] = 0.444 gini split0 = 0.444 gini split1 = 1/6 gini(S1) +5/6 gini(S2) gini(S1) = 1 - [(1/1) 2 ] = 0 gini(S2) = 1 - [(3/4) 2 +(2/4) 2 ] = 0.1875 gini split1 = 0.156 gini split2 = 2/6 gini(S1) +4/6 gini(S2) gini(S1) = 1 - [(2/2) 2 ] = 0 gini(S2) = 1 - [(2/4) 2 +(2/4) 2 ] = 0.5 gini split2 = 0.333 gini split3 =3/6 gini(S1) +3/6 gini(S2) gini(S1) = 1 - [(3/3) 2 ] = 0 gini(S2) = 1 - [(1/3) 2 +(2/3) 2 ] = 0.444 gini split3 = 0.222 gini split4 =4/6 gini(S1) +2/6 gini(S2) gini(S1) = 1 - [(3/4) 2 +(1/4) 2 ] = 0.375 gini(S2) = 1 - [(1/2) 2 +(1/2) 2 ] = 0.5 gini split4 = 0.416 gini split5 =5/6 gini(S1) +1/6 gini(S2) gini(S1) = 1 - [(4/5) 2 +(1/5) 2 ] = 0.320 gini(S2) = 1 - [(1/1) 2 ] = 0 gini split5 = 0.222 gini split5 =6/6 gini(S1) +0/6 gini(S2) gini(S1) = 1 - [(4/6) 2 +(2/6) 2 ] = 0.320 gini split6 = 0.444 Age <= 18.5
10
Splitting categorical attributes Single scan through the attribute list collecting counts on count matrix for each combination of class label + attribute value
11
Example gini split(family) = 3/6 gini(S1) + 3/6 gini(S2) gini(S1) = 1 - [(2/3) 2 + (1/3) 2 ] = 4/9 gini(S2) = 1- [(2/3) 2 + (1/3) 2 ] = 4/9 gini split(family) = 0.444 gini split((sports) = 2/6 gini(S1) + 4/6 gini(S2) gini(S1) = 1 - [(2/2) 2 ] = 0 gini(S2) = 1- [(2/4) 2 + (2/4) 2 ] = 0.5 gini split((sports) ) = 0.333 gini split(truck) = 1/6 gini(S1) + 5/6 gini(S2) gini(S1) = 1 - [(1/1) 2 ] = 0 gini(S2) = 1- [(4/5) 2 + (1/5) 2 ] = 0.32 gini split(truck) ) = 0.266 Car Type = Truck
12
Example (2 attributes) The winner is Age <= 18.5 H YN
13
Example for Bayes Rules The patient either has a cancer or does not. A prior knowledge: over the entire population,.008 have cancer Lab test result + or - is imperfect. It returns – a correct positive result in only 98% of the cases in which the cancer is actually present – a correct negative result in only 97% of the cases in which the cancer is not present What happens if a new patient for whom the lab test returns +?
14
Example for Bayes Rules Pr(cancer)=0.008 Pr(not cancer)=0.992 Pr(+|cancer)=0.98 Pr(-|cancer)=0.02 Pr(+|not cancer)=0.03 Pr(-|not cancer)=0.97 Pr(+|cancer)p(cancer) = 0.98* 0.008 = 0.0078 Pr(+|not cancer)Pr(not cancer) = 0.03*0.992=0.0298 Hence, Pr(cancer|+) = 0.0078/(0.0078+0.0298)=0.21
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.