Download presentation
Presentation is loading. Please wait.
Published byLynn Clark Modified over 9 years ago
1
Data Mining-Knowledge Presentation—ID3 algorithm Prof. Sin-Min Lee Department of Computer Science
11
Data Mining Tasks Predicting onto new data by using rules or patterns and behaviors –Classification –Estimation Understanding the groupings, trends, and characteristics of your customer –Segmentation Visualizing the Euclidean spatial relationships, trends, and patterns of your data –Description
12
Stages of Data Mining Process 1. Data gathering, e.g., data warehousing. 2. Data cleansing: eliminate errors and/or bogus data, e.g., patient fever = 125. 3. Feature extraction: obtaining only the interesting attributes of the data, e.g., “date acquired” is probably not useful for clustering celestial objects, as in Skycat. 4. Pattern extraction and discovery. This is the stage that is often thought of as “data mining” and is where we shall concentrate our effort. 5. Visualization of the data. 6. Evaluation of results; not every discovered fact is useful, or even true! Judgment is necessary before following your software's conclusions.
14
Clusters of Galaxies Skycat clustered 2x10 9 sky objects into stars, galaxies, quasars, etc. Each object was a point in a space of 7 dimensions, with each dimension representing radiation in one band of the spectrum. The Sloan Sky Survey is a more ambitious attempt to catalog and cluster the entire visible universe
15
Cholera outbreak in London Clustering: Examples
17
Decision trees are an alternative way of structuring rule information.
18
humidity windy outlook sunny rainovercast normaltruefalse NNPP P
19
A Classification rule based on the tree if outlook = overcast outlook = sunny & humidity = normal outlook = rain & windy = false then P if outlook = overcast then P if outlook = sunny & humidity = normal then P if outlook = rain & windy = false then P
20
Outlook SunnyOvercastRain Humidity HighNormal NoYes Each internal node tests an attribute Each branch corresponds to an attribute value node Each leaf node assigns a classification
24
Top-Down Induction of Decision Trees ID3 1.A the “best” decision attribute for next node 2.Assign A as decision attribute for node 3. For each value of A create new descendant 4.Sort training examples to leaf node according to the attribute value of the branch 5.If all training examples are perfectly classified (same value of target attribute) stop, else iterate over new leaf nodes.
26
Which Attribute is ”best”? A 1 =? TrueFalse [21+, 5-][8+, 30-] [29+,35-] A 2 =? TrueFalse [18+, 33-] [11+, 2-] [29+,35-]
27
Entropy S is a sample of training examples p + is the proportion of positive examples p - is the proportion of negative examples Entropy measures the impurity of S Entropy(S) = -p + log 2 p + - p - log 2 p -
28
Entropy(S)= expected number of bits needed to encode class (+ or -) of randomly drawn members of S (under the optimal, shortest length-code) Why? Information theory optimal length code assign –log 2 p bits to messages having probability p. So the expected number of bits to encode (+ or -) of random member of S: -p + log 2 p + - p - log 2 p - Entropy
29
Gain(S,A): expected reduction in entropy due to sorting S on attribute A A 1 =? TrueFalse [21+, 5-][8+, 30-] [29+,35-] A 2 =? TrueFalse [18+, 33-] [11+, 2-] [29+,35-] Gain(S,A)=Entropy(S) - v values(A) |S v |/|S| Entropy(S v ) Entropy([29+,35-]) = -29/64 log 2 29/64 – 35/64 log 2 35/64 = 0.99 Information Gain
30
A 1 =? TrueFalse [21+, 5-][8+, 30-] [29+,35-] Entropy([21+,5-]) = 0.71 Entropy([8+,30-]) = 0.74 Gain(S,A 1 )=Entropy(S) -26/64*Entropy([21+,5-]) -38/64*Entropy([8+,30-]) =0.27 Entropy([18+,33-]) = 0.94 Entropy([8+,30-]) = 0.62 Gain(S,A 2 )=Entropy(S) -51/64*Entropy([18+,33-]) -13/64*Entropy([11+,2-]) =0.12 A 2 =? TrueFalse [18+, 33-] [11+, 2-] [29+,35-]
31
DayOutlookTemp.HumidityWindPlay Tennis D1SunnyHotHighWeakNo D2SunnyHotHighStrongNo D3OvercastHotHighWeakYes D4RainMildHighWeakYes D5RainCoolNormalWeakYes D6RainCoolNormalStrongNo D7OvercastCoolNormalWeakYes D8SunnyMildHighWeakNo D9SunnyColdNormalWeakYes D10RainMildNormalStrongYes D11SunnyMildNormalStrongYes D12OvercastMildHighStrongYes D13OvercastHotNormalWeakYes D14RainMildHighStrongNo Training Examples
32
Selecting the Next Attribute Humidity HighNormal [3+, 4-][6+, 1-] S=[9+,5-] E=0.940 Gain(S,Humidity) =0.940-(7/14)*0.985 – (7/14)*0.592 =0.151 E=0.985 E=0.592 Wind WeakStrong [6+, 2-][3+, 3-] S=[9+,5-] E=0.940 Gain(S,Wind) =0.940-(8/14)*0.811 – (6/14)*1.0 =0.048
33
Outlook Sunny Rain [2+, 3-] [3+, 2-] S=[9+,5-] E=0.940 Gain(S,Outlook) =0.940-(5/14)*0.971 -(4/14)*0.0 – (5/14)*0.0971 =0.247 E=0.971 Over cast [4+, 0] E=0.0 Selecting the Next Attribute
34
Outlook SunnyOvercastRain Yes [D1,D2,…,D14] [9+,5-] S sunny =[D1,D2,D8,D9,D11] [2+,3-] ? ? [D3,D7,D12,D13] [4+,0-] [D4,D5,D6,D10,D14] [3+,2-] Gain(S sunny, Humidity)=0.970-(3/5)0.0 – 2/5(0.0) = 0.970 Gain(S sunny, Temp.)=0.970-(2/5)0.0 –2/5(1.0)-(1/5)0.0 = 0.570 Gain(S sunny, Wind)=0.970= -(2/5)1.0 – 3/5(0.918) = 0.019 ID3 Algorithm
35
Outlook SunnyOvercastRain Humidity HighNormal Wind StrongWeak NoYes No [D3,D7,D12,D13] [D8,D9,D11] [D6,D14] [D1,D2] [D4,D5,D10]
36
The ID3 Algorithm Given a set of disjoint target classes {C 1, C 2, …, C k }, a set of training data, S, containing objects of more than one class. Let T be any test on a single attribute of the data, with O 1, O 2, …, O n representing the possible outcomes of applying T to any object x (written as T(x)). T produces a partition {S 1, S 2, …, S n } of S such that S i = { x | T(x) = O i }
37
Proceed recursively to replace each S i with a decision tree. Crucial factor: Selecting the tests. S1S1 SnSn S O1O1 OnOn O2O2 S2S2 … …
38
In making this decision, Quinlan employs the notion of uncertainty (entropy from information theory). M = {m 1, m 2, …, m n } Set of messages p(m i )Probability of the message m i being received I(m i ) = -log p(m i )Amount of information of message m i U(M) = i p(m i ) I(m i )Uncertainty of the set M Quinlan’s assumptions: A correct decision tree for S will classify objects in the same proportion as their representation in S. Given a case to classify, a test can be regarded as the source of a message about that case.
39
Let N i be the number of cases in S that belong to a class C i : p(c C i ) = N i / |S| The uncertainty, U(S), measures the average amount of information needed to determine the class of a random case, c S. Uncertainty measure after S has been partitioned. U T (S) = i (|S i | / |S|) U(S i ) Select the test T that gains the most information, i.e., where G S (T) = U(S) – U T (S) is maximal.
40
Evaluation of ID3 The ID3 algorithm tends to favor tests with a large number of outcomes over tests with a smaller number. Its computational complexity depends on the cost of choosing the next test to branch on; It was adapted to deal with noisy and incomplete data; It is a feasible alternative to knowledge elicitation if sufficient data of the right kind are available; However this method is not incremental. Further modification were introduced in C4.5, e.g : pruning the decision tree in order to avoid overfitting Better test selection heuristic
42
Search Space and Search Trees Search space is logical space composed of –nodes are search states –links are all legal connections between search states e.g. in chess, no link between states where W castles having previously moved K. –always just an abstraction –think of search algorithms trying to navigate this extremely complex space
43
Search Trees Search trees do not summarise all possible searches –instead an abstraction of one possible search Root is null state edges represent one choice –e.g. to set value of A first child nodes represent extensions –children give all possible choices leaf nodes are solutions/failures Example in SAT –algorithm detects failure early –need not pick same variables everywhere
50
Definition A tree shaped structure that represents a set of decisions. These decisions are used as a basis for predictions. They represent rules for classifying datasets. Useful knowledge can be extracted by this classification.
51
DT Structure Node Types –Decision nodes: specifies some test to be carried out on a single attribute value. Each outcome is assigned to a branch that connects to a leaf node or another decision node. –Leaf nodes: indicates the classification of an example.
52
An Example
53
Growing a Decision Tree
54
Iterative Dichotomiser 3 (ID3) Algorithm Invented by J. Ross Quinlan in 1975 Based on a greedy search algorithm.
55
ID3 Cont. The goal is to create the best possible tree that works on all available data. –An example is strictly either belonging to one class or the other. –Need to select the attribute that best classify the example (i.e. need to select the attribute with smallest entropy on all the example.) –The lower the entropy the higher the Information Gain. We desire high IG.
56
Entropy A quantitative measurement of the homogeniety of a set of examples. It tells us how well an attribute separate the training examples according to their target classification.
57
Entropy cont. Given a set S with only positive or negative examples (2 class case): Entropy(S) = -P P log 2 P P – P n log 2 P n Where P P = proportion of positive examples P N = proportion of negative examples
58
Entropy cont. Ex. Given 25 examples with 15 positive and 10 negative. Entropy(S) = -(15/25)log 2 (15/25) -(10/25)log 2 (10/25) =.97 If Entropy(S) = 0 all members in S belong to strictly one class. If Entropy(S) = 1 (max value) members are split equally between the two classes.
59
In General… In general, if an attribute takes more than two values entropy(S) = Where n is the number of values
60
Information Gain Where A is an attribute of S Value(A) is the set of possible value of A v is a particular value in Value(A) Sv is a subset of S having of v’s on value(A).
61
Actual Algorithm ID3 (Examples, Target_Attribute, Attributes) Create a root node for the tree If all examples are positive, Return the single-node tree Root, with label = +. If all examples are negative, Return the single-node tree Root, with label = -. If number of predicting attributes is empty, then Return the single node tree Root, with label = most common value of the target attribute in the examples. Otherwise Begin A = The Attribute that best classifies examples. Decision Tree attribute for Root = A. For each possible value, vi, of A, Add a new tree branch below Root, corresponding to the test A = vi. Let Examples(vi), be the subset of examples that have the value vi for A If Examples(vi) is empty Then below this new branch add a leaf node with label = most common target value in the examples Else below this new branch add the subtree ID3 (Examples(vi), Target_Attribute, Attributes – {A}) End Return Root
63
Independent AttributesDependent OutlookTemperatureHumidityWindyPlay sunny85 FALSEDon’t play sunny8090TRUEDon’t play overcast8378FALSEPlay rain7095FALSEPlay rain6880FALSEPlay rain6570TRUEDon’t play overcast6465TRUEPlay sunny7295FALSEDon’t play sunny6970FALSEPlay rain7580FALSEPlay sunny7570TRUEPlay overcast7290TRUEPlay overcast8175FALSEPlay rain7180TRUEDon’t play We choose Play as our dependent attribute. Don’t play = Negative Play = Positive
64
Independent AttributesDependent OutlookTemperatureHumidityWindyPlay sunny85 FALSEDon’t play sunny8090TRUEDon’t play overcast8378FALSEPlay rain7095FALSEPlay rain6880FALSEPlay rain6570TRUEDon’t play overcast6465TRUEPlay sunny7295FALSEDon’t play sunny6970FALSEPlay rain7580FALSEPlay sunny7570TRUEPlay overcast7290TRUEPlay overcast8175FALSEPlay rain7180TRUEDon’t play
65
Outlookplay don’t playtotalEntropy sunny2350.34676807 ovecast4040 rain3250.34676807 total0.69353614 Temp >705490.63712032 <=704150.25783146 total0.89495179 Humidity >7064100.69353614 <=703140.23179375 total0.92532989 Windy TRUE3360.42857143 FALSE6280.4635875 total0.89215893
67
Independent AttributesDependent OutlookTemperatureHumidityWindyPlay sunny85 FALSEDon’t play sunny8090TRUEDon’t play sunny7295FALSEDon’t play sunny6970FALSEPlay sunny7570TRUEPlay
68
playdon’t playtotalEntropy Temp >701340.6490225 <=701010 total0.6490225 Humidity >700330 <=702020 total0 Windy TRUE1120.4 FALSE1230.5509775 total0.9509775
70
Independent AttributesDependent OutlookTemperatureHumidityWindyPlay rain7095FALSEPlay rain6880FALSEPlay rain6570TRUEDon’t play rain7580FALSEPlay rain7180TRUEDon’t play
71
playdon’t playtotalEntropy Temp >701120.4 <=702130.5509775 total0.9509775 Humidity >701030 <=703140.6490225 total0.6490225 Windy TRUE0220 FALSE3030 total0
73
Independent AttributesDependent OutlookTemperatureHumidityWindyPlay overcast8378FALSEPlay overcast6465TRUEPlay overcast7290TRUEPlay overcast8175FALSEPlay
74
playdon’t playtotalEntropy Temp >702020 <=700220 total0 Humidity >703030 <=701010 total0 Windy TRUE2020 FALSE2020 total0
76
ID3 Summary Step 1: Take all unused attribute and count their entropy concerning test samples. Step 2: Choose the attribute with smallest entropy. Step 3: Make a node containing that attribute.
77
Growth stops when: Every attribute already exist along the path thru the tree. The training examples associated with a leaf all have the same target attribute.
78
References An Overview of Data Mining Techniques –http://www.thearling.com/text/dmtechniques/ dmtechniques.htm Decision tree –http://en.wikipedia.org/wiki/Decision_tree Decision Trees –http://dms.irb.hr/tutorial/tut_dtrees.phphttp://dms.irb.hr/tutorial/tut_dtrees.php
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.