Data Mining-Knowledge Presentation—ID3 algorithm Prof. Sin-Min Lee Department of Computer Science
Data Mining Tasks Predicting onto new data by using rules or patterns and behaviors –Classification –Estimation Understanding the groupings, trends, and characteristics of your customer –Segmentation Visualizing the Euclidean spatial relationships, trends, and patterns of your data –Description
Stages of Data Mining Process 1. Data gathering, e.g., data warehousing. 2. Data cleansing: eliminate errors and/or bogus data, e.g., patient fever = Feature extraction: obtaining only the interesting attributes of the data, e.g., “date acquired” is probably not useful for clustering celestial objects, as in Skycat. 4. Pattern extraction and discovery. This is the stage that is often thought of as “data mining” and is where we shall concentrate our effort. 5. Visualization of the data. 6. Evaluation of results; not every discovered fact is useful, or even true! Judgment is necessary before following your software's conclusions.
Clusters of Galaxies Skycat clustered 2x10 9 sky objects into stars, galaxies, quasars, etc. Each object was a point in a space of 7 dimensions, with each dimension representing radiation in one band of the spectrum. The Sloan Sky Survey is a more ambitious attempt to catalog and cluster the entire visible universe
Cholera outbreak in London Clustering: Examples
Decision trees are an alternative way of structuring rule information.
humidity windy outlook sunny rainovercast normaltruefalse NNPP P
A Classification rule based on the tree if outlook = overcast outlook = sunny & humidity = normal outlook = rain & windy = false then P if outlook = overcast then P if outlook = sunny & humidity = normal then P if outlook = rain & windy = false then P
Outlook SunnyOvercastRain Humidity HighNormal NoYes Each internal node tests an attribute Each branch corresponds to an attribute value node Each leaf node assigns a classification
Top-Down Induction of Decision Trees ID3 1.A the “best” decision attribute for next node 2.Assign A as decision attribute for node 3. For each value of A create new descendant 4.Sort training examples to leaf node according to the attribute value of the branch 5.If all training examples are perfectly classified (same value of target attribute) stop, else iterate over new leaf nodes.
Which Attribute is ”best”? A 1 =? TrueFalse [21+, 5-][8+, 30-] [29+,35-] A 2 =? TrueFalse [18+, 33-] [11+, 2-] [29+,35-]
Entropy S is a sample of training examples p + is the proportion of positive examples p - is the proportion of negative examples Entropy measures the impurity of S Entropy(S) = -p + log 2 p + - p - log 2 p -
Entropy(S)= expected number of bits needed to encode class (+ or -) of randomly drawn members of S (under the optimal, shortest length-code) Why? Information theory optimal length code assign –log 2 p bits to messages having probability p. So the expected number of bits to encode (+ or -) of random member of S: -p + log 2 p + - p - log 2 p - Entropy
Gain(S,A): expected reduction in entropy due to sorting S on attribute A A 1 =? TrueFalse [21+, 5-][8+, 30-] [29+,35-] A 2 =? TrueFalse [18+, 33-] [11+, 2-] [29+,35-] Gain(S,A)=Entropy(S) - v values(A) |S v |/|S| Entropy(S v ) Entropy([29+,35-]) = -29/64 log 2 29/64 – 35/64 log 2 35/64 = 0.99 Information Gain
A 1 =? TrueFalse [21+, 5-][8+, 30-] [29+,35-] Entropy([21+,5-]) = 0.71 Entropy([8+,30-]) = 0.74 Gain(S,A 1 )=Entropy(S) -26/64*Entropy([21+,5-]) -38/64*Entropy([8+,30-]) =0.27 Entropy([18+,33-]) = 0.94 Entropy([8+,30-]) = 0.62 Gain(S,A 2 )=Entropy(S) -51/64*Entropy([18+,33-]) -13/64*Entropy([11+,2-]) =0.12 A 2 =? TrueFalse [18+, 33-] [11+, 2-] [29+,35-]
DayOutlookTemp.HumidityWindPlay Tennis D1SunnyHotHighWeakNo D2SunnyHotHighStrongNo D3OvercastHotHighWeakYes D4RainMildHighWeakYes D5RainCoolNormalWeakYes D6RainCoolNormalStrongNo D7OvercastCoolNormalWeakYes D8SunnyMildHighWeakNo D9SunnyColdNormalWeakYes D10RainMildNormalStrongYes D11SunnyMildNormalStrongYes D12OvercastMildHighStrongYes D13OvercastHotNormalWeakYes D14RainMildHighStrongNo Training Examples
Selecting the Next Attribute Humidity HighNormal [3+, 4-][6+, 1-] S=[9+,5-] E=0.940 Gain(S,Humidity) =0.940-(7/14)*0.985 – (7/14)*0.592 =0.151 E=0.985 E=0.592 Wind WeakStrong [6+, 2-][3+, 3-] S=[9+,5-] E=0.940 Gain(S,Wind) =0.940-(8/14)*0.811 – (6/14)*1.0 =0.048
Outlook Sunny Rain [2+, 3-] [3+, 2-] S=[9+,5-] E=0.940 Gain(S,Outlook) =0.940-(5/14)* (4/14)*0.0 – (5/14)* =0.247 E=0.971 Over cast [4+, 0] E=0.0 Selecting the Next Attribute
Outlook SunnyOvercastRain Yes [D1,D2,…,D14] [9+,5-] S sunny =[D1,D2,D8,D9,D11] [2+,3-] ? ? [D3,D7,D12,D13] [4+,0-] [D4,D5,D6,D10,D14] [3+,2-] Gain(S sunny, Humidity)=0.970-(3/5)0.0 – 2/5(0.0) = Gain(S sunny, Temp.)=0.970-(2/5)0.0 –2/5(1.0)-(1/5)0.0 = Gain(S sunny, Wind)=0.970= -(2/5)1.0 – 3/5(0.918) = ID3 Algorithm
Outlook SunnyOvercastRain Humidity HighNormal Wind StrongWeak NoYes No [D3,D7,D12,D13] [D8,D9,D11] [D6,D14] [D1,D2] [D4,D5,D10]
The ID3 Algorithm Given a set of disjoint target classes {C 1, C 2, …, C k }, a set of training data, S, containing objects of more than one class. Let T be any test on a single attribute of the data, with O 1, O 2, …, O n representing the possible outcomes of applying T to any object x (written as T(x)). T produces a partition {S 1, S 2, …, S n } of S such that S i = { x | T(x) = O i }
Proceed recursively to replace each S i with a decision tree. Crucial factor: Selecting the tests. S1S1 SnSn S O1O1 OnOn O2O2 S2S2 … …
In making this decision, Quinlan employs the notion of uncertainty (entropy from information theory). M = {m 1, m 2, …, m n } Set of messages p(m i )Probability of the message m i being received I(m i ) = -log p(m i )Amount of information of message m i U(M) = i p(m i ) I(m i )Uncertainty of the set M Quinlan’s assumptions: A correct decision tree for S will classify objects in the same proportion as their representation in S. Given a case to classify, a test can be regarded as the source of a message about that case.
Let N i be the number of cases in S that belong to a class C i : p(c C i ) = N i / |S| The uncertainty, U(S), measures the average amount of information needed to determine the class of a random case, c S. Uncertainty measure after S has been partitioned. U T (S) = i (|S i | / |S|) U(S i ) Select the test T that gains the most information, i.e., where G S (T) = U(S) – U T (S) is maximal.
Evaluation of ID3 The ID3 algorithm tends to favor tests with a large number of outcomes over tests with a smaller number. Its computational complexity depends on the cost of choosing the next test to branch on; It was adapted to deal with noisy and incomplete data; It is a feasible alternative to knowledge elicitation if sufficient data of the right kind are available; However this method is not incremental. Further modification were introduced in C4.5, e.g : pruning the decision tree in order to avoid overfitting Better test selection heuristic
Search Space and Search Trees Search space is logical space composed of –nodes are search states –links are all legal connections between search states e.g. in chess, no link between states where W castles having previously moved K. –always just an abstraction –think of search algorithms trying to navigate this extremely complex space
Search Trees Search trees do not summarise all possible searches –instead an abstraction of one possible search Root is null state edges represent one choice –e.g. to set value of A first child nodes represent extensions –children give all possible choices leaf nodes are solutions/failures Example in SAT –algorithm detects failure early –need not pick same variables everywhere
Definition A tree shaped structure that represents a set of decisions. These decisions are used as a basis for predictions. They represent rules for classifying datasets. Useful knowledge can be extracted by this classification.
DT Structure Node Types –Decision nodes: specifies some test to be carried out on a single attribute value. Each outcome is assigned to a branch that connects to a leaf node or another decision node. –Leaf nodes: indicates the classification of an example.
An Example
Growing a Decision Tree
Iterative Dichotomiser 3 (ID3) Algorithm Invented by J. Ross Quinlan in 1975 Based on a greedy search algorithm.
ID3 Cont. The goal is to create the best possible tree that works on all available data. –An example is strictly either belonging to one class or the other. –Need to select the attribute that best classify the example (i.e. need to select the attribute with smallest entropy on all the example.) –The lower the entropy the higher the Information Gain. We desire high IG.
Entropy A quantitative measurement of the homogeniety of a set of examples. It tells us how well an attribute separate the training examples according to their target classification.
Entropy cont. Given a set S with only positive or negative examples (2 class case): Entropy(S) = -P P log 2 P P – P n log 2 P n Where P P = proportion of positive examples P N = proportion of negative examples
Entropy cont. Ex. Given 25 examples with 15 positive and 10 negative. Entropy(S) = -(15/25)log 2 (15/25) -(10/25)log 2 (10/25) =.97 If Entropy(S) = 0 all members in S belong to strictly one class. If Entropy(S) = 1 (max value) members are split equally between the two classes.
In General… In general, if an attribute takes more than two values entropy(S) = Where n is the number of values
Information Gain Where A is an attribute of S Value(A) is the set of possible value of A v is a particular value in Value(A) Sv is a subset of S having of v’s on value(A).
Actual Algorithm ID3 (Examples, Target_Attribute, Attributes) Create a root node for the tree If all examples are positive, Return the single-node tree Root, with label = +. If all examples are negative, Return the single-node tree Root, with label = -. If number of predicting attributes is empty, then Return the single node tree Root, with label = most common value of the target attribute in the examples. Otherwise Begin A = The Attribute that best classifies examples. Decision Tree attribute for Root = A. For each possible value, vi, of A, Add a new tree branch below Root, corresponding to the test A = vi. Let Examples(vi), be the subset of examples that have the value vi for A If Examples(vi) is empty Then below this new branch add a leaf node with label = most common target value in the examples Else below this new branch add the subtree ID3 (Examples(vi), Target_Attribute, Attributes – {A}) End Return Root
Independent AttributesDependent OutlookTemperatureHumidityWindyPlay sunny85 FALSEDon’t play sunny8090TRUEDon’t play overcast8378FALSEPlay rain7095FALSEPlay rain6880FALSEPlay rain6570TRUEDon’t play overcast6465TRUEPlay sunny7295FALSEDon’t play sunny6970FALSEPlay rain7580FALSEPlay sunny7570TRUEPlay overcast7290TRUEPlay overcast8175FALSEPlay rain7180TRUEDon’t play We choose Play as our dependent attribute. Don’t play = Negative Play = Positive
Independent AttributesDependent OutlookTemperatureHumidityWindyPlay sunny85 FALSEDon’t play sunny8090TRUEDon’t play overcast8378FALSEPlay rain7095FALSEPlay rain6880FALSEPlay rain6570TRUEDon’t play overcast6465TRUEPlay sunny7295FALSEDon’t play sunny6970FALSEPlay rain7580FALSEPlay sunny7570TRUEPlay overcast7290TRUEPlay overcast8175FALSEPlay rain7180TRUEDon’t play
Outlookplay don’t playtotalEntropy sunny ovecast4040 rain total Temp > <= total Humidity > <= total Windy TRUE FALSE total
Independent AttributesDependent OutlookTemperatureHumidityWindyPlay sunny85 FALSEDon’t play sunny8090TRUEDon’t play sunny7295FALSEDon’t play sunny6970FALSEPlay sunny7570TRUEPlay
playdon’t playtotalEntropy Temp > <= total Humidity > <= total0 Windy TRUE FALSE total
Independent AttributesDependent OutlookTemperatureHumidityWindyPlay rain7095FALSEPlay rain6880FALSEPlay rain6570TRUEDon’t play rain7580FALSEPlay rain7180TRUEDon’t play
playdon’t playtotalEntropy Temp > <= total Humidity > <= total Windy TRUE0220 FALSE3030 total0
Independent AttributesDependent OutlookTemperatureHumidityWindyPlay overcast8378FALSEPlay overcast6465TRUEPlay overcast7290TRUEPlay overcast8175FALSEPlay
playdon’t playtotalEntropy Temp > <= total0 Humidity > <= total0 Windy TRUE2020 FALSE2020 total0
ID3 Summary Step 1: Take all unused attribute and count their entropy concerning test samples. Step 2: Choose the attribute with smallest entropy. Step 3: Make a node containing that attribute.
Growth stops when: Every attribute already exist along the path thru the tree. The training examples associated with a leaf all have the same target attribute.
References An Overview of Data Mining Techniques – dmtechniques.htm Decision tree – Decision Trees –