Data Mining-Knowledge Presentation—ID3 algorithm Prof. Sin-Min Lee Department of Computer Science.

Slides:



Advertisements
Similar presentations
Decision Tree Learning - ID3
Advertisements

Decision Trees Decision tree representation ID3 learning algorithm
Classification Algorithms
Huffman code and ID3 Prof. Sin-Min Lee Department of Computer Science.
Decision Tree Algorithm (C4.5)
ICS320-Foundations of Adaptive and Learning Systems
Classification Techniques: Decision Tree Learning
Decision Tree Example MSE 2400 EaLiCaRA Spring 2015 Dr. Tom Way.
Decision Trees Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Eibe Frank and Jiawei Han.
ID3 Algorithm Abbas Rizvi CS157 B Spring What is the ID3 algorithm? ID3 stands for Iterative Dichotomiser 3 Algorithm used to generate a decision.
Part 7.3 Decision Trees Decision tree representation ID3 learning algorithm Entropy, information gain Overfitting.
Decision Tree Algorithm
CS 590M Fall 2001: Security Issues in Data Mining Lecture 4: ID3.
Decision Tree Learning Learning Decision Trees (Mitchell 1997, Russell & Norvig 2003) –Decision tree induction is a simple but powerful learning paradigm.
Induction of Decision Trees
1 Classification with Decision Trees I Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Eibe Frank and Jiawei.
Università di Milano-Bicocca Laurea Magistrale in Informatica Corso di APPRENDIMENTO E APPROSSIMAZIONE Prof. Giancarlo Mauri Lezione 3 - Learning Decision.
Decision Trees Decision tree representation Top Down Construction
Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island.
Ch 3. Decision Tree Learning
Classification and Prediction by Yen-Hsien Lee Department of Information Management College of Management National Sun Yat-Sen University March 4, 2003.
Machine Learning Reading: Chapter Text Classification  Is text i a finance new article? PositiveNegative.
MACHINE LEARNING. What is learning? A computer program learns if it improves its performance at some task through experience (T. Mitchell, 1997) A computer.
Machine Learning Lecture 10 Decision Trees G53MLE Machine Learning Dr Guoping Qiu1.
ID3 and Decision tree by Tuan Nguyen May 2008.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
ID3 Algorithm Allan Neymark CS157B – Spring 2007.
Fall 2004 TDIDT Learning CS478 - Machine Learning.
Machine Learning Chapter 3. Decision Tree Learning
Learning what questions to ask. 8/29/03Decision Trees2  Job is to build a tree that represents a series of questions that the classifier will ask of.
Artificial Intelligence 7. Decision trees
Machine Learning Decision Tree.
Lecture 7. Outline 1. Overview of Classification and Decision Tree 2. Algorithm to build Decision Tree 3. Formula to measure information 4. Weka, data.
Machine Learning Lecture 10 Decision Tree Learning 1.
CpSc 810: Machine Learning Decision Tree Learning.
Decision Tree Learning
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.3: Decision Trees Rodney Nielsen Many of.
For Wednesday No reading Homework: –Chapter 18, exercise 6.
CS690L Data Mining: Classification
For Monday No new reading Homework: –Chapter 18, exercises 3 and 4.
CS 8751 ML & KDDDecision Trees1 Decision tree representation ID3 learning algorithm Entropy, Information gain Overfitting.
CS 5751 Machine Learning Chapter 3 Decision Tree Learning1 Decision Trees Decision tree representation ID3 learning algorithm Entropy, Information gain.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
Decision Trees, Part 1 Reading: Textbook, Chapter 6.
Decision Tree Learning
Training Examples. Entropy and Information Gain Information answers questions The more clueless I am about the answer initially, the more information.
Seminar on Machine Learning Rada Mihalcea Decision Trees Very short intro to Weka January 27, 2003.
Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.
Machine Learning Recitation 8 Oct 21, 2009 Oznur Tastan.
1 Universidad de Buenos Aires Maestría en Data Mining y Knowledge Discovery Aprendizaje Automático 4-Inducción de árboles de decisión (1/2) Eduardo Poggi.
1 Decision Trees. 2 OutlookTemp (  F) Humidity (%) Windy?Class sunny7570true play sunny8090true don’t play sunny85 false don’t play sunny7295false don’t.
Chapter 3 Data Mining: Classification & Association Chapter 4 in the text box Section: 4.3 (4.3.1),
CSE343/543 Machine Learning: Lecture 4.  Chapter 3: Decision Trees  Weekly assignment:  There are lot of applications and systems using machine learning.
Review of Decision Tree Learning Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
CSE573 Autumn /09/98 Machine Learning Administrative –Last topic: Decision Tree Learning Reading: 5.1, 5.4 Last time –finished NLP sample system’s.
CSE573 Autumn /11/98 Machine Learning Administrative –Finish this topic –The rest of the time is yours –Final exam Tuesday, Mar. 17, 2:30-4:20.
Machine Learning Inductive Learning and Decision Trees
DECISION TREES An internal node represents a test on an attribute.
Decision Trees an introduction.
Decision Tree Learning
Decision trees (concept learnig)
Machine Learning Lecture 2: Decision Tree Learning.
Decision trees (concept learnig)
Artificial Intelligence
Data Science Algorithms: The Basic Methods
Decision Tree Saed Sayad 9/21/2018.
Machine Learning Chapter 3. Decision Tree Learning
Decision Trees Decision tree representation ID3 learning algorithm
Machine Learning Chapter 3. Decision Tree Learning
Decision Trees Decision tree representation ID3 learning algorithm
Presentation transcript:

Data Mining-Knowledge Presentation—ID3 algorithm Prof. Sin-Min Lee Department of Computer Science

Data Mining Tasks Predicting onto new data by using rules or patterns and behaviors –Classification –Estimation Understanding the groupings, trends, and characteristics of your customer –Segmentation Visualizing the Euclidean spatial relationships, trends, and patterns of your data –Description

Stages of Data Mining Process 1. Data gathering, e.g., data warehousing. 2. Data cleansing: eliminate errors and/or bogus data, e.g., patient fever = Feature extraction: obtaining only the interesting attributes of the data, e.g., “date acquired” is probably not useful for clustering celestial objects, as in Skycat. 4. Pattern extraction and discovery. This is the stage that is often thought of as “data mining” and is where we shall concentrate our effort. 5. Visualization of the data. 6. Evaluation of results; not every discovered fact is useful, or even true! Judgment is necessary before following your software's conclusions.

Clusters of Galaxies  Skycat clustered 2x10 9 sky objects into stars, galaxies, quasars, etc. Each object was a point in a space of 7 dimensions, with each dimension representing radiation in one band of the spectrum.  The Sloan Sky Survey is a more ambitious attempt to catalog and cluster the entire visible universe

 Cholera outbreak in London  Clustering: Examples

Decision trees are an alternative way of structuring rule information.

humidity windy outlook sunny rainovercast normaltruefalse NNPP P

A Classification rule based on the tree if outlook = overcast  outlook = sunny & humidity = normal  outlook = rain & windy = false then P if outlook = overcast then P if outlook = sunny & humidity = normal then P if outlook = rain & windy = false then P

Outlook SunnyOvercastRain Humidity HighNormal NoYes Each internal node tests an attribute Each branch corresponds to an attribute value node Each leaf node assigns a classification

Top-Down Induction of Decision Trees ID3 1.A  the “best” decision attribute for next node 2.Assign A as decision attribute for node 3. For each value of A create new descendant 4.Sort training examples to leaf node according to the attribute value of the branch 5.If all training examples are perfectly classified (same value of target attribute) stop, else iterate over new leaf nodes.

Which Attribute is ”best”? A 1 =? TrueFalse [21+, 5-][8+, 30-] [29+,35-] A 2 =? TrueFalse [18+, 33-] [11+, 2-] [29+,35-]

Entropy S is a sample of training examples p + is the proportion of positive examples p - is the proportion of negative examples Entropy measures the impurity of S Entropy(S) = -p + log 2 p + - p - log 2 p -

Entropy(S)= expected number of bits needed to encode class (+ or -) of randomly drawn members of S (under the optimal, shortest length-code) Why? Information theory optimal length code assign –log 2 p bits to messages having probability p. So the expected number of bits to encode (+ or -) of random member of S: -p + log 2 p + - p - log 2 p - Entropy

Gain(S,A): expected reduction in entropy due to sorting S on attribute A A 1 =? TrueFalse [21+, 5-][8+, 30-] [29+,35-] A 2 =? TrueFalse [18+, 33-] [11+, 2-] [29+,35-] Gain(S,A)=Entropy(S) -  v  values(A) |S v |/|S| Entropy(S v ) Entropy([29+,35-]) = -29/64 log 2 29/64 – 35/64 log 2 35/64 = 0.99 Information Gain

A 1 =? TrueFalse [21+, 5-][8+, 30-] [29+,35-] Entropy([21+,5-]) = 0.71 Entropy([8+,30-]) = 0.74 Gain(S,A 1 )=Entropy(S) -26/64*Entropy([21+,5-]) -38/64*Entropy([8+,30-]) =0.27 Entropy([18+,33-]) = 0.94 Entropy([8+,30-]) = 0.62 Gain(S,A 2 )=Entropy(S) -51/64*Entropy([18+,33-]) -13/64*Entropy([11+,2-]) =0.12 A 2 =? TrueFalse [18+, 33-] [11+, 2-] [29+,35-]

DayOutlookTemp.HumidityWindPlay Tennis D1SunnyHotHighWeakNo D2SunnyHotHighStrongNo D3OvercastHotHighWeakYes D4RainMildHighWeakYes D5RainCoolNormalWeakYes D6RainCoolNormalStrongNo D7OvercastCoolNormalWeakYes D8SunnyMildHighWeakNo D9SunnyColdNormalWeakYes D10RainMildNormalStrongYes D11SunnyMildNormalStrongYes D12OvercastMildHighStrongYes D13OvercastHotNormalWeakYes D14RainMildHighStrongNo Training Examples

Selecting the Next Attribute Humidity HighNormal [3+, 4-][6+, 1-] S=[9+,5-] E=0.940 Gain(S,Humidity) =0.940-(7/14)*0.985 – (7/14)*0.592 =0.151 E=0.985 E=0.592 Wind WeakStrong [6+, 2-][3+, 3-] S=[9+,5-] E=0.940 Gain(S,Wind) =0.940-(8/14)*0.811 – (6/14)*1.0 =0.048

Outlook Sunny Rain [2+, 3-] [3+, 2-] S=[9+,5-] E=0.940 Gain(S,Outlook) =0.940-(5/14)* (4/14)*0.0 – (5/14)* =0.247 E=0.971 Over cast [4+, 0] E=0.0 Selecting the Next Attribute

Outlook SunnyOvercastRain Yes [D1,D2,…,D14] [9+,5-] S sunny =[D1,D2,D8,D9,D11] [2+,3-] ? ? [D3,D7,D12,D13] [4+,0-] [D4,D5,D6,D10,D14] [3+,2-] Gain(S sunny, Humidity)=0.970-(3/5)0.0 – 2/5(0.0) = Gain(S sunny, Temp.)=0.970-(2/5)0.0 –2/5(1.0)-(1/5)0.0 = Gain(S sunny, Wind)=0.970= -(2/5)1.0 – 3/5(0.918) = ID3 Algorithm

Outlook SunnyOvercastRain Humidity HighNormal Wind StrongWeak NoYes No [D3,D7,D12,D13] [D8,D9,D11] [D6,D14] [D1,D2] [D4,D5,D10]

The ID3 Algorithm Given a set of disjoint target classes {C 1, C 2, …, C k }, a set of training data, S, containing objects of more than one class. Let T be any test on a single attribute of the data, with O 1, O 2, …, O n representing the possible outcomes of applying T to any object x (written as T(x)). T produces a partition {S 1, S 2, …, S n } of S such that S i = { x | T(x) = O i }

Proceed recursively to replace each S i with a decision tree. Crucial factor: Selecting the tests. S1S1 SnSn S O1O1 OnOn O2O2 S2S2 … …

In making this decision, Quinlan employs the notion of uncertainty (entropy from information theory). M = {m 1, m 2, …, m n } Set of messages p(m i )Probability of the message m i being received I(m i ) = -log p(m i )Amount of information of message m i U(M) =  i p(m i ) I(m i )Uncertainty of the set M Quinlan’s assumptions: A correct decision tree for S will classify objects in the same proportion as their representation in S. Given a case to classify, a test can be regarded as the source of a message about that case.

Let N i be the number of cases in S that belong to a class C i : p(c  C i ) = N i / |S| The uncertainty, U(S), measures the average amount of information needed to determine the class of a random case, c  S. Uncertainty measure after S has been partitioned. U T (S) =  i (|S i | / |S|) U(S i ) Select the test T that gains the most information, i.e., where G S (T) = U(S) – U T (S) is maximal.

Evaluation of ID3 The ID3 algorithm tends to favor tests with a large number of outcomes over tests with a smaller number. Its computational complexity depends on the cost of choosing the next test to branch on; It was adapted to deal with noisy and incomplete data; It is a feasible alternative to knowledge elicitation if sufficient data of the right kind are available; However this method is not incremental. Further modification were introduced in C4.5, e.g : pruning the decision tree in order to avoid overfitting Better test selection heuristic

Search Space and Search Trees Search space is logical space composed of –nodes are search states –links are all legal connections between search states e.g. in chess, no link between states where W castles having previously moved K. –always just an abstraction –think of search algorithms trying to navigate this extremely complex space

Search Trees Search trees do not summarise all possible searches –instead an abstraction of one possible search Root is null state edges represent one choice –e.g. to set value of A first child nodes represent extensions –children give all possible choices leaf nodes are solutions/failures Example in SAT –algorithm detects failure early –need not pick same variables everywhere

Definition A tree shaped structure that represents a set of decisions. These decisions are used as a basis for predictions. They represent rules for classifying datasets. Useful knowledge can be extracted by this classification.

DT Structure Node Types –Decision nodes: specifies some test to be carried out on a single attribute value. Each outcome is assigned to a branch that connects to a leaf node or another decision node. –Leaf nodes: indicates the classification of an example.

An Example

Growing a Decision Tree

Iterative Dichotomiser 3 (ID3) Algorithm Invented by J. Ross Quinlan in 1975 Based on a greedy search algorithm.

ID3 Cont. The goal is to create the best possible tree that works on all available data. –An example is strictly either belonging to one class or the other. –Need to select the attribute that best classify the example (i.e. need to select the attribute with smallest entropy on all the example.) –The lower the entropy the higher the Information Gain. We desire high IG.

Entropy A quantitative measurement of the homogeniety of a set of examples. It tells us how well an attribute separate the training examples according to their target classification.

Entropy cont. Given a set S with only positive or negative examples (2 class case): Entropy(S) = -P P log 2 P P – P n log 2 P n Where P P = proportion of positive examples P N = proportion of negative examples

Entropy cont. Ex. Given 25 examples with 15 positive and 10 negative. Entropy(S) = -(15/25)log 2 (15/25) -(10/25)log 2 (10/25) =.97 If Entropy(S) = 0 all members in S belong to strictly one class. If Entropy(S) = 1 (max value) members are split equally between the two classes.

In General… In general, if an attribute takes more than two values entropy(S) = Where n is the number of values

Information Gain Where A is an attribute of S Value(A) is the set of possible value of A v is a particular value in Value(A) Sv is a subset of S having of v’s on value(A).

Actual Algorithm ID3 (Examples, Target_Attribute, Attributes) Create a root node for the tree If all examples are positive, Return the single-node tree Root, with label = +. If all examples are negative, Return the single-node tree Root, with label = -. If number of predicting attributes is empty, then Return the single node tree Root, with label = most common value of the target attribute in the examples. Otherwise Begin A = The Attribute that best classifies examples. Decision Tree attribute for Root = A. For each possible value, vi, of A, Add a new tree branch below Root, corresponding to the test A = vi. Let Examples(vi), be the subset of examples that have the value vi for A If Examples(vi) is empty Then below this new branch add a leaf node with label = most common target value in the examples Else below this new branch add the subtree ID3 (Examples(vi), Target_Attribute, Attributes – {A}) End Return Root

Independent AttributesDependent OutlookTemperatureHumidityWindyPlay sunny85 FALSEDon’t play sunny8090TRUEDon’t play overcast8378FALSEPlay rain7095FALSEPlay rain6880FALSEPlay rain6570TRUEDon’t play overcast6465TRUEPlay sunny7295FALSEDon’t play sunny6970FALSEPlay rain7580FALSEPlay sunny7570TRUEPlay overcast7290TRUEPlay overcast8175FALSEPlay rain7180TRUEDon’t play We choose Play as our dependent attribute. Don’t play = Negative Play = Positive

Independent AttributesDependent OutlookTemperatureHumidityWindyPlay sunny85 FALSEDon’t play sunny8090TRUEDon’t play overcast8378FALSEPlay rain7095FALSEPlay rain6880FALSEPlay rain6570TRUEDon’t play overcast6465TRUEPlay sunny7295FALSEDon’t play sunny6970FALSEPlay rain7580FALSEPlay sunny7570TRUEPlay overcast7290TRUEPlay overcast8175FALSEPlay rain7180TRUEDon’t play

Outlookplay don’t playtotalEntropy sunny ovecast4040 rain total Temp > <= total Humidity > <= total Windy TRUE FALSE total

Independent AttributesDependent OutlookTemperatureHumidityWindyPlay sunny85 FALSEDon’t play sunny8090TRUEDon’t play sunny7295FALSEDon’t play sunny6970FALSEPlay sunny7570TRUEPlay

playdon’t playtotalEntropy Temp > <= total Humidity > <= total0 Windy TRUE FALSE total

Independent AttributesDependent OutlookTemperatureHumidityWindyPlay rain7095FALSEPlay rain6880FALSEPlay rain6570TRUEDon’t play rain7580FALSEPlay rain7180TRUEDon’t play

playdon’t playtotalEntropy Temp > <= total Humidity > <= total Windy TRUE0220 FALSE3030 total0

Independent AttributesDependent OutlookTemperatureHumidityWindyPlay overcast8378FALSEPlay overcast6465TRUEPlay overcast7290TRUEPlay overcast8175FALSEPlay

playdon’t playtotalEntropy Temp > <= total0 Humidity > <= total0 Windy TRUE2020 FALSE2020 total0

ID3 Summary Step 1: Take all unused attribute and count their entropy concerning test samples. Step 2: Choose the attribute with smallest entropy. Step 3: Make a node containing that attribute.

Growth stops when: Every attribute already exist along the path thru the tree. The training examples associated with a leaf all have the same target attribute.

References An Overview of Data Mining Techniques – dmtechniques.htm Decision tree – Decision Trees –