CS344: Introduction to Artificial Intelligence Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 29– Decision Tree Learning; ID3;Entropy
Information in the data
Sunny High F N T Cloudy Y Rain Med Cold Low Outlook (0) Temp (T) Humidity (H) Windy (W) Decision to play (D) Sunny High F N T Cloudy Y Rain Med Cold Low To-play-or-not-to-play-tennis data vs. Climatic-Condition from Ross Quinlan’s paper on ID3 (1986), C4.5 (1993)
Sunny Med High F N Cold Low Y Rain T Cloudy Weather (0) Temp (T) Humidity (H) Windy (W) Decision (D) Sunny Med High F N Cold Low Y Rain T Cloudy
Outlook Sunny Cloudy Rain Humidity Yes Windy High F Low T No No Yes Yes
Rule Base R1: If outlook is sunny and if humidity is high then Decision is No. R2: If outlook is sunny and if humidity is low then Decision is Yes. R3: If outlook is cloudy then Decision is Yes.
Making Sense of Information Classification Clustering Giving a short and nice description
Short Description Occam Razor principle (Shortest/simplest description is the best for generalization)
Representation Language Decision tree. Neural network. Rule base. Boolean expression.
Information & Entropy The example data presented in the form of rows and labels has low ordered/structured information compared to the succinct description (Decision Trees and Rule Base). Lack of structure in information: measured by “Entropy”
Define Entropy of S (Labeled data) E(S) = - ( P+ log2P+ + P- log2P- ) P+ = proportion of positively labeled data. P- = proportion of negatively labeled data.
Example P+ = 9/14 P- = 5/14 E(S) = - 9/14 log2 (9/14) – 5/14 log2 (5/14) = 0.940
Partitioning the Data “Windy” as the attribute Windy = [ T, F] Windy = T : Partitioning the data
Partitioning the Data (Contd) Partitioning by focusing on a particular attribute produced “Information gain” “Reduction in Entropy”
Partitioning the Data (Contd) Information gain when we choose windy = [ T, F ] Windy = T, #+ = 6 , #- = 2 Windy = F, #+ = 3 , #- = 3 # denotes “number of”
Partitioning the Data (Contd) Windy T F 6, + 2, - 3, + 3, -
Partitioning the Data (Contd) Gain(S,A) = = E(S) -∑( |Sv| / |S| )E(Sv) v є values of A E(S) = 0.940 E(S, Windy): E( Windy=T) = - 6/8 log 6/8 – 2/8 log 2/8 = 0.811
Partitioning the Data (Contd) E( Windy=F) = - 3/6 log 3/6 – 3/6 log 3/6 = 1.0
Partitioning the Data (Contd) Gain(S,Windy) = = 0.940 – (8/14 *0.811 + 6/14* 1.0) = 0.048 Exercise: Find information gain for each attribute: outlook, Temp, Humidity and windy.
ID3 Algorithm Calculating the gain for every attribute and choosing the one with maximum gain to finally arrive at the decision tree is called “ID3” algorithm to build a classifier.
Neural Computation
Some Observation on the brain Ray Kurzweil, The Singularity is Near, 2005. Machines will be able to out-think people within a few decades. But brain arose through natural selection Contains layers of systems for that arose for one function and then were adopted for another even if they do not work perfectly
Difference between brain and computers Highly efficient use of energy in brain High Adaptability Tremendous amount of compressions: space is a premium for the cranium One cubic centimeter of numna brain tissue contains 50 million neurons Several hundred miles of axons which are “wires” for transmitting signals Close to trillion synapses- the connections between neurons
Immense memory capacity 1 cc contains 1 terabyte of information About 1000 cc makes up the whole brain So about 1 million gigabyte or 1 petabyte of information Entire archived cntent of internet is 3 petabyte
Moore’s law Every year doubles the storage capacity Single computer the size of brain will contain a petabyte of information by 2030 Question mark: Power Consumption?
Power issues By 2025, the memory of an artificial brain will use nearly a gigawatt of power: the amount currently consumed by entire Washington DC Contrastedly: brain uses only 12 watts or power, less than the energy used by a typical refrigerator light
Brain vs. computer’s procesing Associative memory vs. adressable memory Parallel Distributed Processing (PDP) vs. Serial computation Fast responses to complex situations vs. precisely repeatable steps Preference for Approximations and “good enough” solutions vs exact solutions Mistakes and biases vs. cold logic
Brain vs. Computers (contd.) Excellent pattern recognition vs. excellent number crunching Emotion- brain’s steerman- assigning values to experiences and future possibilities vs. computer being insensitive to emotions Evaluate potential outcomes efficiently and rapidly when information is uncertain vs. “Garbage in Garbage out” situation”
Properties of Entropy
Example S = {P1 , P2} P1 + P2 = 1 Tossing of coin E(s) = - [ P1 log P1 + P2 log P2] where P1 = 0.5 = P2 E(s) = 1.0
Entropy with respect to Boolean Classification 1.0 Entropy 1 P+
If all events are equally likely then entropy is max. We expect that S = {s1 , s2 … sq} E(S) will be max when each Pi = 1/q.
Theorem E(S) = maximum when each Pi = 1/q S = {s1 , s2 … sq} Lemma : ln(x) = logex <= (x – 1) consider f(x) = (x -1) – ln x f(1) = 0
df(x)/dx = 1 – 1/x equating to 0 x = 1 f(x) had extremum at x = 1 d2f(x)/dx2 = 1/x2 > 0 for x > 0 at x = 1, f(x) has a minima
f(x) had minimum at x =1 f(x) = x -1 – ln x ln x <= x -1 , x >= 1
Corollary Let, Σ xi = 1 Σ yi = 1 and xi >= 0 , yi > 0 i = 1 to m Σ yi = 1 and xi >= 0 , yi > 0 xi and yi are probability distributions. Σ xi ln 1/ xi <= Σ xi ln 1/ yi i = 1 to m i = 1 to m
Proof Σ xi ln 1/ xi - Σ xi ln 1/ yi = Σ xi ln yi/ xi i = 1 to m i = 1 to m = Σ xi ln yi/ xi i = 1 to m <= Σ xi (yi / xi - 1) = Σ yi - Σ xi = 0 i = 1 to m i = 1 to m
Σ xi ln 1/ xi <= Σ xi ln 1/ yi This proves that Σ xi ln 1/ xi <= Σ xi ln 1/ yi i = 1 to m i = 1 to m Proof of theorem follows from this corollary.
Σ xi ln 1/xi <= Σ xi ln1/yi S = {s1 , s2 … sq} P(si) =Pi , i = 1… q Choose xi = Pi , yi = 1/q Σ xi ln 1/xi <= Σ xi ln1/yi i = 1 to q i = 1 to n Σ Pi ln 1/Pi <= Σ xi ln q i = 1 to q i = 1 to n
E(S) is upper bounded by Σ Pi ln q = ln q So, E(S) <= Σ Pi ln q i = 1 to m = ln q. Σ Pi = ln q E(S) is upper bounded by Σ Pi ln q = ln q Which value is reached when each Pi = q. This establishes the maximum value for Entropy.
Summary Review Established the intuition for information function I(si). Related to the ‘surprise’. Average information is called entropy. Minimum value of E = 0. Maximum of E ? Lemma ln x <= x – 1 Corollary Σ xi ln 1/ xi <= Σ xi ln 1/ yi Σ xi = Σ yi = 1 Max E is ln q * k and is reached where pi = 1/q for each i
Shannon asked what is the “entropy of English language” ? S = {a,b,c,…. ‘,’,‘:’, ….} P(a) = Relative frequency of ‘a’ from large corpus. P(b) = … This gives Pis E(English) = Σ Pi log1/Pi = 4.08
Max entropy for tossing of coin: = 1.0 when the coin is unbiased. Interest of the reader/listener Novel is more interesting than a scientific paper for some people.
Origin of Information Theory Shannon “The mathematical Theory of communication”, Bell systems Journal, 1948. Cover and Thomas, “Elements of Information Theory”, 1991.
Summary Haphazard presentation of data is not acceptable to MIND. Focusing attention on an attribute, automatically leads to information gain. Designed entropy. Parallely designed information gain . Related this to message communication.