CS344: Introduction to Artificial Intelligence

Slides:



Advertisements
Similar presentations
Introduction to Artificial Intelligence CS440/ECE448 Lecture 21
Advertisements

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
1 Machine Learning: Lecture 3 Decision Tree Learning (Based on Chapter 3 of Mitchell T.., Machine Learning, 1997)
Decision Tree Approach in Data Mining
1 Chapter 10 Introduction to Machine Learning. 2 Chapter 10 Contents (1) l Training l Rote Learning l Concept Learning l Hypotheses l General to Specific.
ID3 Algorithm Abbas Rizvi CS157 B Spring What is the ID3 algorithm? ID3 stands for Iterative Dichotomiser 3 Algorithm used to generate a decision.
CS 590M Fall 2001: Security Issues in Data Mining Lecture 4: ID3.
Classification and Prediction by Yen-Hsien Lee Department of Information Management College of Management National Sun Yat-Sen University March 4, 2003.
Lecture 2: Basic Information Theory Thinh Nguyen Oregon State University.
MACHINE LEARNING. What is learning? A computer program learns if it improves its performance at some task through experience (T. Mitchell, 1997) A computer.
Machine Learning Lecture 10 Decision Trees G53MLE Machine Learning Dr Guoping Qiu1.
Neural Networks. Background - Neural Networks can be : Biological - Biological models Artificial - Artificial models - Desire to produce artificial systems.
Learning Chapter 18 and Parts of Chapter 20
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Fall 2004 TDIDT Learning CS478 - Machine Learning.
Learning what questions to ask. 8/29/03Decision Trees2  Job is to build a tree that represents a series of questions that the classifier will ask of.
Short Introduction to Machine Learning Instructor: Rada Mihalcea.
CS-424 Gregory Dudek Today’s outline Administrative issues –Assignment deadlines: 1 day = 24 hrs (holidays are special) –The project –Assignment 3 –Midterm.
Classification I. 2 The Task Input: Collection of instances with a set of attributes x and a special nominal attribute Y called class attribute Output:
CS344: Introduction to Artificial Intelligence Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 31 and 32– Brain and Perceptron.
CS344: Introduction to Artificial Intelligence Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 29 and 30– Decision Tree Learning; ID3;Entropy.
Prof. Pushpak Bhattacharyya, IIT Bombay1 Basics Of Entropy CS 621 Artificial Intelligence Lecture /09/05 Prof. Pushpak Bhattacharyya.
CS 4100 Artificial Intelligence Prof. C. Hafner Class Notes March 27, 2012.
Artificial Intelligence Chapter 3 Neural Networks Artificial Intelligence Chapter 3 Neural Networks Biointelligence Lab School of Computer Sci. & Eng.
1 Chapter 10 Introduction to Machine Learning. 2 Chapter 10 Contents (1) l Training l Rote Learning l Concept Learning l Hypotheses l General to Specific.
Decision Tree Learning
An Introduction Student Name: Riaz Ahmad Program: MSIT( ) Subject: Data warehouse & Data Mining.
Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.
CS623: Introduction to Computing with Neural Nets (lecture-12) Pushpak Bhattacharyya Computer Science and Engineering Department IIT Bombay.
CS623: Introduction to Computing with Neural Nets (lecture-17) Pushpak Bhattacharyya Computer Science and Engineering Department IIT Bombay.
DECISION TREES Asher Moody, CS 157B. Overview  Definition  Motivation  Algorithms  ID3  Example  Entropy  Information Gain  Applications  Conclusion.
Prof. Pushpak Bhattacharyya, IIT Bombay1 CS 621 Artificial Intelligence Lecture 12 – 30/08/05 Prof. Pushpak Bhattacharyya Fundamentals of Information.
CS621: Artificial Intelligence Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 9– Uncertainty.
CSE343/543 Machine Learning: Lecture 4.  Chapter 3: Decision Trees  Weekly assignment:  There are lot of applications and systems using machine learning.
CS621: Artificial Intelligence Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 10– Uncertainty: Fuzzy Logic.
CMPT 310 Simon Fraser University Oliver Schulte Learning.
Introduction to Machine Learning, its potential usage in network area,
Machine Learning Inductive Learning and Decision Trees
DECISION TREES An internal node represents a test on an attribute.
CS 9633 Machine Learning Decision Tree Learning
Shannon Entropy Shannon worked at Bell Labs (part of AT&T)
Classification Algorithms
CS 9633 Machine Learning Concept Learning
Prepared by: Mahmoud Rafeek Al-Farra
Artificial Intelligence
Decision Trees (suggested time: 30 min)
Ch9: Decision Trees 9.1 Introduction A decision tree:
Data Science Algorithms: The Basic Methods
SAD: 6º Projecto.
Issues in Decision-Tree Learning Avoiding overfitting through pruning
Decision Tree Saed Sayad 9/21/2018.
ID3 Algorithm.
Machine Learning Today: Reading: Maria Florina Balcan
CS621: Artificial Intelligence
Machine Learning Chapter 3. Decision Tree Learning
Machine Learning: Lecture 3
Artificial Intelligence Chapter 3 Neural Networks
Perceptron as one Type of Linear Discriminants
Machine Learning Chapter 3. Decision Tree Learning
Artificial Intelligence Lecture No. 28
Artificial Intelligence Chapter 3 Neural Networks
CS 621 Artificial Intelligence Lecture /10/05 Prof
Artificial Intelligence Chapter 3 Neural Networks
Artificial Intelligence 6. Decision Tree Learning
Artificial Intelligence Chapter 3 Neural Networks
28th September 2005 Dr Bogdan L. Vrusias
A task of induction to find patterns
A task of induction to find patterns
CS623: Introduction to Computing with Neural Nets (lecture-11)
Artificial Intelligence Chapter 3 Neural Networks
Presentation transcript:

CS344: Introduction to Artificial Intelligence Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 29– Decision Tree Learning; ID3;Entropy

Information in the data

Sunny High F N T Cloudy Y Rain Med Cold Low Outlook (0) Temp (T) Humidity (H) Windy (W) Decision to play (D) Sunny High F N T Cloudy Y Rain Med Cold Low To-play-or-not-to-play-tennis data vs. Climatic-Condition from Ross Quinlan’s paper on ID3 (1986), C4.5 (1993)

Sunny Med High F N Cold Low Y Rain T Cloudy Weather (0) Temp (T) Humidity (H) Windy (W) Decision (D) Sunny Med High F N Cold Low Y Rain T Cloudy

Outlook Sunny Cloudy Rain Humidity Yes Windy High F Low T No No Yes Yes

Rule Base R1: If outlook is sunny and if humidity is high then Decision is No. R2: If outlook is sunny and if humidity is low then Decision is Yes. R3: If outlook is cloudy then Decision is Yes.

Making Sense of Information Classification Clustering Giving a short and nice description

Short Description Occam Razor principle (Shortest/simplest description is the best for generalization)

Representation Language Decision tree. Neural network. Rule base. Boolean expression.

Information & Entropy The example data presented in the form of rows and labels has low ordered/structured information compared to the succinct description (Decision Trees and Rule Base). Lack of structure in information: measured by “Entropy”

Define Entropy of S (Labeled data) E(S) = - ( P+ log2P+ + P- log2P- ) P+ = proportion of positively labeled data. P- = proportion of negatively labeled data.

Example P+ = 9/14 P- = 5/14 E(S) = - 9/14 log2 (9/14) – 5/14 log2 (5/14) = 0.940

Partitioning the Data “Windy” as the attribute Windy = [ T, F] Windy = T : Partitioning the data

Partitioning the Data (Contd) Partitioning by focusing on a particular attribute produced “Information gain” “Reduction in Entropy”

Partitioning the Data (Contd) Information gain when we choose windy = [ T, F ] Windy = T, #+ = 6 , #- = 2 Windy = F, #+ = 3 , #- = 3 # denotes “number of”

Partitioning the Data (Contd) Windy T F 6, + 2, - 3, + 3, -

Partitioning the Data (Contd) Gain(S,A) = = E(S) -∑( |Sv| / |S| )E(Sv) v є values of A E(S) = 0.940 E(S, Windy): E( Windy=T) = - 6/8 log 6/8 – 2/8 log 2/8 = 0.811

Partitioning the Data (Contd) E( Windy=F) = - 3/6 log 3/6 – 3/6 log 3/6 = 1.0

Partitioning the Data (Contd) Gain(S,Windy) = = 0.940 – (8/14 *0.811 + 6/14* 1.0) = 0.048 Exercise: Find information gain for each attribute: outlook, Temp, Humidity and windy.

ID3 Algorithm Calculating the gain for every attribute and choosing the one with maximum gain to finally arrive at the decision tree is called “ID3” algorithm to build a classifier.

Neural Computation

Some Observation on the brain Ray Kurzweil, The Singularity is Near, 2005. Machines will be able to out-think people within a few decades. But brain arose through natural selection Contains layers of systems for that arose for one function and then were adopted for another even if they do not work perfectly

Difference between brain and computers Highly efficient use of energy in brain High Adaptability Tremendous amount of compressions: space is a premium for the cranium One cubic centimeter of numna brain tissue contains 50 million neurons Several hundred miles of axons which are “wires” for transmitting signals Close to trillion synapses- the connections between neurons

Immense memory capacity 1 cc contains 1 terabyte of information About 1000 cc makes up the whole brain So about 1 million gigabyte or 1 petabyte of information Entire archived cntent of internet is 3 petabyte

Moore’s law Every year doubles the storage capacity Single computer the size of brain will contain a petabyte of information by 2030 Question mark: Power Consumption?

Power issues By 2025, the memory of an artificial brain will use nearly a gigawatt of power: the amount currently consumed by entire Washington DC Contrastedly: brain uses only 12 watts or power, less than the energy used by a typical refrigerator light

Brain vs. computer’s procesing Associative memory vs. adressable memory Parallel Distributed Processing (PDP) vs. Serial computation Fast responses to complex situations vs. precisely repeatable steps Preference for Approximations and “good enough” solutions vs exact solutions Mistakes and biases vs. cold logic

Brain vs. Computers (contd.) Excellent pattern recognition vs. excellent number crunching Emotion- brain’s steerman- assigning values to experiences and future possibilities vs. computer being insensitive to emotions Evaluate potential outcomes efficiently and rapidly when information is uncertain vs. “Garbage in Garbage out” situation”

Properties of Entropy

Example S = {P1 , P2} P1 + P2 = 1 Tossing of coin E(s) = - [ P1 log P1 + P2 log P2] where P1 = 0.5 = P2 E(s) = 1.0

Entropy with respect to Boolean Classification 1.0 Entropy 1 P+

If all events are equally likely then entropy is max. We expect that S = {s1 , s2 … sq} E(S) will be max when each Pi = 1/q.

Theorem E(S) = maximum when each Pi = 1/q S = {s1 , s2 … sq} Lemma : ln(x) = logex <= (x – 1) consider f(x) = (x -1) – ln x f(1) = 0

df(x)/dx = 1 – 1/x equating to 0 x = 1 f(x) had extremum at x = 1 d2f(x)/dx2 = 1/x2 > 0 for x > 0 at x = 1, f(x) has a minima

f(x) had minimum at x =1 f(x) = x -1 – ln x ln x <= x -1 , x >= 1

Corollary Let, Σ xi = 1 Σ yi = 1 and xi >= 0 , yi > 0 i = 1 to m Σ yi = 1 and xi >= 0 , yi > 0 xi and yi are probability distributions. Σ xi ln 1/ xi <= Σ xi ln 1/ yi i = 1 to m i = 1 to m

Proof Σ xi ln 1/ xi - Σ xi ln 1/ yi = Σ xi ln yi/ xi i = 1 to m i = 1 to m = Σ xi ln yi/ xi i = 1 to m <= Σ xi (yi / xi - 1) = Σ yi - Σ xi = 0 i = 1 to m i = 1 to m

Σ xi ln 1/ xi <= Σ xi ln 1/ yi This proves that Σ xi ln 1/ xi <= Σ xi ln 1/ yi i = 1 to m i = 1 to m Proof of theorem follows from this corollary.

Σ xi ln 1/xi <= Σ xi ln1/yi S = {s1 , s2 … sq} P(si) =Pi , i = 1… q Choose xi = Pi , yi = 1/q Σ xi ln 1/xi <= Σ xi ln1/yi i = 1 to q i = 1 to n Σ Pi ln 1/Pi <= Σ xi ln q i = 1 to q i = 1 to n

E(S) is upper bounded by Σ Pi ln q = ln q So, E(S) <= Σ Pi ln q i = 1 to m = ln q. Σ Pi = ln q E(S) is upper bounded by Σ Pi ln q = ln q Which value is reached when each Pi = q. This establishes the maximum value for Entropy.

Summary Review Established the intuition for information function I(si). Related to the ‘surprise’. Average information is called entropy. Minimum value of E = 0. Maximum of E ? Lemma ln x <= x – 1 Corollary Σ xi ln 1/ xi <= Σ xi ln 1/ yi Σ xi = Σ yi = 1 Max E is ln q * k and is reached where pi = 1/q for each i

Shannon asked what is the “entropy of English language” ? S = {a,b,c,…. ‘,’,‘:’, ….} P(a) = Relative frequency of ‘a’ from large corpus. P(b) = … This gives Pis E(English) = Σ Pi log1/Pi = 4.08

Max entropy for tossing of coin: = 1.0 when the coin is unbiased. Interest of the reader/listener Novel is more interesting than a scientific paper for some people.

Origin of Information Theory Shannon “The mathematical Theory of communication”, Bell systems Journal, 1948. Cover and Thomas, “Elements of Information Theory”, 1991.

Summary Haphazard presentation of data is not acceptable to MIND. Focusing attention on an attribute, automatically leads to information gain. Designed entropy. Parallely designed information gain . Related this to message communication.