CS 4100 Artificial Intelligence Prof. C. Hafner Class Notes March 27, 2012.

Slides:



Advertisements
Similar presentations
Introduction to Artificial Intelligence CS440/ECE448 Lecture 21
Advertisements

Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Huffman code and ID3 Prof. Sin-Min Lee Department of Computer Science.
CS 4100 Artificial Intelligence Prof. C. Hafner Class Notes March 27, 2012.
Decision Tree Approach in Data Mining
Sta220 - Statistics Mr. Smith Room 310 Class #7.
Bounds on Code Length Theorem: Let l ∗ 1, l ∗ 2,..., l ∗ m be optimal codeword lengths for a source distribution p and a D-ary alphabet, and let L ∗ be.
SI485i : NLP Day 2 Probability Review. Introduction to Probability Experiment (trial) Repeatable procedure with well-defined possible outcomes Outcome.
Machine Learning Decision Trees. Exercise Solutions.
Final Exam: May 10 Thursday. If event E occurs, then the probability that event H will occur is p ( H | E ) IF E ( evidence ) is true THEN H ( hypothesis.
PROBABILISTIC MODELS David Kauchak CS451 – Fall 2013.
ID3 Algorithm Abbas Rizvi CS157 B Spring What is the ID3 algorithm? ID3 stands for Iterative Dichotomiser 3 Algorithm used to generate a decision.
Induction of Decision Trees
Machine Learning CMPT 726 Simon Fraser University
Learning decision trees derived from Hwee Tou Ng, slides for Russell & Norvig, AI a Modern Approachslides Tom Carter, “An introduction to information theory.
Example of a Decision Tree categorical continuous class Splitting Attributes Refund Yes No NO MarSt Single, Divorced Married TaxInc NO < 80K > 80K.
Learning decision trees
Learning decision trees derived from Hwee Tou Ng, slides for Russell & Norvig, AI a Modern Approachslides Tom Carter, “An introduction to information theory.
Classification and Prediction by Yen-Hsien Lee Department of Information Management College of Management National Sun Yat-Sen University March 4, 2003.
Lecture 2: Basic Information Theory Thinh Nguyen Oregon State University.
© Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist.
MACHINE LEARNING. What is learning? A computer program learns if it improves its performance at some task through experience (T. Mitchell, 1997) A computer.
Information Theory and Security
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
STATISTIC & INFORMATION THEORY (CSNB134)
CS 4100 Artificial Intelligence Prof. C. Hafner Class Notes April 3, 2012.
Copyright © 2015, 2011, 2008 Pearson Education, Inc. Chapter 7, Unit A, Slide 1 Probability: Living With The Odds 7.
Stat 1510: Introducing Probability. Agenda 2  The Idea of Probability  Probability Models  Probability Rules  Finite and Discrete Probability Models.
Chapter 1 Basics of Probability.
1 9/8/2015 MATH 224 – Discrete Mathematics Basic finite probability is given by the formula, where |E| is the number of events and |S| is the total number.
Bayesian Networks. Male brain wiring Female brain wiring.
Decision Tree Problems CSE-391: Artificial Intelligence University of Pennsylvania Matt Huenerfauth April 2005.
CS-424 Gregory Dudek Today’s outline Administrative issues –Assignment deadlines: 1 day = 24 hrs (holidays are special) –The project –Assignment 3 –Midterm.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
JHU CS /Jan Hajic 1 Introduction to Natural Language Processing ( ) Essential Information Theory I AI-lab
Lossless Compression CIS 465 Multimedia. Compression Compression: the process of coding that will effectively reduce the total number of bits needed to.
1 Bayesian Methods. 2 Naïve Bayes New data point to classify: X=(x 1,x 2,…x m ) Strategy: – Calculate P(C i /X) for each class C i. – Select C i for which.
Prof. Pushpak Bhattacharyya, IIT Bombay1 Basics Of Entropy CS 621 Artificial Intelligence Lecture /09/05 Prof. Pushpak Bhattacharyya.
Summer 2004CS 4953 The Hidden Art of Steganography A Brief Introduction to Information Theory  Information theory is a branch of science that deals with.
Prof. Amr Goneid, AUC1 Analysis & Design of Algorithms (CSCE 321) Prof. Amr Goneid Department of Computer Science, AUC Part 8. Greedy Algorithms.
Huffman coding Content 1 Encoding and decoding messages Fixed-length coding Variable-length coding 2 Huffman coding.
Learning with Decision Trees Artificial Intelligence CMSC February 20, 2003.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
MM207 Statistics Welcome to the Unit 7 Seminar With Ms. Hannahs.
Basic Principles (continuation) 1. A Quantitative Measure of Information As we already have realized, when a statistical experiment has n eqiuprobable.
Abdullah Aldahami ( ) April 6,  Huffman Coding is a simple algorithm that generates a set of variable sized codes with the minimum average.
Natural Language Processing Giuseppe Attardi Introduction to Probability IP notice: some slides from: Dan Jurafsky, Jim Martin, Sandiway Fong, Dan Klein.
Foundation of Computing Systems
Probability Theory Modelling random phenomena. Permutations the number of ways that you can order n objects is: n! = n(n-1)(n-2)(n-3)…(3)(2)(1) Definition:
Presented by Minkoo Seo March, 2006
Great Theoretical Ideas in Computer Science for Some.
Chapter 6: Random Variables
1 Decision Trees Greg Grudic (Notes borrowed from Thomas G. Dietterich and Tom Mitchell) [Edited by J. Wiebe] Decision Trees.
CIS 335 CIS 335 Data Mining Classification Part I.
Medical Decision Making Learning: Decision Trees Artificial Intelligence CMSC February 10, 2005.
Prof. Pushpak Bhattacharyya, IIT Bombay1 CS 621 Artificial Intelligence Lecture 12 – 30/08/05 Prof. Pushpak Bhattacharyya Fundamentals of Information.
SEAC-3 J.Teuhola Information-Theoretic Foundations Founder: Claude Shannon, 1940’s Gives bounds for:  Ultimate data compression  Ultimate transmission.
Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)
AP Statistics From Randomness to Probability Chapter 14.
Welcome to Week 06 College Statistics
Artificial Intelligence
Ch9: Decision Trees 9.1 Introduction A decision tree:
Data Science Algorithms: The Basic Methods
CHAPTER 12: Introducing Probability
Discrete Probability Distributions
Learning with Identification Trees
A Brief Introduction to Information Theory
CSCI 5832 Natural Language Processing
CSE 589 Applied Algorithms Spring 1999
Presentation transcript:

CS 4100 Artificial Intelligence Prof. C. Hafner Class Notes March 27, 2012

Bayes net another example What are the conditional independence assumptions embodied in this model ? Ulcer Infection Stomach Ache Fever

Bayes net another example What are the conditional independence assumptions embodied in this model ? (and how is it useful) – Fever is conditionally independent of ulcer and stomach ache, given Infection – Stomach ache is conditionally independent of Infection and fever, given Ulcer. Ulcer Infection Stomach Ache Fever

Bayes net another example What are the conditional independence assumptions embodied in this model ? Ulcer Infection Stomach Ache Fever P(Ulcer | Fever) = α P(Fever | Ulcer) P(Ulcer) P(Fever | Ulcer) = α [P(Fever, Inf, SA | Ulc) + P(Fever, ~Inf, SA | Ulc) + P(Fever, Inf, ~SA | Ulc) + P(Fever, ~Inf, ~SA | Ulc) ] P(Ulc) Simplifications: P(Fever, Inf, SA | Ulc) is P(Fever |Inf) P(Inf | Ulc) P(SA | Ulc) P(Fever, ~Inf, SA | Ulc) is P(Fever |~Inf) P(~Inf | Ulc) P(SA | Ulc)

Test your understanding: design a Bayes net with plausible numbers

Information Theory Information is about categories and classification We measure quantity of information by the resources needed to represent/store/transmit the information Messages are sequences of 0’s and 1’s (dots/dashes) which we call “bits” (for binary digits) You need to send a message containing the identity of a spy – It is known to be Mr. Brown or Mr. Smith You can send the message with 1 bit, therefore the event “the spy is Smith” has 1 bit of information

Calculating quantity of information Def: A uniform distribution of a set of possible outcomes (X1... Xn) means the outcomes are equally probable; that is, they each have probability 1/n. Suppose there are 8 people who can be the spy. Then the message requires 3 bits. If there are 64 possible spies the message requires 6 bits, etc. (assuming a uniform distribution) Def: The information quantity of a message where the (uniform) probability of each value is p: I = -log p bits

Intuition and Examples Intuitively, the more “surprising” a message is, the more information it contains. If there are 64 equally- probable spies we are more surprised by the identity of the spy than if there are only two equally probable spies. There are 26 letters in the alphabet. Assuming they are equally probable, how much information is in each letter: I = -log (1/26) = log 26 = 4.7 bits Assuming the digits from 0 to 9 are equally probable. Will the information in each digit be more or less than the information in each letter?

Sequences of messages Things get interesting when we looks beyond a single message to a long sequence of messages. Consider a 4-sided die, with symbols A, B, C, D: – Let 00 = A, 01=B, 10=C, 11=D – Each message is 2 bits. If you throw the die 800 times, you get a message 1600 bits long That’s the best you can do if A,B,C,D equally probable

Non-uniform distributions (cont.) Consider a 4-sided die, with symbols A, B, C, D: – But assume P(A) = 7/8 and P(B)=P(C)=P(D) = 3/24 – We can take advantage of that with a different code: 0 = A, 10= B, 110 = C, 111 = D – If we throw the die 800 times, what is the expected length of the message? What is the entropy? ENTROPY is the average information (in bits) of events in a long repeated sequence

Entropy Formula for entropy with outcomes x 1... x n : - Σ P(x i ) * log P(x i ) bits For a uniform distribution this is the same as –log P(x 1 ) since all the P(x i ) are the same. What does it mean? Consider 6-sided die, outcomes equally probable: -log 1/6 = 2.58 tells us a long sequence of die throws can be transmitted using 2.58 bits per throw on the average and this is the theoretical best

Review/Explain Entropy Let the possible outcomes be x1.... xn – With probabilities p1... pn that add up to 1 Ex: an unfair coin where n = 2, x1 = H (3/4), x2 = T (1/4) In a long sequence of events E = e1... ek, we assume that outcome x i will occur k * p i times, etc. E = HHTHTHHHTTHHHHHTHHHHTTHTHTHHHH ……. If k = 10000, we can assume H occurs 7500 times, T Note: the concept TYPES vs. TOKENS. There are two types and tokens in this scenario.

Review/Explain Entropy The entropy of E H(E) is the average information of the events in the sequence e1... ek : k 1/k * Σ I (e j ) = [now switch to summation over outcomes] j=1 n n 1/k * Σ I (x i ) * (k*p i ) = k/k * Σ I (x i ) * p i i=1i=1 n Σ -log(p i ) * p i bits i=1

Review/Explain Entropy Entropy is sometimes called “disorder” – it represents the lack of predictability as to the outcome for any element of a sequence (or set) If a set has just one outcome, entropy = 1 * -log(1) = 0 If there are 2 outcomes, then 50/50 probability gives the maximum entropy – complete unpredictability. This generalizes to any uniform distribution for n outcomes. - (0.5 * log(.5) * log(.5)) = 1 bit Note: log(1/2) = -log(2) = -1

Calculating Entropy Consider a biased coin: P(heads) = ¾; P(tails) = ¼ What is the entropy of a coin toss outcome? H = ¼ * -log(1/4) + ¾ * -log(3/4) = bits Using the Information Theory Log Table H = 0.25 * * = =.811 A fair coin toss has more “information” The more unbalanced the probabilities, the more predictable the outcome, the less you learn from each message.

H (entropy in bits) 0 ½ 1 1 Entropy for a set containing 2 possible outcomes (x1, x2) What if there are 3 possible outcomes? for equal probability case: H = -log(1/3) = about 1.58 Probability of x1 Maximum disorder

Define classification tree and ID3 algorithm Def: Given a table with one result attribute and several designated predictor attributes, a classification tree for that table is a tree such that: – Each leaf node is labeled with a value of the result attribute – Each non-leaf node is labeled with the name of a predictor attribute – Each link is labeled with one value of the parent’s predictor Def: the ID3 algorithm takes a table as input and “learns” a classification tree that efficiently maps predictor value sets into their results from the table.

Record#ColorShapeFruit 1redroundapple 2yellowroundlemon 3yellowoblongbanana A trivial example of a classification tree Color lemon Shape apple banana red yellow round oblong The goal is to create an “efficient” classification tree which always gives the same answer as the table

NoYesLightShortBlondeKatie No HeavyAverageBrownJohn No HeavyTallBrownPete YesNoHeavyAverageRedEmily YesNoAverageShortBlondeAnnie NoYesAverageShortBrownAleX NoYesAverageTallBlondeDana YesNoLightAverageBlondeSarah SunburnedLotionWeightHeightHairName A well-known “toy” example: sunburn data Predictor attributes: hair, height, weight, lotion

Hair Blonde Red Brown Sunburned Not Sunburned Lotion Y N Sunburned Not Sunburned

Outline of the algorithm 1.Create the root, and make its COLLECTION the entire table 2.Select any non-singular leaf node N to SPLIT 1.Choose the best attribute A for splitting N (use info theory) 2.For each value of A (a 1, a 2,..) create a child of N, N a i 3.Label the links from N to its children: “A = a i ” 4.SPLIT the collection of N among its children according to their values of A 3.When no more non-singular leaf nodes exist, the tree is finished 4.Def: a singular node is one whose COLLECTION includes just one value for the result attribute (therefore its entropy = 0)

Choosing the best attribute to SPLIT: the one that is MOST INFORMATIVE that reduces the entropy (DISORDER) the most Assume there are k attributes we can choose. For each one, we compute how much less entropy exists in the resulting children than we had in the parents: H(N) – weighted sum of H(children of N) Each child’s entropy is weighted by the “probability” of that child (estimated by the proportion of the parent’s collection that would be transferred to the child in the split)

C(S1) = {S,D,X,A,E,P,J,K}(3,5)/____} Calculate entropy: - [3/8 log 3/8 + 5/8 log 5/8] = =.954 S1: _______ Find information gain (IG) for all 4 predictors: hair, height, weight, lotion Start with lotion: values (yes, no) Child 1: (yes) = {D,X,K}(0, 3)/0 Child 2: (no) = {S,A,E,P,J}(3,2)/ -[3/5 log 3/5 + 2/5 log 2/5] =.971 Child set entropy = 3/8 * 0 + 5/8 *.971 = IG(Lotion) = =.347 Then try hair color: values (blond, brown, red) Child 1(blond) = {S,D,A,K}(2,2)/1 Child 2(brown) = {X,P,J}(0,3)/0 Child 3(red) = {E}(1,0)/0 Child set entropy = 4/8 * 1 + 3/8 * 0 + 1/8 * 0 = 0.5 IG(Hair color) = =.454

Next try Height: values (average, tall short) Child1(average) = {S,E,J}(2,1)/ -[2/3 log 2/3 + 1/3 log 1/3] = 0.92 Child2(tall) = {D,P}(0,2)/0 Child3(short)={X,A,K}(1,2)/0.92 Child set entropy = 3/8 * /8 * 0 + 3/8 * 0.92 = 0.69 IG(Height) = = 0.26 Next try Weight... IG(Weight) – 0.94 = So Hair color wins: Draw the first split and assign the collections N1: Hair Color yesno Red Brown Blond: C = {S,D,A,K}(2,2)/1 S2:_______

C(S2) = {S,D,A,K}(2,2)/1 Start with lotion: values (yes, no) Child 1: (yes) = {D, K}(0, 2)/0 Child 2: (no) = {S,A}(2,0)/ 0 Child set entropy = 0 IG(Lotion) = 1 – 0 = 1 No reason to go any farther S1: Hair Color yesno Red Brown Blond: C = {S,D,A,K}(2,2)/1 S2: Lotion S2:_________ yesno yes no

Discuss assignment 5

Perceptrons and Neural Networks: Another Supervised Learning Approach

Perceptron Learning (Supervised) Assign random weights (or set all to 0) Cycle through input data until change < target Let α be the “learning coefficient” For each input: – If perceptron gives correct answer, do nothing – If perceptron says yes when answer should be no, decrease the weights on all units that “fired” by α – If perceptron says no when answer should be yes, increase the weights on all units that “fired” by α