Machine Learning in Practice Lecture 5 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Slides:



Advertisements
Similar presentations
Sta220 - Statistics Mr. Smith Room 310 Class #7.
Advertisements

Machine Learning in Practice Lecture 7 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
AP STATISTICS Simulating Experiments. Steps for simulation Simulation: The imitation of chance behavior, based on a model that accurately reflects the.
INTRODUCTION TO ARTIFICIAL INTELLIGENCE Massimo Poesio LECTURE 11 (Lab): Probability reminder.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 10: The Bayesian way to fit models Geoffrey Hinton.
Excursions in Modern Mathematics Sixth Edition
Algorithms: The basic methods. Inferring rudimentary rules Simplicity first Simple algorithms often work surprisingly well Many different kinds of simple.
Business Statistics: A Decision-Making Approach, 7e © 2008 Prentice-Hall, Inc. Chap 4-1 Business Statistics: A Decision-Making Approach 7 th Edition Chapter.
Data Mining with Naïve Bayesian Methods
Evaluation.
Review. 2 Statistical modeling  “Opposite” of 1R: use all the attributes  Two assumptions: Attributes are  equally important  statistically independent.
Algorithms for Classification: The Basic Methods.
Thanks to Nir Friedman, HU
PSY 307 – Statistics for the Behavioral Sciences Chapter 8 – The Normal Curve, Sample vs Population, and Probability.
Lecture 6: Descriptive Statistics: Probability, Distribution, Univariate Data.
CEEN-2131 Business Statistics: A Decision-Making Approach CEEN-2130/31/32 Using Probability and Probability Distributions.
Data Mining – Algorithms: OneR Chapter 4, Section 4.1.
NAÏVE BAYES CLASSIFIER 1 ACM Student Chapter, Heritage Institute of Technology 10 th February, 2012 SIGKDD Presentation by Anirban Ghose Parami Roy Sourav.
Midterm Review Rao Vemuri 16 Oct Posing a Machine Learning Problem Experience Table – Each row is an instance – Each column is an attribute/feature.
TagHelper: Basics Part 1 Carolyn Penstein Rosé Carnegie Mellon University Funded through the Pittsburgh Science of Learning Center and The Office of Naval.
Data Mining – Algorithms: Prism – Learning Rules via Separating and Covering Chapter 4, Section 4.4.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Classification: Bayes February 17, 2009.
IE241: Introduction to Hypothesis Testing. We said before that estimation of parameters was one of the two major areas of statistics. Now let’s turn to.
1 Reasoning Under Uncertainty Artificial Intelligence Chapter 9.
Chapter 9 Review. 1. Give the probability of each outcome.
Classification Techniques: Bayesian Classification
Copyright © 2010 Pearson Education, Inc. Chapter 6 Probability.
1Weka Tutorial 5 - Association © 2009 – Mark Polczynski Weka Tutorial 5 – Association Technology Forge Version 0.1 ?
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.2 Statistical Modeling Rodney Nielsen Many.
 Classification 1. 2  Task: Given a set of pre-classified examples, build a model or classifier to classify new cases.  Supervised learning: classes.
Algorithms for Classification: The Basic Methods.
Data Mining – Algorithms: Naïve Bayes Chapter 4, Section 4.2.
CHAPTER 6 Naive Bayes Models for Classification. QUESTION????
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Sections 4.1 Inferring Rudimentary Rules Rodney Nielsen.
POSC 202A: Lecture 4 Probability. We begin with the basics of probability and then move on to expected value. Understanding probability is important because.
Inference: Probabilities and Distributions Feb , 2012.
1 Definitions In statistics, a hypothesis is a claim or statement about a property of a population. A hypothesis test is a standard procedure for testing.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.5: Mining Association Rules Rodney Nielsen.
1 Chapter 8 Interval Estimation. 2 Chapter Outline  Population Mean: Known  Population Mean: Unknown  Population Proportion.
Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Machine Learning in Practice Lecture 6 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Machine Learning in Practice Lecture 2 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Parameter Estimation. Statistics Probability specified inferred Steam engine pump “prediction” “estimation”
Machine Learning in Practice Lecture 21 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Machine Learning in Practice Lecture 9 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Chapter 7: The Distribution of Sample Means
Machine Learning in Practice Lecture 8 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.
Chap 4-1 Chapter 4 Using Probability and Probability Distributions.
Outline Historical note about Bayes’ rule Bayesian updating for probability density functions –Salary offer estimate Coin trials example Reading material:
Machine Learning in Practice Lecture 9 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Data Mining Chapter 4 Algorithms: The Basic Methods Reporter: Yuen-Kuei Hsueh.
AP Statistics From Randomness to Probability Chapter 14.
Machine Learning in Practice Lecture 4 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Machine Learning in Practice Lecture 18
Oliver Schulte Machine Learning 726
Data Science Algorithms: The Basic Methods
Data Science Algorithms: The Basic Methods
Bayes Net Learning: Bayesian Approaches
Oliver Schulte Machine Learning 726
Machine Learning Techniques for Data Mining
Machine Learning in Practice Lecture 7
Machine Learning in Practice Lecture 17
Machine Learning in Practice Lecture 6
Probability Probability Principles of EngineeringTM
Data Mining CSCI 307, Spring 2019 Lecture 18
Data Mining CSCI 307, Spring 2019 Lecture 6
Naïve Bayes Classifier
Data Mining CSCI 307, Spring 2019 Lecture 9
Presentation transcript:

Machine Learning in Practice Lecture 5 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute

Plan for the Day Announcements  Assignment 3  Project update Quiz 2 Naïve Bayes

Decision Tables vs Decision Trees Open World Assumption  Only examine some attributes in particular contexts  Uses majority class within a context to eliminate closed world requirement Divide and Conquer approach Closed World Assumption  Every case enumerated No generalization except by limiting number of attributes 1R algorithms produces the simplest possible Decision Table

Decision Tables vs Decision Trees Open World Assumption  Only examine some attributes in particular contexts  Uses majority class within a context to eliminate closed world requirement Divide and Conquer approach Closed World Assumption  Every case enumerated No generalization except by limiting number of attributes 1R algorithms produces the simplest possible Decision Table

Decision Tables vs Decision Trees Open World Assumption  Only examine some attributes in particular contexts  Uses majority class within a context to eliminate closed world requirement Divide and Conquer approach Closed World Assumption  Every case enumerated No generalization except by limiting number of attributes 1R algorithms produces the simplest possible Decision Table

Decision Tables vs Decision Trees Open World Assumption  Only examine some attributes in particular contexts  Uses majority class within a context to eliminate closed world requirement Divide and Conquer approach Closed World Assumption  Every case enumerated No generalization except by limiting number of attributes 1R algorithms produces the simplest possible Decision Table

Decision Tables vs Decision Trees Open World Assumption  Only examine some attributes in particular contexts  Uses majority class within a context to eliminate closed world requirement Divide and Conquer approach Closed World Assumption  Every case enumerated No generalization except by limiting number of attributes 1R algorithms produces the simplest possible Decision Table

Decision Tables vs Decision Trees Open World Assumption  Only examine some attributes in particular contexts  Uses majority class within a context to eliminate closed world requirement Divide and Conquer approach Closed World Assumption  Every case enumerated No generalization except by limiting number of attributes 1R algorithms produces the simplest possible Decision Table

Weights Versus Probabilities: A Historical Perspective Artificial intelligence is about separating declarative and procedural knowledge Algorithms can reason using knowledge in the form of rules  E.g., expert systems, some cognitive models This can be used for planning, diagnosing, inferring, etc.

Weights Versus Probabilities: A Historical Perspective But what about reasoning under uncertainty?  Incomplete knowledge  Errors  Knowledge with exceptions  A changing world

Rules with Confidence Values Will Carolyn eat the chocolate?  Positive evidence Carolyn usually eats what she likes. (.85) Carolyn likes chocolate. (.98)  Negative Evidence Carolyn doesn’t normally eat more than one dessert per day. (.75) Carolyn already drank hot chocolate. (.95) Hot chocolate is sort of like a dessert. (.5) How do you combine positive and negative evidence?

What is a probability? You have a notion of an event  Tossing a coin How many things can happen  Heads, tails How likely are you to get heads on a random toss?  50% Probabilities give you a principled way of combining predictions  How likely are you to get heads twice in a row? .5 *.5 =.25

Statistical Modeling Basics Rule and tree based methods use contingencies between patterns of attribute values as a basis for decision making Statistical models treat attributes as independent pieces of evidence that the decision should go one way or another Most of the time in real data sets the values of the different attributes are not independent of each other

Statistical Modeling Pros and Cons Statistical modeling people argue that statistical models are more elegant than other types of learned models because of their formal properties  You can combine probabilities in a principled way You can also combine the “weights” that other approaches assign But it is more ad hoc

Statistical Modeling Pros and Cons Statistical approach depends on assumptions that are not in general true In practice statistical approaches don’t work better than “ad-hoc” methods

Statistical Modeling Basics Even without features you can make a prediction about a class based on prior probabilities  You would always predict the majority class

Statistical Modeling Basics Statistical approaches balance evidence from features with prior probabilities  Thousand feet view: Can I beat performance based on priors with performance including evidence from features?  On very skewed data sets it can be hard to beat your priors (evaluation done based on percent correct)

Basic Probability If you roll a pair of dice, what is the probability that you will get a 4 and a 5?

Basic Probability If you roll a pair of dice, what is the probability that you will get a 4 and a 5? 1/18

Basic Probability If you roll a pair of dice, what is the probability that you will get a 4 and a 5? 1/18 How did you figure it out?

Basic Probability If you roll a pair of dice, what is the probability that you will get a 4 and a 5? 1/18 How did you figure it out? How many ways can the dice land?

Basic Probability If you roll a pair of dice, what is the probability that you will get a 4 and a 5? 1/18 How did you figure it out? How many ways can the dice land? How many of these satisfy our constraints?

Basic Probability If you roll a pair of dice, what is the probability that you will get a 4 and a 5? 1/18 How did you figure it out? How many ways can the dice land? How many of these satisfy our constraints? Divide ways to satisfy constraints by number of things that can happen

Basic Probability What if you want the first die to be 5 and the second die to be 4?

Basic Probability What if you want the first die to be 5 and the second die to be 4? What if you know the first die landed on 5?

Computing Conditional Probabilities

What is the probability of high humidity?

Computing Conditional Probabilities What is the probability of high humidity? What is the probability of high humidity given that the temperature is cool?

How do we train a model?

For every value of every feature, store a count. How many times do you see Outlook = rainy?

How do we train a model? For every value of every feature, store a count. How many times do you see Outlook = rainy? What is P(Outlook = rainy)?

How do we train a model? We also need to know what evidence each value of every feature gives of each possible prediction (or how typical it would be for instances of that class) What is P(Outlook = rainy | Class = yes)?

How do we train a model? We also need to know what evidence each value of every feature gives of each possible prediction (or how typical it would be for instances of that class) What is P(Outlook = rainy | Class = yes)? Store counts on (class value, feature value) pairs How many times is Outlook = rainy when class = yes?

How do we train a model? We also need to know what evidence each value of every feature gives of each possible prediction (or how typical it would be for instances of that class) What is P(Outlook = rainy | Class = yes)? Store counts on (class value, feature value) pairs How many times is Outlook = rainy when class = yes? Likelihood that play = yes if Outlook = rainy = Count(yes & rainy)/ Count(yes) * Count(yes)/Count(yes or no)

How do we train a model? Now try to compute likelihood play = yes for Outlook = overcast, Temperature = hot, Humidity = high, Windy = FALSE

Combinations of features? E.g., P(play = yes | Outlook = rainy & Temperature = hot) Multiply conditional probabilities for each predictor and prior probability of predicted class together before you normalize  P(play = yes | Outlook = rainy & Temperature = hot)  Likelihood of yes = Count(yes & rainy)/ Count(yes) * Count(yes & hot)/ Count(yes) * Count(yes)/Count(yes or no)  After you compute the likelihood of yes and likelihood of no, you will normalize to get probability of yes and probability of no

Unknown Values Not a problem for Naïve Bayes Probabilities computed using only the specified values Likelihood that play = yes when Outlook = sunny, Temperature = cool, Humidity = high, Windy = true  2/9 * 3/9 * 3/9 * 3/9 * 9/14  If Outlook is unknown, 3/9 * 3/9 * 3/9 * 9/14 Likelihoods will be higher when there are unknown values  Factored out during normalization

Numeric Values List values of numeric feature for all class features  Values for play = yes: 83, 70,68, 64, 69, 75, 75, 72, 81 Compute Mean and Standard Deviation  Values for play = yes: 83, 70, 68, 64, 69,75, 75, 72, 81  = 73,  = 6.16  Values for play = no: 85, 80, 65, 72, 71  = 74.6,  = 7.89 Compute likelihoods  f(x) = [1/sqrt(2   )]e (x-  ) 2 /2  2 Normalize using proportion of predicted class feature as before

Bayes Theorem How would you compute the likelihood that a person was a bagpipe major given that they had red hair?

Bayes Theorem How would you compute the likelihood that a person was a bagpipe major given that they had red hair? Could you compute the likelihood that a person has red hair given that they were a bagpipe major?

Bayes Theorem How would you compute the likelihood that a person was a bagpipe major given that they had red hair? Could you compute the likelihood that a person has red hair given that they were a bagpipe major?

Another Example ice-cream {chocolate, vanilla, coffee, rocky-road, cake {chocolate, yummy chocolate,chocolate,yum vanilla,chocolate,good coffee,chocolate,yum coffee,vanilla,ok rocky-road,chocolate,yum is-yummy Compute conditional probabilities for each attribute value/class pair  P(B|A) = Count(B&A)/Count(A)  P(coffee ice-cream | yum) =.25  P(vanilla ice-cream | yum) = 0

Another Example ice-cream {chocolate, vanilla, coffee, rocky-road, cake {chocolate, yummy chocolate,chocolate,yum vanilla,chocolate,good coffee,chocolate,yum coffee,vanilla,ok rocky-road,chocolate,yum is-yummy  What class would you assign to strawberry ice cream with chocolate cake?  Compute likelihoods and then normalize  Note: this model cannot take into account that the class might depend on how well the cake and ice cream “go together” What is the likelihood that the answer is yum? P(strawberry|yum) =.25 P(chocolate cake|yum) = *.75 *.66 =.124 What is the likelihood that The answer is good? P(strawberry|good) = 0 P(chocolate cake|good) = 1 0* 1 *.17 = 0 What is the likelihood that The answer is ok? P(strawberry|ok) = 0 P(chocolate cake|ok) = 0 0*0 *.17 = 0

Another Example ice-cream {chocolate, vanilla, coffee, rocky-road, cake {chocolate, yummy chocolate,chocolate,yum vanilla,chocolate,good coffee,chocolate,yum coffee,vanilla,ok rocky-road,chocolate,yum is-yummy What about vanilla ice cream and vanilla cake Intuitively, there is more evidence that the selected category should be Good. What is the likelihood that the answer is yum? P(vanilla|yum) = 0 P(vanilla cake|yum) =.25 0*.25 *.66= 0 What is the likelihood that The answer is good? P(vanilla|good) = 1 P(vanilla cake|good) = 0 1*0 *.17= 0 What is the likelihood that The answer is ok? P(vanilla|ok) = 0 P(vanilla cake|ok) = 1 0* 1 *.17 = 0

Statistical Modeling with Small Datasets When you train your model, how many probabilities are you trying to estimate? This statistical modeling approach has problems with small datasets where not every class is observed in combination with every attribute value  What potential problem occurs when you never observe coffee ice-cream with class ok?  When is this not a problem?

Smoothing One way to compensate for 0 counts is to add 1 to every count Then you never have 0 probabilities But what might be the problem you still have on small data sets?

Naïve Bayes with ice-cream {chocolate, vanilla, coffee, rocky-road, cake {chocolate, yummy chocolate,chocolate,yum vanilla,chocolate,good coffee,chocolate,yum coffee,vanilla,ok rocky-road,chocolate,yum is-yummy What is the likelihood that the answer is yum? P(vanilla|yum) =.11 P(vanilla cake|yum) =.33.11*.33*.66=.03 What is the likelihood that The answer is good? P(vanilla|good) =.33 P(vanilla cake|good) = *.33 *.17 =.02 What is the likelihood that The answer is ok? P(vanilla|ok) =.17 P(vanilla cake|ok) = *.66 *.17 =.02

Take Home Message Naïve Bayes is a simple form of statistical machine learning It’s naïve in that it assumes that all attributes are independent In the training process, counts are kept that indicate the connection between attribute values and predicted class values 0 counts interfere with making predictions, but smoothing can help address this difficulty