Machine Learning in Practice Lecture 6

Machine Learning in Practice Lecture 6
Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute

Plan for the Day Announcements Finish Naïve Bayes Start Linear Models
Clarification on Naïve Bayes with missing information Feedback on Quiz and Assignment Finish Naïve Bayes Start Linear Models

Clarification on Unknown Values
Not a problem for Naïve Bayes Probabilities computed using only the specified values Likelihood that play = yes when Outlook = sunny, Temperature = cool, Humidity = high, Windy = true 2/9 * 3/9 * 3/9 * 3/9 * 9/14 If Outlook is unknown, 3/9 * 3/9 * 3/9 * 9/14 Likelihoods will be higher when there are unknown values Same effect on likelihood of all possible outcomes! Factored out during normalization

Quiz Notes Most people did great!
Most frequent issue was the last question where we compared likelihoods and probabilities Main difference is scaling Sum of probabilities for all possible events should come out to 1 That’s what gives statistical models their nice formal properties Comment about technical versus common usages of terms like likelihood, concept, etc.

Assignment 2 Notes What is output from the machine learning model
Carolyn says: the model Kishore says: the prediction Some people had trouble identifying the impact of the noise Different strategies for finding where the impact of the noise was Error analysis Didn’t notice effect on which information was taken into account

Finishing Naïve Bayes

Scanario Math story problem Math Skill 1 Math Skill 2 Math Skill 3

Each problem may be associated with more than one skill

Each skill may be associated with more than one problem

How to address the problem?
In reality there is a many-to-many mapping between math problems and skills

In reality there is a many-to-many mapping between math problems and skills Ideally, we should be able to assign any subset of the full set of skills to any problem But can we do that accurately?

In reality there is a many-to-many mapping between math problems and skills Ideally, we should be able to assign any subset of the full set of skills to any problem But can we do that accurately? If we can’t do that, it may be good enough to assign the single most important skill

In reality there is a many-to-many mapping between math problems and skills Ideally, we should be able to assign any subset of the full set of skills to any problem But can we do that accurately? If we can’t do that, it may be good enough to assign the single most important skill In that case, we will not accomplish the whole task

But if we can do that part of the task more accurately, then we might accomplish more overall than if we try to achieve the more ambitious goal

Low resolution gives more information if the accuracy is higher
Remember this discussion from lecture 2?

Which of these approaches is better?
You have a corpus of math problem texts and you are trying to learn models that assign skill labels. Approch one: You have 91 binary prediction models, each of which makes an independent decision about each math text. Approach two: You have one multi-class classifier that assigns one out of the same 91 skill labels.

Approach 1 Math story problem Math Skill 1 Math Skill 2 Math Skill 3
Each skill corresponds to a separate binary predictor. Each of 91 binary predictors is applied to each text 91 separate predictions are made for each text.

Approach 2 Math story problem Math Skill 1 Math Skill 2 Math Skill 3
Each skill corresponds to a separate Class value. A single multi- class predictor is applied to each text Only 1 prediction is made for each text.

You have a corpus of math problem texts and you are trying to learn models that assign skill labels. Approch one: You have 91 binary prediction models, each of which makes an independent decision about each math text. Approach two: You have one multi-class classifier that assigns one out of the same 91 skill labels. More power, but more opportunity for error

You have a corpus of math problem texts and you are trying to learn models that assign skill labels. Approch one: You have 91 binary prediction models, each of which makes an independent decision about each math text. Approach two: You have one multi-class classifier that assigns one out of the same 91 skill labels. Less power, but fewer opportunities for error

Approach 1: One versus all
Assume you have 40 example texts, and 4 of them have skill5 associated with them Assume you are using some form of smoothing – 0 counts become 1 Let’s say WordX occurs with skill5 75% of the time and only once with any other class (it’s the best predictor for skill5) After smoothing, P(WordX|Skill5) = 2/3 P(WordX|majority) = 2/38

Counts Without Smoothing
40 math problem texts 3 of them are skill5 WordX occurs with skill5 75% of the time and occurs only once with any other class (it’s the best predictor for skill5) WordX WordY 3 Skill5 Majority Class 1

Counts With Smoothing 40 math problem texts 3 of them are skill5
WordX occurs with skill5 75% of the time and occurs only once with any other class (it’s the best predictor for skill5) WordX WordY 4 Skill5 Majority Class 2

Approach 1 Assume you have 40 example texts, and 3 of them have skill5 associated with them Assume you are using some form of smoothing – 0 counts become 1 Let’s say WordX occurs with skill5 75% of the time and only once with any other class (it’s the best predictor for skill5) After smoothing, P(WordX|Skill5) = 2/3 P(WordX|majority) = 2/38

Approach 1 Let’s say WordY occurs 17% of the time with the majority class and 25% of the time with skill5 (it’s a moderately good predictor) In reality, 9 counts of WordY with majority and 1 with Skill5 With smoothing, we get 10 counts of WordY with majority and 2 with Skill5 P(WordY|Skill5) = 1/3 P(WordY|Majority) = 7/38 Because you multiply the conditional probabilities and the prior probabilities together, it’s nearly impossible to predict the minority class when the data is this skewed For “WordX WordY” you would get .66*.33*.04 = .009 for skill5 and .05*.18 *.96 = .009 for majority What would you predict without smoothing?

Counts Without Smoothing
40 math problem texts 4 of them are skill5 WordY occurs 17% of the time with the majority class and 25% of the time with skill5 (it’s a moderately good predictor) WordX WordY 3 1 Skill5 Majority Class 1 6

Counts With Smoothing 40 math problem texts 4 of them are skill5
WordY occurs 17% of the time with the majority class and 25% of the time with skill5 (it’s a moderately good predictor) WordX WordY 5 2 Skill5 Majority Class 1 7

Approach 1 Let’s say WordY occurs 17% of the time with the majority class and 25% of the time with skill5 (it’s a moderately good predictor) In reality, 9 counts of WordY with majority and 1 with Skill5 With smoothing, we get 10 counts of WordY with majority and 2 with Skill5 P(WordY|Skill5) = 1/3 P(WordY|Majority) = 7/38 Because you multiply the conditional probabilities and the prior probabilities together, it’s nearly impossible to predict the minority class when the data is this skewed For “WordX WordY” you would get .66*.33*.04 = .009 for skill5 and .05*.18 *.96 = .009 for majority What would you predict without smoothing?

Linear Models

Remember this: What do concepts look like?

Review: Concepts as Lines
B S T C X

Review: Concepts as Lines
B S T C X X What will be the prediction for this new data point?

What are we learning? We’re learning to draw a line through a multidimensional space Really a “hyperplane” Each function we learn is like single split in a decision tree But it can take many features into account at one time rather than just one F(x) = X0 + C1X1 + C2X2 + C3X3 X1-Xn are our attributes C1-Cn are coefficients We’re learning the coefficients, which are weights

Taking a Step Back We started out with tree learning a algorithms that learn symbolic rules with the goal of achieving the highest accuracy 0R, 1R, Decision Trees (J48) Then we talked about statistical models that make decisions based on probability Naïve Bayes Rules look different – we just store counts No explicit focus on accuracy during learning What are the implications of the contrast between an accuracy focus and a probability focus?

Performing well with skewed class distributions
Naïve Bayes has trouble with skewed class distributions because of the contribution of prior probabilities Remember our math problem case Linear models can compensate for this They don’t have any notion of prior probability per se If they can find a good split on the data, they will find it wherever it is Problem if there is not a good split

Skewed but clean separation

Skewed but no clean separation

Taking a Step Back The models we will look at now have rules composed of numbers So they “look” more like Naïve Bayes than like Decision Trees But the numbers are obtained through a focus on achieving accuracy So the learning process is more like Decision Trees Given these two properties, what can you say about assumptions about the form of the solution and assumptions about the world that are made?

Machine Learning in Practice Lecture 6

Similar presentations

Presentation on theme: "Machine Learning in Practice Lecture 6"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Machine Learning in Practice Lecture 6

Similar presentations

Presentation on theme: "Machine Learning in Practice Lecture 6"— Presentation transcript:

Similar presentations

About project

Feedback