Presentation is loading. Please wait.

Presentation is loading. Please wait.

Machine Learning in Practice Lecture 2 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Similar presentations


Presentation on theme: "Machine Learning in Practice Lecture 2 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute."— Presentation transcript:

1 Machine Learning in Practice Lecture 2 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute

2 Plan for the Day Any questions? Announcements:  First homework assigned Machine Learning process overview Learn how to use weka Introduce assignment Introduction to Cross-Validation

3 Overview of Machine Learning Process Skills

4 Naïve Approach: When all you have is a hammer… Target Representation Data

5 Naïve Approach: When all you have is a hammer… Target Representation Problem: there isn’t one universally best approach!!!!! Data

6 Slightly less naïve approach: Aimless wandering… Target Representation Data

7 Slightly less naïve approach: Aimless wandering… Target Representation Problem 1: It takes too long!!! Data

8 Slightly less naïve approach: Aimless wandering… Target Representation Problem 2: You might not realize all of the options that are available to you! Data

9 Expert Approach: Hypothesis driven Target Representation Data

10 Expert Approach: Hypothesis driven Target Representation You might end up with the same solution in the end, but you’ll get there faster. Data

11 Expert Approach: Hypothesis driven Target Representation Today we’ll start to learn how! Data

12 Warm Up Exercise

13 Every combination of feature values is represented. Warm Up Exercise

14 Every combination of feature values is represented. What will happen if you try to predict HairColor from the other features? Warm Up Exercise

15 If you don’t have good features, even the most powerful algorithm won’t be able to learn an accurate prediction rule.

16 Every combination of feature values is represented. What will happen if you try to predict HairColor from the other features? Warm Up Exercise If you don’t have good features, even the most powerful algorithm won’t be able to learn an accurate prediction rule. But that doesn’t mean this data set is a hopeless case! For example, maybe the people who like red and have brown hair like a different shade of red than the ones who have blond hair. So ask yourself: what information might be hidden or implicit that might allow me to learn a rule?

17 Getting a bit more sophisticated…

18 Example Data Set

19 We’re going to consider a new algorithm

20 Example Data Set We’re going to consider a new algorithm We’re also going to consider data representation issues

21 More Complex Algorithm… Two simple algorithms last time  0R – Predict the majority class  1R – Use the most predictive single feature Today – Intro to Decision Trees  Today we will stay at a high level  We’ll investigate more details of the algorithm next time * Only makes 2 mistakes!

22 More Complex Algorithm… Two simple algorithms last time  0R – Predict the majority class  1R – Use the most predictive single feature Today – Intro to Decision Trees  Today we will stay at a high level  We’ll investigate more details of the algorithm next time * Only makes 2 mistakes!

23 More Complex Algorithm… Two simple algorithms last time  0R – Predict the majority class  1R – Use the most predictive single feature Today – Intro to Decision Trees  Today we will stay at a high level  We’ll investigate more details of the algorithm next time * Only makes 2 mistakes! What will it do with this example?

24 More Complex Algorithm… Two simple algorithms last time  0R – Predict the majority class  1R – Use the most predictive single feature Today – Intro to Decision Trees  Today we will stay at a high level  We’ll investigate more details of the algorithm next time * Only makes 2 mistakes! What will it do with this example?

25 More Complex Algorithm… Two simple algorithms last time  0R – Predict the majority class  1R – Use the most predictive single feature Today – Intro to Decision Trees  Today we will stay at a high level  We’ll investigate more details of the algorithm next time * Only makes 2 mistakes! What will it do with this example?

26 More Complex Algorithm… Two simple algorithms last time  0R – Predict the majority class  1R – Use the most predictive single feature Today – Intro to Decision Trees  Today we will stay at a high level  We’ll investigate more details of the algorithm next time * Only makes 2 mistakes! What will it do with this example?

27 Why is it better? Not because it is more complex  Sometimes more complexity makes performance worse What is different in what the three rule representations assume about your data?  0R  1R  Trees The best algorithm for your data will give you exactly the power you need

28 Why is it better? Not because it is more complex  Sometimes more complexity makes performance worse What is different in what the three rule representations assume about your data?  0R  1R  Trees The best algorithm for your data will give you exactly the power you need Let’s say you know the rule you are trying to learn is a circle and you have these points. What rule would you learn?

29 Why is it better? Not because it is more complex  Sometimes more complexity makes performance worse What is different in what the three rule representations assume about your data?  0R  1R  Trees The best algorithm for your data will give you exactly the power you need Let’s say you know the rule you are trying to learn is a circle and you have these points. What rule would you learn?

30 Why is it better? Not because it is more complex  Sometimes more complexity makes performance worse What is different in what the three rule representations assume about your data?  0R  1R  Trees The best algorithm for your data will give you exactly the power you need Let’s say you know the rule you are trying to learn is a circle and you have these points. What rule would you learn? Now lets say you don’t know the shape, now what would you learn?

31 Why is it better? Not because it is more complex  Sometimes more complexity makes performance worse What is different in what the three rule representations assume about your data?  0R  1R  Trees The best algorithm for your data will give you exactly the power you need Let’s say you know the rule you are trying to learn is a circle and you have these points. What rule would you learn? Now lets say you don’t know the shape, now what would you learn?

32 Why is it better? Not because it is more complex  Sometimes more complexity makes performance worse What is different in what the three rule representations assume about your data?  0R  1R  Trees The best algorithm for your data will give you exactly the power you need Let’s say you know the rule you are trying to learn is a circle and you have these points. What rule would you learn? Now lets say you don’t know the shape, now what would you learn? If you know the shape, you have fewer degrees of freedom – less room to make a mistake.

33 Why is it better? Not because it is more complex  Sometimes more complexity makes performance worse What is different in what the three rule representations assume about your data?  0R  1R  Trees The best algorithm for your data will give you exactly the power you need Let’s say you know the rule you are trying to learn is a circle and you have these points. What rule would you learn? Now lets say you don’t know the shape, now what would you learn? If you know the shape, you have fewer degrees of freedom – less room to make a mistake.

34 Why is it better? Not because it is more complex  Sometimes more complexity makes performance worse What is different in what the three rule representations assume about your data?  0R  1R  Trees The best algorithm for your data will give you exactly the power you need Let’s say you know the rule you are trying to learn is a circle and you have these points. What rule would you learn? Now lets say you don’t know the shape, now what would you learn? If you know the shape, you have fewer degrees of freedom – less room to make a mistake.

35 Why is it better? Not because it is more complex  Sometimes more complexity makes performance worse What is different in what the three rule representations assume about your data?  0R  1R  Trees The best algorithm for your data will give you exactly the power you need Let’s say you know the rule you are trying to learn is a circle and you have these points. What rule would you learn? Now lets say you don’t know the shape, now what would you learn? If you know the shape, you have fewer degrees of freedom – less room to make a mistake.

36 Why is it better? Not because it is more complex  Sometimes more complexity makes performance worse What is different in what the three rule representations assume about your data?  0R  1R  Trees The best algorithm for your data will give you exactly the power you need

37 Back to the Opinion Poll Data Set From http://www.swivel.com/http://www.swivel.com/ Example of the kind of data set you could use for your course project  Better to find a larger data set

38 Back to the Opinion Poll Data Set From http://www.swivel.com/http://www.swivel.com/ Example of the kind of data set you could use for your course project  Better to find a larger data set Who ran the opinion poll

39 Back to the Opinion Poll Data Set From http://www.swivel.com/http://www.swivel.com/ Example of the kind of data set you could use for your course project  Better to find a larger data set When the poll was conducted

40 Back to the Opinion Poll Data Set From http://www.swivel.com/http://www.swivel.com/ Example of the kind of data set you could use for your course project  Better to find a larger data set Who the Democratic candidate would be :

41 Back to the Opinion Poll Data Set From http://www.swivel.com/http://www.swivel.com/ Example of the kind of data set you could use for your course project  Better to find a larger data set Who the Republican candidate would be :

42 Back to the Opinion Poll Data Set From http://www.swivel.com/http://www.swivel.com/ Example of the kind of data set you could use for your course project  Better to find a larger data set Who is running against who

43 Back to the Opinion Poll Data Set From http://www.swivel.com/http://www.swivel.com/ Example of the kind of data set you could use for your course project  Better to find a larger data set Which party will win

44 Back to the Opinion Poll Data Set From http://www.swivel.com/http://www.swivel.com/ Example of the kind of data set you could use for your course project  Better to find a larger data set This is what we want to predict

45 Do you see any redundant information?

46 Do you see any missing or hidden information?

47 How could you expand on what’s here? Add features that describe the source

48 How could you expand on what’s here? Add features that describe things that were going on during the time when the poll was taken

49 How could you expand on what’s here? Add features that describe personal characteristics of the candidates

50 What do you think would be the best rule?

51 What would Weka do with this data?

52 Using Weka Start Weka Open up the Explorer interface

53 Using Weka Click on Open File  Open OpinionPoll.csv from the Lectures folder You can save it as a.arff file

54 Using Weka Click on Open File  Open OpinionPoll.csv from the Lectures folder You can save it as a.arff file Summary stats for selected attributes are displayed

55 Using Weka Observe interaction between attributes by selecting on interface Select one attribute here Select another attribute here

56 Using Weka Observe interaction between attributes by selecting on interface Select one attribute here Select another attribute here Based on what you see, do you think the sources of the opinion polls were biased?

57 Using Weka Go to Classify Panel Select a classifier

58 Using Weka Select a classifier Select the predicted value

59 Using Weka Select a classifier Select the predicted value Start the evaluation

60 Using Weka Select a classifier Select the predicted value Start the evaluation Observe the results

61 Looking at the Results Percent correct Percent correct, controlling for correct by chance Performance on individual categories Confusion matrix * Right click in Result list and select Save Result Buffer to save performance stats.

62 Looking at the Results Percent correct Percent correct, controlling for correct by chance Performance on individual categories Confusion matrix * Right click in Result list and select Save Result Buffer to save performance stats.

63 Notice the shape of the tree (although the text is too small to read!)

64 It’s making its decision based only on who the Republican candidate is.

65 Why did it do that?

66 Where will it make mistakes?

67 Notice the more complex rule if we force binary splits … Note that the more complex rule performs worse!!!

68 More representation issues… “Gyre” by Eric Rosé

69 Low resolution image gives some information

70 Higher resolution image gives more information

71 But not if the accuracy is bad

72 Question: When might that happen?

73 Low resolution gives more information if the accuracy is higher

74 Assignment 1

75 Make sure Weka is set up properly on your machine Know the basics of using Weka Information about you…

76 Information about You Learning goals Priority on learning activities Project goals Programming competence

77 Cross-Validation

78

79

80 If Outlook = sunny, no else if Outlook = overcast, yes else if Outlook = rainy and Windy = TRUE, no else yes Performance on training data?

81 If Outlook = sunny, no else if Outlook = overcast, yes else if Outlook = rainy and Windy = TRUE, no else yes Performance on training data? Performance on testing data?

82 If Outlook = sunny, no else if Outlook = overcast, yes else if Outlook = rainy and Windy = TRUE, no else yes IMPORTANT! If you evaluate the performance of your rule on the same data you trained on, you won’t get an accurate estimate of how well it will do on new data.

83 What is cross validation?

84 Notice that Cross validation is for testing only! Not for building the rule!

85 But then….. If we are satisfied with the performance estimate we get Then we build the model with the WHOLE SET Now let’s see how it works…

86 But then….. If we are satisfied with the performance estimate we get Then we build the model with the WHOLE SET Now let’s see how it works… If you are not satisfied with the performance you get, then you should try to determine what went wrong, and then evaluate a different model that compensates.

87 Simple Cross Validation Let’s say your data has attributes A, B, and C You want to train a rule to predict D First train on 2, 3, 4, 5, 6,7 and apply trained model to 1 The results is Accuracy1 1 2 3 4 5 6 7 TEST TRAIN Fold: 1

88 Simple Cross Validation Let’s say your data has attributes A, B, and C You want to train a rule to predict D First train on 1, 3, 4, 5, 6,7 and apply trained model to 2 The results is Accuracy2 1 2 3 4 5 6 7 TRAIN TEST Fold: 2

89 Simple Cross Validation Let’s say your data has attributes A, B, and C You want to train a rule to predict D First train on 1, 2, 4, 5, 6,7 and apply trained model to 3 The results is Accuracy3 1 2 3 4 5 6 7 TRAIN TEST TRAIN Fold: 3

90 Simple Cross Validation Let’s say your data has attributes A, B, and C You want to train a rule to predict D First train on 1,2, 3, 5, 6,7 and apply trained model to 4 The results is Accuracy4 1 2 3 4 5 6 7 TRAIN TEST TRAIN Fold: 4

91 Simple Cross Validation Let’s say your data has attributes A, B, and C You want to train a rule to predict D First train on 1, 2, 3, 4, 6,7 and apply trained model to 5 The results is Accuracy5 1 2 3 4 5 6 7 TRAIN TEST TRAIN Fold: 5

92 Simple Cross Validation Let’s say your data has attributes A, B, and C You want to train a rule to predict D First train on 1, 2, 3, 4, 5, 7 and apply trained model to 6 The results is Accuracy6 1 2 3 4 5 6 7 TRAIN TEST TRAIN Fold: 6

93 Simple Cross Validation Let’s say your data has attributes A, B, and C You want to train a rule to predict D First train on 1, 2, 3, 4, 5, 6 and apply trained model to 7 The results is Accuracy7 Finally: Average Accuracy1 through Accuracy7 1 2 3 4 5 6 7 TRAIN TEST TRAIN Fold: 7

94 Remember! If we are satisfied with the performance estimate we get using cross-validation Then we build the model with the WHOLE SET We don’t use cross-validation to build the model

95 Why do we do cross validation? Use cross-validation when you do not have enough data to have completely independent train and test sets We are trying to estimate what performance would you get if you trained over your whole set and applied that model to an independent set of the same size We compute that estimate by averaging over folds

96 Do we have to do all of the folds? Yes! The test set on each fold is too small to give you an accurate estimate of performance alone Variation across folds Evaluation over part of the data is likely to be misleading

97 Why do we do cross validation? Makes the most of your data – large portion used for training Avoids testing on training data  Testing on training data will over estimate your performance!!! But if you do multiple iterations of cross- validation, in some ways you are using insights from your testing data in building your model

98 Questions about cross-validation from in-person students… How do you decide how many folds? How is data divided between folds? Don’t you need to have a hold-out set to be totally sure you have a good estimate of performance?

99 Other questions from in-person students… Do our class projects have to be classification problems per se?  Clustering of pen stroke data Will we learn to work with time series data in this course?

100 Questions?


Download ppt "Machine Learning in Practice Lecture 2 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute."

Similar presentations


Ads by Google