1 Data Mining I Karl Young Center for Imaging of Neurodegenerative Diseases, UCSF
2 The “Issues” Data Explosion Problem Data Explosion Problem –Automated data collection tools + widely used database systems + computerized society + Internet lead to tremendous amounts of data accumulated and/or to be analyzed in databases, data warehouses, WWW, and other information repositories We are drowning in data, but starving for knowledge! We are drowning in data, but starving for knowledge! Solution: Data Warehousing and Data Mining Solution: Data Warehousing and Data Mining –Data warehousing and on-line analytical processing (OLAP) –Mining interesting knowledge (rules, regularities, patterns, constraints) from data in large databases
3 Data Warehousing + Data Mining (one of many schematic views) Efficient And Robust Data Storage And Retrival Database Technology Statistics Computer Science High Performance Computing Machine Learning Visualization,… Efficient And Robust Data Summary And Visualization
4 Machine learning and statistics Historical difference (grossly oversimplified): Historical difference (grossly oversimplified): –Statistics: testing hypotheses –Machine learning: finding the right hypothesis But: huge overlap But: huge overlap –Decision trees (C4.5 and CART) –Nearest-neighbor methods Today: perspectives have converged Today: perspectives have converged –Most ML algorithms employ statistical techniques
5 Schematically Data Cleaning Data Integration Data Warehouse Task-relevant Data Selection Data Mining Pattern Evaluation
6 Schematically –Data warehouse — core of efficient data organization Data Cleaning Data Integration Data Warehouse Task-relevant Data Selection Data Mining Pattern Evaluation
7 –Data mining—core of knowledge discovery process Data Cleaning Data Integration Data Warehouse Task-relevant Data Selection Data Mining Pattern Evaluation Schematically
8 Data mining Needed: programs that detect patterns and regularities in the data Needed: programs that detect patterns and regularities in the data Strong patterns = good predictions Strong patterns = good predictions –Problem 1: most patterns are not interesting –Problem 2: patterns may be inexact (or spurious) –Problem 3: data may be garbled or missing Want to learn “concept”, i.e. rule or set of rules that characterize observed patterns in data Want to learn “concept”, i.e. rule or set of rules that characterize observed patterns in data
9 Types of Learning Supervised - Classification Supervised - Classification –Know classes for examples Induction Rules Decision Trees Bayesian Classification –Naieve –Networks Numeric Prediction Numeric Prediction –Linear Regression –Neural Nets –Support Vector Machines Unsupervised – Learn Natural Groupings Unsupervised – Learn Natural Groupings –Clustering Partitioning Methods Hierarchical Methods Density Based Methods Model Based Methods Learn Association Rules – In Principle Learn All Atributes Learn Association Rules – In Principle Learn All Atributes
10 Algorithms: The basic methods Simplicity first: 1R Simplicity first: 1R Use all attributes: Naïve Bayes Use all attributes: Naïve Bayes Decision trees: ID3 Decision trees: ID3 Covering algorithms: decision rules: PRISM Covering algorithms: decision rules: PRISM Association rules Association rules Linear models Linear models Instance-based learning Instance-based learning
11 Algorithms: The basic methods Simplicity first: 1R Simplicity first: 1R Use all attributes: Naïve Bayes Use all attributes: Naïve Bayes Decision trees: ID3 Decision trees: ID3 Covering algorithms: decision rules: PRISM Covering algorithms: decision rules: PRISM Association rules Association rules Linear models Linear models Instance-based learning Instance-based learning
12 Simplicity first Simple algorithms often work very well! Simple algorithms often work very well! There are many kinds of simple structure, eg: There are many kinds of simple structure, eg: –One attribute does all the work –All attributes contribute equally & independently –A weighted linear combination might do –Instance-based: use a few prototypes –Use simple logical rules Success of method depends on the domain Success of method depends on the domain
13 The weather problem (used for illustration) Conditions for playing a certain game Conditions for playing a certain game OutlookTemperatureHumidityWindyPlay SunnyHotHighFalseNo SunnyHotHighTrueNo OvercastHotHighFalseYes RainyMildNormalFalseYes …………… If outlook = sunny and humidity = high then play = no If outlook = rainy and windy = true then play = no If outlook = overcast then play = yes If humidity = normal then play = yes If none of the above then play = yes
14 Weather data with mixed attributes Some attributes have numeric values Some attributes have numeric values OutlookTemperatureHumidityWindyPlay Sunny8585FalseNo Sunny8090TrueNo Overcast8386FalseYes Rainy7580FalseYes …………… If outlook = sunny and humidity > 83 then play = no If outlook = rainy and windy = true then play = no If outlook = overcast then play = yes If humidity < 85 then play = yes If none of the above then play = yes
15 Inferring rudimentary rules 1R: learns a 1-level decision tree 1R: learns a 1-level decision tree –I.e., rules that all test one particular attribute Basic version Basic version –One branch for each value –Each branch assigns most frequent class –Error rate: proportion of instances that don’t belong to the majority class of their corresponding branch –Choose attribute with lowest error rate (assumes nominal attributes)
16 Pseudo-code for 1R For each attribute, For each value of the attribute, make a rule as follows: count how often each class appears find the most frequent class make the rule assign that class to this attribute-value Calculate the error rate of the rules Choose the rules with the smallest error rate Note: “missing” is treated as a separate attribute value Note: “missing” is treated as a separate attribute value
17 Evaluating the weather attributes AttributeRulesErrors Total errors Outlook Sunny No 2/54/14 Overcast Yes 0/4 Rainy Yes 2/5 Temp Hot No* 2/45/14 Mild Yes 2/6 Cool Yes 1/4 Humidity High No 3/74/14 Normal Yes 1/7 Windy False Yes 2/85/14 True No* 3/6OutlookTempHumidityWindyPlaySunnyHotHighFalseNo SunnyHotHighTrueNo OvercastHotHighFalseYes RainyMildHighFalseYes RainyCoolNormalFalseYes RainyCoolNormalTrueNo OvercastCoolNormalTrueYes SunnyMildHighFalseNo SunnyCoolNormalFalseYes RainyMildNormalFalseYes SunnyMildNormalTrueYes OvercastMildHighTrueYes OvercastHotNormalFalseYes RainyMildHighTrueNo * indicates a tie
18 Dealing with numeric attributes Discretize numeric attributes Discretize numeric attributes Divide each attribute’s range into intervals Divide each attribute’s range into intervals –Sort instances according to attribute’s values –Place breakpoints where the class changes (the majority class) –This minimizes the total error Example: temperature from weather data Example: temperature from weather data OutlookTemperatureHumidityWindyPlay Sunny8585FalseNo Sunny8090TrueNo Overcast8386FalseYes Rainy7580FalseYes ……………
19 Dealing with numeric attributes Example: temperature from weather data Example: temperature from weather data Yes | No | Yes Yes Yes | No No Yes | Yes Yes | No | Yes Yes | No OutlookTemperatureHumidityWindyPlaySunny8585FalseNo Sunny8090TrueNo Overcast8386FalseYes Rainy7580FalseYes ……………
20 The problem of overfitting This procedure is very sensitive to noise This procedure is very sensitive to noise –One instance with an incorrect class label will probably produce a separate interval Also: time stamp attribute will have zero errors Also: time stamp attribute will have zero errors Simple solution: enforce minimum number of instances in majority class per interval Simple solution: enforce minimum number of instances in majority class per interval Example (with min = 3): Example (with min = 3): Yes | No | Yes Yes Yes | No No Yes | Yes Yes | No | Yes Yes | No Yes No Yes Yes Yes | No No Yes Yes Yes | No Yes Yes No
21 With overfitting avoidance Resulting rule set: Resulting rule set: AttributeRulesErrors Total errors Outlook Sunny No 2/54/14 Overcast Yes 0/4 Rainy Yes 2/5 Temperature 77.5 Yes 3/105/14 > 77.5 No* 2/4 Humidity 82.5 Yes 1/73/14 > 82.5 and 95.5 No 2/6 > 95.5 Yes 0/1 Windy False Yes 2/85/14 True No* 3/6
22 Discussion of 1R 1R was described in a paper by Holte (1993) 1R was described in a paper by Holte (1993) –Contains an experimental evaluation on 16 datasets (using cross-validation so that results were representative of performance on future data) –Minimum number of instances was set to 6 after some experimentation –1R’s simple rules performed not much worse than much more complex decision trees Simplicity first pays off! Simplicity first pays off! Very Simple Classification Rules Perform Well on Most Commonly Used Datasets Robert C. Holte, Computer Science Department, University of Ottawa
23 Algorithms: The basic methods Simplicity first: 1R Simplicity first: 1R Use all attributes: Naïve Bayes Use all attributes: Naïve Bayes Decision trees: ID3 Decision trees: ID3 Covering algorithms: decision rules: PRISM Covering algorithms: decision rules: PRISM Association rules Association rules Linear models Linear models Instance-based learning Instance-based learning
24 Statistical modeling “Opposite” of 1R: use all the attributes “Opposite” of 1R: use all the attributes Two assumptions: Attributes are Two assumptions: Attributes are –equally important –statistically independent (given the class value) I.e., knowing the value of one attribute says nothing about the value of another (if the class is known) Independence assumption is never correct! Independence assumption is never correct! But … this scheme works well in practice But … this scheme works well in practice
25 Probabilities for weather data OutlookTempHumidityWindyPlay SunnyHotHighFalseNo SunnyHotHighTrueNo OvercastHotHighFalseYes RainyMildHighFalseYes RainyCoolNormalFalseYes RainyCoolNormalTrueNo OvercastCoolNormalTrueYes SunnyMildHighFalseNo SunnyCoolNormalFalseYes RainyMildNormalFalseYes SunnyMildNormalTrueYes OvercastMildHighTrueYes OvercastHotNormalFalseYes RainyMildHighTrueNo
26 Probabilities for weather data OutlookTemperatureHumidityWindyPlay YesNoYesNoYesNoYesNoYesNo Sunny23Hot22High34False6295 Overcast40Mild42Normal61True33 Rainy32Cool31 Sunny2/93/5Hot2/92/5High3/94/5False6/92/59/145/14 Overcast4/90/5Mild4/92/5Normal6/91/5True3/93/5 Rainy3/92/5Cool3/91/5
27 Probabilities for weather data OutlookTemperatureHumidityWindyPlay YesNoYesNoYesNoYesNoYesNo Sunny23Hot22High34False6295 Overcast40Mild42Normal61True33 Rainy32Cool31 Sunny2/93/5Hot2/92/5High3/94/5False6/92/59/145/14 Overcast4/90/5Mild4/92/5Normal6/91/5True3/93/5 Rainy3/92/5Cool3/91/5 OutlookTemp.HumidityWindyPlaySunnyCoolHighTrue? A new day: A new day: Likelihood of the two classes For “yes” = 2/9 3/9 3/9 3/9 9/14 = For “no” = 3/5 1/5 4/5 3/5 5/14 = Conversion into a probability by normalization: P(“yes”) = / ( ) = P(“no”) = / ( ) = 0.795
28 Bayes’s rule Probability of event H given evidence E : Probability of event H given evidence E : Prior probability of H : Prior probability of H : –Probability of event before evidence is seen Posterior probability of H : Posterior probability of H : –Probability of event after evidence is seen Thomas Bayes Born:1702 in London, England Died:1761 in Tunbridge Wells, Kent, England
29 Naïve Bayes for classification Classification learning: what’s the probability of the class given an instance? Classification learning: what’s the probability of the class given an instance? –Evidence E = instance –Event H = class value for instance Naïve assumption: evidence splits into parts (i.e. attributes) that are independent Naïve assumption: evidence splits into parts (i.e. attributes) that are independent
30 Weather data example OutlookTemp.HumidityWindyPlay SunnyCoolHighTrue? Evidence E Probability of class “yes”
31 The “zero-frequency problem” What if an attribute value doesn’t occur with every class value? (e.g. “Humidity = high” for class “yes”) What if an attribute value doesn’t occur with every class value? (e.g. “Humidity = high” for class “yes”) –Probability will be zero! –A posteriori probability will also be zero! (No matter how likely the other values are!) Remedy: add 1 to the count for every attribute value-class combination (Laplace estimator) Remedy: add 1 to the count for every attribute value-class combination (Laplace estimator) Result: probabilities will never be zero! (also: stabilizes probability estimates) Result: probabilities will never be zero! (also: stabilizes probability estimates)
32 Modified probability estimates In some cases adding a constant different from 1 might be more appropriate In some cases adding a constant different from 1 might be more appropriate Example: attribute outlook for class yes Example: attribute outlook for class yes Weights don’t need to be equal (but they must sum to 1) Weights don’t need to be equal (but they must sum to 1) SunnyOvercastRainy
33 Missing values Training: instance is not included in frequency count for attribute value- class combination Training: instance is not included in frequency count for attribute value- class combination Classification: attribute will be omitted from calculation Classification: attribute will be omitted from calculation Example: Example: OutlookTemp.HumidityWindyPlay ?CoolHighTrue? Likelihood of “yes” = 3/9 3/9 3/9 9/14 = Likelihood of “no” = 1/5 4/5 3/5 5/14 = P(“yes”) = / ( ) = 41% P(“no”) = / ( ) = 59%
34 Numeric attributes Usual assumption: attributes have a normal or Gaussian probability distribution (given the class) Usual assumption: attributes have a normal or Gaussian probability distribution (given the class) The probability density function for the normal distribution is defined by two parameters: The probability density function for the normal distribution is defined by two parameters: –Sample mean: –Standard deviation: –density function is:
35 Statistics for weather data Example density value: Example density value:OutlookTemperatureHumidityWindyPlayYesNoYesNoYesNoYesNoYesNo Sunny23 64, 68, 65, 71, 65, 70, 70, 85, False6295 Overcast40 69, 70, 72, 80, 70, 75, 90, 91, True33 Rainy32 72, … 85, … 80, … 95, … Sunny2/93/5 =73 =75 =79 =86 False6/92/59/145/14 Overcast4/90/5 =6.2 =7.9 =10.2 =9.7 True3/93/5 Rainy3/92/5
36 Classifying a new day A new day: A new day: Missing values during training are not included in calculation of mean and standard deviation Missing values during training are not included in calculation of mean and standard deviation OutlookTemp.HumidityWindyPlay Sunny6690true? Likelihood of “yes” = 2/9 3/9 9/14 = Likelihood of “no” = 3/5 3/5 5/14 = P(“yes”) = / ( ) = 20.9% P(“no”) = / ( ) = 79.1%
37 Probability densities Relationship between probability and density: Relationship between probability and density: But: this doesn’t change calculation of a posteriori probabilities because cancels out But: this doesn’t change calculation of a posteriori probabilities because cancels out Exact relationship: Exact relationship:
38 Naïve Bayes: discussion Naïve Bayes works surprisingly well (even if independence assumption is clearly violated) Naïve Bayes works surprisingly well (even if independence assumption is clearly violated) Why? Because classification doesn’t require accurate probability estimates as long as maximum probability is assigned to correct class Why? Because classification doesn’t require accurate probability estimates as long as maximum probability is assigned to correct class However: adding too many redundant attributes will cause problems (e.g. identical attributes) However: adding too many redundant attributes will cause problems (e.g. identical attributes) Note also: many numeric attributes are not normally distributed ( kernel density estimators) Note also: many numeric attributes are not normally distributed ( kernel density estimators)
39 Algorithms: The basic methods Simplicity first: 1R Simplicity first: 1R Use all attributes: Naïve Bayes Use all attributes: Naïve Bayes Decision trees: ID3 Decision trees: ID3 Covering algorithms: decision rules: PRISM Covering algorithms: decision rules: PRISM Association rules Association rules Linear models Linear models Instance-based learning Instance-based learning
40 Constructing decision trees Strategy: top down Recursive divide-and-conquer fashion Strategy: top down Recursive divide-and-conquer fashion –First: select attribute for root node Create branch for each possible attribute value –Then: split instances into subsets One for each branch extending from the node –Finally: repeat recursively for each branch, using only instances that reach the branch Stop if all instances have the same class Stop if all instances have the same class
41 Which attribute to select?
42 Criterion for attribute selection Which is the best attribute? Which is the best attribute? –Want to get the smallest tree –Heuristic: choose the attribute that produces the “purest” nodes Popular impurity criterion: information gain Popular impurity criterion: information gain –Information gain increases with the average purity of the subsets Strategy: choose attribute that gives greatest information gain Strategy: choose attribute that gives greatest information gain
43 Computing information Measure information in bits Measure information in bits –Given a probability distribution, the info required to predict an event is the distribution’s entropy –Entropy gives the information required in bits (can involve fractions of bits!) Recall, formula for entropy: Recall, formula for entropy:
44 Claude Shannon, who has died aged 84, perhaps more than anyone laid the groundwork for today’s digital revolution. His exposition of information theory, stating that all information could be represented mathematically as a succession of noughts and ones, facilitated the digital manipulation of data without which today’s information society would be unthinkable. Shannon’s master’s thesis, obtained in 1940 at MIT, demonstrated that problem solving could be achieved by manipulating the symbols 0 and 1 in a process that could be carried out automatically with electrical circuitry. That dissertation has been hailed as one of the most significant master’s theses of the 20th century. Eight years later, Shannon published another landmark paper, A Mathematical Theory of Communication, generally taken as his most important scientific contribution. Claude Shannon Born: 30 April 1916 Died: 23 February 2001 “Father of information theory” Shannon applied the same radical approach to cryptography research, in which he later became a consultant to the US government. Many of Shannon’s pioneering insights were developed before they could be applied in practical form. He was truly a remarkable man, yet unknown to most of the world.
45 Example: attribute Outlook Outlook = Sunny : Outlook = Sunny : Outlook = Overcast : Outlook = Overcast : Outlook = Rainy : Outlook = Rainy : Expected information for attribute: Expected information for attribute: Note: this is normally undefined.
46 Computing information gain Information gain: information before splitting – information after splitting Information gain: information before splitting – information after splitting Information gain for attributes from weather data: Information gain for attributes from weather data: gain(Outlook )= bits gain(Temperature )= bits gain(Humidity )= bits gain(Windy )= bits gain(Outlook )= info([9,5]) – info([2,3],[4,0],[3,2]) = – = bits
47 Continuing to split gain(Temperature )= bits gain(Humidity )= bits gain(Windy )= bits
48 Final decision tree Note: not all leaves need to be pure; sometimes identical instances have different classes Note: not all leaves need to be pure; sometimes identical instances have different classes Splitting stops when data can’t be split any further
49 Wishlist for a purity measure Properties we require from a purity measure: Properties we require from a purity measure: –When node is pure, measure should be zero –When impurity is maximal (i.e. all classes equally likely), measure should be maximal –Measure should obey multistage property (i.e. decisions can be made in several stages): Entropy is the only function that satisfies all three properties! Entropy is the only function that satisfies all three properties!
50 Properties of the entropy The multistage property: The multistage property: Simplification of computation: Simplification of computation: Note: instead of maximizing info gain we could just minimize information Note: instead of maximizing info gain we could just minimize information
51 Highly-branching attributes Problematic: attributes with a large number of values (extreme case: ID code) Problematic: attributes with a large number of values (extreme case: ID code) Subsets are more likely to be pure if there is a large number of values Subsets are more likely to be pure if there is a large number of values Information gain is biased towards choosing attributes with a large number of values This may result in overfitting (selection of an attribute that is non-optimal for prediction) Another problem: fragmentation Another problem: fragmentation
52 Weather data with ID code ID code OutlookTemp.HumidityWindyPlay ASunnyHotHighFalseNo BSunnyHotHighTrueNo COvercastHotHighFalseYes DRainyMildHighFalseYes ERainyCoolNormalFalseYes FRainyCoolNormalTrueNo GOvercastCoolNormalTrueYes HSunnyMildHighFalseNo ISunnyCoolNormalFalseYes JRainyMildNormalFalseYes KSunnyMildNormalTrueYes LOvercastMildHighTrueYes MOvercastHotNormalFalseYes NRainyMildHighTrueNo
53 Tree stump for ID code attribute Entropy of split: Entropy of split: Information gain is maximal for ID code (namely bits)
54 Gain ratio Gain ratio: a modification of the information gain that reduces its bias Gain ratio: a modification of the information gain that reduces its bias Gain ratio takes number and size of branches into account when choosing an attribute Gain ratio takes number and size of branches into account when choosing an attribute –It corrects the information gain by taking the intrinsic information of a split into account Intrinsic information: entropy of distribution of instances into branches (i.e. how much info do we need to tell which branch an instance belongs to) Intrinsic information: entropy of distribution of instances into branches (i.e. how much info do we need to tell which branch an instance belongs to)
55 Computing the gain ratio Example: intrinsic information for ID code Example: intrinsic information for ID code Value of attribute decreases as intrinsic information gets larger Value of attribute decreases as intrinsic information gets larger Definition of gain ratio: Definition of gain ratio: Example: Example:
56 Gain ratios for weather data OutlookTemperature Info:0.693Info:0.911 Gain: Gain: Split info: info([5,4,5]) Split info: info([4,6,4]) Gain ratio: 0.247/ Gain ratio: 0.029/ HumidityWindyInfo:0.788Info:0.892 Gain: Gain: Split info: info([7,7]) Split info: info([8,6]) Gain ratio: 0.152/ Gain ratio: 0.048/
57 More on the gain ratio “Outlook” still comes out top “Outlook” still comes out top However: “ID code” has greater gain ratio However: “ID code” has greater gain ratio –Standard fix: ad hoc test to prevent splitting on that type of attribute Problem with gain ratio: it may overcompensate Problem with gain ratio: it may overcompensate –May choose an attribute just because its intrinsic information is very low –Standard fix: only consider attributes with greater than average information gain
58 Discussion Top-down induction of decision trees: ID3, algorithm developed by Ross Quinlan Top-down induction of decision trees: ID3, algorithm developed by Ross Quinlan –Gain ratio just one modification of this basic algorithm – C4.5: deals with numeric attributes, missing values, noisy data Similar approach: CART Similar approach: CART There are many other attribute selection criteria! (But little difference in accuracy of result) There are many other attribute selection criteria! (But little difference in accuracy of result)
59 Algorithms: The basic methods Simplicity first: 1R Simplicity first: 1R Use all attributes: Naïve Bayes Use all attributes: Naïve Bayes Decision trees: ID3 Decision trees: ID3 Covering algorithms: decision rules: PRISM Covering algorithms: decision rules: PRISM Association rules Association rules Linear models Linear models Instance-based learning Instance-based learning
60 Covering algorithms Convert decision tree into a rule set Convert decision tree into a rule set –Straightforward, but rule set overly complex –More effective conversions are not trivial Instead, can generate rule set directly Instead, can generate rule set directly –for each class in turn find rule set that covers all instances in it (excluding instances not in the class) Called a covering approach: Called a covering approach: –at each stage a rule is identified that “covers” some of the instances
61 Example: generating a rule If x > 1.2 then class = a If x > 1.2 and y > 2.6 then class = a If true then class = a Possible rule set for class “b”: Possible rule set for class “b”: Could add more rules, get “perfect” rule set Could add more rules, get “perfect” rule set If x 1.2 then class = b If x > 1.2 and y 2.6 then class = b
62 Rules vs. trees Corresponding decision tree: (produces exactly the same Corresponding decision tree: (produces exactly the samepredictions) But: rule sets can be more perspicuous when decision trees suffer from replicated subtrees But: rule sets can be more perspicuous when decision trees suffer from replicated subtrees Also: in multiclass situations, covering algorithm concentrates on one class at a time whereas decision tree learner takes all classes into account Also: in multiclass situations, covering algorithm concentrates on one class at a time whereas decision tree learner takes all classes into account
63 Simple covering algorithm Generates a rule by adding tests that maximize rule’s accuracy Generates a rule by adding tests that maximize rule’s accuracy Similar to situation in decision trees: problem of selecting an attribute to split on Similar to situation in decision trees: problem of selecting an attribute to split on –But: decision tree inducer maximizes overall purity Each new test reduces rule’s coverage: Each new test reduces rule’s coverage:
64 Selecting a test Goal: maximize accuracy Goal: maximize accuracy t total number of instances covered by rule t total number of instances covered by rule –p positive examples of the class covered by rule –t – p number of errors made by rule Select test that maximizes the ratio p/t We are finished when p/t = 1 or the set of instances can’t be split any further We are finished when p/t = 1 or the set of instances can’t be split any further
65 Rules vs. decision lists PRISM with outer loop removed generates a decision list for one class PRISM with outer loop removed generates a decision list for one class –Subsequent rules are designed for rules that are not covered by previous rules –But: order doesn’t matter because all rules predict the same class Outer loop considers all classes separately Outer loop considers all classes separately –No order dependence implied Problems: overlapping rules, default rule required Problems: overlapping rules, default rule required
66 Pseudo-code for PRISM For each class C Initialize E to the instance set Initialize E to the instance set While E contains instances in class C While E contains instances in class C Create a rule R with an empty left-hand side that predicts class C Create a rule R with an empty left-hand side that predicts class C Until R is perfect (or there are no more attributes to use) do Until R is perfect (or there are no more attributes to use) do For each attribute A not mentioned in R, and each value v, For each attribute A not mentioned in R, and each value v, Consider adding the condition A = v to the left-hand side of R Consider adding the condition A = v to the left-hand side of R Select A and v to maximize the accuracy p/t Select A and v to maximize the accuracy p/t (break ties by choosing the condition with the largest p) (break ties by choosing the condition with the largest p) Add A = v to R Add A = v to R Remove the instances covered by R from E Remove the instances covered by R from E
67 Separate and conquer Methods like PRISM (for dealing with one class) are separate-and-conquer algorithms: Methods like PRISM (for dealing with one class) are separate-and-conquer algorithms: –First, identify a useful rule –Then, separate out all the instances it covers –Finally, “conquer” the remaining instances Difference to divide-and-conquer methods: Difference to divide-and-conquer methods: –Subset covered by rule doesn’t need to be explored any further
68 Algorithms: The basic methods Simplicity first: 1R Simplicity first: 1R Use all attributes: Naïve Bayes Use all attributes: Naïve Bayes Decision trees: ID3 Decision trees: ID3 Covering algorithms: decision rules: PRISM Covering algorithms: decision rules: PRISM Association rules Association rules Linear models Linear models Instance-based learning Instance-based learning
69 Association rules Association rules… Association rules… –… can predict any attribute and combinations of attributes –… are not intended to be used together as a set Problem: immense number of possible associations Problem: immense number of possible associations –Output needs to be restricted to show only the most predictive associations only those with high support and high confidence
70 Support and confidence of a rule OutlookTempHumidityWindyPlay SunnyHotHighFalseNo SunnyHotHighTrueNo OvercastHotHighFalseYes RainyMildHighFalseYes RainyCoolNormalFalseYes RainyCoolNormalTrueNo OvercastCoolNormalTrueYes SunnyMildHighFalseNo SunnyCoolNormalFalseYes RainyMildNormalFalseYes SunnyMildNormalTrueYes OvercastMildHighTrueYes OvercastHotNormalFalseYes RainyMildHighTrueNo
71 Support and confidence of a rule Support: number of instances predicted correctly Support: number of instances predicted correctly Confidence: number of correct predictions, as proportion of all instances the rule applies to Confidence: number of correct predictions, as proportion of all instances the rule applies to Example: 4 cool days with normal humidity Example: 4 cool days with normal humidity Support = 4, confidence = 100% Normally: minimum support and confidence pre-specified (e.g. 58 rules with support 2 and confidence 95% for weather data) Normally: minimum support and confidence pre-specified (e.g. 58 rules with support 2 and confidence 95% for weather data) If temperature = cool then humidity = normal
72 Interpreting association rules If humidity = high and windy = false and play = no then outlook = sunny Interpretation is not obvious: Interpretation is not obvious: is not the same as However, it means that the following also holds: However, it means that the following also holds: If windy = false and play = no then outlook = sunny If windy = false and play = no then humidity = high If windy = false and play = no then outlook = sunny and humidity = high
73 Mining association rules Naïve method for finding association rules: Naïve method for finding association rules: –Use separate-and-conquer method –Treat every possible combination of attribute values as a separate class Two problems: Two problems: –Computational complexity –Resulting number of rules (which would have to be pruned on the basis of support and confidence) But: we can look for high support rules directly! But: we can look for high support rules directly!
74 Item sets Support: number of instances correctly covered by association rule Support: number of instances correctly covered by association rule –The same as the number of instances covered by all tests in the rule (LHS and RHS!) Item: one test/attribute-value pair Item: one test/attribute-value pair Item set : all items occurring in a rule Item set : all items occurring in a rule Goal: only rules that exceed pre-defined support Goal: only rules that exceed pre-defined support Do it by finding all item sets with the given minimum support and generating rules from them!
75 Item Sets For Weather Data OutlookTempHumidityWindyPlay SunnyHotHighFalseNo SunnyHotHighTrueNo OvercastHotHighFalseYes RainyMildHighFalseYes RainyCoolNormalFalseYes RainyCoolNormalTrueNo OvercastCoolNormalTrueYes SunnyMildHighFalseNo SunnyCoolNormalFalseYes RainyMildNormalFalseYes SunnyMildNormalTrueYes OvercastMildHighTrueYes OvercastHotNormalFalseYes RainyMildHighTrueNo
76 Item sets for weather data One-item sets Two-item sets Three-item sets Four-item sets Outlook = Sunny (5) Outlook = Sunny Temperature = Hot (2) Outlook = Sunny Temperature = Hot Humidity = High (2) Outlook = Sunny Temperature = Hot Humidity = High Play = No (2) Temperature = Cool (4) Outlook = Sunny Humidity = High (3) Outlook = Sunny Humidity = High Windy = False (2) Outlook = Rainy Temperature = Mild Windy = False Play = Yes (2) ………… In total: 12 one-item sets, 47 two-item sets, 39 three-item sets, 6 four-item sets and 0 five-item sets (with minimum support of two) In total: 12 one-item sets, 47 two-item sets, 39 three-item sets, 6 four-item sets and 0 five-item sets (with minimum support of two)
77 Generating rules from an item set Once all item sets with minimum support have been generated, we can turn them into rules Once all item sets with minimum support have been generated, we can turn them into rules Example: Example: Seven (2 N -1) potential rules: Seven (2 N -1) potential rules: Humidity = Normal, Windy = False, Play = Yes (4) If Humidity = Normal and Windy = False then Play = Yes If Humidity = Normal and Play = Yes then Windy = False If Windy = False and Play = Yes then Humidity = Normal If Humidity = Normal then Windy = False and Play = Yes If Windy = False then Humidity = Normal and Play = Yes If Play = Yes then Humidity = Normal and Windy = False If True then Humidity = Normal and Windy = False and Play = Yes 4/44/64/64/74/84/94/12
78 Rules for weather data Rules with support > 1 and confidence = 100%: Rules with support > 1 and confidence = 100%: In total: 3 rules with support four 5 with support three 50 with support two In total: 3 rules with support four 5 with support three 50 with support two Association rule Sup.Conf. 1 Humidity=Normal Windy=False Play=Yes 4100% 2Temperature=Cool Humidity=Normal 4100% 3Outlook=Overcast Play=Yes 4100% 4 Temperature=Cold Play=Yes Humidity=Normal 3100% Outlook=Sunny Temperature=Hot Humidity=High 2100%
79 Example rules from the same set Item set: Item set: Resulting rules (all with 100% confidence): Resulting rules (all with 100% confidence): due to the following “frequent” item sets: Temperature = Cool, Humidity = Normal, Windy = False, Play = Yes (2) Temperature = Cool, Windy = False Humidity = Normal, Play = Yes Temperature = Cool, Windy = False, Humidity = Normal Play = Yes Temperature = Cool, Windy = False, Play = Yes Humidity = Normal Temperature = Cool, Windy = False (2) Temperature = Cool, Humidity = Normal, Windy = False (2) Temperature = Cool, Windy = False, Play = Yes (2)
80 Generating item sets efficiently How can we efficiently find all frequent item sets? How can we efficiently find all frequent item sets? Finding one-item sets easy Finding one-item sets easy Idea: use one-item sets to generate two-item sets, two-item sets to generate three-item sets, … Idea: use one-item sets to generate two-item sets, two-item sets to generate three-item sets, … –If (A B) is frequent item set, then (A) and (B) have to be frequent item sets as well! –In general: if X is frequent k-item set, then all (k-1)- item subsets of X are also frequent Compute k-item set by merging (k-1)-item sets
81 Example Given: five three-item sets Given: five three-item sets (A B C), (A B D), (A C D), (A C E), (B C D) Lexicographically ordered! Lexicographically ordered! Candidate four-item sets: Candidate four-item sets: (A B C D) OK because of (B C D) (A C D E) Not OK because of (C D E) Final check by counting instances in dataset! Final check by counting instances in dataset! (k –1)-item sets are stored in hash table (k –1)-item sets are stored in hash table
82 Generating rules efficiently We are looking for all high-confidence rules We are looking for all high-confidence rules –Support of antecedent obtained from hash table –But: brute-force method is (2 N -1) Better way: building (c + 1)-consequent rules from c-consequent ones Better way: building (c + 1)-consequent rules from c-consequent ones –Observation: (c + 1)-consequent rule can only hold if all corresponding c-consequent rules also hold Resulting algorithm similar to procedure for large item sets Resulting algorithm similar to procedure for large item sets
83 Example 1-consequent rules: 1-consequent rules: Corresponding 2-consequent rule: Corresponding 2-consequent rule: Final check of antecedent against hash table! Final check of antecedent against hash table! If Windy = False and Play = No then Outlook = Sunny and Humidity = High (2/2) If Outlook = Sunny and Windy = False and Play = No then Humidity = High (2/2) If Humidity = High and Windy = False and Play = No then Outlook = Sunny (2/2)
84 Association rules: discussion Above method makes one pass through the data for each different size item set Above method makes one pass through the data for each different size item set –Other possibility: generate (k+2)-item sets just after (k+1)-item sets have been generated –Result: more (k+2)-item sets than necessary will be considered but less passes through the data –Makes sense if data too large for main memory Practical issue: generating a certain number of rules (e.g. by incrementally reducing min. support) Practical issue: generating a certain number of rules (e.g. by incrementally reducing min. support)
85 Other issues Standard ARFF format very inefficient for typical market basket data Standard ARFF format very inefficient for typical market basket data –Attributes represent items in a basket and most items are usually missing –Need way of representing sparse data Instances are also called transactions Instances are also called transactions Confidence is not necessarily the best measure Confidence is not necessarily the best measure –Example: milk occurs in almost every supermarket transaction –Other measures have been devised (e.g. lift)
86 Algorithms: The basic methods Simplicity first: 1R Simplicity first: 1R Use all attributes: Naïve Bayes Use all attributes: Naïve Bayes Decision trees: ID3 Decision trees: ID3 Covering algorithms: decision rules: PRISM Covering algorithms: decision rules: PRISM Association rules Association rules Linear models Linear models Instance-based learning Instance-based learning
87 Linear models Work most naturally with numeric attributes Work most naturally with numeric attributes Standard technique for numeric prediction: linear regression Standard technique for numeric prediction: linear regression –Outcome is linear combination of attributes Weights are calculated from the training data Weights are calculated from the training data Predicted value for first training instance a (1) Predicted value for first training instance a (1)
88 Minimizing the squared error Choose k +1 coefficients to minimize the squared error on the training data Choose k +1 coefficients to minimize the squared error on the training data Squared error: Squared error: Derive coefficients using standard matrix operations Derive coefficients using standard matrix operations Can be done if there are more instances than attributes (roughly speaking) Can be done if there are more instances than attributes (roughly speaking) Minimizing the absolute error is more difficult Minimizing the absolute error is more difficult
89 Classification Any regression technique can be used for classification Any regression technique can be used for classification –Training: perform a regression for each class, setting the output to 1 for training instances that belong to class, and 0 for those that don’t –Prediction: predict class corresponding to model with largest output value (membership value) For linear regression this is known as multi- response linear regression For linear regression this is known as multi- response linear regression
90 Theoretical justification Model Instance Observed target value (either 0 or 1) True class probability Constant We want to minimize this The scheme minimizes this
91 Pairwise regression Another way of using regression for classification: Another way of using regression for classification: –A regression function for every pair of classes, using only instances from these two classes –Assign output of +1 to one member of the pair, –1 to the other Prediction is done by voting Prediction is done by voting –Class that receives most votes is predicted –Alternative: “don’t know” if there is no agreement More likely to be accurate but more expensive More likely to be accurate but more expensive
92 Logistic regression Problem: some assumptions violated when linear regression is applied to classification problems Problem: some assumptions violated when linear regression is applied to classification problems Logistic regression: alternative to linear regression Logistic regression: alternative to linear regression –Designed for classification problems –Tries to estimate class probabilities directly Does this using the maximum likelihood method –Uses this linear model: Class probability
93 Discussion of linear models Not appropriate if data exhibits non-linear dependencies Not appropriate if data exhibits non-linear dependencies But: can serve as building blocks for more complex schemes (i.e. model trees) But: can serve as building blocks for more complex schemes (i.e. model trees) Example: multi-response linear regression defines a hyperplane for any two given classes: Example: multi-response linear regression defines a hyperplane for any two given classes:
94 Algorithms: The basic methods Simplicity first: 1R Simplicity first: 1R Use all attributes: Naïve Bayes Use all attributes: Naïve Bayes Decision trees: ID3 Decision trees: ID3 Covering algorithms: decision rules: PRISM Covering algorithms: decision rules: PRISM Association rules Association rules Linear models Linear models Instance-based learning Instance-based learning
95 Instance-based representation Simplest form of learning: rote learning Simplest form of learning: rote learning –Training instances are searched for instance that most closely resembles new instance –The instances themselves represent the knowledge –Also called instance-based learning Similarity function defines what’s “learned” Similarity function defines what’s “learned” Instance-based learning is lazy learning Instance-based learning is lazy learning Methods: Methods: –nearest-neighbor –k-nearest-neighbor –…
96 The distance function Simplest case: one numeric attribute Simplest case: one numeric attribute –Distance is the difference between the two attribute values involved (or a function thereof) Several numeric attributes: normally, Euclidean distance is used and attributes are normalized Several numeric attributes: normally, Euclidean distance is used and attributes are normalized Nominal attributes: distance is set to 1 if values are different, 0 if they are equal Nominal attributes: distance is set to 1 if values are different, 0 if they are equal Are all attributes equally important? Are all attributes equally important? –Weighting the attributes might be necessary
97 Instance-based learning Distance function defines what’s learned Distance function defines what’s learned Most instance-based schemes use Euclidean distance: Most instance-based schemes use Euclidean distance: a (1) and a (2) : two instances with k attributes Taking the square root is not required when comparing distances Taking the square root is not required when comparing distances Other popular metric: city-block metric Other popular metric: city-block metric –Adds differences without squaring them
98 Normalization and other issues Different attributes are measured on different scales need to be normalized: Different attributes are measured on different scales need to be normalized: v i : the actual value of attribute i Nominal attributes: distance either 0 or 1 Nominal attributes: distance either 0 or 1 Common policy for missing values: assumed to be maximally distant (given normalized attributes) Common policy for missing values: assumed to be maximally distant (given normalized attributes)
99 Discussion of 1-NN Often very accurate Often very accurate … but slow: … but slow: –simple version scans entire training data to derive a prediction Assumes all attributes are equally important Assumes all attributes are equally important –Remedy: attribute selection or weights Possible remedies against noisy instances: Possible remedies against noisy instances: –Take a majority vote over the k nearest neighbors –Removing noisy instances from dataset (difficult!) Statisticians have used k-NN since early 1950s Statisticians have used k-NN since early 1950s –If n and k/n 0, error approaches minimum
100 Comments on basic methods Bayes’ rule stems from his “Essay towards solving a problem in the doctrine of chances” (1763) Bayes’ rule stems from his “Essay towards solving a problem in the doctrine of chances” (1763) –Difficult bit: estimating prior probabilities Extension of Naïve Bayes: Bayesian Networks Extension of Naïve Bayes: Bayesian Networks Algorithm for association rules is called APRIORI Algorithm for association rules is called APRIORI Minsky and Papert (1969) showed that linear classifiers have limitations, e.g. can’t learn XOR Minsky and Papert (1969) showed that linear classifiers have limitations, e.g. can’t learn XOR –But: combinations of them can ( Neural Nets)
101 Credibility: Evaluating what’s been learned Issues: training, testing, tuning Issues: training, testing, tuning Predicting performance: confidence limits Predicting performance: confidence limits Holdout, cross-validation, bootstrap Holdout, cross-validation, bootstrap Comparing schemes: the t-test Comparing schemes: the t-test Predicting probabilities: loss functions Predicting probabilities: loss functions Cost-sensitive measures Cost-sensitive measures Evaluating numeric prediction Evaluating numeric prediction The Minimum Description Length principle The Minimum Description Length principle
102 Evaluation: the key to success How predictive is the model we learned? How predictive is the model we learned? Error on the training data is not a good indicator of performance on future data Error on the training data is not a good indicator of performance on future data –Otherwise 1-NN would be the optimum classifier! Simple solution that can be used if lots of (labeled) data is available: Simple solution that can be used if lots of (labeled) data is available: –Split data into training and test set However: (labeled) data is usually limited However: (labeled) data is usually limited –More sophisticated techniques need to be used
103 Issues in evaluation Statistical reliability of estimated differences in performance ( significance tests) Statistical reliability of estimated differences in performance ( significance tests) Choice of performance measure: Choice of performance measure: –Number of correct classifications –Accuracy of probability estimates –Error in numeric predictions Costs assigned to different types of errors Costs assigned to different types of errors –Many practical applications involve costs
104 Credibility: Evaluating what’s been learned Issues: training, testing, tuning Issues: training, testing, tuning Predicting performance: confidence limits Predicting performance: confidence limits Holdout, cross-validation, bootstrap Holdout, cross-validation, bootstrap Comparing schemes: the t-test Comparing schemes: the t-test Predicting probabilities: loss functions Predicting probabilities: loss functions Cost-sensitive measures Cost-sensitive measures Evaluating numeric prediction Evaluating numeric prediction The Minimum Description Length principle The Minimum Description Length principle
105 Training and testing I Natural performance measure for classification problems: error rate Natural performance measure for classification problems: error rate –Success: instance’s class is predicted correctly –Error: instance’s class is predicted incorrectly –Error rate: proportion of errors made over the whole set of instances Resubstitution error: error rate obtained from training data Resubstitution error: error rate obtained from training data Resubstitution error is (hopelessly) optimistic! Resubstitution error is (hopelessly) optimistic!
106 Training and testing II Test set: independent instances that have played no part in formation of classifier Test set: independent instances that have played no part in formation of classifier –Assumption: both training data and test data are representative samples of the underlying problem Test and training data may differ in nature Test and training data may differ in nature –Example: classifiers built using subject data with two different diagnoses A and B To estimate performance of classifier for subjects with diagnosis A on subjects diagnosed with B, test it on data for subjects diagnosed with B
107 Note on parameter tuning It is important that the test data is not used in any way to create the classifier It is important that the test data is not used in any way to create the classifier Some learning schemes operate in two stages: Some learning schemes operate in two stages: –Stage 1: build the basic structure –Stage 2: optimize parameter settings The test data can’t be used for parameter tuning! The test data can’t be used for parameter tuning! Proper procedure uses three sets: training data, validation data, and test data Proper procedure uses three sets: training data, validation data, and test data –Validation data is used to optimize parameters
108 Making the most of the data Once evaluation is complete, all the data can be used to build the final classifier Once evaluation is complete, all the data can be used to build the final classifier Generally, the larger the training data the better the classifier (but returns diminish) Generally, the larger the training data the better the classifier (but returns diminish) The larger the test data the more accurate the error estimate The larger the test data the more accurate the error estimate Holdout procedure: method of splitting original data into training and test set Holdout procedure: method of splitting original data into training and test set –Dilemma: ideally both training set and test set should be large!
109 Credibility: Evaluating what’s been learned Issues: training, testing, tuning Issues: training, testing, tuning Predicting performance: confidence limits Predicting performance: confidence limits Holdout, cross-validation, bootstrap Holdout, cross-validation, bootstrap Comparing schemes: the t-test Comparing schemes: the t-test Predicting probabilities: loss functions Predicting probabilities: loss functions Cost-sensitive measures Cost-sensitive measures Evaluating numeric prediction Evaluating numeric prediction The Minimum Description Length principle The Minimum Description Length principle
110 Predicting performance Assume the estimated error rate is 25%. How close is this to the true error rate? Assume the estimated error rate is 25%. How close is this to the true error rate? –Depends on the amount of test data Prediction is just like tossing a (biased!) coin Prediction is just like tossing a (biased!) coin –“Head” is a “success”, “tail” is an “error” In statistics, a succession of independent events like this is called a Bernoulli process In statistics, a succession of independent events like this is called a Bernoulli process –Statistical theory provides us with confidence intervals for the true underlying proportion
111 Confidence intervals We can say: p lies within a certain specified interval with a certain specified confidence We can say: p lies within a certain specified interval with a certain specified confidence Example: S=750 successes in N=1000 trials Example: S=750 successes in N=1000 trials –Estimated success rate: 75% –How close is this to true success rate p? Answer: with 80% confidence p [73.2,76.7] Another example: S=75 and N=100 Another example: S=75 and N=100 –Estimated success rate: 75% –With 80% confidence p [69.1,80.1]
112 Mean and variance Mean and variance for a Bernoulli trial: p, p (1–p) Mean and variance for a Bernoulli trial: p, p (1–p) Expected success rate f=S/N Expected success rate f=S/N Mean and variance for f : p, p (1–p)/N Mean and variance for f : p, p (1–p)/N For large enough N, f follows a Normal distribution For large enough N, f follows a Normal distribution c% confidence interval [–z X z] for random variable with 0 mean is given by: c% confidence interval [–z X z] for random variable with 0 mean is given by: With a symmetric distribution: With a symmetric distribution:
113 Confidence limits Confidence limits for the normal distribution with 0 mean and a variance of 1: Confidence limits for the normal distribution with 0 mean and a variance of 1: Thus: Thus: To use this we have to reduce our random variable f to have 0 mean and unit variance To use this we have to reduce our random variable f to have 0 mean and unit variance Pr[X z] z 0.1% %2.58 1%2.33 5% % % %0.25 –
114 Transforming f Transformed value for f : (i.e. subtract the mean and divide by the standard deviation) Transformed value for f : (i.e. subtract the mean and divide by the standard deviation) Resulting equation: Resulting equation: Solving for p : Solving for p :
115 Examples f = 75%, N = 1000, c = 80% (so that z = 1.28): f = 75%, N = 1000, c = 80% (so that z = 1.28): f = 75%, N = 100, c = 80% (so that z = 1.28): f = 75%, N = 100, c = 80% (so that z = 1.28): Note that normal distribution assumption is only valid for large N (i.e. N > 100) Note that normal distribution assumption is only valid for large N (i.e. N > 100) f = 75%, N = 10, c = 80% (so that z = 1.28): f = 75%, N = 10, c = 80% (so that z = 1.28): (should be taken with a grain of salt)
116 Credibility: Evaluating what’s been learned Issues: training, testing, tuning Issues: training, testing, tuning Predicting performance: confidence limits Predicting performance: confidence limits Holdout, cross-validation, bootstrap Holdout, cross-validation, bootstrap Comparing schemes: the t-test Comparing schemes: the t-test Predicting probabilities: loss functions Predicting probabilities: loss functions Cost-sensitive measures Cost-sensitive measures Evaluating numeric prediction Evaluating numeric prediction The Minimum Description Length principle The Minimum Description Length principle
117 Holdout estimation What to do if the amount of data is limited? What to do if the amount of data is limited? The holdout method reserves a certain amount for testing and uses the remainder for training The holdout method reserves a certain amount for testing and uses the remainder for training –Usually: one third for testing, the rest for training Problem: the samples might not be representative Problem: the samples might not be representative –Example: class might be missing in the test data Advanced version uses stratification Advanced version uses stratification –Ensures that each class is represented with approximately equal proportions in both subsets
118 Repeated holdout method Holdout estimate can be made more reliable by repeating the process with different subsamples Holdout estimate can be made more reliable by repeating the process with different subsamples –In each iteration, a certain proportion is randomly selected for training (possibly with stratificiation) –The error rates on the different iterations are averaged to yield an overall error rate This is called the repeated holdout method This is called the repeated holdout method Still not optimum: the different test sets overlap Still not optimum: the different test sets overlap –Can we prevent overlapping?
119 Cross-validation Cross-validation avoids overlapping test sets Cross-validation avoids overlapping test sets –First step: split data into k subsets of equal size –Second step: use each subset in turn for testing, the remainder for training Called k-fold cross-validation Called k-fold cross-validation Often the subsets are stratified before the cross-validation is performed Often the subsets are stratified before the cross-validation is performed The error estimates are averaged to yield an overall error estimate The error estimates are averaged to yield an overall error estimate
120 More on cross-validation Standard method for evaluation: stratified ten-fold cross-validation Standard method for evaluation: stratified ten-fold cross-validation Why ten? Why ten? –Extensive experiments have shown that this is the best choice to get an accurate estimate –There is also some theoretical evidence for this Stratification reduces the estimate’s variance Stratification reduces the estimate’s variance Even better: repeated stratified cross- validation Even better: repeated stratified cross- validation –E.g. ten-fold cross-validation is repeated ten times and results are averaged (reduces the variance)
121 Leave-One-Out cross-validation Leave-One-Out: a particular form of cross-validation: Leave-One-Out: a particular form of cross-validation: –Set number of folds to number of training instances –I.e., for n training instances, build classifier n times Makes best use of the data Makes best use of the data Involves no random subsampling Involves no random subsampling Very computationally expensive Very computationally expensive –(exception: NN)
122 Leave-One-Out-CV and stratification Disadvantage of Leave-One-Out-CV: stratification is not possible Disadvantage of Leave-One-Out-CV: stratification is not possible –It guarantees a non-stratified sample because there is only one instance in the test set! Extreme example: random dataset split equally into two classes Extreme example: random dataset split equally into two classes –Best inducer predicts majority class –50% accuracy on fresh data –Leave-One-Out-CV estimate is 100% error!
123 The bootstrap CV uses sampling without replacement CV uses sampling without replacement –The same instance, once selected, can not be selected again for a particular training/test set The bootstrap uses sampling with replacement to form the training set The bootstrap uses sampling with replacement to form the training set –Sample a dataset of n instances n times with replacement to form a new dataset of n instances –Use this data as the training set –Use the instances from the original dataset that don’t occur in the new training set for testing
124 The bootstrap Also called the bootstrap Also called the bootstrap –A particular instance has a probability of 1–1/n of not being picked –Thus its probability of not ending up in the test data is: –This means the training data will contain approximately 63.2% of the instances
125 Estimating error with the bootstrap The error estimate on the test data will be very pessimistic The error estimate on the test data will be very pessimistic –Trained on just ~63% of the instances Therefore, combine it with the resubstitution error: Therefore, combine it with the resubstitution error: The resubstitution error gets less weight than the error on the test data The resubstitution error gets less weight than the error on the test data Repeat process several times with different replacement samples; average the results Repeat process several times with different replacement samples; average the results
126 More on the bootstrap Probably the best way of estimating performance for very small datasets Probably the best way of estimating performance for very small datasets However, it has some problems However, it has some problems –Consider the random dataset from above –A perfect memorizer will achieve 0% resubstitution error and ~50% error on test data –Bootstrap estimate for this classifier: –True expected error: 50%
127 Credibility: Evaluating what’s been learned Issues: training, testing, tuning Issues: training, testing, tuning Predicting performance: confidence limits Predicting performance: confidence limits Holdout, cross-validation, bootstrap Holdout, cross-validation, bootstrap Comparing schemes: the t-test Comparing schemes: the t-test Predicting probabilities: loss functions Predicting probabilities: loss functions Cost-sensitive measures Cost-sensitive measures Evaluating numeric prediction Evaluating numeric prediction The Minimum Description Length principle The Minimum Description Length principle
128 Comparing data mining schemes Frequent question: which of two learning schemes performs better? Frequent question: which of two learning schemes performs better? Note: this is domain dependent! Note: this is domain dependent! Obvious way: compare 10-fold CV estimates Obvious way: compare 10-fold CV estimates Problem: variance in estimate Problem: variance in estimate Variance can be reduced using repeated CV Variance can be reduced using repeated CV However, we still don’t know whether the results are reliable However, we still don’t know whether the results are reliable
129 Significance tests Significance tests tell us how confident we can be that there really is a difference Significance tests tell us how confident we can be that there really is a difference Null hypothesis: there is no “real” difference Null hypothesis: there is no “real” difference Alternative hypothesis: there is a difference Alternative hypothesis: there is a difference A significance test measures how much evidence there is in favor of rejecting the null hypothesis A significance test measures how much evidence there is in favor of rejecting the null hypothesis Let’s say we are using 10-fold CV Let’s say we are using 10-fold CV Question: do the two means of the 10 CV estimates differ significantly? Question: do the two means of the 10 CV estimates differ significantly?
130 Paired t-test Student’s t-test tells whether the means of two samples are significantly different Student’s t-test tells whether the means of two samples are significantly different Take individual samples using cross- validation Take individual samples using cross- validation Use a paired t-test because the individual samples are paired Use a paired t-test because the individual samples are paired –The same CV is applied twice William Gosset Born:1876 in Canterbury; Died: 1937 in Beaconsfield, England Obtained a post as a chemist in the Guinness brewery in Dublin in Invented the t-test to handle small samples for quality control in brewing. Wrote under the name "Student".
131 Student’s distribution With small samples (k < 100) the mean follows Student’s distribution with k–1 degrees of freedom With small samples (k < 100) the mean follows Student’s distribution with k–1 degrees of freedom Confidence limits: Confidence limits: Pr[X z] z 0.1% %3.25 1%2.82 5% % %0.88 z0.1% %2.58 1%2.33 5% % % degrees of freedom normal distribution
132 Distribution of the means x 1 x 2 … x k and y 1 y 2 … y k are the 2k samples for a k- fold CV x 1 x 2 … x k and y 1 y 2 … y k are the 2k samples for a k- fold CV m x and m y are the means m x and m y are the means With enough samples, the mean of a set of independent samples is normally distributed With enough samples, the mean of a set of independent samples is normally distributed Estimated variances of the means are x 2 /k and y 2 /k Estimated variances of the means are x 2 /k and y 2 /k If x and y are the true means then are approximately normally distributed with mean 0, variance 1 If x and y are the true means then are approximately normally distributed with mean 0, variance 1
133 Distribution of the differences Let m d = m x – m y Let m d = m x – m y The difference of the means (m d ) also has a Student’s distribution with k–1 degrees of freedom The difference of the means (m d ) also has a Student’s distribution with k–1 degrees of freedom Let d 2 be the variance of the difference Let d 2 be the variance of the difference The standardized version of m d is called the t- statistic: The standardized version of m d is called the t- statistic: We use t to perform the t-test We use t to perform the t-test
134 Performing the test Fix a significance level Fix a significance level If a difference is significant at the % level, there is a (100- )% chance that there really is a difference If a difference is significant at the % level, there is a (100- )% chance that there really is a difference Divide the significance level by two because the test is two-tailed Divide the significance level by two because the test is two-tailed I.e. the true difference can be +ve or – ve I.e. the true difference can be +ve or – ve Look up the value for z that corresponds to /2 Look up the value for z that corresponds to /2 If t –z or t z then the difference is significant If t –z or t z then the difference is significant I.e. the null hypothesis can be rejected I.e. the null hypothesis can be rejected
135 Unpaired observations If the CV estimates are from different randomizations, they are no longer paired If the CV estimates are from different randomizations, they are no longer paired (or maybe we used k -fold CV for one scheme, and j -fold CV for the other one) (or maybe we used k -fold CV for one scheme, and j -fold CV for the other one) Then we have to use an un paired t-test with min(k, j) – 1 degrees of freedom Then we have to use an un paired t-test with min(k, j) – 1 degrees of freedom The t-statistic becomes: The t-statistic becomes:
136 Interpreting the result All our cross-validation estimates are based on the same dataset All our cross-validation estimates are based on the same dataset Samples are not independent Samples are not independent Should really use a different dataset sample for each of the k estimates used in the test to judge performance across different training sets Should really use a different dataset sample for each of the k estimates used in the test to judge performance across different training sets Or, use heuristic test, e.g. corrected resampled t-test Or, use heuristic test, e.g. corrected resampled t-test
137 Credibility: Evaluating what’s been learned Issues: training, testing, tuning Issues: training, testing, tuning Predicting performance: confidence limits Predicting performance: confidence limits Holdout, cross-validation, bootstrap Holdout, cross-validation, bootstrap Comparing schemes: the t-test Comparing schemes: the t-test Predicting probabilities: loss functions Predicting probabilities: loss functions Cost-sensitive measures Cost-sensitive measures Evaluating numeric prediction Evaluating numeric prediction The Minimum Description Length principle The Minimum Description Length principle
138 Predicting probabilities Performance measure so far: success rate Performance measure so far: success rate Also called 0-1 loss function: Also called 0-1 loss function: Most classifiers produces class probabilities Most classifiers produces class probabilities Depending on the application, we might want to check the accuracy of the probability estimates Depending on the application, we might want to check the accuracy of the probability estimates 0-1 loss is not the right thing to use in those cases 0-1 loss is not the right thing to use in those cases
139 Quadratic loss function p 1 … p k are probability estimates for an instance p 1 … p k are probability estimates for an instance c is the index of the instance’s actual class c is the index of the instance’s actual class a 1 … a k = 0, except for a c which is 1 a 1 … a k = 0, except for a c which is 1 Quadratic loss is: Quadratic loss is: Want to minimize Want to minimize Can show that this is minimized when p j = p j *, the true probabilities Can show that this is minimized when p j = p j *, the true probabilities
140 Informational loss function The informational loss function is –log(p c ), where c is the index of the instance’s actual class The informational loss function is –log(p c ), where c is the index of the instance’s actual class Number of bits required to communicate the actual class Number of bits required to communicate the actual class Let p 1 * … p k * be the true class probabilities Let p 1 * … p k * be the true class probabilities Then the expected value for the loss function is: Then the expected value for the loss function is: Justification: minimized when p j = p j * Justification: minimized when p j = p j * Difficulty: zero-frequency problem Difficulty: zero-frequency problem
141 Discussion Which loss function to choose? Which loss function to choose? –Both encourage honesty –Quadratic loss function takes into account all class probability estimates for an instance –Informational loss focuses only on the probability estimate for the actual class –Quadratic loss is bounded: it can never exceed 2 –Informational loss can be infinite Informational loss is related to MDL principle [later] Informational loss is related to MDL principle [later]
142 Credibility: Evaluating what’s been learned Issues: training, testing, tuning Issues: training, testing, tuning Predicting performance: confidence limits Predicting performance: confidence limits Holdout, cross-validation, bootstrap Holdout, cross-validation, bootstrap Comparing schemes: the t-test Comparing schemes: the t-test Predicting probabilities: loss functions Predicting probabilities: loss functions Cost-sensitive measures Cost-sensitive measures Evaluating numeric prediction Evaluating numeric prediction The Minimum Description Length principle The Minimum Description Length principle
143 Counting the cost In practice, different types of classification errors often incur different costs In practice, different types of classification errors often incur different costs Examples: Examples: –Disease diagnosis –Terrorist profiling “Not a terrorist” correct 99.99% of the time –Loan decisions –Oil-slick detection –Fault diagnosis –Promotional mailing
144 Counting the cost The confusion matrix: The confusion matrix: There many other types of cost! There many other types of cost! –E.g.: cost of collecting training data Predicted class YesNo Actual class Yes True positive False negative No False positive True negative
145 Lift charts In practice, costs are rarely known In practice, costs are rarely known Decisions are usually made by comparing possible scenarios Decisions are usually made by comparing possible scenarios Example: promotional mailout to 1,000,000 households Example: promotional mailout to 1,000,000 households Mail to all; 0.1% respond (1000) Mail to all; 0.1% respond (1000) Data mining tool identifies subset of 100,000 most promising, 0.4% of these respond (400) 40% of responses for 10% of cost may pay off Data mining tool identifies subset of 100,000 most promising, 0.4% of these respond (400) 40% of responses for 10% of cost may pay off Identify subset of 400,000 most promising, 0.2% respond (800) Identify subset of 400,000 most promising, 0.2% respond (800) A lift chart allows a visual comparison A lift chart allows a visual comparison
146 Generating a lift chart Sort instances according to predicted probability of being positive: Sort instances according to predicted probability of being positive: x axis is sample size y axis is number of true positives x axis is sample size y axis is number of true positives Predicted probability Actual class 10.95Yes 20.93Yes 30.93No 40.88Yes ………
147 A hypothetical lift chart 40% of responses for 10% of cost 80% of responses for 40% of cost
148 ROC curves ROC curves are similar to lift charts ROC curves are similar to lift charts –Stands for “receiver operating characteristic” –Used in signal detection to show tradeoff between hit rate and false alarm rate over noisy channel Differences to lift chart: Differences to lift chart: –y axis shows percentage of true positives in sample rather than absolute number –x axis shows percentage of false positives in sample rather than sample size
149 A sample ROC curve Jagged curve—one set of test data Jagged curve—one set of test data Smooth curve—use cross-validation Smooth curve—use cross-validation
150 Cross-validation and ROC curves Simple method of getting a ROC curve using cross-validation: Simple method of getting a ROC curve using cross-validation: –Collect probabilities for instances in test folds –Sort instances according to probabilities This method is implemented in WEKA This method is implemented in WEKA However, this is just one possibility However, this is just one possibility –The method described in the WEKA book generates an ROC curve for each fold and averages them
151 ROC curves for two schemes For a small, focused sample, use method A For a small, focused sample, use method A For a larger one, use method B For a larger one, use method B In between, choose between A and B with appropriate probabilities In between, choose between A and B with appropriate probabilities
152 The convex hull Given two learning schemes we can achieve any point on the convex hull! Given two learning schemes we can achieve any point on the convex hull! TP and FP rates for scheme 1: t 1 and f 1 TP and FP rates for scheme 1: t 1 and f 1 TP and FP rates for scheme 2: t 2 and f 2 TP and FP rates for scheme 2: t 2 and f 2 If scheme 1 is used to predict 100 q % of the cases and scheme 2 for the rest, then If scheme 1 is used to predict 100 q % of the cases and scheme 2 for the rest, then –TP rate for combined scheme: q t 1 +(1-q) t 2 –FP rate for combined scheme: q f 2 +(1-q) f 2
153 Cost-sensitive learning Most learning schemes do not perform cost- sensitive learning Most learning schemes do not perform cost- sensitive learning –They generate the same classifier no matter what costs are assigned to the different classes –Example: standard decision tree learner Simple methods for cost-sensitive learning: Simple methods for cost-sensitive learning: –Resampling of instances according to costs –Weighting of instances according to costs Some schemes can take costs into account by varying a parameter, e.g. naïve Bayes Some schemes can take costs into account by varying a parameter, e.g. naïve Bayes
154 Measures in information retrieval Percentage of retrieved documents that are relevant: precision=TP/(TP+FP) Percentage of retrieved documents that are relevant: precision=TP/(TP+FP) Percentage of relevant documents that are returned: recall =TP/(TP+FN) Percentage of relevant documents that are returned: recall =TP/(TP+FN) Precision/recall curves have hyperbolic shape Precision/recall curves have hyperbolic shape Summary measures: average precision at 20%, 50% and 80% recall (three-point average recall) Summary measures: average precision at 20%, 50% and 80% recall (three-point average recall) F-measure=(2 recall precision)/(recall+precision) F-measure=(2 recall precision)/(recall+precision)
155 Summary of measures DomainPlotExplanation Lift chart MarketingTP Subset size TP (TP+FP)/(TP+FP+TN +FN) ROC curve Disease Classification TP rate (Sensitivity) FP rate TP/(TP+FN)FP/(FP+TN) Recall- precision curve Information retrieval RecallPrecisionTP/(TP+FN)TP/(TP+FP)
156 Credibility: Evaluating what’s been learned Issues: training, testing, tuning Issues: training, testing, tuning Predicting performance: confidence limits Predicting performance: confidence limits Holdout, cross-validation, bootstrap Holdout, cross-validation, bootstrap Comparing schemes: the t-test Comparing schemes: the t-test Predicting probabilities: loss functions Predicting probabilities: loss functions Cost-sensitive measures Cost-sensitive measures Evaluating numeric prediction Evaluating numeric prediction The Minimum Description Length principle The Minimum Description Length principle
157 Evaluating numeric prediction Same strategies: independent test set, cross- validation, significance tests, etc. Same strategies: independent test set, cross- validation, significance tests, etc. Difference: error measures Difference: error measures Actual target values: a 1 a 2 …a n Actual target values: a 1 a 2 …a n Predicted target values: p 1 p 2 … p n Predicted target values: p 1 p 2 … p n Most popular measure: mean-squared error Most popular measure: mean-squared error –Easy to manipulate mathematically
158 Other measures The root mean-squared error : The root mean-squared error : The mean absolute error is less sensitive to outliers than the mean-squared error: The mean absolute error is less sensitive to outliers than the mean-squared error: Sometimes relative error values are more appropriate (e.g. 10% for an error of 50 when predicting 500) Sometimes relative error values are more appropriate (e.g. 10% for an error of 50 when predicting 500)
159 Improvement on the mean How much does the scheme improve on simply predicting the average? How much does the scheme improve on simply predicting the average? The relative squared error is ( ): The relative squared error is ( ): The relative absolute error is: The relative absolute error is:
160 Correlation coefficient Measures the statistical correlation between the predicted values and the actual values Measures the statistical correlation between the predicted values and the actual values Scale independent, between –1 and +1 Scale independent, between –1 and +1 Good performance leads to large values! Good performance leads to large values!
161 Which measure? Best to look at all of them Best to look at all of them Often it doesn’t matter Often it doesn’t matter Example: Example: ABCD Root mean-squared error Mean absolute error Root rel squared error 42.2%57.2%39.4%35.8% Relative absolute error 43.1%40.1%34.8%30.4% Correlation coefficient D best D best C second-best C second-best A, B arguable A, B arguable
162 Credibility: Evaluating what’s been learned Issues: training, testing, tuning Issues: training, testing, tuning Predicting performance: confidence limits Predicting performance: confidence limits Holdout, cross-validation, bootstrap Holdout, cross-validation, bootstrap Comparing schemes: the t-test Comparing schemes: the t-test Predicting probabilities: loss functions Predicting probabilities: loss functions Cost-sensitive measures Cost-sensitive measures Evaluating numeric prediction Evaluating numeric prediction The Minimum Description Length principle The Minimum Description Length principle
163 The MDL principle MDL stands for minimum description length MDL stands for minimum description length The description length is defined as: The description length is defined as: space required to describe a theory + space required to describe the theory’s mistakes In our case the theory is the classifier and the mistakes are the errors on the training data In our case the theory is the classifier and the mistakes are the errors on the training data Aim: we seek a classifier with minimal DL Aim: we seek a classifier with minimal DL MDL principle is a model selection criterion MDL principle is a model selection criterion
164 Model selection criteria Model selection criteria attempt to find a good compromise between: Model selection criteria attempt to find a good compromise between: The complexity of a model The complexity of a model Its prediction accuracy on the training data Its prediction accuracy on the training data Reasoning: a good model is a simple model that achieves high accuracy on the given data Reasoning: a good model is a simple model that achieves high accuracy on the given data Also known as Occam’s Razor : the best theory is the smallest one that describes all the facts Also known as Occam’s Razor : the best theory is the smallest one that describes all the facts William of Ockham, born in the village of Ockham in Surrey (England) about 1285, was the most influential philosopher of the 14th century and a controversial theologian.
165 Elegance vs. errors Theory 1: very simple, elegant theory that explains the data almost perfectly Theory 1: very simple, elegant theory that explains the data almost perfectly Theory 2: significantly more complex theory that reproduces the data without mistakes Theory 2: significantly more complex theory that reproduces the data without mistakes Theory 1 is probably preferable Theory 1 is probably preferable Classical example: Kepler’s three laws on planetary motion Classical example: Kepler’s three laws on planetary motion –Less accurate than Copernicus’s latest refinement of the Ptolemaic theory of epicycles
166 Elegance vs. errors Kepler – “I have cleared the Augean stables of astronomy of cycles and spirals, and left behind me only a single cartload of dung”
167 MDL and compression MDL principle relates to data compression: MDL principle relates to data compression: –The best theory is the one that compresses the data the most –I.e. to compress a dataset we generate a model and then store the model and its mistakes We need to compute (a) size of the model, and (b) space needed to encode the errors We need to compute (a) size of the model, and (b) space needed to encode the errors (b) easy: use the informational loss function (b) easy: use the informational loss function (a) need a method to encode the model (a) need a method to encode the model
168 MDL and Bayes’s theorem L[T]=“length” of the theory L[T]=“length” of the theory L[E|T]=training set encoded wrt the theory (“dung”) L[E|T]=training set encoded wrt the theory (“dung”) Description length= L[T] + L[E|T] Description length= L[T] + L[E|T] Bayes’s theorem gives a posteriori probability of a theory given the data: Bayes’s theorem gives a posteriori probability of a theory given the data: Equivalent to: Equivalent to: constant
169 MDL and MAP MAP stands for maximum a posteriori probability MAP stands for maximum a posteriori probability Finding the MAP theory corresponds to finding the MDL theory Finding the MAP theory corresponds to finding the MDL theory Difficult bit in applying the MAP principle: determining the prior probability Pr[T] of the theory Difficult bit in applying the MAP principle: determining the prior probability Pr[T] of the theory Corresponds to difficult part in applying the MDL principle: coding scheme for the theory Corresponds to difficult part in applying the MDL principle: coding scheme for the theory I.e. if we know a priori that a particular theory is more likely we need less bits to encode it I.e. if we know a priori that a particular theory is more likely we need less bits to encode it
170 Discussion of MDL principle Advantage: makes full use of the training data when selecting a model Advantage: makes full use of the training data when selecting a model Disadvantage 1: appropriate coding scheme/prior probabilities for theories are crucial Disadvantage 1: appropriate coding scheme/prior probabilities for theories are crucial Disadvantage 2: no guarantee that the MDL theory is the one which minimizes the expected error Disadvantage 2: no guarantee that the MDL theory is the one which minimizes the expected error Note: Occam’s Razor is an axiom! Note: Occam’s Razor is an axiom! Epicurus’s principle of multiple explanations: keep all theories that are consistent with the data Epicurus’s principle of multiple explanations: keep all theories that are consistent with the data
171 Bayesian model averaging Reflects Epicurus’s principle: all theories are used for prediction weighted according to P[T|E] Reflects Epicurus’s principle: all theories are used for prediction weighted according to P[T|E] Let I be a new instance whose class we must predict Let I be a new instance whose class we must predict Let C be the random variable denoting the class Let C be the random variable denoting the class Then BMA gives the probability of C given Then BMA gives the probability of C given –I –training data E –possible theories T j
172 MDL and clustering Description length of theory: bits needed to encode the clusters Description length of theory: bits needed to encode the clusters –e.g. cluster centers Description length of data given theory: encode cluster membership and position relative to cluster Description length of data given theory: encode cluster membership and position relative to cluster –e.g. distance to cluster center Works if coding scheme uses less code space for small numbers than for large ones Works if coding scheme uses less code space for small numbers than for large ones With nominal attributes, must communicate probability distributions for each cluster With nominal attributes, must communicate probability distributions for each cluster
173 Main References Han, J., Kamber, M. (2011). Data mining: Concepts and Techniques (2 nd ed.). New York: Morgan-Kaufman. Han, J., Kamber, M. (2011). Data mining: Concepts and Techniques (2 nd ed.). New York: Morgan-Kaufman. Witten, I. H., & Frank, E. (2005). Data mining: Practical Machine Learning Tools and Techniques (2 nd ed.). New York: Morgan-Kaufmann. Witten, I. H., & Frank, E. (2005). Data mining: Practical Machine Learning Tools and Techniques (2 nd ed.). New York: Morgan-Kaufmann. Hastie, T., Tibshirani, R., & Friedman, J. H. (2nd ed th Printing.). The elements of statistical learning : Data mining, inference, and prediction. New York: Springer. Hastie, T., Tibshirani, R., & Friedman, J. H. (2nd ed th Printing.). The elements of statistical learning : Data mining, inference, and prediction. New York: Springer.