Download presentation
Presentation is loading. Please wait.
Published byAlison Carson Modified over 9 years ago
1
1 Algorithm-Independent Machine Learning Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia, National Taiwan University
2
2 Some Fundamental Problems Which algorithm is “ best ” ? Are there any reasons to favor one algorithm over another? Is “ Occam ’ s razor ” really so evident? Do simpler or “ smoother ” classifiers generalize better? If so, why? Are there fundamental “ conservation ” or “ constraint ” laws other than Bayes error rate?
3
3 Meaning of “Algorithm-Independent” Mathematical foundations that do not depend upon the particular classifier or learning algorithm used –e.g., bias and variance concept Techniques that can be used in conjunction with different learning algorithm, or provide guidance in their use –e.g., cross-validation and resampling techniques
4
4 Roadmap No pattern classification method is inherently superior to any other Ways to quantify and adjust the “ match ” between a learning algorithm and the problem it addresses Estimation of accuracies and comparison of different classifiers with certain assumptions Methods for integrating component classifiers
5
5 Generalization Performance by Off-Training Set Error Consider a two-category problem Training set D Training patterns x i y i = 1 or -1 for i = 1,..., n is generated by unknown target function F(x) to be learned y i = 1 or -1 for i = 1,..., n is generated by unknown target function F(x) to be learned F(x) is often with a random component F(x) is often with a random component –The same input could lead to different categories –Giving non-zero Bayes error
6
6 Generalization Performance by Off-Training Set Error Let H be the (discrete) set of hypotheses, or sets of parameters to be learned A particular h belongs to H –quantized weights in neural network –Parameters q in a functional model –Sets of decisions in a tree P(h) : prior probability that the algorithm will produce hypothesis h after training P(h) : prior probability that the algorithm will produce hypothesis h after training
7
7 Generalization Performance by Off-Training Set Error P(h|D) : probability the algorithm will yield h when trained on data D P(h|D) : probability the algorithm will yield h when trained on data D –Nearest-neighbor and decision tree: non-zero only for a single hypothesis –Neural network: can be a broad distribution E : error for zero-one or other loss function E : error for zero-one or other loss function
8
8 Generalization Performance by Off-Training Set Error A natural measure Expected off-training-set classification error for the k th candidate learning algorithm
9
9 No Free Lunch Theorem
10
10 No Free Lunch Theorem For any two algorithms No matter how clever in choosing a “ good ” algorithm and a “ bad ” algorithm, if all target functions are equally likely, the “ good ” algorithm will not outperform the “ bad ” one There is at least one target function for which random guessing is a better algorithm
11
11 No Free Lunch Theorem Even if we know D, averaged over all target functions no algorithm yields an off-training set error that is superior to any other
12
12 Example 1 No Free Lunch for Binary Data xF h1h1h1h1 h2h2h2h2 D 000111 001 010111 0111 10011 1011 11011 11111
13
13 No Free Lunch Theorem
14
14 Conservation in Generalization Can not achieve positive performance on some problems without getting an equal and opposite amount of negative performance on other problems Can trade performance on problems we do not expect to encounter with those that we do expect to encounter It is the assumptions about the learning domains that are relevant
15
15 Ugly Duckling Theorem In the absence of assumptions there is no privileged or “ best ” feature representation Even the notion of similarity between patterns depends implicitly on assumptions that may or may not be correct
16
16 Venn Diagram Representation of Features as Predicates
17
17 Rank of a Predicate Number of the simplest or indivisible elements it contains Example: rank r = 1 – x 1 : f 1 AND NOT f 2 – x 2 : f 1 AND f 2 – x 3 : f 2 AND NOT f 1 – x 4 : NOT( f 1 OR f 2 ) – C(4,1) = 4 predicates
18
18 Examples of Rank of a Predicate Rank r = 2 – x 1 OR x 2 : f 1 – x 1 OR x 3 : f 1 XOR f 2 – x 1 OR x 4 : NOT f 2 – x 2 OR x 3 : f 2 – x 2 OR x 4 : ( f 1 AND f 2 ) OR NOT (f 1 OR f 2 ) – x 3 OR x 4 : NOT f 1 – C(4, 2) = 6 predicates
19
19 Examples of Rank of a Predicate Rank r = 3 – x 1 OR x 2 OR x 3 : f 1 OR f 2 – x 1 OR x 2 OR x 4 : f 1 OR NOT f 2 – x 1 OR x 3 OR x 4 : NOT( f 1 AND f 2 ) – x 2 OR x 3 OR x 4 : f 2 OR NOT f 1 – C(4, 3) = 4 predicates
20
20 Total Number of Predicates in Absence of Constraints Let d be the number of regions in the Venn diagrams (i.e., number of distinctive patterns, or number of possible values determined by combinations of the features)
21
21 A Measure of Similarity in Absence of Prior Information Number of features or attributes shared by two patterns –Concept difficulties e.g., blind_in_right_eyes and blind_in_left_eyes, (1,0) more similar to (1,1) and (0,0) than to (0,1) e.g., blind_in_right_eyes and blind_in_left_eyes, (1,0) more similar to (1,1) and (0,0) than to (0,1) –There are always multiple ways to represent vectors of attributes e.g. blind_in_right_eye and same_in_both_eyes e.g. blind_in_right_eye and same_in_both_eyes –No principled reason to prefer one of these representations over another
22
22 A Plausible Measure of Similarity in Absence of Prior Information Number of predicates the patterns share Consider two distinct patterns – no predicates of rank 1 is shared – 1 predicate of rank 2 is shared – C(d-2, 1) predicates of rank 3 is shared – C(d-2, r-2) predicates of rank r is shared –Total number of predicates shared
23
23 Ugly Duckling Theorem Given a finite set of predicates that enables us to distinguish any two patterns The number of predicates shared by any two such patterns is constant and independent of the choice of those patterns If pattern similarity is based on the total number of predicates shared, any two patterns are “ equally similar ”
24
24 Ugly Duckling Theorem No problem-independent or privileged or “ best ” set of features or feature attributes Also applies to a continuous feature spaces
25
25 Minimum Description Length (MDL) Find some irreducible, smallest representation ( “ signal ” ) of all members of a category All variation among the individual patterns is then “ noise ” By simplifying recognizers appropriately, the signal can be retained while the noise is ignored
26
26 Algorithm Complexity (Kolmogorov Complexity) Kolmogrov complexity of binary string x –On an abstract computer (Turing machine) U –As the shortest program (binary) string y –Without additional data, computes the string x and halts
27
27 Algorithm Complexity Example Suppose x consists solely of n 1 s Use some fixed number of bits k to specify a loop for printing a string of 1 s Need log 2 n more bits to specify the iteration number n, the condition for halting Thus K(x) = O(log 2 n)
28
28 Algorithm Complexity Example Constant =11.001001000011111110110101010001… The shortest program is the one that can produce any arbitrary large number of consecutive digits of Thus K(x) = O(1)
29
29 Algorithm Complexity Example Assume that x is a “ truly ” random binary string Can not be expressed as a shorter string Thus K(x) = O(|x|)
30
30 Minimum Description Length (MDL) Principle Minimize the sum of the model ’ s algorithmic complexity and the description of the training data with respect to that model
31
31 An Application of MDL Principle For decision-tree classifiers, a model h specifies the tree and the decisions at the nodes Algorithmic complexity of h is proportional to the number of nodes Complexity of data is expressed in terms of the entropy (in bits) of the data Tree-pruning based on entropy is equivalent to the MDL principle
32
32 Convergence of MDL Classifiers MDL classifiers are guaranteed to converge to the ideal or true model in the limit of more and more data Can not prove that the MDL principle leads to a superior performance in the finite data case Still consistent with the no free lunch principle
33
33 Bayesian Perspective of MDL Principle
34
34 Overfitting Avoidance Avoiding overfitting or minimizing description length are not inherently beneficial Amount to a preference over the forms or parameters of classifiers Beneficial only if they match their problems There are problems that overfitting avoidance leads to worse performances
35
35 Explanation of Success of Occam’s Razor Through evolution and strong selection pressure on our neurons Likely to ignore problems for which Occam ’ s razor does not hold Researchers naturally develop simple algorithms before more complex ones – a bias imposed by methodology Principle of satisficing: creating an adequate though possibly nonoptimal solution
36
36 Bias and Variance Ways to measure the “ match ” or “ alignment ” of the learning algorithm to the classification problem Bias –Accuracy or quality of the match –High bias implies a poor match Variance –Precision or specificity of the match –High variance implies a weak match
37
37 Bias and Variance for Regression
38
38 Bias-Variance Dilemma
39
39 Bias-Variance Dilemma Given a target function Model has many parameters –Generally low bias –Fits data well –Yields high variance Model has few parameters –Generally high bias –May not fit data well –The fit does not change much for different data sets (low variance)
40
40 Bias-Variance Dilemma Best way to get low bias and low variance –Have prior information about the target function Virtually never get zero bias and zero variance –Only one learning problem to solve and the answer is already known Large amount of training data will yield improved performance as the model is sufficiently general
41
41 Bias and Variance for Classification Reference –J. H. Friedman, “ On bias, variance, 0/1- loss, and the curse of dimensionality, ” Data Mining and Knowledge Discovery, vol. 1, no. 1, pp. 55-77, 1997.
42
42 Bias and Variance for Classification Two-category problem Target function F(x)=Pr[y=1|x]=1-Pr[y=0|x] F(x)=Pr[y=1|x]=1-Pr[y=0|x] Discriminant function y d = F(x) + y d = F(x) + : zero mean, centered binomialy distributed random variable : zero mean, centered binomialy distributed random variable F(x) = E[y d |x]
43
43 Bias and Variance for Classification Find estimate g(x;D) to minimize E D [(g(x;D)-y d ) 2 ] The estimated g(x;D) can be used to classify x The bias and variance concept for regression can be applied to g(x;D) as an estimate of F(x) However, this is not related to classification error directly
44
44 Bias and Variance for Classification
45
45 Bias and Variance for Classification
46
46 Bias and Variance for Classification
47
47 Bias and Variance for Classification
48
48 Bias and Variance for Classification Sign of the boundary bias affects the role of variance in the error Low variance is generally important for accurate classification, if the sign is positive Low boundary bias need not result in lower error rate Simple methods is often with lower variance, and need not be inferior to more flexible methods
49
49 Error rates and optimal K vs. N for d = 20 in KNN
50
50 Estimation and classification error vs. d for N = 12800 in KNN d
51
51 Boundary Bias-Variance Trade-Off
52
52 Leave-One-Out Method (Jackknife)
53
53 Generalization to Estimates of Other Statistics Estimator of other statistics –median, 25th percentile, mode, etc. Leave-one-out estimate Jackknife estimate and its related variance
54
54 Jackknife Bias Estimate
55
55 Example 2 Jackknife for Mode
56
56 Bootstrap Randomly selecting n points from the training data set D, with replacement Repeat this process independently B times to yield B bootstrap data set, treated as independent sets Bootstrap estimate of a statistic
57
57 Bootstrap Bias and Variance Estimate
58
58 Properties of Bootstrap Estimates The larger the number B of bootstrap samples, the more satisfactory is the estimate of a statistic and its variance B can be adjusted to the computational resources B can be adjusted to the computational resources –Jackknife estimate requires exactly n repetitions
59
59 Bagging Arcing –Adaptive Reweighting and Combining –Reusing or selecting data in order to improve classification, e.g., AdaBoost Bagging –Bootstrap aggregation –Multiple versions of D, by drawing n’ < n samples from D with replacement –Each set trains a component classifier –Final decision is based on vote of each component classifier
60
60 Unstable Algorithm and Bagging Unstable algorithm –“ small ” changes in training data lead to significantly different classifiers and relatively “ large ” changes in accuracy In general bagging improves recognition for unstable classifiers –Effectively averages over such discontinuities –No convincing theoretical derivations or simulation studies showing bagging helps for all unstable classifiers
61
61 Boosting Create the first classifier –With accuracy on the training set greater than average (weak learner) Add a new classifier –Form an ensemble –Joint decision rule has much higher accuracy on the training set Classification performance has been “ boosted ”
62
62 Boosting Procedure Trains successive component classifiers with a subset of the training data that is most “ informative ” given the current set of component classifiers
63
63 Training Data and Weak Learner
64
64 Flip a fair coin Head –Select remaining samples from D –Present them to C 1 one by one until C 1 misclassifies a pattern –Add the misclassified pattern to D 2 Tail –Find a pattern that C 1 classifies correctly “Most Informative” Set Given C 1
65
65 Third Data Set and Classifier C 3 Randomly select a remaining training pattern Add the pattern if C 1 and C 2 disagree, otherwise ignore it
66
66 Classification of a Test Pattern If C 1 and C 2 agree, use their label If they disagree, trust C 3
67
67 Choosing n 1 For final vote, n 1 ~ n 2 ~ n 3 ~ n/3 is desired Reasonable guess: n/3 Simple problem: n 2 << n 1 Difficult problem: n 2 too large In practice we need to run the whole boosting procedure a few times –To use the full training set –To get roughly equal partitions of the training set
68
68 AdaBoost Adaptive boosting Most popular version of basic boosting Continue adding weak learners until some desired low training error has been achieved
69
69 AdaBoost Algorithm
70
70 Final Decision Discriminant function
71
71 Ensemble Training Error
72
72 AdaBoost vs. No Free Lunch Theorem Boosting only improves classification if the component classifiers perform better than chance –Can not be guaranteed a priori Exponential reduction in error on the training set does not ensure reduction of the off-training set error or generalization Proven effective in many real-world applications
73
73 Learning with Queries Set of unlabeled patterns Exists some (possibly costly) way of labeling any pattern To determine which unlabeled patterns would be most informative if they were presented as a query to an oracle Also called active learning or interactive learning Can be refined further as cost-based learning
74
74 Application Example Design a classifier for handwritten numerals Using unlabeled pixel images scanned from documents from a corpus too large to label every pattern Human as the oracle
75
75 Learning with Queries Begin with a preliminary, weak classifier developed with a small set of labeled samples Two related methods for selecting an informative pattern –Confidence-based query selection –Voting-based or committee-based query selection
76
76 Selecting Most Informative Patterns Confidence-based query selection –Pattern that two largest discriminant functions have nearly the same value –i.e., patterns lie near the current decision boundaries Voting-based query selection –Pattern that yields the greatest disagreement among the k resulting category labels
77
77 Active Learning Example
78
78 Arcing and Active Learning vs. IID Sampling If take a model of true distribution and train it with a highly skewed distribution by active learning, the final classifier accuracy might be low Resampling methods are generally use techniques not attempt to model or fit the full category distributions –Not fitting parameters in a model –But instead seeking decision boundaries directly
79
79 Arcing and Active Learning As number of component classifiers is increased, resampling, boosting and related methods effectively broaden that class of implementable functions Allow to try to “ match ” the final classifier to the problem by indirectly adjusting the bias and variance Can be used with arbitrary classification techniques
80
80 Estimating the Generalization Rate See if the classifier performs well enough to be useful Compare its performance with that of a competing design Requires making assumptions about the classifier or the problem or both All the methods given here are heuristic
81
81 Parametric Models Compute from the assumed parametric model Example: two-class multivariate normal case –Bhattacharyya or Chernoff bounds using estimated mean and covariance matrix Problems –Overly optimistic –Always suspect the model –Error rate may be difficult to compute
82
82 Simple Cross-Validation Randomly split the set of labeled training samples D into a training set and a validation set
83
83 m -Fold Cross-Validation Training set is randomly divided into m disjoint sets of equal size n/m The classifier is trained m times –Each time with a different set held out as a validation set Estimated performance is the mean of these m errors When m = n, it is in effect the leave- one-out approach
84
84 Forms of Learning for Cross-Validation neural networks of fixed topology –Number of epochs or presentations of the training set –Number of hidden units Width of the Gaussian window in Parzen windows Optimal k in the k -nearest neighbor classifier
85
85 Portion of D as a Validation Set Should be small –Validation set is used merely to know when to stop adjusting parameters –Training set is used to set large number of parameters or degrees of freedoms Traditional default –Set = 0.1 –Proven effective in many applications
86
86 Anti-Cross-Validation Cross-validation need not work on every problem Anti-cross-validation –Halt when the validation error is the first local maximum –Must explore different values of –Possibly abandon the use of cross- validation if performance cannot be improved
87
87 Estimation of Error Rate Let p be the true and unknown error rate of the classifier Assume k of the n’ independent, randomly drawn test samples are misclassified, then k has the binomial distribution Maximum-likelihood estimate for p
88
88 95% Confidence Intervals for a Given Estimated p
89
89 Jackknife Estimation of Classification Accuracy Use leave-one-out approach Obtain Jackknife estimate for the mean and variance of the leave-one- out accuracies Use traditional hypothesis testing to see if one classifier is superior to another with statistical significance
90
90 Jackknife Estimation of Classification Accuracy
91
91 Bootstrap Estimation of Classification Accuracy Train B classifiers, each with a different bootstrap data set Test on other bootstrap data sets Bootstrap estimate is the mean of these bootstrap accuracies
92
92 Maximum-Likelihood Comparison (ML-II) Also called maximum-likelihood selection Find the maximum-likelihood parameters for each of the candidate models Calculate the resulting likelihoods (evidences) Choose the model with the largest likelihood
93
93 Maximum-Likelihood Comparison
94
94 Scientific Process *D. J. C. MacKay, “ Bayesian interpolation, ” Neural Computation, 4(3), 415-447, 1992
95
95 Bayesian Model Comparison
96
96 Concept of Occam Factor
97
97 Concept of Occam Factor An inherent bias toward simple models (small 0 ) Models that are overly complex (large 0 ) are automatically self- penalizing
98
98 Evidence for Gaussian Parameters
99
99 Bayesian Model Selection vs. No Free Lunch Theorem Bayesian model selection –Ignore the prior over the space of models –Effectively assume that it is uniform –Not take into account how models correspond to underlying target functions –Usually corresponds to non-uniform prior over target functions No Free Lunch Theorem –Allows that for some particular non-uniform prior there may be an algorithm that gives better than chance, or even optimal, results
100
100 Error Rate as a Function of Number n of Training Samples Classifiers trained by a small number of samples will not performed well on new data Typical steps –Estimate unknown parameters from samples –Use these estimates to determine the classifier –Calculate the error rate for the resulting classifier
101
101 Analytical Analysis Case of two categories having equal prior probabilities Partition feature space into some m disjoint cells, C 1,..., C m Conditional probabilities p(x| 1 ) and p(x| 2 ) do not vary appreciably within any cell Need only know which cell x falls
102
102 Analytical Analysis
103
103 Analytical Analysis
104
104 Analytical Analysis
105
105 Results of Simulation Experiments
106
106 Discussions on Error Rate for Given n For every curve involving finite n there is an optimal number of cells At first increasing number of cells make it easier to distinguish between distributions represented by p and q If the number of cells becomes too large, there will not be enough training patterns to fill them –Eventually number of patterns in most cells will be zero
107
107 Discussions on Error Rate for Given n For n = 500, the minimal error rate occurs somewhere around m = 20 Form the cells by dividing each feature axis into l intervals With d features, m = l d If l = 2, using more than four or five binary features will lead to worsen rather than better performance
108
108 Test Errors vs. Number of Training Patterns
109
109 Test and Training Error
110
110 Power Law
111
111 Sum and Difference of Test and Training Error
112
112 Fraction of Dichotomies of n Points in d Dimensions That are Linear
113
113 One-Dimensional Case f(n = 4, d = 1) = 0.5 LabelsLinearlySeparable?LabelsLinearlySeparable? 0000X1000X 0001X1001 00101010 0011X1011 01001100X 01011101 01101110X 0111X1111X
114
114 Capacity of a Separating Plane Not until n is a sizable fraction of 2(d+1) that the problem begins to become difficult Capacity of a hyperplane –At n = 2(d+1), half of the possible dichotomies are still linear Can not expect a linear classifier to “ match ” a problem, on average, if the dimension of the feature space is greater than n/2 - 1
115
115 Mixture-of-Expert Models Classifiers whose decision is based on the outputs of component classifiers Also called –Ensemble classifiers –Modular classifiers –Pooled classifiers Useful if each component classifier is highly trained ( “ expert ” ) in a different region of the feature space
116
116 Mixture Model for Producing Patterns
117
117 Mixture-of-Experts Architecture
118
118 Ensemble Classifiers
119
119 Maximum-Likelihood Estimation
120
120 Final Decision Rule Choose the category corresponding to the maximum discriminant value after the pooling system Winner-take-all method –Use the decision of the single component classifier that is the “ most confident ”, i.e., largest g rj –Suboptimal but simple –Works well if the component classifiers are experts in separate regions
121
121 Component Classifiers without Discriminant Functions Example –A KNN classifier (rank order) –A decision tree (label) –A neural network (analog value) –A rule-based system (label)
122
122 Heuristics to Convert Outputs to Discrimunant Values
123
123 Illustration Examples Analog Rank Order One-of- c gigigigi gigigigi gigigigi 0.40.1583rd4/21=0.19400.0 0.60.1936th1/21=0.04811.0 0.90.2605th2/21=0.09500.0 0.30.1431st6/21=0.28600.0 0.20.1292nd5/21=0.23800.0 0.10.1114th3/21=0.14300.0
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.