1 Algorithm-Independent Machine Learning Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of.

1 Algorithm-Independent Machine Learning Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia, National Taiwan University

2 Some Fundamental Problems Which algorithm is “ best ” ? Are there any reasons to favor one algorithm over another? Is “ Occam ’ s razor ” really so evident? Do simpler or “ smoother ” classifiers generalize better? If so, why? Are there fundamental “ conservation ” or “ constraint ” laws other than Bayes error rate?

3 Meaning of “Algorithm-Independent” Mathematical foundations that do not depend upon the particular classifier or learning algorithm used –e.g., bias and variance concept Techniques that can be used in conjunction with different learning algorithm, or provide guidance in their use –e.g., cross-validation and resampling techniques

4 Roadmap No pattern classification method is inherently superior to any other Ways to quantify and adjust the “ match ” between a learning algorithm and the problem it addresses Estimation of accuracies and comparison of different classifiers with certain assumptions Methods for integrating component classifiers

5 Generalization Performance by Off-Training Set Error Consider a two-category problem Training set D Training patterns x i y i = 1 or -1 for i = 1,..., n is generated by unknown target function F(x) to be learned y i = 1 or -1 for i = 1,..., n is generated by unknown target function F(x) to be learned F(x) is often with a random component F(x) is often with a random component –The same input could lead to different categories –Giving non-zero Bayes error

6 Generalization Performance by Off-Training Set Error Let H be the (discrete) set of hypotheses, or sets of parameters to be learned A particular h belongs to H –quantized weights in neural network –Parameters q in a functional model –Sets of decisions in a tree P(h) : prior probability that the algorithm will produce hypothesis h after training P(h) : prior probability that the algorithm will produce hypothesis h after training

7 Generalization Performance by Off-Training Set Error P(h|D) : probability the algorithm will yield h when trained on data D P(h|D) : probability the algorithm will yield h when trained on data D –Nearest-neighbor and decision tree: non-zero only for a single hypothesis –Neural network: can be a broad distribution E : error for zero-one or other loss function E : error for zero-one or other loss function

8 Generalization Performance by Off-Training Set Error A natural measure Expected off-training-set classification error for the k th candidate learning algorithm

9 No Free Lunch Theorem

10 No Free Lunch Theorem For any two algorithms No matter how clever in choosing a “ good ” algorithm and a “ bad ” algorithm, if all target functions are equally likely, the “ good ” algorithm will not outperform the “ bad ” one There is at least one target function for which random guessing is a better algorithm

11 No Free Lunch Theorem Even if we know D, averaged over all target functions no algorithm yields an off-training set error that is superior to any other

12 Example 1 No Free Lunch for Binary Data xF h1h1h1h1 h2h2h2h2 D 000111 001 010111 0111 10011 1011 11011 11111

13 No Free Lunch Theorem

14 Conservation in Generalization Can not achieve positive performance on some problems without getting an equal and opposite amount of negative performance on other problems Can trade performance on problems we do not expect to encounter with those that we do expect to encounter It is the assumptions about the learning domains that are relevant

15 Ugly Duckling Theorem In the absence of assumptions there is no privileged or “ best ” feature representation Even the notion of similarity between patterns depends implicitly on assumptions that may or may not be correct

16 Venn Diagram Representation of Features as Predicates

17 Rank of a Predicate Number of the simplest or indivisible elements it contains Example: rank r = 1 – x 1 : f 1 AND NOT f 2 – x 2 : f 1 AND f 2 – x 3 : f 2 AND NOT f 1 – x 4 : NOT( f 1 OR f 2 ) – C(4,1) = 4 predicates

18 Examples of Rank of a Predicate Rank r = 2 – x 1 OR x 2 : f 1 – x 1 OR x 3 : f 1 XOR f 2 – x 1 OR x 4 : NOT f 2 – x 2 OR x 3 : f 2 – x 2 OR x 4 : ( f 1 AND f 2 ) OR NOT (f 1 OR f 2 ) – x 3 OR x 4 : NOT f 1 – C(4, 2) = 6 predicates

19 Examples of Rank of a Predicate Rank r = 3 – x 1 OR x 2 OR x 3 : f 1 OR f 2 – x 1 OR x 2 OR x 4 : f 1 OR NOT f 2 – x 1 OR x 3 OR x 4 : NOT( f 1 AND f 2 ) – x 2 OR x 3 OR x 4 : f 2 OR NOT f 1 – C(4, 3) = 4 predicates

20 Total Number of Predicates in Absence of Constraints Let d be the number of regions in the Venn diagrams (i.e., number of distinctive patterns, or number of possible values determined by combinations of the features)

21 A Measure of Similarity in Absence of Prior Information Number of features or attributes shared by two patterns –Concept difficulties e.g., blind_in_right_eyes and blind_in_left_eyes, (1,0) more similar to (1,1) and (0,0) than to (0,1) e.g., blind_in_right_eyes and blind_in_left_eyes, (1,0) more similar to (1,1) and (0,0) than to (0,1) –There are always multiple ways to represent vectors of attributes e.g. blind_in_right_eye and same_in_both_eyes e.g. blind_in_right_eye and same_in_both_eyes –No principled reason to prefer one of these representations over another

22 A Plausible Measure of Similarity in Absence of Prior Information Number of predicates the patterns share Consider two distinct patterns – no predicates of rank 1 is shared – 1 predicate of rank 2 is shared – C(d-2, 1) predicates of rank 3 is shared – C(d-2, r-2) predicates of rank r is shared –Total number of predicates shared

23 Ugly Duckling Theorem Given a finite set of predicates that enables us to distinguish any two patterns The number of predicates shared by any two such patterns is constant and independent of the choice of those patterns If pattern similarity is based on the total number of predicates shared, any two patterns are “ equally similar ”

24 Ugly Duckling Theorem No problem-independent or privileged or “ best ” set of features or feature attributes Also applies to a continuous feature spaces

25 Minimum Description Length (MDL) Find some irreducible, smallest representation ( “ signal ” ) of all members of a category All variation among the individual patterns is then “ noise ” By simplifying recognizers appropriately, the signal can be retained while the noise is ignored

26 Algorithm Complexity (Kolmogorov Complexity) Kolmogrov complexity of binary string x –On an abstract computer (Turing machine) U –As the shortest program (binary) string y –Without additional data, computes the string x and halts

27 Algorithm Complexity Example Suppose x consists solely of n 1 s Use some fixed number of bits k to specify a loop for printing a string of 1 s Need log 2 n more bits to specify the iteration number n, the condition for halting Thus K(x) = O(log 2 n)

28 Algorithm Complexity Example Constant  =11.001001000011111110110101010001… The shortest program is the one that can produce any arbitrary large number of consecutive digits of  Thus K(x) = O(1)

29 Algorithm Complexity Example Assume that x is a “ truly ” random binary string Can not be expressed as a shorter string Thus K(x) = O(|x|)

30 Minimum Description Length (MDL) Principle Minimize the sum of the model ’ s algorithmic complexity and the description of the training data with respect to that model

31 An Application of MDL Principle For decision-tree classifiers, a model h specifies the tree and the decisions at the nodes Algorithmic complexity of h is proportional to the number of nodes Complexity of data is expressed in terms of the entropy (in bits) of the data Tree-pruning based on entropy is equivalent to the MDL principle

32 Convergence of MDL Classifiers MDL classifiers are guaranteed to converge to the ideal or true model in the limit of more and more data Can not prove that the MDL principle leads to a superior performance in the finite data case Still consistent with the no free lunch principle

33 Bayesian Perspective of MDL Principle

34 Overfitting Avoidance Avoiding overfitting or minimizing description length are not inherently beneficial Amount to a preference over the forms or parameters of classifiers Beneficial only if they match their problems There are problems that overfitting avoidance leads to worse performances

35 Explanation of Success of Occam’s Razor Through evolution and strong selection pressure on our neurons Likely to ignore problems for which Occam ’ s razor does not hold Researchers naturally develop simple algorithms before more complex ones – a bias imposed by methodology Principle of satisficing: creating an adequate though possibly nonoptimal solution

36 Bias and Variance Ways to measure the “ match ” or “ alignment ” of the learning algorithm to the classification problem Bias –Accuracy or quality of the match –High bias implies a poor match Variance –Precision or specificity of the match –High variance implies a weak match

37 Bias and Variance for Regression

38 Bias-Variance Dilemma

39 Bias-Variance Dilemma Given a target function Model has many parameters –Generally low bias –Fits data well –Yields high variance Model has few parameters –Generally high bias –May not fit data well –The fit does not change much for different data sets (low variance)

40 Bias-Variance Dilemma Best way to get low bias and low variance –Have prior information about the target function Virtually never get zero bias and zero variance –Only one learning problem to solve and the answer is already known Large amount of training data will yield improved performance as the model is sufficiently general

41 Bias and Variance for Classification Reference –J. H. Friedman, “ On bias, variance, 0/1- loss, and the curse of dimensionality, ” Data Mining and Knowledge Discovery, vol. 1, no. 1, pp. 55-77, 1997.

42 Bias and Variance for Classification Two-category problem Target function F(x)=Pr[y=1|x]=1-Pr[y=0|x] F(x)=Pr[y=1|x]=1-Pr[y=0|x] Discriminant function y d = F(x) +  y d = F(x) +  : zero mean, centered binomialy distributed random variable : zero mean, centered binomialy distributed random variable F(x) = E[y d |x]

43 Bias and Variance for Classification Find estimate g(x;D) to minimize E D [(g(x;D)-y d ) 2 ] The estimated g(x;D) can be used to classify x The bias and variance concept for regression can be applied to g(x;D) as an estimate of F(x) However, this is not related to classification error directly

44 Bias and Variance for Classification

48 Bias and Variance for Classification Sign of the boundary bias affects the role of variance in the error Low variance is generally important for accurate classification, if the sign is positive Low boundary bias need not result in lower error rate Simple methods is often with lower variance, and need not be inferior to more flexible methods

49 Error rates and optimal K vs. N for d = 20 in KNN

50 Estimation and classification error vs. d for N = 12800 in KNN d

51 Boundary Bias-Variance Trade-Off

52 Leave-One-Out Method (Jackknife)

53 Generalization to Estimates of Other Statistics Estimator of other statistics –median, 25th percentile, mode, etc. Leave-one-out estimate Jackknife estimate and its related variance

54 Jackknife Bias Estimate

55 Example 2 Jackknife for Mode

56 Bootstrap Randomly selecting n points from the training data set D, with replacement Repeat this process independently B times to yield B bootstrap data set, treated as independent sets Bootstrap estimate of a statistic 

57 Bootstrap Bias and Variance Estimate

58 Properties of Bootstrap Estimates The larger the number B of bootstrap samples, the more satisfactory is the estimate of a statistic and its variance B can be adjusted to the computational resources B can be adjusted to the computational resources –Jackknife estimate requires exactly n repetitions

59 Bagging Arcing –Adaptive Reweighting and Combining –Reusing or selecting data in order to improve classification, e.g., AdaBoost Bagging –Bootstrap aggregation –Multiple versions of D, by drawing n’ < n samples from D with replacement –Each set trains a component classifier –Final decision is based on vote of each component classifier

60 Unstable Algorithm and Bagging Unstable algorithm –“ small ” changes in training data lead to significantly different classifiers and relatively “ large ” changes in accuracy In general bagging improves recognition for unstable classifiers –Effectively averages over such discontinuities –No convincing theoretical derivations or simulation studies showing bagging helps for all unstable classifiers

61 Boosting Create the first classifier –With accuracy on the training set greater than average (weak learner) Add a new classifier –Form an ensemble –Joint decision rule has much higher accuracy on the training set Classification performance has been “ boosted ”

62 Boosting Procedure Trains successive component classifiers with a subset of the training data that is most “ informative ” given the current set of component classifiers

63 Training Data and Weak Learner

64 Flip a fair coin Head –Select remaining samples from D –Present them to C 1 one by one until C 1 misclassifies a pattern –Add the misclassified pattern to D 2 Tail –Find a pattern that C 1 classifies correctly “Most Informative” Set Given C 1

65 Third Data Set and Classifier C 3 Randomly select a remaining training pattern Add the pattern if C 1 and C 2 disagree, otherwise ignore it

66 Classification of a Test Pattern If C 1 and C 2 agree, use their label If they disagree, trust C 3

67 Choosing n 1 For final vote, n 1 ~ n 2 ~ n 3 ~ n/3 is desired Reasonable guess: n/3 Simple problem: n 2 << n 1 Difficult problem: n 2 too large In practice we need to run the whole boosting procedure a few times –To use the full training set –To get roughly equal partitions of the training set

68 AdaBoost Adaptive boosting Most popular version of basic boosting Continue adding weak learners until some desired low training error has been achieved

69 AdaBoost Algorithm

70 Final Decision Discriminant function

71 Ensemble Training Error

72 AdaBoost vs. No Free Lunch Theorem Boosting only improves classification if the component classifiers perform better than chance –Can not be guaranteed a priori Exponential reduction in error on the training set does not ensure reduction of the off-training set error or generalization Proven effective in many real-world applications

73 Learning with Queries Set of unlabeled patterns Exists some (possibly costly) way of labeling any pattern To determine which unlabeled patterns would be most informative if they were presented as a query to an oracle Also called active learning or interactive learning Can be refined further as cost-based learning

74 Application Example Design a classifier for handwritten numerals Using unlabeled pixel images scanned from documents from a corpus too large to label every pattern Human as the oracle

75 Learning with Queries Begin with a preliminary, weak classifier developed with a small set of labeled samples Two related methods for selecting an informative pattern –Confidence-based query selection –Voting-based or committee-based query selection

76 Selecting Most Informative Patterns Confidence-based query selection –Pattern that two largest discriminant functions have nearly the same value –i.e., patterns lie near the current decision boundaries Voting-based query selection –Pattern that yields the greatest disagreement among the k resulting category labels

77 Active Learning Example

78 Arcing and Active Learning vs. IID Sampling If take a model of true distribution and train it with a highly skewed distribution by active learning, the final classifier accuracy might be low Resampling methods are generally use techniques not attempt to model or fit the full category distributions –Not fitting parameters in a model –But instead seeking decision boundaries directly

79 Arcing and Active Learning As number of component classifiers is increased, resampling, boosting and related methods effectively broaden that class of implementable functions Allow to try to “ match ” the final classifier to the problem by indirectly adjusting the bias and variance Can be used with arbitrary classification techniques

80 Estimating the Generalization Rate See if the classifier performs well enough to be useful Compare its performance with that of a competing design Requires making assumptions about the classifier or the problem or both All the methods given here are heuristic

81 Parametric Models Compute from the assumed parametric model Example: two-class multivariate normal case –Bhattacharyya or Chernoff bounds using estimated mean and covariance matrix Problems –Overly optimistic –Always suspect the model –Error rate may be difficult to compute

82 Simple Cross-Validation Randomly split the set of labeled training samples D into a training set and a validation set

83 m -Fold Cross-Validation Training set is randomly divided into m disjoint sets of equal size n/m The classifier is trained m times –Each time with a different set held out as a validation set Estimated performance is the mean of these m errors When m = n, it is in effect the leave- one-out approach

84 Forms of Learning for Cross-Validation neural networks of fixed topology –Number of epochs or presentations of the training set –Number of hidden units Width of the Gaussian window in Parzen windows Optimal k in the k -nearest neighbor classifier

85 Portion  of D as a Validation Set Should be small –Validation set is used merely to know when to stop adjusting parameters –Training set is used to set large number of parameters or degrees of freedoms Traditional default –Set  = 0.1 –Proven effective in many applications

86 Anti-Cross-Validation Cross-validation need not work on every problem Anti-cross-validation –Halt when the validation error is the first local maximum –Must explore different values of  –Possibly abandon the use of cross- validation if performance cannot be improved

87 Estimation of Error Rate Let p be the true and unknown error rate of the classifier Assume k of the n’ independent, randomly drawn test samples are misclassified, then k has the binomial distribution Maximum-likelihood estimate for p

88 95% Confidence Intervals for a Given Estimated p

89 Jackknife Estimation of Classification Accuracy Use leave-one-out approach Obtain Jackknife estimate for the mean and variance of the leave-one- out accuracies Use traditional hypothesis testing to see if one classifier is superior to another with statistical significance

90 Jackknife Estimation of Classification Accuracy

91 Bootstrap Estimation of Classification Accuracy Train B classifiers, each with a different bootstrap data set Test on other bootstrap data sets Bootstrap estimate is the mean of these bootstrap accuracies

92 Maximum-Likelihood Comparison (ML-II) Also called maximum-likelihood selection Find the maximum-likelihood parameters for each of the candidate models Calculate the resulting likelihoods (evidences) Choose the model with the largest likelihood

93 Maximum-Likelihood Comparison

94 Scientific Process *D. J. C. MacKay, “ Bayesian interpolation, ” Neural Computation, 4(3), 415-447, 1992

95 Bayesian Model Comparison

96 Concept of Occam Factor

97 Concept of Occam Factor An inherent bias toward simple models (small  0 ) Models that are overly complex (large  0 ) are automatically self- penalizing

98 Evidence for Gaussian Parameters

99 Bayesian Model Selection vs. No Free Lunch Theorem Bayesian model selection –Ignore the prior over the space of models –Effectively assume that it is uniform –Not take into account how models correspond to underlying target functions –Usually corresponds to non-uniform prior over target functions No Free Lunch Theorem –Allows that for some particular non-uniform prior there may be an algorithm that gives better than chance, or even optimal, results

100 Error Rate as a Function of Number n of Training Samples Classifiers trained by a small number of samples will not performed well on new data Typical steps –Estimate unknown parameters from samples –Use these estimates to determine the classifier –Calculate the error rate for the resulting classifier

101 Analytical Analysis Case of two categories having equal prior probabilities Partition feature space into some m disjoint cells, C 1,..., C m Conditional probabilities p(x|  1 ) and p(x|  2 ) do not vary appreciably within any cell Need only know which cell x falls

102 Analytical Analysis

105 Results of Simulation Experiments

106 Discussions on Error Rate for Given n For every curve involving finite n there is an optimal number of cells At first increasing number of cells make it easier to distinguish between distributions represented by p and q If the number of cells becomes too large, there will not be enough training patterns to fill them –Eventually number of patterns in most cells will be zero

107 Discussions on Error Rate for Given n For n = 500, the minimal error rate occurs somewhere around m = 20 Form the cells by dividing each feature axis into l intervals With d features, m = l d If l = 2, using more than four or five binary features will lead to worsen rather than better performance

108 Test Errors vs. Number of Training Patterns

109 Test and Training Error

110 Power Law

111 Sum and Difference of Test and Training Error

112 Fraction of Dichotomies of n Points in d Dimensions That are Linear

113 One-Dimensional Case f(n = 4, d = 1) = 0.5 LabelsLinearlySeparable?LabelsLinearlySeparable? 0000X1000X 0001X1001 00101010 0011X1011 01001100X 01011101 01101110X 0111X1111X

114 Capacity of a Separating Plane Not until n is a sizable fraction of 2(d+1) that the problem begins to become difficult Capacity of a hyperplane –At n = 2(d+1), half of the possible dichotomies are still linear Can not expect a linear classifier to “ match ” a problem, on average, if the dimension of the feature space is greater than n/2 - 1

115 Mixture-of-Expert Models Classifiers whose decision is based on the outputs of component classifiers Also called –Ensemble classifiers –Modular classifiers –Pooled classifiers Useful if each component classifier is highly trained ( “ expert ” ) in a different region of the feature space

116 Mixture Model for Producing Patterns

117 Mixture-of-Experts Architecture

118 Ensemble Classifiers

119 Maximum-Likelihood Estimation

120 Final Decision Rule Choose the category corresponding to the maximum discriminant value after the pooling system Winner-take-all method –Use the decision of the single component classifier that is the “ most confident ”, i.e., largest g rj –Suboptimal but simple –Works well if the component classifiers are experts in separate regions

121 Component Classifiers without Discriminant Functions Example –A KNN classifier (rank order) –A decision tree (label) –A neural network (analog value) –A rule-based system (label)

122 Heuristics to Convert Outputs to Discrimunant Values

123 Illustration Examples Analog Rank Order One-of- c gigigigi gigigigi gigigigi 0.40.1583rd4/21=0.19400.0 0.60.1936th1/21=0.04811.0 0.90.2605th2/21=0.09500.0 0.30.1431st6/21=0.28600.0 0.20.1292nd5/21=0.23800.0 0.10.1114th3/21=0.14300.0

1 Algorithm-Independent Machine Learning Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of.

Similar presentations

Presentation on theme: "1 Algorithm-Independent Machine Learning Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Algorithm-Independent Machine Learning Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of.

Similar presentations

Presentation on theme: "1 Algorithm-Independent Machine Learning Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of."— Presentation transcript:

Similar presentations

About project

Feedback