A classification learning example

Slides:

Advertisements

Similar presentations

Artificial Neural Networks

Advertisements

Slides from: Doug Gray, David Poole

Learning in Neural and Belief Networks - Feed Forward Neural Network 2001 년 3 월 28 일 안순길.

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.

Classification Techniques: Decision Tree Learning

Tuomas Sandholm Carnegie Mellon University Computer Science Department

Machine Learning: Connectionist McCulloch-Pitts Neuron Perceptrons Multilayer Networks Support Vector Machines Feedback Networks Hopfield Networks.

Machine Learning Neural Networks

Lecture 14 – Neural Networks

1 Chapter 10 Introduction to Machine Learning. 2 Chapter 10 Contents (1) l Training l Rote Learning l Concept Learning l Hypotheses l General to Specific.

LEARNING FROM OBSERVATIONS Yılmaz KILIÇASLAN. Definition Learning takes place as the agent observes its interactions with the world and its own decision-making.

Donald “Godel” Rumsfeld ''Reports that say that something hasn't happened are always interesting to me, because as we know, there are known knowns, there.

Neural Networks Marco Loog.

Three kinds of learning

LEARNING DECISION TREES

Machine Learning Motivation for machine learning How to set up a problem How to design a learner Introduce one class of learners (ANN) –Perceptrons –Feed-forward.

Learning decision trees derived from Hwee Tou Ng, slides for Russell & Norvig, AI a Modern Approachslides Tom Carter, “An introduction to information theory.

Learning….in a rather broad sense: improvement of performance on the basis of experience Machine learning…… improve for task T with respect to performance.

Artificial Neural Networks

Artificial Neural Networks

Memory-Based Learning Instance-Based Learning K-Nearest Neighbor.

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Part I: Classification and Bayesian Learning

CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.

Radial Basis Function Networks

Machine learning Image source:

More Machine Learning Linear Regression Squared Error L1 and L2 Regularization Gradient Descent.

This week: overview on pattern recognition (related to machine learning)

Machine Learning Chapter 3. Decision Tree Learning

Learning what questions to ask. 8/29/03Decision Trees2  Job is to build a tree that represents a series of questions that the classifier will ask of.

Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.

Artificial Neural Networks

INTRODUCTION TO MACHINE LEARNING. $1,000,000 Machine Learning  Learn models from data  Three main types of learning :  Supervised learning  Unsupervised.

Inductive learning Simplest form: learn a function from examples

Midterm Review Rao Vemuri 16 Oct Posing a Machine Learning Problem Experience Table – Each row is an instance – Each column is an attribute/feature.

Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.

CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.

Neural Networks Ellen Walker Hiram College. Connectionist Architectures Characterized by (Rich & Knight) –Large number of very simple neuron-like processing.

Machine Learning Chapter 4. Artificial Neural Networks

Machine Learning CSE 681 CH2 - Supervised Learning.

An Introduction to Support Vector Machine (SVM) Presenter : Ahey Date : 2007/07/20 The slides are based on lecture notes of Prof. 林智仁 and Daniel Yeung.

Learning from observations

Learning from Observations Chapter 18 Through

CHAPTER 18 SECTION 1 – 3 Learning from Observations.

Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.

Learning from Observations Chapter 18 Section 1 – 3, 5-8 (presentation TBC)

CS 5751 Machine Learning Chapter 3 Decision Tree Learning1 Decision Trees Decision tree representation ID3 learning algorithm Entropy, Information gain.

1 Chapter 10 Introduction to Machine Learning. 2 Chapter 10 Contents (1) l Training l Rote Learning l Concept Learning l Hypotheses l General to Specific.

Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.

CS Inductive Bias1 Inductive Bias: How to generalize on novel data.

Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.

Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.

Data Mining and Decision Support

Learning Neural Networks (NN) Christina Conati UBC

Machine Learning Chapter 7. Computational Learning Theory Tom M. Mitchell.

Learning From Observations Inductive Learning Decision Trees Ensembles.

SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.

Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

Machine Learning Supervised Learning Classification and Regression

Fall 2004 Backpropagation CS478 - Machine Learning.

Introduce to machine learning

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

Overview of Machine Learning

Support Vector Machines

Learning from Observations

A task of induction to find patterns

Memory-Based Learning Instance-Based Learning K-Nearest Neighbor

A task of induction to find patterns

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Presentation transcript:

A classification learning example Predicting when Rusell will wait for a table --similar to book preferences, predicting credit card fraud, predicting when people are likely to respond to junk mail

Uses different biases in predicting Russel’s waiting habbits Decision Trees --Examples are used to --Learn topology --Order of questions K-nearest neighbors If patrons=full and day=Friday then wait (0.3/0.7) If wait>60 and Reservation=no then wait (0.4/0.9) Association rules --Examples are used to --Learn support and confidence of association rules SVMs Neural Nets --Examples are used to --Learn topology --Learn edge weights Naïve bayes (bayesnet learning) --Examples are used to --Learn topology --Learn CPTs

Inductive Learning (Classification Learning) Given a set of labeled examples, and a space of hypotheses Find the rule that underlies the labeling (so you can use it to predict future unlabeled examples) Tabularasa, fully supervised Idea: Loop through all hypotheses Rank each hypothesis in terms of its match to data Pick the best hypothesis The main problem is that the space of hypotheses is too large Given examples described in terms of n boolean variables There are 2 different hypotheses For 6 features, there are 18,446,744,073,709,551,616 hypotheses 2n

Ranking hypotheses A good hypothesis will have fewest false positives (Fh+) and fewest false negatives (Fh-) [Ideally, we want them to be zero] On training or testing data?? Rank(h) = f(Fh+, Fh-) (loss function) --f depends on the domain by default f=Sum; but can give different weights to different errors (Cost-based learning) False +ve: The learner classifies the example as +ve, but it is actually -ve Ranking hypotheses H1: Russell waits only in italian restaurants false +ves: X10, false –ves: X1,X3,X4,X8,X12 H2: Russell waits only in cheap french restaurants False +ves: False –ves: X1,X3,X4,X6,X8,X12 Medical domain --Higher cost for F- --But also high cost for F+ Spam Mailer --Very low cost for F+ --higher cost for F- Terrorist/Criminal Identification --High cost for F+ (for the individual) --High cost for F- (for the society)

What is a reasonable goal in designing a learner? Complexity measured in number of Samples required to PAC-learn What is a reasonable goal in designing a learner? (Idea) Learner must classify all new instances (test cases) correctly always Any test cases? Test cases drawn from the same distribution as the training cases Always? May be the training samples are not completely representative of the test samples So, we go with “probably” Correctly? May be impossible if the training data has noise (the teacher may make mistakes too) So, we go with “approximately” The goal of a learner then is to produce a probably approximately correct (PAC) hypothesis, for a given approximation (error rate) e and probability d. When is a learner A better than learner B? For the same e,d bounds, A needs fewer training samples than B to reach PAC. Learning Curves

Deriving Sample Complexity for PAC Learning IDEA: We want to compute the probability that a “bad” hypothesis (that makes more than ² error on the test cases) is chosen for being consistent with the training examples, and constrain it to be less than ± Probability that an hb 2 Hbad is consistent with a single training example is · (1 - ²) (since error rate of hb > ²). This holds ONLY because we assume training and test instances are drawn with the same distribution The probability that it is consistent with all N training examples is ·(1-²)N The probability that at least one bad hypothesis does this is · |Hbad| (1-²)N · |H| (1-²)N ( since |Hbad| · |H|) We want this probability be less than ±. That is |H| (1-²)N · ± Since (1 - ²) · e-² we can have it if |H| e-²N · ± or N ¸ 1/² (log 1/± + log |H|) hb - + hb - + - - + - - + + + - - - + + - + - + - +

Inductive Learning (Classification Learning) Given a set of labeled examples, and a space of hypotheses Find the rule that underlies the labeling (so you can use it to predict future unlabeled examples) Tabularasa, fully supervised Idea: Loop through all hypotheses Rank each hypothesis in terms of its match to data Pick the best hypothesis Main variations: Bias: the “sort” of rule are you looking for? If you are looking for only conjunctive hypotheses, there are just 3n Search: Greedy search Decision tree learner Systematic search Version space learner Iterative search Neural net learner The main problem is that the space of hypotheses is too large Given examples described in terms of n boolean variables There are 2 different hypotheses For 6 features, there are 18,446,744,073,709,551,616 hypotheses 2n

Bias & Learning Accuracy Why Simple is Better? Having weak bias (large hypothesis space) Allows us to capture more concepts ..increases learning cost May lead to over-fitting Also the goal of a compression algorithm is to drive down the training error But the goal of a learning algorithm is to drive down the test error

Uses different biases in predicting Russel’s waiting habbits Decision Trees --Examples are used to --Learn topology --Order of questions K-nearest neighbors If patrons=full and day=Friday then wait (0.3/0.7) If wait>60 and Reservation=no then wait (0.4/0.9) Association rules --Examples are used to --Learn support and confidence of association rules SVMs Neural Nets --Examples are used to --Learn topology --Learn edge weights Naïve bayes (bayesnet learning) --Examples are used to --Learn topology --Learn CPTs

20 Questions: AI Style Learning Decision Trees---How? Basic Idea: --Pick an attribute --Split examples in terms of that attribute --If all examples are +ve label Yes. Terminate --If all examples are –ve label No. Terminate --If some are +ve, some are –ve continue splitting recursively (Special case: Decision Stumps If you don’t feel like splitting any further, return the majority label ) Which one to pick? 20 Questions: AI Style

Depending on the order we pick, we can get smaller or bigger trees Which tree is better? Why do you think so??

Decision Trees & Sample Complexity Decision Trees can Represent any boolean function ..So PAC-learning decision trees should be exponentially hard (since there are 22n hypotheses) ..however, decision tree learning algorithms use greedy approaches for learning a good (rather than the optimal) decision tree Thus, using greedy rather than exhaustive search of hypotheses space is another way of keeping complexity low (at the expense of losing PAC guarantees)

Would you split on patrons or Type? Basic Idea: --Pick an attribute --Split examples in terms of that attribute --If all examples are +ve label Yes. Terminate --If all examples are –ve label No. Terminate --If some are +ve, some are –ve continue splitting recursively --if no attributes left to split? (label with majority element) Would you split on patrons or Type?

S The Information Gain Computation P+ : N+ /(N++N-) P- : N- /(N++N-) # expected comparisons needed to tell whether a given example is +ve or -ve P+ : N+ /(N++N-) P- : N- /(N++N-) I(P+ ,, P-) = -P+ log(P+) - P- log(P- ) N+ N- N1+ N1- N2+ N2- Nk+ Nk- Splitting on feature fk The difference is the information gain So, pick the feature with the largest Info Gain I.e. smallest residual info I(P1+ ,, P1-) I(P2+ ,, P2-) I(Pk+ ,, Pk-) S [Ni+ + Ni- ]/[N+ + N-] I(Pi+ ,, Pi-) i=1 k Given k mutually exclusive and exhaustive events E1….Ek whose probabilities are p1….pk The “information” content (entropy) is defined as S i -pi log2 pi A split is good if it reduces the entropy..

S The Information Gain Computation P+ : N+ /(N++N-) P- : N- /(N++N-) # expected comparisons needed to tell whether a given example is +ve or -ve P+ : N+ /(N++N-) P- : N- /(N++N-) I(P+ ,, P-) = -P+ log(P+) - P- log(P- ) N+ N- N1+ N1- N2+ N2- Nk+ Nk- Splitting on feature fk The difference is the information gain So, pick the feature with the largest Info Gain I.e. smallest residual info I(P1+ ,, P1-) I(P2+ ,, P2-) I(Pk+ ,, Pk-) S [Ni+ + Ni- ]/[N+ + N-] I(Pi+ ,, Pi-) i=1 k Given k mutually exclusive and exhaustive events E1….Ek whose probabilities are p1….pk The “information” content (entropy) is defined as S i -pi log2 pi A split is good if it reduces the entropy..

V(M) = 2/4 * I(1/2,1/2) + 2/4 * I(1/2,1/2) A simple example I(1/2,1/2) = -1/2 *log 1/2 -1/2 *log 1/2 = 1/2 + 1/2 =1 I(1,0) = 1*log 1 + 0 * log 0 = 0 V(M) = 2/4 * I(1/2,1/2) + 2/4 * I(1/2,1/2) = 1 V(A) = 2/4 * I(1,0) + 2/4 * I(0,1) = 0 V(N) = 2/4 * I(1/2,1/2) + 2/4 * I(1/2,1/2) So Anxious is the best attribute to split on Once you split on Anxious, the problem is solved

Evaluating the Decision Trees m-fold cross-validation Split N examples into m equal sized parts for i=1..m train with all parts except ith test with the ith part Evaluating the Decision Trees Russell Domain “Majority” function (say yes if majority of attributes are yes) Lesson: Every bias makes some concepts easier to learn and others harder to learn… Learning curves… Given N examples, partition them into Ntr the training set and Ntest the test instances Loop for i=1 to |Ntr| Loop for Ns in subsets of Ntr of size I Train the learner over Ns Test the learned pattern over Ntest and compute the accuracy (%correct)

Decision Trees vs. Naïve Bayes For Russell Restaurant Scenario Decision trees are better if there is a “succinct” explanation in terms of a few features. NBC is better if all features wind up playing a role e.g. Spam mails

Problems with Info. Gain. Heuristics Feature correlation: We are splitting on one feature at a time The Costanza party problem No obvious easy solution… Overfitting: We may look too hard for patterns where there are none E.g. Coin tosses classified by the day of the week, the shirt I was wearing, the time of the day etc. Solution: Don’t consider splitting if the information gain given by the best feature is below a minimum threshold Can use the c2 test for statistical significance Will also help when we have noisy samples… We may prefer features with very high branching e.g. Branch on the “universal time string” for Russell restaurant example Branch on social security number to look for patterns on who will get A Solution: “gain ratio” --ratio of information gain with the attribute A to the information content of answering the question “What is the value of A?” The denominator is smaller for attributes with smaller domains.

Decision Stumps P+= N1+ / N1++N1- Decision stumps are decision trees where the leaf nodes do not necessarily have all +ve or all –ve training examples Could happen either because examples are noisy and mis-classified or because you want to stop before reaching pure leafs When you reach that node, you return the majority label as the decision. (We can associate a confidence with that decision using the P+ and P-) N+ N- Splitting on feature fk N1+ N1- N2+ N2- Nk+ Nk- P+= N1+ / N1++N1- Sometimes, the best decision tree for a problem could be a decision stump (see coin toss example next)

Mirror, Mirror, on the wall Which learning bias is the best of all? Well, there is no such thing, silly! --Each bias makes it easier to learn some patterns and harder (or impossible) to learn others: -A line-fitter can fit the best line to the data very fast but won’t know what to do if the data doesn’t fall on a line --A curve fitter can fit lines as well as curves… but takes longer time to fit lines than a line fitter. -- Different types of bias classes (Decision trees, NNs etc) provide different ways of naturally carving up the space of all possible hypotheses So a more reasonable question is: -- What is the bias class that has a specialization corresponding to the type of patterns that underlie my data? Bias can be seen as a sneaky way of letting background knowledge in.. -- In this bias class, what is the most restrictive bias that still can capture the true pattern in the data? --Decision trees can capture all boolean functions --but are faster at capturing conjunctive boolean functions --Neural nets can capture all boolean or real-valued functions --but are faster at capturing linearly separable functions --Bayesian learning can capture all probabilistic dependencies But are faster at capturing single level dependencies (naïve bayes classifier)

Uses different biases in predicting Russel’s waiting habbits Decision Trees --Examples are used to --Learn topology --Order of questions K-nearest neighbors If patrons=full and day=Friday then wait (0.3/0.7) If wait>60 and Reservation=no then wait (0.4/0.9) Association rules --Examples are used to --Learn support and confidence of association rules SVMs Neural Nets --Examples are used to --Learn topology --Learn edge weights Naïve bayes (bayesnet learning) --Examples are used to --Learn topology --Learn CPTs

Decision Surface Learning (aka Neural Network Learning) Idea: Since classification is really a question of finding a surface to separate the +ve examples from the -ve examples, why not directly search in the space of possible surfaces? Mathematically, a surface is a function Need a way of learning functions “Threshold units”

The “Brain” Connection A Threshold Unit Threshold Functions differentiable …is sort of like a neuron

Perceptron Networks == What happened to the “Threshold”? --Can model as an extra weight with static input w1 w2 t=k I1 I2 == w1 w2 w0= k I0=-1 t=0

Perceptron Learning as Gradient Descent Search in the weight-space Optimal perceptron has the lowest error on the training data Often a constant learning rate parameter is used instead Ij I

Perceptron Learning Until “convergence” A nice applet at: Perceptron learning algorithm Loop through training examples If the activation level of the output unit is 1 when it should be 0, reduce the weight on the link to the jth input unit by a*Ij, where Ii is the ith input value and a a learning rate So, we are assuming g’(.) is a constant.. Which it is really not.. If the activation level of the output unit is 0 when it should be 1, increase the weight on the link to the ith input unit by a*Ij Otherwise, do nothing Until “convergence” Iterative search! --node -> network weights --goodness -> error Actually a “gradient descent” search A nice applet at: http://neuron.eng.wayne.edu/java/Perceptron/New38.html

Perceptron Training in Action A nice applet at: http://neuron.eng.wayne.edu/java/Perceptron/New38.html Any line that separates the +ve & –ve examples is a solution --may want to get the line that is in some sense equidistant from the nearest +ve/-ve --Need “support vector machines” for that

Comparing Perceptrons and Decision Trees in Majority Function and Russell Domain Decision Trees Perceptron Decision Trees Perceptron Majority function Russell Domain Majority function is linearly seperable.. Russell domain is apparently not.... Encoding: one input unit per attribute. The unit takes as many distinct real values as the size of attribute domain

Can Perceptrons Learn All Boolean Functions? --Are all boolean functions linearly separable?

Max-Margin Classification & Support Vector Machines Any line that separates the +ve & –ve examples is a solution And perceptron learning finds one of them But could we have a preference among these? may want to get the line that provides maximum margin (equidistant from the nearest +ve/-ve) The nereast +ve and –ve holding up the line are called support vectors This changes optimization objective Quadratic Programming can be used to directly find such a line

Lagrangian Dual

Two ways to learn non-linear decision surfaces First transform the data into higher dimensional space Find a linear surface Which is guaranteed to exist Transform it back to the original space TRICK is to do this without explicitly doing a transformation Learn non-linear surfaces directly (as multi-layer neural nets) Trick is to do training efficiently Back Propagation to the rescue..

Recurrent Feed Forward Single Layer Multi-Layer “Neural Net” is a collection of with interconnections threshold units = 1 if w1I1+w2I2 > k = 0 otherwise differentiable Recurrent Bi-directional connections Feed Forward Uni-directional connections Single Layer Multi-Layer Any linear decision surface can be represented by a single layer neural net Any “continuous” decision surface (function) can be approximated to any degree of accuracy by some 2-layer neural net Can act as associative memory

Linear Separability in High Dimensions “Kernels” allow us to consider separating surfaces in high-D without first converting all points to high-D

Kernelized Support Vector Machines Turns out that it is not always necessary to first map the data into high-D, and then do linear separation The quadratic programming formulation for SVM winds up using only the pair-wise dot product of training vectors Dot product is a form of similarity metric between points If you replace that dot product by any non-linear function, you will, in essence, be transforming data into some high-dimensional space and then finding the max-margin linear classifier in that space Which will correspond to some wiggly surface in the original dimension The trick is to find the RIGHT similarity function Which is a form of prior knowledge

Kernelized Support Vector Machines Turns out that it is not always necessary to first map the data into high-D, and then do linear separation The quadratic programming formulation for SVM winds up using only the pair-wise dot product of training vectors Dot product is a form of similarity metric between points If you replace that dot product by any non-linear function, you will, in essence, be tranforming data into some high-dimensional space and then finding the max-margin linear classifier in that space Which will correspond to some wiggly surface in the original dimension The trick is to find the RIGHT similarity function Which is a form of prior knowledge

Domain-knowledge & Learning Those who ignore easily available domain knowledge are doomed to re-learn it… Santayana’s brother Domain-knowledge & Learning Classification learning is a problem addressed by both people from AI (machine learning) and Statistics Statistics folks tend to “distrust” domain-specific bias. Let the data speak for itself… ..but this is often futile. The very act of “describing” the data points introduces bias (in terms of the features you decided to use to describe them..) …but much human learning occurs because of strong domain-specific bias.. Machine learning is torn by these competing influences.. In most current state of the art algorithms, domain knowledge is allowed to influence learning only through relatively narrow avenues/formats (E.g. through “kernels”) Okay in domains where there is very little (if any) prior knowledge (e.g. what part of proteins are doing what cellular function) ..restrictive in domains where there already exists human expertise..

Multi-layer Neural Nets How come back-prop doesn’t get stuck in local minima? One answer: It is actually hard for local minimas to form in high-D, as the “trough” has to be closed in all dimensions

Multi-Network Learning can learn Russell Domains Decision Trees Decision Trees Multi-layer networks Perceptron Russell Domain …but does it slowly…

Practical Issues in Multi-layer network learning For multi-layer networks, we need to learn both the weights and the network topology Topology is fixed for perceptrons If we go with too many layers and connections, we can get over-fitting as well as sloooow convergence Optimal brain damage Start with more than needed hidden layers as well as connections; after a network is learned, remove the nodes and connections that have very low weights; retrain

Other impressive applications: --no-hands across america Humans make 0.2% Neumans (postmen) make 2% Other impressive applications: --no-hands across america --learning to speak K-nearest-neighbor The test example’s class is determined by the class of the majority of its k nearest neighbors Need to define an appropriate distance measure --sort of easy for real valued vectors --harder for categorical attributes

Decision Trees vs. Neural Nets Can handle real-valued attributes Can learn any non-linear decision surface Incremental; as new examples arrive, the network can adapt. Good at handling noise Convergence is quite slow Faster at learning linear ones Learned concept is represented by the weights and topology of the network (so hard to understand) Consider understanding Einstein by dissecting his brain. Double edged argument—there are many learning tasks for whion ch we do not know how to articulated what we have learned. Eg. Face recognition; word recognition Work well for discrete attributes. Converge fast for conjunctive concepts Non-incremental (looks at all the examples at once) Not very good at handling noise Generally good at avoiding irrelevant attributes Easy to understand the learned concept Why is it important to understand what is learned? --The military “hidden tank photos” example

Learning Improving the performance of the agent -w.r.t. the external performance measure Dimensions: What can be learned? --Any of the boxes representing the agent’s knowledge --action description, effect probabilities, causal relations in the world (and the probabilities of causation), utility models (sort of through credit assignment), sensor data interpretation models What feedback is available? --Supervised, unsupervised, “reinforcement” learning --Credit assignment problem What prior knowledge is available? -- “Tabularasa” (agent’s head is a blank slate) or pre-existing knowledge

Dimensions of Learning “Representation” of the knowledge Degree of Guidance Supervised Teacher provides training examples & solutions E.g. Classification Unsupervised No assistance from teacher E.g. Clustering; Inducing hidden variables In-between Either feedback is given only for some of the examples Semi-supervised Learning Or feedback is provided after a sequence of decision are made Reinforcement Learning Degree of Background Knowledge Tabula Rasa No background knowledge other than the training examples Knowledge-based learning Examples are interpreted in the context of existing knowledge Knowledge Level vs. Speedup Learning If you do have background knowledge, then a question is whether the learned knowledge is “entailed” by the background knowledge or not (Entailment can be logical or probabilistic) If it is entailed, then it is called “deductive” learning If it is not entailed, then it is called inductive learning

Inductive Learning (Classification Learning) How are learners tested? Performance on the test data (not the training data) Performance measured in terms of positive (when) Can learning work? Training and test examples the same? Training and test examples have no connection? Training and Test examples from the same distribution