START OF DAY 3 Reading: Chap. 7. Instance-based Learning.

Slides:



Advertisements
Similar presentations
Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
Advertisements

Data Mining Classification: Alternative Techniques
Data Mining Classification: Alternative Techniques
Data Mining Classification: Alternative Techniques
Instance Based Learning
1 Machine Learning: Lecture 7 Instance-Based Learning (IBL) (Based on Chapter 8 of Mitchell T.., Machine Learning, 1997)
Ai in game programming it university of copenhagen Statistical Learning Methods Marco Loog.
1 Chapter 12 Probabilistic Reasoning and Bayesian Belief Networks.
Bayesian classifiers.
Classification and Regression. Classification and regression  What is classification? What is regression?  Issues regarding classification and regression.
Introduction to Bayesian Learning Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)
Evaluating Hypotheses
Machine Learning CMPT 726 Simon Fraser University
KNN, LVQ, SOM. Instance Based Learning K-Nearest Neighbor Algorithm (LVQ) Learning Vector Quantization (SOM) Self Organizing Maps.
1er. Escuela Red ProTIC - Tandil, de Abril, Bayesian Learning 5.1 Introduction –Bayesian learning algorithms calculate explicit probabilities.
Introduction to Bayesian Learning Ata Kaban School of Computer Science University of Birmingham.
INSTANCE-BASE LEARNING
Thanks to Nir Friedman, HU
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Evaluating Hypotheses.
The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.
CS Instance Based Learning1 Instance Based Learning.
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Jeff Howbert Introduction to Machine Learning Winter Classification Bayesian Classifiers.
Crash Course on Machine Learning
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Bayesian Networks. Male brain wiring Female brain wiring.
Midterm Review Rao Vemuri 16 Oct Posing a Machine Learning Problem Experience Table – Each row is an instance – Each column is an attribute/feature.
DATA MINING LECTURE 10 Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines.
Text Classification, Active/Interactive learning.
1 Data Mining Lecture 5: KNN and Bayes Classifiers.
Naive Bayes Classifier
Kansas State University Department of Computing and Information Sciences CIS 730: Introduction to Artificial Intelligence Lecture 25 Wednesday, 20 October.
Aprendizagem Computacional Gladys Castillo, UA Bayesian Networks Classifiers Gladys Castillo University of Aveiro.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
CS464 Introduction to Machine Learning1 Bayesian Learning Features of Bayesian learning methods: Each observed training example can incrementally decrease.
Kansas State University Department of Computing and Information Sciences CIS 730: Introduction to Artificial Intelligence Lecture 25 of 41 Monday, 25 October.
Bayesian Classification. Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Statistical Inference (By Michael Jordon) l Bayesian perspective –conditional perspective—inferences.
Classification Techniques: Bayesian Classification
CS Bayesian Learning1 Bayesian Learning A powerful and growing approach in machine learning We use it in our own decision making all the time – You.
1 Chapter 12 Probabilistic Reasoning and Bayesian Belief Networks.
Chapter 6 Bayesian Learning
Conditional Probability Mass Function. Introduction P[A|B] is the probability of an event A, giving that we know that some other event B has occurred.
Naïve Bayes Classification Material borrowed from Jonathan Huang and I. H. Witten’s and E. Frank’s “Data Mining” and Jeremy Wyatt and others.
Mehdi Ghayoumi MSB rm 132 Ofc hr: Thur, a Machine Learning.
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
1 Machine Learning: Lecture 6 Bayesian Learning (Based on Chapter 6 of Mitchell T.., Machine Learning, 1997)
Bayesian Learning Bayes Theorem MAP, ML hypotheses MAP learners
CS Machine Learning Instance Based Learning (Adapted from various sources)
Eick: kNN kNN: A Non-parametric Classification and Prediction Technique Goals of this set of transparencies: 1.Introduce kNN---a popular non-parameric.
Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Classification COMP Seminar BCB 713 Module Spring 2011.
Instance-Based Learning Evgueni Smirnov. Overview Instance-Based Learning Comparison of Eager and Instance-Based Learning Instance Distances for Instance-Based.
BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.
CS Ensembles and Bayes1 Ensembles, Model Combination and Bayesian Combination.
Naive Bayes Classifier. REVIEW: Bayesian Methods Our focus this lecture: – Learning and classification methods based on probability theory. Bayes theorem.
Bayesian Classification 1. 2 Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership.
Bayesian Learning Reading: Tom Mitchell, “Generative and discriminative classifiers: Naive Bayes and logistic regression”, Sections 1-2. (Linked from.
Oliver Schulte Machine Learning 726
Naive Bayes Classifier
Bayes Net Learning: Bayesian Approaches
Data Mining Lecture 11.
Instance Based Learning (Adapted from various sources)
K Nearest Neighbor Classification
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Parametric Methods Berlin Chen, 2005 References:
Machine Learning: Lecture 6
Machine Learning: UNIT-3 CHAPTER-1
Naive Bayes Classifier
Presentation transcript:

START OF DAY 3 Reading: Chap. 7

Instance-based Learning

Introduction Instance-based learning is often termed lazy learning, as there is typically no “transformation” of training instances into more general “statements” Instead, the presented training data is simply stored and, when a new query instance is encountered, a set of similar, related instances is retrieved from memory and used to classify the new query instance Hence, instance-based learners never form an explicit general hypothesis regarding the target function. They simply compute the classification of each new query instance as needed

k-NN Approach The simplest, most used instance-based learning algorithm is the k-NN algorithm k-NN assumes that all instances are points in some n-dimensional space and defines neighbors in terms of distance (usually Euclidean in R-space) k is the number of neighbors considered

k-NN Algorithm For each training instance t=(x, f(x)) – Add t to the set Tr_instances Given a query instance q to be classified – Let x 1, …, x k be the k training instances in Tr_instances nearest to q – Return Where V is the finite set of target class values, and δ(a,b)=1 if a=b, and 0 otherwise (Kronecker function)

Bias Intuitively, the k-NN algorithm assigns to each new query instance the majority class among its k nearest neighbors – Things that look the same ought to be labeled the same In practice, k is usually chosen to be odd, so as to avoid ties The k = 1 rule is generally called the nearest- neighbor classification rule

Decision Surface (1-NN) Euclidean distanceManhattan distance Properties: 1)All possible points within a sample's Voronoi cell are the nearest neighboring points for that sample 2)For any sample, the nearest sample is determined by the closest Voronoi cell edge

Impact of the Value of k q is + under 1-NN, but – under 5-NN

Distance-weighted k-NN Replace by:

Scale Effects Different features may have different measurement scales – E.g., patient weight in kg (range [50,200]) vs. blood protein values in ng/dL (range [-3,3]) Consequences – Patient weight will have a much greater influence on the distance between samples – May bias the performance of the classifier Use normalization or standardization

Predicting Continuous Values Replace by: Where f(x) is the output value of instance x

Regression Example What is the value of the new instance? Assume dist(x q, n 8 ) = 2, dist(x q, n 5 ) = 3, dist(x q, n 3 ) = 4 f(x q ) = (8/ / /4 2 )/(1/ / /4 2 ) = 2.74/.42 = 6.5 The denominator renormalizes the value 8 5 3

Some Remarks k-NN works well on many practical problems and is fairly noise tolerant (depending on the value of k) k-NN is subject to the curse of dimensionality, and presence of many irrelevant attributes k-NN relies on efficient indexing – Could also reduce number of instances (e.g., drop if it would still be classified correctly) What distance?

These work great for continuous attributes How about nominal attributes? How about mixtures of continuous and nominal attributes? Distances

Distance for Nominal Attributes

Distance for Heterogeneous Data Wilson, D. R. and Martinez, T. R., Improved Heterogeneous Distance Functions, Journal of Artificial Intelligence Research, vol. 6, no. 1, pp. 1-34, 1997

Incremental Learning A learning algorithm is incremental if and only if, for any given training samples, e 1, …, e n, it produces a sequence of hypotheses (or models), h 0, h 1, …, h n, such that h i+1 depends only on h i and the current example e i

How is k-NN Incremental? All training instances are stored Model consists of the set of training instances Adding a new training instance only affects the computation of neighbors, which is done at execution time (i.e., lazily) Note that the storing of training instances is a violation of the strict definition of incremental learning.

Naïve Bayes

Bayesian Learning A powerful and growing approach in machine learning We use Bayesian reasoning in our own decision-making all the time – You hear a word which could equally be “Thanks” or “Tanks”, which would you go with? Combine data likelihood and your prior knowledge – Texting suggestions on phone – Spell checkers, etc.

Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of interest are governed by probability distributions (priors) and that optimal decisions can be made by reasoning about these probabilities together with observed data (likelihood) 21

Example (I) Suppose I wish to know whether someone is telling the truth or lying about some issue X – The available data is from a lie detector with two possible outcomes: truthful or liar – I also have prior knowledge that over the entire population, 21% of people lie about X – Finally, I know that the lie detector is imperfect: it returns truthful in only 94% of the cases where people actually told the truth and liar in only 87% of the cases where people where actually lying 22

Example (II) P(lies about X) = 0.21 P(tells the truth about X) = 0.79 P(truthful | tells the truth about X) = 0.94 P(liar | tells the truth about X) = 0.06 P(liar | lies about X) = 0.87 P(truthful | lies about X) =

Example (III) Suppose a new person is asked about X and the lie detector returns liar Should we conclude the person is indeed lying about X or not What we need is to compare: – P(lies about X | liar) – P(tells the truth about X | liar) 24

Example (IV) P(A|B) = P(AB) / P(B)by definition P(B|A) = P(AB) / P(A)by definition Combining the above two P(B|A) = P(A|B)P(B) / P(A) In our case: – P(lies about X | liar) = [P(liar | lies about X).P(lies about X)]/P(liar) – P(tells the truth about X | liar) = [P(liar | tells the truth about X).P(tells the truth about X)]/P(liar)

Example (V) All of the above probabilities we have, except for P(liar) However, – P(A) = P(A|B)P(B)+P(A|~B)P(~B) In our case: – P(liar) = P(liar | lies about X).P(lies about X) + P(liar | tells the truth about X).P(tells the truth about X) And again, we know all of these probabilities Law of Total Probability

Example (VI) Computing, we get: – P(liar) = 0.87x x0.79 = 0.23 – P(lies about X | liar) = [0.87x0.21]/0.23 = 0.79 – P(tells the truth about X | liar) = [0.06x0.79]/0.23 = 0.21 And we would conclude that the person was indeed lying about X 27

Bayesian Learning Assume we have two (random) variables, X and C X represents multi- dimensional objects and C is their class label – C = PlayTennis = {Yes, No} – X = Outlook x Temperature x Humidity x Wind We can define the following quantities: P(C) – Prior probability What we believe is true about C before we look at any specific instance of X In this case, what would P(C=Yes) and P(C=No) be? P(X|C) – Likelihood How likely a particular instance X is given that we know its label C In this case, what would P(X= |C=No) be? P(C|X) – Posterior probability What we believe is true about C for a specific instance X In this case, what would be P(C=Yes|X= ) be? This is what we are interested in! How can we find it?

Finding P(C|X) We use Bayes Theorem P(C|X) = P(X|C)P(C) / P(X) Allows us to talk about probabilities/beliefs even when there is little data, because we can use the prior – What is the probability of a nuclear plant meltdown? – What is the probability that BYU will win the national championship? As the amount of data increases, Bayes shifts confidence from the prior to the likelihood Requires reasonable priors in order to be helpful We use priors all the time in our decision making Remember bias: one form of prior

Probabilistic Learning In ML, we are often interested in determining the best hypothesis from some space H, given the observed training data D. One way to specify what is meant by the best hypothesis is to say that we demand the most probable hypothesis, given the data D together with any initial knowledge about the prior probabilities of the various hypotheses in H.

Bayes Theorem Bayes theorem is the cornerstone of Bayesian learning methods It provides a way of calculating the posterior probability P(h | D), from the prior probabilities P(h), P(D) and P(D | h), as follows: 31

MAP Learning How do we make our decision? – We choose the/a maximally probable or maximum a posteriori (MAP) hypothesis, namely:

Example of MAP Hypothesis Assume only 3 possible hypotheses in H Given a data set D which h do we choose? HLikelihood P(D|h) Prior P(h) Relative Posterior P(D|h)P(h) h1h h2h h3h Non-bayesian? Probably select h 2 Bayesian? Select h 3 in a principled way Not normalized by P(D) Note interaction between prior and likelihood

Remarks Brute-Force MAP learning algorithm: – Often impractical (H too large) – Answers the question: which is the most probable hypothesis given the training data? Often, more significant question: – Which is the most probable classification of the new query instance given the training data? The most probable classification of a new instance can be obtained by combining the predictions of all hypotheses, weighted by their posterior probabilities

Bayes Optimal Classification (I) If the possible classification of the new instance can take on any value c j from some set C, then the probability P(c j | D) that the correct classification for the new instance is c j, is just: Clearly, the optimal classification of the new instance is thevalue cj, for which P(cj | D) is maximum, which gives rise to thefollowing algorithm to classify query instances. Clearly, the optimal classification of the new instance is the value c j, for which P(c j | D) is maximum, which gives rise to the following algorithm to classify query instances. Law of Total Probability

Bayes Optimal Classification (II) Return No other classification method using the samehypothesis space and same prior knowledge canoutperform this method on average: No other classification method using the same hypothesis space and same prior knowledge can outperform this method on average: It maximizes the probability that the new instance isclassified correctly, given the available data, hypothesisspace and prior probabilities over the hypotheses It maximizes the probability that the new instance is classified correctly, given the available data, hypothesis space and prior probabilities over the hypotheses

Example of Bayes Optimal Classification (I) Assume same 3 hypotheses with priors and posteriors as shown for a data set D with 2 possible output classes (A and B) Assume novel input instance x where h 1 and h 2 output B and h 3 outputs A for x HLikelihood P(D|h) Prior P(h) Posterior P(D|h)P(h) P(A|D)P(B|D) h1h x.18 = 01x.18 =.18 h2h x.18 = 01x.18 =.18 h3h x.35 =.350x.35 = 0 Sum.35.36

Assume probabilistic outputs from the hypotheses HLikelihood P(D|h) Prior P(h) Posterior P(D|h)P(h) P(A|D)P(B|D) h1h x.18 =.054.7x.18 =.126 h2h x.18 =.072.6x.18 =.108 h3h x.35 =.315.1x.35 =.035 Sum HP(A|h ) P(B|h) h1h1.3.7 h2h2.4.6 h3h3.9.1 Example of Bayes Optimal Classification (II)

Naïve Bayes Learning (I) Large or infinite H make the above algorithm impractical still Naive Bayes learning is a practical Bayesian learning method: – Applies directly to learning tasks where instances are conjunction of attribute values and the target function takes its values from some finite set C – The Bayesian approach consists in assigning to a new query instance the most probable target value, c MAP, given the attribute values a 1, …, a n that describe the instance, i.e.,

Naïve Bayes Learning (II) Using Bayes theorem, this can be reformulated as: Finally, we make the further simplifying assumption that: Attribute values are conditionally independent given thetarget value Attribute values are conditionally independent given the target value Hence, one can write the conjunctive conditional probabilityas a product of simple conditional probabilities Hence, one can write the conjunctive conditional probability as a product of simple conditional probabilities

Naïve Bayes Learning (III) Return Naive Bayes learning method involves a learning step in whichthe various P(cj) and P(ai | cj) terms are estimated: Naive Bayes learning method involves a learning step in which the various P(c j ) and P(a i | c j ) terms are estimated: Based on their frequencies over the training data These estimates are then used in the above formula to classifyeach new query instance These estimates are then used in the above formula to classify each new query instance Whenever the assumption of conditional independence issatisfied, the naive Bayes classification is identical to the MAPclassification Whenever the assumption of conditional independence is satisfied, the naive Bayes classification is identical to the MAP classification

NB Example (I)

NB Example (II)

Continuous Attributes Can discretize into bins thus changing it into a nominal feature and then gather statistics normally – How many bins? - More bins is good, but need sufficient data to make statistically significant bins. Thus, base it on data available Could also assume data is Gaussian and compute the mean and variance for each feature given the output class, then each P(a i |c j ) becomes (a i |μ vj, σ 2 vj )

Estimating Probabilities We have so far estimated P(X=x | Y=y) by the fraction n x|y /n y, where n y is the number of instances for which Y=y and n x|y is the number of these for which X=x This is a problem when n x is small – E.g., assume P(X=x | Y=y)=0.05 and the training set is s.t. that n y =5. Then it is highly probable that n x|y =0 – The fraction is thus an underestimate of the actual probability – It will dominate the Bayes classifier for all new queries with X=x (i.e., drive probabilities to 0)

Laplacian or m-estimate Replace n x|y /n y by: Where p is our prior estimate of the probability wewish to determine and m is a constant Where p is our prior estimate of the probability we wish to determine and m is a constant Typically, p = 1/k (where k is the number of possiblevalues of X, i.e., attribute values in our case) Typically, p = 1/k (where k is the number of possible values of X, i.e., attribute values in our case) m acts as a weight (similar to adding m virtual instancesdistributed according to p) m acts as a weight (similar to adding m virtual instances distributed according to p) Laplacian: m = 1/p

How is NB Incremental? No training instances are stored Model consists of summary statistics that are sufficient to compute prediction Adding a new training instance only affects summary statistics, which may be updated incrementally

Revisiting Conditional Independence Definition: X is conditionally independent of Y given Z iff P(X | Y, Z) = P(X | Z) NB assumes that all attributes are conditionally independent, given the class. Hence,

What if ? In many cases, the NB assumption is overly restrictive What we need is a way of handling independence or dependence over subsets of attributes – Joint probability distribution Defined over Y 1 x Y 2 x … x Y n Specifies the probability of each variable binding

Bayesian Belief Network Directed acyclic graph: – Nodes represent variables in the joint space – Arcs represent the assertion that the variable is conditionally independent of its non descendants in the network given its immediate predecessors in the network – A conditional probability table is also given for each variable: P(V | immediate predecessors)

BN Examples

END OF DAY 3 Homework: Weka & Incremental Learning