CS5112: Algorithms and Data Structures for Applications

Slides:

Advertisements

Similar presentations

Probability and Maximum Likelihood. How are we doing on the pass sequence? This fit is pretty good, but… Hand-labeled horizontal coordinate, t The red.

Advertisements

The Maximum Likelihood Method

Linear Regression.

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) ETHEM ALPAYDIN © The MIT Press, 2010

Evaluating Classifiers

CS433: Modeling and Simulation

2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)

1 Methods of Experimental Particle Physics Alexei Safonov Lecture #21.

Prof. Ramin Zabih (CS) Prof. Ashish Raj (Radiology) CS5540: Computational Techniques for Analyzing Clinical Data.

Prof. Ramin Zabih (CS) Prof. Ashish Raj (Radiology) CS5540: Computational Techniques for Analyzing Clinical Data.

統計計算與模擬政治大學統計系余清祥 2003 年 6 月 9 日 ~ 6 月 10 日第十六週：估計密度函數

MACHINE LEARNING 9. Nonparametric Methods. Introduction Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 

Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 10 Statistical Modelling Martin Russell.

Statistics & Modeling By Yan Gao. Terms of measured data Terms used in describing data –For example: “mean of a dataset” –An objectively measurable quantity.

Non Parametric Classifiers Alexandros Potamianos Dept of ECE, Tech. Univ. of Crete Fall

Machine Learning CMPT 726 Simon Fraser University

Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Training HMMs Version 4: February 2005.

EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.

Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.

880.P20 Winter 2006 Richard Kass 1 Confidence Intervals and Upper Limits Confidence intervals (CI) are related to confidence limits (CL). To calculate.

Modeling and Simulation CS 313

COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.

CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.

Basic Time Series Analyzing variable star data for the amateur astronomer.

Lesson Confidence Intervals about a Population Mean in Practice where the Population Standard Deviation is Unknown.

Lecture 5: Statistical Methods for Classification CAP 5415: Computer Vision Fall 2006.

CHAPTER – 1 UNCERTAINTIES IN MEASUREMENTS. 1.3 PARENT AND SAMPLE DISTRIBUTIONS  If we make a measurement x i in of a quantity x, we expect our observation.

R. Kass/W03 P416 Lecture 5 l Suppose we are trying to measure the true value of some quantity (x T ). u We make repeated measurements of this quantity.

Naive Bayes (Generative Classifier) vs. Logistic Regression (Discriminative Classifier) Minkyoung Kim.

LECTURE 15: PARTIAL LEAST SQUARES AND DEALING WITH HIGH DIMENSIONS March 23, 2016 SDS 293 Machine Learning.

Oliver Schulte Machine Learning 726

The Maximum Likelihood Method

Physics 114: Lecture 13 Probability Tests & Linear Fitting

Oliver Schulte Machine Learning 726

Introductory statistics is …

Deep Feedforward Networks

DTC Quantitative Methods Bivariate Analysis: t-tests and Analysis of Variance (ANOVA) Thursday 20th February 2014

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Properties of the Normal Distribution

Ch8: Nonparametric Methods

The Maximum Likelihood Method

Evaluating Classifiers

Mean Shift Segmentation

Clustering Evaluation The EM Algorithm

The Maximum Likelihood Method

Lecture 26: Faces and probabilities

Hidden Markov Models Part 2: Algorithms

Bayesian Models in Machine Learning

Prof. Ramin Zabih Least squares fitting Prof. Ramin Zabih

Preprocessing to accelerate 1-NN

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

Gaussian Mixture Models And their training with the EM algorithm

Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models

Review for test Ch. 6 ON TUESDAY:

EE513 Audio Signals and Systems

CS5112: Algorithms and Data Structures for Applications

Mathematical Foundations of BME

CS 188: Artificial Intelligence Fall 2008

Parametric Methods Berlin Chen, 2005 References:

Multivariate Methods Berlin Chen

Mathematical Foundations of BME

Multivariate Methods Berlin Chen, 2005 References:

Sampling Distributions (§ )

CHAPTER – 1.2 UNCERTAINTIES IN MEASUREMENTS.

M248: Analyzing data Block A UNIT A3 Modeling Variation.

政治大學統計系余清祥 2004年5月26日~ 6月7日第十六、十七週：估計密度函數

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Density Curves Normal Distribution Area under the curve

CHAPTER – 1.2 UNCERTAINTIES IN MEASUREMENTS.

Presentation transcript:

CS5112: Algorithms and Data Structures for Applications Lecture 11: Density estimation with nearest neighbors Ramin Zabih Some figures from Wikipedia/Google image search

Administrivia Reminder: HW comments and minor corrections on Slack HW3 coming soon Anonymous survey coming re: speed of course, etc.

Today NN for machine learning problems Density estimation to classify cats versus dogs Continuous versus discrete random variables

NN for machine learning Nearest neighbors is a fundamental ML technique Used for classification, regression, etc. We will study (and implement!) NN algorithms Exact algorithms on Tuesday next week Approximate algorithms starting Thursday Today we will focus on understanding the use of NN

Classification and NN algorithms Suppose the height/weight of your query animal is very similar to the cats you have seen, and unlike the dogs you have seen, then it’s probably a cat There’s actually a lot going on in the sentence above: “very similar” “the cats you have seen” “probably” You can classify directly from NN

NN and k-NN classification Find the animal you’ve seen that’s most similar to your query NN classification What can go wrong? More robustly, look at the k most similar animals and take the mode (most common label) k-NN classification Choice of k?

k-NN classifier small example Slide credit: https://www.cs.rit.edu/~rlaz/PatternRecognition/, Richard Zanibbi

(k-)NN classifier example By Agor153 - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=24350614

Classification and NN algorithms For many applications it’s way more useful to have the density Tells you a lot more about your data To classify we need to estimate the density Good way to do this is from nearest neighbors Lots of other algorithms also but NN is very popular Basic intuition: most of the cats are where the cat density is high, and vice-versa “When you hear hoofbeats, think of horses not zebras”

Cats versus dogs (simple version) Suppose we want to classify based on a single number Such as weight Simple case: cats and dogs have their weights described by a Gaussian (normal) distribution Cats occur about as frequently as dogs in our data We just need to estimate the density for cats vs dogs This allows us to build our classifier

Cat vs dog classification from weight A not so simple case “Simple” case

NN Density estimation Lots of practical questions boil down to density estimation Even if you don’t explicitly say you’re doing it! “How much do typical cats weigh?” Google says: 7.9 – 9.9 lbs Your classes generally have some density in feature space Hopefully they are compact and well-separated Given a new data point, which class does it belong to? We just maximize 𝑃(𝑑𝑎𝑡𝑎|𝑐𝑙𝑎𝑠𝑠), called the likelihood Formalizes what we did on the previous slide

What is a density? Consider an arbitrary function p where Can view it as a probability density function The PDF for a real-valued random variable If we weighed ∞ cats, what frequency of weights would we get?

A pitfall of interpreting densities The value of PDF at 𝑥 is not the probability we would observe the weight 𝑥 Which is always zero (think about it!) Instead, it gives the probability of getting a weight in a given interval 𝛼 𝛽 𝑝 𝑥 𝑑𝑥=𝑃[𝛼≤𝑤𝑒𝑖𝑔ℎ𝑡≤𝛽]

Discrete case is easier Sometimes the values of the random variable are discrete Instead of a PDF you have a probability mass function (PMF) I.e., a histogram whose entries sum to 1 No bucket has a value greater than 1 This is the true relative frequencies i.e., what we would get in the limit as we weigh more and more cats

Sampling from a PDF Suppose we weigh a bunch of cats This generates our sample (data set) How does this relate to the true PDF? It simplifies life considerably to assume: All cats have their weights from the same PDF (identical distributions) No effect between weighing one cat and another (independence)

Welford’s online mean algorithm Suppose we know that cats come from a Gaussian distribution but we have too many cats to store all their weights Can we estimate the mean (and variance) online? Online algorithms are very important for modern applications For the mean we have 𝜇 𝑛 = 1 𝑛 𝑖=1 𝑛 𝑥 𝑖 = 𝜇 𝑛−1 + 1 𝑛 ( 𝑥 𝑛 − 𝜇 𝑛−1 )

Non-parametric approach We knew the underlying distribution All we needed was to estimate the parameters Obviously, this gives bad results when the true distribution isn’t what we think it is Non-Gaussian distributions are in general rare, and hard to handle But they occur a lot in some areas Box’s law: All models are wrong but some are useful

Is there a free lunch? Suppose we simply plot the data points Assume 1-D for the moment (cat weights) How to compute density from data? True density

Histogram representation

Bin-size tradeoffs Fewer, larger bins give a smoother but less accurate answer Tend to avoid “gaps” i.e., places where the density is declared to be 0 More, smaller bins have opposite property If you don’t know anything in advance, there’s no way to predict bin size Also, note that this method isn’t a great idea in high dimensions

Histogram-based estimates You can use a variety of fitting techniques to produce a curve from a histogram Lines, polynomials, splines, etc. Also called regression/function approximation Normalize to make this a density If you know quite a bit about the underlying density you can compute a good bin size But that’s rarely realistic And defeats the whole purpose of the non-parametric approach!

Nearest-neighbor estimate To estimate the density, count number of nearby data points Like histogramming with sliding bins Avoid bin-placement artifacts Can fix epsilon and compute this quantity, or we can fix the quantity and compute epsilon

NN density estimation Slide credit: https://www.cs.rit.edu/~rlaz/PatternRecognition/, Richard Zanibbi

Sliding sums Suppose we want to “smooth” a histogram, i.e. replace the values by the average over a window How can we do this efficiently? https://www.geeksforgeeks.org/window-sliding-technique/