Machine Learning CMPT 726 Simon Fraser University

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Pattern Recognition and Machine Learning
Pattern Recognition and Machine Learning
ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
CS479/679 Pattern Recognition Dr. George Bebis
Pattern Recognition and Machine Learning
CS Statistical Machine learning Lecture 13 Yuan (Alan) Qi Purdue CS Oct
2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)
Pattern Classification. Chapter 2 (Part 1): Bayesian Decision Theory (Sections ) Introduction Bayesian Decision Theory–Continuous Features.
Chapter 4: Linear Models for Classification
What is Statistical Modeling
Laboratory for Social & Neural Systems Research (SNS) PATTERN RECOGNITION AND MACHINE LEARNING Institute of Empirical Research in Economics (IEW)
Visual Recognition Tutorial
Pattern Recognition and Machine Learning
Classification and risk prediction
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.
Machine Learning CMPT 726 Simon Fraser University CHAPTER 1: INTRODUCTION.
Descriptive statistics Experiment  Data  Sample Statistics Experiment  Data  Sample Statistics Sample mean Sample mean Sample variance Sample variance.
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation X = {
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation Given.
G. Cowan Lectures on Statistical Data Analysis Lecture 10 page 1 Statistical Data Analysis: Lecture 10 1Probability, Bayes’ theorem 2Random variables and.
Bayesian Decision Theory Making Decisions Under uncertainty 1.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
PATTERN RECOGNITION AND MACHINE LEARNING
ECSE 6610 Pattern Recognition Professor Qiang Ji Spring, 2011.
Principles of Pattern Recognition
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
Speech Recognition Pattern Classification. 22 September 2015Veton Këpuska2 Pattern Classification  Introduction  Parametric classifiers  Semi-parametric.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Yung-Kyun Noh and Joo-kyung Kim Biointelligence.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.
Chapter 3: Maximum-Likelihood Parameter Estimation l Introduction l Maximum-Likelihood Estimation l Multivariate Case: unknown , known  l Univariate.
Mathematical Foundations Elementary Probability Theory Essential Information Theory Updated 11/11/2005.
Ch15: Decision Theory & Bayesian Inference 15.1: INTRO: We are back to some theoretical statistics: 1.Decision Theory –Make decisions in the presence of.
Lecture 2: Statistical learning primer for biologists
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
1 Introduction to Statistics − Day 3 Glen Cowan Lecture 1 Probability Random variables, probability densities, etc. Brief catalogue of probability densities.
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
Review of statistical modeling and probability theory Alan Moses ML4bio.
Basic Technical Concepts in Machine Learning Introduction Supervised learning Problems in supervised learning Bayesian decision theory.
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Joo-kyung Kim Biointelligence Laboratory,
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
Ch 1. Introduction (Latter) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by J.W. Ha Biointelligence Laboratory, Seoul National.
Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability Primer Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by J.W. Ha Biointelligence Laboratory, Seoul National University.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
CS Statistical Machine learning Lecture 7 Yuan (Alan) Qi Purdue CS Sept Acknowledgement: Sargur Srihari’s slides.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
Lecture 1.31 Criteria for optimal reception of radio signals.
Chapter 3: Maximum-Likelihood Parameter Estimation
Oliver Schulte Machine Learning 726
Deep Feedforward Networks
Probability Theory and Parameter Estimation I
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
ICS 280 Learning in Graphical Models
CS 2750: Machine Learning Probability Review Density Estimation
CS668: Pattern Recognition Ch 1: Introduction
Special Topics In Scientific Computing
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Pattern Recognition and Machine Learning
Loss.
Generally Discriminant Analysis
Parametric Methods Berlin Chen, 2005 References:
Multivariate Methods Berlin Chen, 2005 References:
Pattern Recognition and Machine Learning Chapter 2: Probability Distributions July chonbuk national university.
Presentation transcript:

Machine Learning CMPT 726 Simon Fraser University CHAPTER 1: INTRODUCTION

Outline Comments on general approach. Probability Theory. Joint, conditional and marginal probabilities. Random Variables. Functions of R.V.s Bernoulli Distribution (Coin Tosses). Maximum Likelihood Estimation. Bayesian Learning With Conjugate Prior. The Gaussian Distribution. More Probability Theory. Entropy. KL Divergence.

Our Approach The course generally follows statistics, very interdisciplinary. Emphasis on predictive models: guess the value(s) of target variable(s). “Pattern Recognition” Generally a Bayesian approach as in the text. Compared to standard Bayesian statistics: more complex models (neural nets, Bayes nets) more discrete variables more emphasis on algorithms and efficiency

Things Not Covered Within statistics: Hypothesis testing Frequentist theory, learning theory. Other types of data (not random samples) Relational data Scientific data (automated scientific discovery) Action + learning = reinforcement learning. Could be optional – what do you think?

Probability Theory Apples and Oranges

Probability Theory Marginal Probability Conditional Probability Joint Probability Key point to remember: from joint probability can get everything else

Probability Theory Sum Rule Product Rule Proof of product rule: Sum rule: In logical terms, X = xi, is equivalent to OR(X=xi, Y=yj) where the OR is over all yj.

The Rules of Probability Sum Rule Product Rule

Bayes’ Theorem posterior  likelihood × prior Exercise: prove this

Bayes’ Theorem: Model Version Let M be model, E be evidence. P(M|E) proportional to P(M) x P(E|M) Intuition prior = how plausible is the event (model, theory) a priori before seeing any evidence. likelihood = how well does the model explain the data? Exercise: prove this

Probability Densities

Transformed Densities Important concept: function of random variable. X = g(y), non-linear invertible transformation.

Expectations Conditional Expectation (discrete) Approximate Expectation (discrete and continuous)

Variances and Covariances Exercise: prove the variance formula

The Gaussian Distribution

Gaussian Mean and Variance

The Multivariate Gaussian

Reading exponential prob formulas In infinite space, cannot just form sum Σx p(x)  grows to infinity. Instead, use exponential, e.g. p(n) = (1/2)n Suppose there is a relevant feature f(x) and I want to express that “the greater f(x) is, the less probable x is”. Use p(x) = exp(-f(x)).

Example: exponential form sample size Fair Coin: The longer the sample size, the less likely it is. p(n) = 2-n. ln[p(n)] Try to do matlab plot Sample size n

Exponential Form: Gaussian mean The further x is from the mean, the less likely it is. ln[p(x)] 2(x-μ)

Smaller variance decreases probability The smaller the variance σ2, the less likely x is (away from the mean). ln[p(x)] -σ2

Minimal energy = max probability The greater the energy (of the joint state), the less probable the state is. ln[p(x)] E(x)

Gaussian Parameter Estimation Likelihood function Independent identically distributed data points

Maximum (Log) Likelihood

Properties of and Sample Mean is unbiased estimator Sample variance is not

Curve Fitting Re-visited

Maximum Likelihood Determine by minimizing sum-of-squares error, .

Predictive Distribution

Frequentism vs. Bayesianism Frequentists: probabilities are measured as the frequencies of repeatable events. E.g., coin flips, snow falls in January. Bayesian: in addition, allow probabilities to be attached to parameter values (e.g., P(μ=0). Frequentist model selection: give performance guarantees (e.g., 95% of the time the method is right). Bayesian model selection: choose prior distribution over parameters, maximize resulting cost function (posterior).

MAP: A Step towards Bayes Key point: Bayesian hyperparameters leads to regularization Determine by minimizing regularized sum-of-squares error, .

Bayesian Curve Fitting Probably skip

Bayesian Predictive Distribution

Model Selection Cross-Validation

Curse of Dimensionality Rule of Thumb: 10 datapoints per parameter.

Curse of Dimensionality Polynomial curve fitting, M = 3 Gaussian Densities in higher dimensions

Decision Theory Inference step Determine either or . Decision step For given x, determine optimal t.

Minimum Misclassification Rate

Minimum Expected Loss Example: classify medical images as ‘cancer’ or ‘normal’ Decision Truth

Minimum Expected Loss Regions are chosen to minimize

Why Separate Inference and Decision? Minimizing risk (loss matrix may change over time) Unbalanced class priors Combining models

Decision Theory for Regression Inference step Determine . Decision step For given x, make optimal prediction, y(x), for t. Loss function: Probably skip

The Squared Loss Function

Generative vs Discriminative Generative approach: Model Use Bayes’ theorem Discriminative approach: Model directly

Entropy Important quantity in coding theory statistical physics machine learning

Entropy

Entropy Coding theory: x discrete with 8 possible states; how many bits to transmit the state of x? All states equally likely

Entropy General compression principle, widely used, intuitive.

The Maximum Entropy Principle Commonly used principle for model selection: maximize entropy. Example: In how many ways can N identical objects be allocated M bins? Entropy maximized when

Differential Entropy and the Gaussian Put bins of width ¢ along the real line Differential entropy maximized (for fixed ) when in which case

Conditional Entropy

The Kullback-Leibler Divergence Used in many ML applications as predictive quality measure

Mutual Information