CS6720 --- Pattern Recognition Review of Prerequisites in Math and Statistics Prepared by Li Yang Based on Appendix chapters of Pattern Recognition, 4.

Slides:



Advertisements
Similar presentations
EigenFaces and EigenPatches Useful model of variation in a region –Region must be fixed shape (eg rectangle) Developed for face recognition Generalised.
Advertisements

CHAPTER 8 More About Estimation. 8.1 Bayesian Estimation In this chapter we introduce the concepts related to estimation and begin this by considering.
1er. Escuela Red ProTIC - Tandil, de Abril, 2006 Principal component analysis (PCA) is a technique that is useful for the compression and classification.
Pattern Classification. Chapter 2 (Part 1): Bayesian Decision Theory (Sections ) Introduction Bayesian Decision Theory–Continuous Features.
Pattern Classification Chapter 2 (Part 2)0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O.
Visual Recognition Tutorial
Principal Component Analysis CMPUT 466/551 Nilanjan Ray.
Numerical Optimization
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Visual Recognition Tutorial
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Constrained Optimization
Machine Learning CMPT 726 Simon Fraser University
The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.
Support Vector Machines and Kernel Methods
Tch-prob1 Chapter 4. Multiple Random Variables Ex Select a student’s name from an urn. S In some random experiments, a number of different quantities.
Probability and Statistics Review
Unconstrained Optimization Problem
Visual Recognition Tutorial1 Random variables, distributions, and probability density functions Discrete Random Variables Continuous Random Variables.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Techniques for studying correlation and covariance structure
Random Variable and Probability Distribution
The Multivariate Normal Distribution, Part 1 BMTRY 726 1/10/2014.
Modern Navigation Thomas Herring
Chapter 21 Random Variables Discrete: Bernoulli, Binomial, Geometric, Poisson Continuous: Uniform, Exponential, Gamma, Normal Expectation & Variance, Joint.
Today Wrap up of probability Vectors, Matrices. Calculus
Review of Probability.
: Appendix A: Mathematical Foundations 1 Montri Karnjanadecha ac.th/~montri Principles of.
Summarized by Soo-Jin Kim
Machine Learning Queens College Lecture 3: Probability and Statistics.
Short Resume of Statistical Terms Fall 2013 By Yaohang Li, Ph.D.
Principles of Pattern Recognition
ECE 8443 – Pattern Recognition LECTURE 03: GAUSSIAN CLASSIFIERS Objectives: Normal Distributions Whitening Transformations Linear Discriminants Resources.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 02: BAYESIAN DECISION THEORY Objectives: Bayes.
Functions of Random Variables. Methods for determining the distribution of functions of Random Variables 1.Distribution function method 2.Moment generating.
LECTURE IV Random Variables and Probability Distributions I.
Continuous Distributions The Uniform distribution from a to b.
Probability Definitions Dr. Dan Gilbert Associate Professor Tennessee Wesleyan College.
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Yung-Kyun Noh and Joo-kyung Kim Biointelligence.
Machine Learning Weak 4 Lecture 2. Hand in Data It is online Only around 6000 images!!! Deadline is one week. Next Thursday lecture will be only one hour.
Stats Probability Theory Summary. The sample Space, S The sample space, S, for a random phenomena is the set of all possible outcomes.
Chapter 3: Maximum-Likelihood Parameter Estimation l Introduction l Maximum-Likelihood Estimation l Multivariate Case: unknown , known  l Univariate.
2. Introduction to Probability. What is a Probability?
L8 Optimal Design concepts pt D
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Review of Probability. Important Topics 1 Random Variables and Probability Distributions 2 Expected Values, Mean, and Variance 3 Two Random Variables.
Machine Learning CUNY Graduate Center Lecture 2: Math Primer.
Signal & Weight Vector Spaces
Chapter 2-OPTIMIZATION G.Anuradha. Contents Derivative-based Optimization –Descent Methods –The Method of Steepest Descent –Classical Newton’s Method.
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
1 Probability: Introduction Definitions,Definitions, Laws of ProbabilityLaws of Probability Random VariablesRandom Variables DistributionsDistributions.
Linear Programming Chapter 9. Interior Point Methods  Three major variants  Affine scaling algorithm - easy concept, good performance  Potential.
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Joo-kyung Kim Biointelligence Laboratory,
Searching a Linear Subspace Lecture VI. Deriving Subspaces There are several ways to derive the nullspace matrix (or kernel matrix). ◦ The methodology.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
MECH 373 Instrumentation and Measurements
LINEAR CLASSIFIERS The Problem: Consider a two class task with ω1, ω2.
Probability for Machine Learning
LECTURE 03: DECISION SURFACES
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
ASV Chapters 1 - Sample Spaces and Probabilities
Principal Component Analysis
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Parametric Methods Berlin Chen, 2005 References:
Principal Component Analysis
Pattern Recognition and Machine Learning Chapter 2: Probability Distributions July chonbuk national university.
Presentation transcript:

CS Pattern Recognition Review of Prerequisites in Math and Statistics Prepared by Li Yang Based on Appendix chapters of Pattern Recognition, 4 th Ed. by S. Theodoridis and K. Koutroumbas and figures from Wikipedia.org 1

Probability and Statistics  Probability P(A) of an event A : a real number between 0 to 1.  Joint probability P(A ∩ B) : probability that both A and B occurs in a single experiment. P(A ∩ B) = P(A)P(B) if A and B and independent.  Probability P(A ∪ B) of union of A and B: either A or B occurs in a single experiment. P(A ∪ B) = P(A) + P(B) if A and B are mutually exclusive.  Conditional probability:  Therefore, the Bayes rule :  Total probability: let then 2

3  Probability density function (pdf): p(x) for a continuous random variable x Total and conditional probabilities can also be extended to pdf’s.  Mean and Variance: let p(x) be the pdf of a random variable x  Statistical independence:  Kullback-Leibler divergence (Distance?) of pdf’s Pay attention that

 Characteristic function of a pdf:  2 nd Characteristic function:  n-th order moment:  Cumulants: 4

Discrete Distributions  Binomial distribution B(n,p): Repeatedly grab n balls, each with a probability p of getting a black ball. The probability of getting k black balls:  Poisson distribution probability of # of events occurring in a fixed period of time if these events occur with a known average. 5

Normal (Gaussian) Distribution  Univariate N( μ, σ 2 ):  Multivariate N( μ, Σ ):  Central limit theorem: 6

Other Continuous Distributions  Chi-square ( Χ 2 ) distribution of k degrees of freedom: distribution of a sum of squares of k independent standard normal random variables, that is,  Mean: k, Variance: 2k  Assume  Then by central limit theorem.  Alsois approximately normally distributed with mean and unit variance. 7

Other Continuous Distributions  t-distribution: estimating mean of a normal distribution when sample size is small. A t-distributed variable Mean: 0 for k > 1, variance: k/(k-2) for k > 2  β-distribution: Beta( α,β ): the posterior distribution of p of a binomial distribution after α −1 events with p and β − 1 with 1 − p. 8

Linear Algebra  Eigenvalues and eigenvectors: there exists λ and v such that Av= λv  Real matrix A is called positive semidefinite if x T Ax ≥ 0 for every nonzero vector x; A is called positive definite if x T Ax > 0.  Positive definite matrixes act as positive numbers. All positive eigenvalues  If A is symmetric, A T =A, then its eigenvectors are orthogonal, v i T v j =0.  Therefore, a symmetric A can be diagonalized as 9

Correlation Matrix and Inner Product Matrix Principal component analysis (PCA)  Let x be a random variable in R l, its correlation matrix Σ=E[xx T ] is positive semidefinite and thus can be diagonalized as  Assign, then  Further assign, then Classical multidimensional scaling (classical MDS) ij  Given a distance matrix D={d ij }, the inner product matrix G={x i T x j } can be computed by a bidirectional centering process  G can be diagnolized as  Actually, nΛ and Λ’ share the same set of eigenvalues, and 10

Cost Function Optimization  Find θ so that a differentiable function J(θ) is minimized.  Gradient descent method  Starts with an initial estimate θ(0)  Adjust θ iteratively by  Taylor expansion of J(θ) at a stationary point θ 0 Ignore higher order terms within a neighborhood of θ 0 11

 Newton’s method  Adjust θ iteratively by  Converges much faster that gradient descent. In fact, from the Taylor expansion, we have  The minimum is found in one iteration.  Conjugate gradient method 12

Constrained Optimization with Equality Constraints Minimize J(θ) subject to f i (θ)=0 for i=1, 2, …, m  Minimization happens at  Lagrange multipliers: construct 13

Constrained Optimization with Inequality Constraints Minimize J(θ) subject to f i (θ)≥0 for i=1, 2, …, m  f i (θ)≥0 i=1, 2, …, m defines a feasible region in which the answer lies.  Karush–Kuhn–Tucker (KKT) conditions: A set of necessary conditions, which a local optimizer θ * has to satisfy. There exists a vector λ of Lagrange multipliers such that 14 (1)Most natural condition. (2)f i (θ * ) is inactive if λ i =0. (3)λ i ≥0 if the minimum is on f i (θ * ). (4)The (unconstrained) minimum in the interior region if all λ i =0. (5)For convex J(θ) and the region, local minimum is global minimum. (6)Still difficult to compute. Assume some f i (θ * )’s active, check λ i ≥0.

 Convex function:  Concave function:  Convex set: Local minimum of a convex function is also global minimum. 15

 Min-Max duality 16

 Lagrange duality  Recall the optimization problem:  Convex Programming 17

Mercer’s Theorem and the Kernel Method  Mercer’s theorem: The kernel method can transform any algorithm that solely depends on the dot product between two vectors to a kernelized vesion, by replacing dot product with the kernel function. The kernelized version is equivalent to the algorithm operating in the range space of φ. Because kernels are used, however, φ is never explicitly computed. 18