Preliminaries Prof. Navneet Goyal CS & IS BITS, Pilani.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Pattern Recognition and Machine Learning
Bayes rule, priors and maximum a posteriori
Modeling of Data. Basic Bayes theorem Bayes theorem relates the conditional probabilities of two events A, and B: A might be a hypothesis and B might.
Pattern Recognition and Machine Learning
Pattern Recognition and Machine Learning
CmpE 104 SOFTWARE STATISTICAL TOOLS & METHODS MEASURING & ESTIMATING SOFTWARE SIZE AND RESOURCE & SCHEDULE ESTIMATING.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 10: The Bayesian way to fit models Geoffrey Hinton.
Chapter 4: Linear Models for Classification
Laboratory for Social & Neural Systems Research (SNS) PATTERN RECOGNITION AND MACHINE LEARNING Institute of Empirical Research in Economics (IEW)
Visual Recognition Tutorial
Chapter 10 Simple Regression.
Machine Learning CMPT 726 Simon Fraser University CHAPTER 1: INTRODUCTION.
Descriptive statistics Experiment  Data  Sample Statistics Experiment  Data  Sample Statistics Sample mean Sample mean Sample variance Sample variance.
Machine Learning CMPT 726 Simon Fraser University
. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.
Visual Recognition Tutorial
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Statistical Methods For Engineers ChE 477 (UO Lab) Larry Baxter & Stan Harding Brigham Young University.
1. Entropy as an Information Measure - Discrete variable definition Relationship to Code Length - Continuous Variable Differential Entropy 2. Maximum Entropy.
PATTERN RECOGNITION AND MACHINE LEARNING
Midterm Review Rao Vemuri 16 Oct Posing a Machine Learning Problem Experience Table – Each row is an instance – Each column is an attribute/feature.
Statistical Decision Theory
PROBABILITY & STATISTICAL INFERENCE LECTURE 3 MSc in Computing (Data Analytics)
Estimating parameters in a statistical model Likelihood and Maximum likelihood estimation Bayesian point estimates Maximum a posteriori point.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.
Statistical Decision Theory Bayes’ theorem: For discrete events For probability density functions.
BCS547 Neural Decoding. Population Code Tuning CurvesPattern of activity (r) Direction (deg) Activity
INTRODUCTION TO Machine Learning 3rd Edition
BCS547 Neural Decoding.
Ch15: Decision Theory & Bayesian Inference 15.1: INTRO: We are back to some theoretical statistics: 1.Decision Theory –Make decisions in the presence of.
Review of fundamental 1 Data mining in 1D: curve fitting by LLS Approximation-generalization tradeoff First homework assignment.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Occam’s Razor No Free Lunch Theorem Minimum.
1 Introduction to Statistics − Day 3 Glen Cowan Lecture 1 Probability Random variables, probability densities, etc. Brief catalogue of probability densities.
Machine Learning 5. Parametric Methods.
6. Population Codes Presented by Rhee, Je-Keun © 2008, SNU Biointelligence Lab,
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
Statistics Sampling Distributions and Point Estimation of Parameters Contents, figures, and exercises come from the textbook: Applied Statistics and Probability.
Univariate Gaussian Case (Cont.)
Review of statistical modeling and probability theory Alan Moses ML4bio.
Chapter 8: Probability: The Mathematics of Chance Probability Models and Rules 1 Probability Theory  The mathematical description of randomness.  Companies.
Statistical Methods. 2 Concepts and Notations Sample unit – the basic landscape unit at which we wish to establish the presence/absence of the species.
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Joo-kyung Kim Biointelligence Laboratory,
Chapter 8 Estimation ©. Estimator and Estimate estimator estimate An estimator of a population parameter is a random variable that depends on the sample.
Parameter Estimation. Statistics Probability specified inferred Steam engine pump “prediction” “estimation”
Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability Primer Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability.
Evaluating Hypotheses. Outline Empirically evaluating the accuracy of hypotheses is fundamental to machine learning – How well does this estimate its.
Evaluating Hypotheses. Outline Empirically evaluating the accuracy of hypotheses is fundamental to machine learning – How well does this estimate accuracy.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
CSC321: Lecture 8: The Bayesian way to fit models Geoffrey Hinton.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
Applied statistics Usman Roshan.
Probability Theory and Parameter Estimation I
Appendix A: Probability Theory
Ch3: Model Building through Regression
Special Topics In Scientific Computing
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Pattern Recognition and Machine Learning
Biointelligence Laboratory, Seoul National University
LECTURE 09: BAYESIAN LEARNING
LECTURE 07: BAYESIAN ESTIMATION
Parametric Methods Berlin Chen, 2005 References:
Mathematical Foundations of BME Reza Shadmehr
Presentation transcript:

Preliminaries Prof. Navneet Goyal CS & IS BITS, Pilani

Topics  Probability Theory  Decision Theory  Information Theory

Topics  Probability Theory  Decision Theory  Information Theory

Probability Theory  Key concept is dealing with uncertainty – Due to noise and finite data sets  Probability Densities  Bayesian Probabilities  Gaussian (normal) Distribution  Curve Fitting revisited  Bayesian Curve Fitting  Maximum Likelihood Estimation

Probability Theory  Frequentist or Classical Approach  Population parameters are fixed constants whose values are unknown  Experiments are repeated indefinitely large no. of times  Toss a fair coin 10 times, it may not be unusual to observe 80% heads  Toss a coin 10 trillion times, we can be fairly certain that the proportion of heads will be close to 50%  Long run behavior defines probability!

Probability Theory  Frequentist or Classical Approach  What is the probability that terrorist will strike an Indian city using AK-47?  Difficult to conceive the long-run behavior  In frequentist approach, the parameters are fixed, and the randomness lies in the data  Data is viewed as a random sample from a given distribution with unknown but fixed parameters

Probability Theory  Bayesian Approach  Turn the assumptions around  Parameters are considered to be random variables  Data are considered to be known  Parameters come from a distribution of possible values  Bayesians look to the observed data to provide information on likely parameter values  Let θ represent the parameters of the unknown distribution  Bayesian approach requires elicitation of a prior distribution for θ, called the prior distribution p( θ )

Probability Theory  Bayesian Approach  p( θ ) can model extant expert (domain) knowledge, if any, regarding the distribution of θ  For example: Churn modeling experts in Telcos may be aware that a customer exceeding a certain threshold no. of calls to customer service may indicate a likelihood to churn  Combine this with prior information about the distribution of customer service calls, including its mean & std. deviation  Non-informative prior – assigns equal probabilities to all values of the parameter  Prior prob. of both churners & non-churners = 0.5 (Telco in question is doomed!!)

Probability Theory  Bayesian Approach  Prior distribution is generally dominated by the overwhelming amount of information that is found in the data  p( θ|X ) – posterior probability, where X represents the entire array of data  Updating of the knowledge about was first performed by Reverend Thomas Bayes ( )

Probability Theory Apples and Oranges

Probability Theory Marginal Probability Conditional Probability Joint Probability

Probability Theory Sum Rule Product Rule

The Rules of Probability Sum Rule Product Rule

Bayes’ Theorem posterior  likelihood × prior Bayes theorem plays a central role in ML!

Joint Distribution over 2 variables

Probability Densities If the probability of a real valued variable x falling in the interval (x, x+ δx) is given by p(x) δx for δx  0, then p(x) is called the prob. density over x

The Gaussian Distribution

Decision Theory  Probability theory provides us with a consistent mathematical framework for quantifying and manipulating uncertainty  Decision theory + Prob. Theory enable us to make optimal decisions in uncertain situations  Input vector x, target variable t  Joint Prob. Dist. p(x,t) provides a complete summary of uncertainty associated with variables x & t  Determination of p(x,t) from a set of training data is an example of inference – a very difficult problem  In practical applications, we make a specific prediction for the value of t & take a specific action based on our understanding of the values t is likely to take  This is Decision theory

Decision Theory  Decision stage is generally very simple, even trivial, once we have solved the inference problem  Role of probabilities in decision making  When we receive an X-ray image of a patient, we need to decide its class  We are interested in the probabilities of the two classes given the image  Use Baye’s th.

Decision Theory Errors

Decision Theory  Optimal Decision Boundary??  Equivalent to minimum misclassification rate decision rule: assign each value of x to the class having the higher posterior probability p(C k |x)

Decision Theory  Minimizing expected loss  Simply minimizing the no. of misclassifications does not suffice in all cases  For example: spam mail filtering, IDS, disease diagnosis etc.  Attach a very high cost to the type of misclassification you want to minimize/eliminate

Information Theory  How much information is received when we observe a specific value for a discrete random variable x?  Amount of information is degree of surprise  Certain means no information  More information when event is unlikely  Entropy:  a measure of disorder/unpredictability or a measure of surprise  Tossing a coin  Fair coin – maximum entropy as there is no way to predict the outcome of the next toss  Biased coin – less entropy as uncertainty is lower and we can bet preferentially on the most frequent result  Two-headed coin – zero entropy as the coin will always turn up heads  Most collections of data in the real world lie somewhere in between

Information Theory  How to measure Entropy?  Information content depends upon probability distribution of x  We look for a function h(x) that is a monotonic function of the of the probability p(x)  If two events x & y are unrelated, then h(x,y) = h(x) + h(y)  Two unrelated events will be statistically independent p(x,y) = p(x)p(y)  h(x) must be log of p(x) h(x) = -log 2 p(x) -ve sign ensures that information is +ve or zero

Information Theory  h(x) = -log 2 p(x) -ve sign ensures that information is +ve or zero  Choice of basis for log is arbitrary  IT theory uses base 2  Units of h(x) are ‘bits’  A sender wishes to transmit the value of a rv to a receiver  Avg. amt. of info. that they transmit is obtained by taking the expectation wrt the distribution p(x)

Entropy Important quantity in coding theory statistical physics machine learning (classification using decision trees)

Entropy  Coding theory: x discrete with 8 possible states; how many bits to transmit the state of x ?  All states equally likely  That is, we need to transmit a msg of length 3 bits  RV x having 8 possible states (a,b,...,h) and respective probabilities are given by (1/2,1/4,1/8,1/16,1/64,1/64,1/64,1/64)

Entropy Non-uniform distribution has a smaller entropy than the uniform one!! Has an interpretation of in terms of disorder! Use shorter codes for more probable events and longer codes for less probable events in the hope of getting a shorter avg code length

Entropy  Noiseless coding theorem of Shannon  Entropy is a lower bound on number of bits needed to transmit a random variable  Natural logarithms are used in relationship to other topics  Nats instead of bits

Linear Basis Function Models  Polynomial basis functions:  These are global; a small change in x affect all basis functions.

Linear Basis Function Models (4)  Gaussian basis functions:  These are local; a small change in x only affect nearby basis functions. μ j and s control location and scale (width).

Linear Basis Function Models (5)  Sigmoidal basis functions:  where  Also these are local; a small change in x only affect nearby basis functions. ¹ j and s control location and scale (slope).

Home Work  Read about Gaussian, Sigmoidal, & Fourier basis functions  Sequential Learning & Online algorithms  Will discuss in the next class!

The Bias-Variance Decomposition  Bias-variance decomposition is a formal method for analyzing the prediction error of a predictive model  Bias = avg. distance bet the target and the location where the projectile hits the ground (depends on angle)  Variance = deviation bet x and the avg. position where the projectile hits the floor (depends on force)  Noise: if the target is not stationary then the observed distance is also affected by changes in the location of target

The Bias-Variance Decomposition  Low degree polynomial has high bias (fits poorly) but has low variance with different data sets  High degree polynomial has low bias (fits well) but has high variance with different data sets  Interactive Mod/e_gm_bias_variance.htm

The Bias-Variance Decomposition  True height of Chinese emperor: 200cm, about 6’6”. Poll a random American: ask “How tall is the emperor?”  We want to determine how wrong they are, on average

The Bias-Variance Decomposition  Each scenario has expected value of 180 (or bias error = 20), but increasing variance in estimate Squared error = Square of bias error + Variance As variance increases, error increases