Lecture 4 Probability and what it has to do with data analysis.

Slides:



Advertisements
Similar presentations
The Maximum Likelihood Method
Advertisements

MGMT 242 Spring, 1999 Random Variables and Probability Distributions Chapter 4 “Never draw to an inside straight.” from Maxims Learned at My Mother’s Knee.
Copula Regression By Rahul A. Parsa Drake University &
The Simple Regression Model
Welcome to PHYS 225a Lab Introduction, class rules, error analysis Julia Velkovska.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 10: The Bayesian way to fit models Geoffrey Hinton.
Lecture 3 Probability and Measurement Error, Part 2.
Terminology A statistic is a number calculated from a sample of data. For each different sample, the value of the statistic is a uniquely determined number.
Lecture 2 Today: Statistical Review cont’d:
Environmental Data Analysis with MatLab
Lecture 2 Probability and Measurement Error, Part 1.
Probability Densities
Probability theory 2010 Main topics in the course on probability theory  Multivariate random variables  Conditional distributions  Transforms  Order.
Lecture 3 Review of Linear Algebra Simple least-squares.
Class notes for ISE 201 San Jose State University
SYSTEMS Identification
Lecture 5 Probability and Statistics. Please Read Doug Martinson’s Chapter 3: ‘Statistics’ Available on Courseworks.
Probability theory 2011 Main topics in the course on probability theory  The concept of probability – Repetition of basic skills  Multivariate random.
Probability theory 2011 The multivariate normal distribution  Characterizing properties of the univariate normal distribution  Different definitions.
Environmental Data Analysis with MatLab Lecture 3: Probability and Measurement Error.
The Simple Regression Model
Sections 4.1, 4.2, 4.3 Important Definitions in the Text:
Lecture 2 Probability and what it has to do with data analysis.
Visual Recognition Tutorial1 Random variables, distributions, and probability density functions Discrete Random Variables Continuous Random Variables.
Continuous Random Variables and Probability Distributions
Lecture 3: Inferences using Least-Squares. Abstraction Vector of N random variables, x with joint probability density p(x) expectation x and covariance.
G. Cowan Lectures on Statistical Data Analysis Lecture 10 page 1 Statistical Data Analysis: Lecture 10 1Probability, Bayes’ theorem 2Random variables and.
Probability theory 2008 Outline of lecture 5 The multivariate normal distribution  Characterizing properties of the univariate normal distribution  Different.
Copyright © Cengage Learning. All rights reserved. 4 Continuous Random Variables and Probability Distributions.
Random Variable and Probability Distribution
Lecture II-2: Probability Review
Jointly distributed Random variables
Modern Navigation Thomas Herring
So are how the computer determines the size of the intercept and the slope respectively in an OLS regression The OLS equations give a nice, clear intuitive.
Separate multivariate observations
Elec471 Embedded Computer Systems Chapter 4, Probability and Statistics By Prof. Tim Johnson, PE Wentworth Institute of Technology Boston, MA Theory and.
Review of Probability.
Chapter 9 Introducing Probability - A bridge from Descriptive Statistics to Inferential Statistics.
Machine Learning Queens College Lecture 3: Probability and Statistics.
Chapter 15 Modeling of Data. Statistics of Data Mean (or average): Variance: Median: a value x j such that half of the data are bigger than it, and half.
Principles of Pattern Recognition
Inferences for Regression
Two Random Variables W&W, Chapter 5. Joint Distributions So far we have been talking about the probability of a single variable, or a variable conditional.
פרקים נבחרים בפיסיקת החלקיקים אבנר סופר אביב
Slide 6.1 Linear Hypotheses MathematicalMarketing In This Chapter We Will Cover Deductions we can make about  even though it is not observed. These include.
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Yung-Kyun Noh and Joo-kyung Kim Biointelligence.
June 11, 2008Stat Lecture 10 - Review1 Midterm review Chapters 1-5 Statistics Lecture 10.
LECTURE 3: ANALYSIS OF EXPERIMENTAL DATA
Probability Course web page: vision.cis.udel.edu/cv March 19, 2003  Lecture 15.
Stats Probability Theory Summary. The sample Space, S The sample space, S, for a random phenomena is the set of all possible outcomes.
Chapter 8: Simple Linear Regression Yang Zhenlin.
Continuous Random Variables and Probability Distributions
Stat 31, Section 1, Last Time Big Rules of Probability –The not rule –The or rule –The and rule P{A & B} = P{A|B}P{B} = P{B|A}P{A} Bayes Rule (turn around.
Lecture 1: Basic Statistical Tools. A random variable (RV) = outcome (realization) not a set value, but rather drawn from some probability distribution.
1 Probability: Introduction Definitions,Definitions, Laws of ProbabilityLaws of Probability Random VariablesRandom Variables DistributionsDistributions.
Chapter 20 Statistical Considerations Lecture Slides The McGraw-Hill Companies © 2012.
Intro to Probability Slides from Professor Pan,Yan, SYSU.
STA347 - week 91 Random Vectors and Matrices A random vector is a vector whose elements are random variables. The collective behavior of a p x 1 random.
Surveying II. Lecture 1.. Types of errors There are several types of error that can occur, with different characteristics. Mistakes Such as miscounting.
Review of statistical modeling and probability theory Alan Moses ML4bio.
G. Cowan Lectures on Statistical Data Analysis Lecture 10 page 1 Statistical Data Analysis: Lecture 10 1Probability, Bayes’ theorem 2Random variables and.
R. Kass/W03 P416 Lecture 5 l Suppose we are trying to measure the true value of some quantity (x T ). u We make repeated measurements of this quantity.
Lecture 2 Probability and what it has to do with data analysis.
Biostatistics Class 3 Probability Distributions 2/15/2000.
Copyright © Cengage Learning. All rights reserved. 4 Continuous Random Variables and Probability Distributions.
Probability Theory and Parameter Estimation I
Main topics in the course on probability theory
Introduction, class rules, error analysis Julia Velkovska
Keller: Stats for Mgmt & Econ, 7th Ed
Probability Theory.
Presentation transcript:

Lecture 4 Probability and what it has to do with data analysis

Please Read Doug Martinson’s Chapter 2: ‘Probability Theory’ Available on Courseworks

Abstraction Random variable, x it has no set value, until you ‘realize’ it its properties are described by a distribution, p(x)

When you realize x the probability that the value you get is between x and x+dx is p(x) dx Probability density distribution

the probability, P, that the value you get is is between x 1 and x 2 is P =  x 1 x 2 p(x) dx Note that it is written with a capital P And represented by a fraction between 0 = never And 1 = always

p(x) x x1x1 x2x2 Probability P that x is between x 1 and x 2 is proportional to this area

the probability that the value you get is is something is unity  -  +  p(x) dx = 1 Or whatever the allowable range of x is … p(x) x Probability that x is between -  and +  is unity, so total area = 1

Why all this is relevant … Any measurement is that contains noise is treated as a random variable, x The distribution p(x) embodies both the ‘true value’ of the quantity being measured and the measurement noise All quantities derived from a random variable are themselves random variables, so … The algebra of random variables allows you to understand how measurement noise affects inferences made from the data

Basic Description of Distributions

p(x) x x mode Mode x at which distribution has peak most-likely value of x peak

But modes can be deceptive … p(x) x x mode peak 010 x N Sure, the 1-2 range has the most counts, but most of the measurements are bigger than 2! 100 realizations of x

p(x) x x median Median 50% chance x is smaller than x median 50% chance x is bigger than x median No special reason the median needs to coincide with the peak 50%

p(x) x Expected value or ‘mean’ x you would get if you took the mean of lots of realizations of x Let’s examine a discrete distribution, for simplicity...

x N Total140 mean = [ 20    3 ] / 140 = (20/140)  1 + (80/140)  2 + (40/140)  3 = p(1)  1 + p(2)  2 + p(3)  3 = Σ i p(x i ) x i Hypothetical table of 140 realizations of x

by analogy for a smooth distribution Expected value of x E(x) =  -  +  x p(x) dx

by the way … You can compute the expected (“mean”) value of any function of x this way … E(x) =  -  +  x p(x) dx E(x 2 ) =  -  +  x 2 p(x) dx E(  x) =  -  +   x p(x) dx etc.

Beware E(x 2 )  E(x) 2 E(x)  E(  x) 2 and so forth …

p(x) x Width of a distribution Here’s a perfectly sensible way to define the width of a distribution… 50% 25% W 50 … it’s not used much, though

p(x) x Width of a distribution Here’s another way… … multiply and integrate E(x) Parabola [x-E(x)] 2

p(x) x Variance =  2 =  -  +  [x-E(x)] 2 p(x) dx E(x) [x-E(x)] 2 [x-E(x)] 2 p(x)x E(x) Compute this total area … Idea is that if distribution is narrow, then most of the probability lines up with the low spot of the parabola But if it is wide, then some of the probability lines up with the high parts of the parabola

p(x) x  variance =  A measure of width …  we don’t immediately know its relationship to area, though … E(x)

the Gaussian or normal distribution p(x) = exp{ - (x-x) 2 / 2  2 ) 1  (2  )  expected value variance Memorize me !

x = 1  = 1 x = 3  = 0.5 x x p(x) Examples of Normal Distributions

x p(x) x x+2  x-2  95% Expectation = Median = Mode = x 95% of probability within 2  of the expected value Properties of the normal distribution

Functions of a random variable any function of a random variable is itself a random variable

If x has distribution p(x) the y(x) has distribution p(y) = p[x(y)] dx/dy

This follows from the rule for transforming integrals … 1 =  x 1 x 2 p(x) dx =  y 1 y 2 p[x(y)] dx/dy dy Limits so that y 1 =y(x 1 ), etc.

example Let x have a uniform (white) distribution of [0,1] p(x) 0x1 1 Uniform probability that x is anywhere between 0 and 1

Let y = x 2 then x=y ½ y(x=0)=0 y(x=1)=1 dx/dy=½y -½ p[x(y)]=1 So p(y)=½y -½ on the interval [0,1] 1

Numerical test histogram of 1000 random numbers Histogram of x, generated with Excel’s rand() function which claims to be based upon a uniform distribution Histogram of x 2, generated by squaring x’s from above Plausible that it’s proportional to 1/  y Plausible that it’s uniform

multivariate distributions

example Liberty island is inhabited by both pigeons and seagulls 40% of the birds are pigeons and 60% of the birds are gulls 50% of pigeons are white and 50% are grey 100% of gulls are white

Two variables species s takes two values pigeon p and gull g color c takes two values white w and tan t Of 100 birds, 20 are white pigeons 20 are tan pigeons 60 are white gulls 0 are tan gulls

What is the probability that a bird has species s and color c ? c wt p g s 20% 60%0% Note: sum of all boxes is 100% a random bird, that is

This is called the Joint Probability and is written P(s,c)

Two continuous variables say x 1 and x 2 have a joint probability distribution and written p(x 1, x 2 ) with   p(x 1, x 2 ) dx 1 dx 2 = 1

The probability that x 1 is between x 1 and x 1 +dx 1 and x 2 is between x 2 and x 2 +dx 2 is p(x 1, x 2 ) dx 1 dx 2 so   p(x 1, x 2 ) dx 1 dx 2 = 1

You would contour a joint probability distribution and it would look something like x2x2 x1x1

What is the probability that a bird has color c ? c wt p g s 20% 60%0% start with P(s,c) 80%20% and sum columns To get P(c) Of 100 birds, 20 are white pigeons 20 are tan pigeons 60 are white gulls 0 are tan gulls

What is the probability that a bird has species s ? c wt p g s 20% 60%0% start with P(s,c) 60% 40% and sum rows To get P(s) Of 100 birds, 20 are white pigeons 20 are tan pigeons 60 are white gulls 0 are tan gulls

These operations make sense with distributions, too x2x2 x1x1 x2x2 x1x1 x1x1 p(x 1 ) p(x 1 ) =  p(x 1,x 2 ) dx 2 x2x2 p(x 2 ) p(x 2 ) =  p(x 1,x 2 ) dx 1 distribution of x 1 (irrespective of x 2 ) distribution of x 2 (irrespective of x 1 )

Given that a bird is species s what is the probability that it has color c ? c wt p g s 50% 100%0% Note, all rows sum to 100 Of 100 birds, 20 are white pigeons 20 are tan pigeons 60 are white gulls 0 are tan gulls

This is called the Conditional Probability of c given s and is written P(c|s) similarly …

Given that a bird is color c what is the probability that it has species s ? c wt p g s 25%100% 75%0% Note, all columns sum to 100 Of 100 birds, 20 are white pigeons 20 are tan pigeons 60 are white gulls 0 are tan gulls So 25% of white birds are pigeons

This is called the Conditional Probability of s given c and is written P(s|c)

Beware! P(c|s)  P(s|c) c wt p g s 50% 100%0% c wt p g s 25%100% 75%0%

note P(s,c) = P(s|c) P(c) c wt p g s c wt p g s = 8020  c wt 25% of 80 is 20

and P(s,c) = P(c|s) P(s) c wt p g s =  c wt p g s p g s 50% of 40 is 20

and if P(s,c) = P(s|c) P(c) = P(c|s) P(s) then P(s|c) = P(c|s) P(s) / P(c) and P(c|s) = P(s|c) P(c) / P(s) … which is called Bayes Theorem

Why Bayes Theorem is important Consider the problem of fitting a straight line to data, d, where the intercept and slope are given by the vector m. If we guess m and use it to predict d we are doing something like P(d|m) But if we observe d and use it to estimate m then we are doing something like P(m|d) Bayes Theorem provides a framework for relating what we do to get P(d|m) to what we do to get P(m|d)

Expectation Variance And Covariance Of a multivariate distribution

The expected value of x 1 and x 2 are calculated in a fashion analogous to the one-variable case: E(x 1 )=  x 1 p(x 1,x 2 ) dx 1 dx 2 E(x 2 )=  x 2 p(x 1,x 2 ) dx 1 dx 2 x2x2 x1x1 Note E(x 1 ) =  x 1 p(x 1,x 2 ) dx 1 dx 2 =  x 1 [  p(x 1,x 2 )dx 2 ] dx 1 =  x 1 p(x 1 ) dx 1 So the formula really is just the expectation of a one- variable distribution

The variance of x 1 and x 2 are calculated in a fashion analogous to the one-variable case, too:  x1 2 =  (x 1 -x 1 ) 2 p(x 1,x 2 ) dx 1 dx 2 with x 1 =E(x 1 ) and similarly for  x2 2 x2x2 x1x1 Note, once again  x1 2 =  (x 1 -x 1 ) 2 p(x 1,x 2 ) dx 1 dx 2 =  (x 1 -x 1 ) 2 [  p(x 1,x 2 ) dx 2 ] dx 2 =  (x 1 -x 1 ) 2 p(x 1 ) dx 1 So the formula really is just the variance of a one-variable distribution

Note that in this distribution if x 1 is bigger than x 1, then x 2 is bigger than x 2 and if x 1 is smaller than x 1, then x 2 is smaller than x 2 x2x2 x1x1 Expected value x1x1 x2x2 This is a positive correlation

Conversely, in this distribution if x 1 is bigger than x 1, then x 2 is smaller than x 2 and if x 1 is smaller than x 1, then x 2 is smaller than x 2 x2x2 x1x1 Expected value x1x1 x2x2 This is a negative correlation

This correlation can be quantified by multiplying the distribution by a four-quadrant function x2x2 x1x1 x1x1 x2x And then integrating. The function (x 1 -x 1 )(x 2 -x 2 ) works fine cov(x 1,x 2 ) =  (x 1 -x 1 ) (x 2 -x 2 ) p(x 1,x 2 ) dx 1 dx 2 Called the “covariance”

Note that the vector x with elements x i = E(x i )=  x i p(x 1,x 2 ) dx 1 dx 2 is the expectation of x and the matrix C x with elements C ij =  (x i -x i ) (x j -x j ) p(x 1,x 2 ) dx 1 dx 2 has diagonal elements equal to the variance of x i C x ii =  xi 2 and off-diagonal elements equal to the covariance of x i and x j C x ij = cov(x i,x j )

“Center” of multivatiate distribution x “Width” and “Correlatedness” of multivariate distribution summarized a lot – but not everything – about a multivariate distribution

Functions of a set of random variables, x A vector of of N random variables in a vector, x

given y(x) Do you remember how to transform the integral  …  p(x) d N x =  …  ? d N y =

given y(x) then  …  p(x) d N x =  …  p[x(y)] |dx/dy| d N y = Jacobian determinant, that is, the determinant of matrix J ij whose elements are dx i /dy j

But here’s something that’s EASIER … Suppose y(x) is a linear function y=Mx Then we can easily calculate the expectation of y y i = E(y i ) =  …  y i p(x 1 … x N ) dx 1 …dx N =  …   M ij x j p(x 1 … x N ) dx 1 … dx N =  M ij  …  x j p(x 1 … x N ) dx 1 … dx N =  M ij E(x i ) =  M ij x i So y=Mx

And we can easily calculate the covariance C y ij =  …  (y i – y i ) (y j – y j ) p(x 1,x 2 ) dx 1 dx 2 =  …  Σ p M ip (x p – x p ) Σ q M jq (x q – x q ) p(x 1 …x N ) dx 1 …dx N = Σ p M ip Σ q M jq  …  (x p – x p ) (x q – x q ) p(x 1 …x N ) dx 1 …dx N = Σ p M ip Σ q M jq C x pq So C y = M C x M T Memorize!

Note that these rules work regardless of the distribution of x if y is linearly related to x, y=Mx then y=Mx (rule for means) C y = M C x M T (rule for propagating error)