Lecture 2 Probability and what it has to do with data analysis.

Lecture 2 Probability and what it has to do with data analysis

Abstraction Random variable, x it has no set value, until you ‘realize’ it its properties are described by a probability, P

pot of an infinite number of x’s Drawing one x from the pot “realizes” x One way to think about it p(x) x

Describing P If x can take on only discrete values, say (1, 2, 3, 4, or 5) then a table would work : x12345 P10%30%40%15%5% Probabilities should sum to 100% 40% probability that x=4

Sometimes you see probabilities written as fractions, instead of percentages x12345 P0.100.40 0.150.05 Probability should sum to 1 0.15 probability that x=4 x P(x) 0.0 0.5 12345 And sometimes you see probabilities plotted as a histogram 0.15 probability that x=4

If x can take on any value, then use a smooth function (or “distribution”) p(x) instead of a table p(x) x x1x1 x2x2 probability that x is between x 1 and x 2 is proportional to this area mathematically P(x 1 <x<x 2 ) =  x 1 x 2 p(x) dx

p(x) x Probability that x is between -  and +  is 100%, so total area = 1 Mathematically  -  +  p(x) dx = 1

One Reason Why all this is relevant … Any measurement of data that contains noise is treated as a random variable, d and …

The distribution p(d) embodies both the ‘true value’ of the datum being measured and the measurement noise and …

All quantities derived from a random variable are themselves random variables, so …

The algebra of random variables allows you to understand how … … measurement noise affects inferences made from the data

Basic Description of Distributions want two basic numbers 1) something that describes what x’s commonly occur 2) something that describes the variability of the x’s

1) something that describes what x’s e commonly occur that is, where the distribution is centered

p(x) x x mode Mode x at which distribution has peak most-likely value of x peak

The most popular car in the US is the Honda CR-V But the next car you see on the highway will probably not be a Honda CR-V Where’s a CV-R? Honda CR-V

But modes can be deceptive … p(x) x x mode peak 010 x N 0-13 1-218 2-311 3-48 4-511 5-614 6-78 7-87 8-911 9-109 Sure, the 1-2 range has the most counts, but most of the measurements are bigger than 2! 100 realizations of x

p(x) x x median Median 50% chance x is smaller than x median 50% chance x is bigger than x median No special reason the median needs to coincide with the peak 50%

P(x) x Expected value or ‘mean’ value you would get if you took the mean of lots of realizations of x 0 1 2 3 4 123 Let’s examine a discrete distribution, for simplicity...

x N 1 20 2 80 3 40 Total140 mean = [ 20  1 + 80  2 + 40  3 ] / 140 = (20/140)  1 + (80/140)  2 + (40/140)  3 = p(1)  1 + p(2)  2 + p(3)  3 = Σ i p(x i ) x i Hypothetical table of 140 realizations of x

by analogy for a smooth distribution Expected (or mean) value of x E(x) =  -  +  x p(x) dx

2) something that describes the variability of the x’s that is, the width of the distribution

p(x) x Here’s a perfectly sensible way to define the width of a distribution… 50% 25% W 50 … it’s not used much, though

p(x) x Width of a distribution Here’s another way… … multiply and integrate E(x) Parabola [x-E(x)] 2

p(x) x Variance =  2 =  -  +  [x-E(x)] 2 p(x) dx E(x) [x-E(x)] 2 [x-E(x)] 2 p(x)x E(x) Compute this total area … Idea is that if distribution is narrow, then most of the probability lines up with the low spot of the parabola But if it is wide, then some of the probability lines up with the high parts of the parabola

p(x) x  variance =  A measure of width …  we don’t immediately know its relationship to area, though … E(x)

the Gaussian or normal distribution p(x) = exp{ - (x-x) 2 / 2  2 ) 1  ( 2  )  x is expected value  2 is variance Memorize me !

x = 1  = 1 x = 3  = 0.5 x x p(x) Examples of Normal Distributions

x p(x) x x+2  x-2s 95% Expectation = Median = Mode = x 95% of probability within 2  of the expected value Properties of the normal distribution

Again, Why all this is relevant … Inference depends on data … You use measurement, d, to deduce the values of some underlying parameter of interest, m. e.g. use measurements of travel time, d, to deduce the seismic velocity, m, of the earth

model parameter, m, depends on measurement, d so m is a function of d, m(d) so …

If data, d, is a random variable then so is model parameter, m All inferences made from uncertain data are themselves uncertain Model parameters are described by a distribution, p(m)

Functions of a random variable any function of a random variable is itself a random variable

Special case of a linear relationship and a normal distribution Normal p(d) with mean d and variance  2 d Linear relationship m = a d + b Normal p(m) with mean ad+b and variance a 2  2 d

multivariate distributions

Example Liberty island is inhabited by both pigeons and seagulls 40% of the birds are pigeons and 60% of the birds are gulls 50% of pigeons are white and 50% are tan 100% of gulls are white

Two variables species s takes two values pigeon p and gull g color c takes two values white w and tan t Of 100 birds, 20 are white pigeons 20 are tan pigeons 60 are white gulls 0 are tan gulls

What is the probability that a bird has species s and color c ? c wt p g s 20% 60%0% Note: sum of all boxes is 100% a random bird, that is

This is called the Joint Probability and is written P(s,c)

Two continuous variables say x 1 and x 2 have a joint probability distribution and written p(x 1, x 2 ) with   p(x 1, x 2 ) dx 1 dx 2 = 1

You would contour a joint probability distribution and it would look something like x2x2 x1x1

What is the probability that a bird has color c ? c wt p g s 20% 60%0% start with P(s,c) 80%20% and sum columns To get P(c) Of 100 birds, 20 are white pigeons 20 are tan pigeons 60 are white gulls 0 are tan gulls

What is the probability that a bird has species s ? c wt p g s 20% 60%0% start with P(s,c) 60% 40% and sum rows To get P(s) Of 100 birds, 20 are white pigeons 20 are tan pigeons 60 are white gulls 0 are tan gulls

These operations make sense with distributions, too x2x2 x1x1 x2x2 x1x1 x1x1 p(x 1 ) p(x 1 ) =  p(x 1,x 2 ) dx 2 x2x2 p(x 2 ) p(x 2 ) =  p(x 1,x 2 ) dx 1 distribution of x 1 (irrespective of x 2 ) distribution of x 2 (irrespective of x 1 )

Given that a bird is species s what is the probability that it has color c ? c wt p g s 50% 100%0% Note, all rows sum to 100 Of 100 birds, 20 are white pigeons 20 are tan pigeons 60 are white gulls 0 are tan gulls

This is called the Conditional Probability of c given s and is written P(c|s) similarly …

Given that a bird is color c what is the probability that it has species s ? c wt p g s 25%100% 75%0% Note, all columns sum to 100 Of 100 birds, 20 are white pigeons 20 are tan pigeons 60 are white gulls 0 are tan gulls So 25% of white birds are pigeons

This is called the Conditional Probability of s given c and is written P(s|c)

Beware! P(c|s)  P(s|c) c wt p g s 50% 100%0% c wt p g s 25%100% 75%0%

Actor Patrick Swayse pancreatic cancer victim Lot of errors occur from confusing the two: Probability that, if you have pancreatic cancer, that you will die from it 90% Probability that, if you die, you will have died of pancreatic cancer 1.4%

note P(s,c) = P(s|c) P(c) c wt p g s 20 600 c wt p g s 25100 750 = 8020  c wt 25% of 80 is 20

and P(s,c) = P(c|s) P(s) c wt p g s 20 600 =  c wt p g s 50 1000 60 40 p g s 50% of 40 is 20

In this example bird color is the observable, the “data”, d bird species is the “model parameter”, m P(c|s) “color given species” or P(d|m) is “making a prediction based on the model” Given a pigeon, what the probability that it’s grey? P(s|c), “species given color” or P(m|d) is “making an inference from the data” Given a grey bird, what the probability that it’s a pigeon?

Example of Bayesian Inference Scenaio: A body of a man is brought to the morgue. The coroner wants to know, “did the man die of pancreatic cancer?”. Thus there is one model parameter, m, takes one of two values, Y (he died of pancreatic cancer) and N (he didn’t). Before examining the body, the best estimate of P(m) that can be made is P(Y)=0.014 and P(N)=0.986, the rate of death by pancreatic cancer in the general population. Now the coroner performs a test for pancreatic cancer, giving one data, d, and its positive, + (as contrasted to negative, -). But the test is not perfect. It has a non-zero rate of both false-positives and false-negatives, as quantified by the conditional distribution: P(Y|+) = P(+|Y) P(Y) / [P(+|Y) P(Y)+P(+|N) P(N)] = 0.995  0.014 / [0.995  0.014+0.005  0.986] = 0.74 or 74% P(d|m) - Y N + 0.005 0.01 0.99 0.995 false negatives (didn’t have cancer but tested +) false positives (did have cancer but tested -) A 74% chance that person died of pancreatic cancer is not all that conclusive!

Why Bayes Theorem is important It provides a framework for relating making a prediction from the model, P(d|m) to making an inference from the data, P(m|d)

Bayes Theorem also implies that the joint distribution of data and model parameters p(d, m) is the fundamental quantity If you know p(d, m), you know everything there is to know …

Expectation Variance And Covariance Of a multivariate distribution

The expectation is computed by first reducing the distribution to one dimension x2x2 x1x1 x2x2 x1x1 x1x1 p(x 1 ) x2x2 p(x 2 ) take the expectation of p(x 1 ) to get x 1 x1x1 x2x2 take the expectation of p(x 2 ) to get x 2

The varaince is also computed by first reducing the distribution to one dimension x2x2 x1x1 x2x2 x1x1 x1x1 p(x 1 ) x2x2 p(x 2 ) take the variance of p(x 1 ) to get  1 2 x1x1 x2x2 take the variance of p(x 2 ) to get  2 2 11 22

Note that in this distribution if x 1 is bigger than x 1, then x 2 is bigger than x 2 and if x 1 is smaller than x 1, then x 2 is smaller than x 2 x2x2 x1x1 Expected value x1x1 x2x2 This is a positive correlation

Conversely, in this distribution if x 1 is bigger than x 1, then x 2 is smaller than x 2 and if x 1 is smaller than x 1, then x 2 is smaller than x 2 x2x2 x1x1 Expected value x1x1 x2x2 This is a negative correlation

This correlation can be quantified by multiplying the distribution by a four-quadrant function x2x2 x1x1 x1x1 x2x2 + + - - And then integrating. The function (x 1 -x 1 )(x 2 -x 2 ) works fine C =  (x 1 -x 1 ) (x 2 -x 2 ) p(x 1,x 2 ) dx 1 dx 2 Called the “covariance”

Note that the matrix C with elements C ij =  (x i -x i ) (x j -x j ) p(x 1,x 2 ) dx 1 dx 2 has diagonal elements of  xi 2 the variance of x i and off-diagonal elements of cov(x i,x j ) the covariance of x i and x j C = 1212 cov(x 1,x 2 )cov(x 1,x 3 ) cov(x 1,x 2 ) 2222 cov(x 2,x 2 ) cov(x 1,x 3 )cov(x 2,x 2 ) 3232

The “vector of means” of multivatiate distribution x and the “Covariance matrix” of multivariate distribution C x summarized a lot – but not everything – about a multivariate distribution

Functions of a set of random variables, x A vector of of N random variables in a vector, x

Special Case linear function y=Mx the expectation of y is y=Mx Memorize!

the covariance of y is So C y = M C x M T Memorize!

Note that these rules work regardless of the distribution of x if y is linearly related to x, y=Mx then y=Mx (rule for means) C y = M C x M T (rule for propagating error) Memorize!

Lecture 2 Probability and what it has to do with data analysis.

Similar presentations

Presentation on theme: "Lecture 2 Probability and what it has to do with data analysis."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Lecture 2 Probability and what it has to do with data analysis.

Similar presentations

Presentation on theme: "Lecture 2 Probability and what it has to do with data analysis."— Presentation transcript:

Similar presentations

About project

Feedback