Download presentation
Presentation is loading. Please wait.
1
Lecture 4 Probability and what it has to do with data analysis
2
Please Read Doug Martinson’s Chapter 2: ‘Probability Theory’ Available on Courseworks
3
Abstraction Random variable, x it has no set value, until you ‘realize’ it its properties are described by a distribution, p(x)
4
When you realize x the probability that the value you get is between x and x+dx is p(x) dx Probability density distribution
5
the probability, P, that the value you get is is between x 1 and x 2 is P = x 1 x 2 p(x) dx Note that it is written with a capital P And represented by a fraction between 0 = never And 1 = always
6
p(x) x x1x1 x2x2 Probability P that x is between x 1 and x 2 is proportional to this area
7
the probability that the value you get is is something is unity - + p(x) dx = 1 Or whatever the allowable range of x is … p(x) x Probability that x is between - and + is unity, so total area = 1
8
Why all this is relevant … Any measurement is that contains noise is treated as a random variable, x The distribution p(x) embodies both the ‘true value’ of the quantity being measured and the measurement noise All quantities derived from a random variable are themselves random variables, so … The algebra of random variables allows you to understand how measurement noise affects inferences made from the data
9
Basic Description of Distributions
10
p(x) x x mode Mode x at which distribution has peak most-likely value of x peak
11
But modes can be deceptive … p(x) x x mode peak 010 x N 0-13 1-218 2-311 3-48 4-511 5-614 6-78 7-87 8-911 9-109 Sure, the 1-2 range has the most counts, but most of the measurements are bigger than 2! 100 realizations of x
12
p(x) x x median Median 50% chance x is smaller than x median 50% chance x is bigger than x median No special reason the median needs to coincide with the peak 50%
13
p(x) x Expected value or ‘mean’ x you would get if you took the mean of lots of realizations of x 0 1 2 3 4 123 Let’s examine a discrete distribution, for simplicity...
14
x N 1 20 2 80 3 40 Total140 mean = [ 20 1 + 80 2 + 40 3 ] / 140 = (20/140) 1 + (80/140) 2 + (40/140) 3 = p(1) 1 + p(2) 2 + p(3) 3 = Σ i p(x i ) x i Hypothetical table of 140 realizations of x
15
by analogy for a smooth distribution Expected value of x E(x) = - + x p(x) dx
16
by the way … You can compute the expected (“mean”) value of any function of x this way … E(x) = - + x p(x) dx E(x 2 ) = - + x 2 p(x) dx E( x) = - + x p(x) dx etc.
17
Beware E(x 2 ) E(x) 2 E(x) E( x) 2 and so forth …
18
p(x) x Width of a distribution Here’s a perfectly sensible way to define the width of a distribution… 50% 25% W 50 … it’s not used much, though
19
p(x) x Width of a distribution Here’s another way… … multiply and integrate E(x) Parabola [x-E(x)] 2
20
p(x) x Variance = 2 = - + [x-E(x)] 2 p(x) dx E(x) [x-E(x)] 2 [x-E(x)] 2 p(x)x E(x) Compute this total area … Idea is that if distribution is narrow, then most of the probability lines up with the low spot of the parabola But if it is wide, then some of the probability lines up with the high parts of the parabola
21
p(x) x variance = A measure of width … we don’t immediately know its relationship to area, though … E(x)
22
the Gaussian or normal distribution p(x) = exp{ - (x-x) 2 / 2 2 ) 1 (2 ) expected value variance Memorize me !
23
x = 1 = 1 x = 3 = 0.5 x x p(x) Examples of Normal Distributions
24
x p(x) x x+2 x-2 95% Expectation = Median = Mode = x 95% of probability within 2 of the expected value Properties of the normal distribution
25
Functions of a random variable any function of a random variable is itself a random variable
26
If x has distribution p(x) the y(x) has distribution p(y) = p[x(y)] dx/dy
27
This follows from the rule for transforming integrals … 1 = x 1 x 2 p(x) dx = y 1 y 2 p[x(y)] dx/dy dy Limits so that y 1 =y(x 1 ), etc.
28
example Let x have a uniform (white) distribution of [0,1] p(x) 0x1 1 Uniform probability that x is anywhere between 0 and 1
29
Let y = x 2 then x=y ½ y(x=0)=0 y(x=1)=1 dx/dy=½y -½ p[x(y)]=1 So p(y)=½y -½ on the interval [0,1] 1
30
Numerical test histogram of 1000 random numbers Histogram of x, generated with Excel’s rand() function which claims to be based upon a uniform distribution Histogram of x 2, generated by squaring x’s from above Plausible that it’s proportional to 1/ y Plausible that it’s uniform
31
multivariate distributions
32
example Liberty island is inhabited by both pigeons and seagulls 40% of the birds are pigeons and 60% of the birds are gulls 50% of pigeons are white and 50% are grey 100% of gulls are white
33
Two variables species s takes two values pigeon p and gull g color c takes two values white w and tan t Of 100 birds, 20 are white pigeons 20 are tan pigeons 60 are white gulls 0 are tan gulls
34
What is the probability that a bird has species s and color c ? c wt p g s 20% 60%0% Note: sum of all boxes is 100% a random bird, that is
35
This is called the Joint Probability and is written P(s,c)
36
Two continuous variables say x 1 and x 2 have a joint probability distribution and written p(x 1, x 2 ) with p(x 1, x 2 ) dx 1 dx 2 = 1
37
The probability that x 1 is between x 1 and x 1 +dx 1 and x 2 is between x 2 and x 2 +dx 2 is p(x 1, x 2 ) dx 1 dx 2 so p(x 1, x 2 ) dx 1 dx 2 = 1
38
You would contour a joint probability distribution and it would look something like x2x2 x1x1
39
What is the probability that a bird has color c ? c wt p g s 20% 60%0% start with P(s,c) 80%20% and sum columns To get P(c) Of 100 birds, 20 are white pigeons 20 are tan pigeons 60 are white gulls 0 are tan gulls
40
What is the probability that a bird has species s ? c wt p g s 20% 60%0% start with P(s,c) 60% 40% and sum rows To get P(s) Of 100 birds, 20 are white pigeons 20 are tan pigeons 60 are white gulls 0 are tan gulls
41
These operations make sense with distributions, too x2x2 x1x1 x2x2 x1x1 x1x1 p(x 1 ) p(x 1 ) = p(x 1,x 2 ) dx 2 x2x2 p(x 2 ) p(x 2 ) = p(x 1,x 2 ) dx 1 distribution of x 1 (irrespective of x 2 ) distribution of x 2 (irrespective of x 1 )
42
Given that a bird is species s what is the probability that it has color c ? c wt p g s 50% 100%0% Note, all rows sum to 100 Of 100 birds, 20 are white pigeons 20 are tan pigeons 60 are white gulls 0 are tan gulls
43
This is called the Conditional Probability of c given s and is written P(c|s) similarly …
44
Given that a bird is color c what is the probability that it has species s ? c wt p g s 25%100% 75%0% Note, all columns sum to 100 Of 100 birds, 20 are white pigeons 20 are tan pigeons 60 are white gulls 0 are tan gulls So 25% of white birds are pigeons
45
This is called the Conditional Probability of s given c and is written P(s|c)
46
Beware! P(c|s) P(s|c) c wt p g s 50% 100%0% c wt p g s 25%100% 75%0%
47
note P(s,c) = P(s|c) P(c) c wt p g s 20 600 c wt p g s 25100 750 = 8020 c wt 25% of 80 is 20
48
and P(s,c) = P(c|s) P(s) c wt p g s 20 600 = c wt p g s 50 1000 60 40 p g s 50% of 40 is 20
49
and if P(s,c) = P(s|c) P(c) = P(c|s) P(s) then P(s|c) = P(c|s) P(s) / P(c) and P(c|s) = P(s|c) P(c) / P(s) … which is called Bayes Theorem
50
Why Bayes Theorem is important Consider the problem of fitting a straight line to data, d, where the intercept and slope are given by the vector m. If we guess m and use it to predict d we are doing something like P(d|m) But if we observe d and use it to estimate m then we are doing something like P(m|d) Bayes Theorem provides a framework for relating what we do to get P(d|m) to what we do to get P(m|d)
51
Expectation Variance And Covariance Of a multivariate distribution
52
The expected value of x 1 and x 2 are calculated in a fashion analogous to the one-variable case: E(x 1 )= x 1 p(x 1,x 2 ) dx 1 dx 2 E(x 2 )= x 2 p(x 1,x 2 ) dx 1 dx 2 x2x2 x1x1 Note E(x 1 ) = x 1 p(x 1,x 2 ) dx 1 dx 2 = x 1 [ p(x 1,x 2 )dx 2 ] dx 1 = x 1 p(x 1 ) dx 1 So the formula really is just the expectation of a one- variable distribution
53
The variance of x 1 and x 2 are calculated in a fashion analogous to the one-variable case, too: x1 2 = (x 1 -x 1 ) 2 p(x 1,x 2 ) dx 1 dx 2 with x 1 =E(x 1 ) and similarly for x2 2 x2x2 x1x1 Note, once again x1 2 = (x 1 -x 1 ) 2 p(x 1,x 2 ) dx 1 dx 2 = (x 1 -x 1 ) 2 [ p(x 1,x 2 ) dx 2 ] dx 2 = (x 1 -x 1 ) 2 p(x 1 ) dx 1 So the formula really is just the variance of a one-variable distribution
54
Note that in this distribution if x 1 is bigger than x 1, then x 2 is bigger than x 2 and if x 1 is smaller than x 1, then x 2 is smaller than x 2 x2x2 x1x1 Expected value x1x1 x2x2 This is a positive correlation
55
Conversely, in this distribution if x 1 is bigger than x 1, then x 2 is smaller than x 2 and if x 1 is smaller than x 1, then x 2 is smaller than x 2 x2x2 x1x1 Expected value x1x1 x2x2 This is a negative correlation
56
This correlation can be quantified by multiplying the distribution by a four-quadrant function x2x2 x1x1 x1x1 x2x2 + + - - And then integrating. The function (x 1 -x 1 )(x 2 -x 2 ) works fine cov(x 1,x 2 ) = (x 1 -x 1 ) (x 2 -x 2 ) p(x 1,x 2 ) dx 1 dx 2 Called the “covariance”
57
Note that the vector x with elements x i = E(x i )= x i p(x 1,x 2 ) dx 1 dx 2 is the expectation of x and the matrix C x with elements C ij = (x i -x i ) (x j -x j ) p(x 1,x 2 ) dx 1 dx 2 has diagonal elements equal to the variance of x i C x ii = xi 2 and off-diagonal elements equal to the covariance of x i and x j C x ij = cov(x i,x j )
58
“Center” of multivatiate distribution x “Width” and “Correlatedness” of multivariate distribution summarized a lot – but not everything – about a multivariate distribution
59
Functions of a set of random variables, x A vector of of N random variables in a vector, x
60
given y(x) Do you remember how to transform the integral … p(x) d N x = … ? d N y =
61
given y(x) then … p(x) d N x = … p[x(y)] |dx/dy| d N y = Jacobian determinant, that is, the determinant of matrix J ij whose elements are dx i /dy j
62
But here’s something that’s EASIER … Suppose y(x) is a linear function y=Mx Then we can easily calculate the expectation of y y i = E(y i ) = … y i p(x 1 … x N ) dx 1 …dx N = … M ij x j p(x 1 … x N ) dx 1 … dx N = M ij … x j p(x 1 … x N ) dx 1 … dx N = M ij E(x i ) = M ij x i So y=Mx
63
And we can easily calculate the covariance C y ij = … (y i – y i ) (y j – y j ) p(x 1,x 2 ) dx 1 dx 2 = … Σ p M ip (x p – x p ) Σ q M jq (x q – x q ) p(x 1 …x N ) dx 1 …dx N = Σ p M ip Σ q M jq … (x p – x p ) (x q – x q ) p(x 1 …x N ) dx 1 …dx N = Σ p M ip Σ q M jq C x pq So C y = M C x M T Memorize!
64
Note that these rules work regardless of the distribution of x if y is linearly related to x, y=Mx then y=Mx (rule for means) C y = M C x M T (rule for propagating error)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.