Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lecture 4 Probability and what it has to do with data analysis.

Similar presentations


Presentation on theme: "Lecture 4 Probability and what it has to do with data analysis."— Presentation transcript:

1 Lecture 4 Probability and what it has to do with data analysis

2 Please Read Doug Martinson’s Chapter 2: ‘Probability Theory’ Available on Courseworks

3 Abstraction Random variable, x it has no set value, until you ‘realize’ it its properties are described by a distribution, p(x)

4 When you realize x the probability that the value you get is between x and x+dx is p(x) dx Probability density distribution

5 the probability, P, that the value you get is is between x 1 and x 2 is P =  x 1 x 2 p(x) dx Note that it is written with a capital P And represented by a fraction between 0 = never And 1 = always

6 p(x) x x1x1 x2x2 Probability P that x is between x 1 and x 2 is proportional to this area

7 the probability that the value you get is is something is unity  -  +  p(x) dx = 1 Or whatever the allowable range of x is … p(x) x Probability that x is between -  and +  is unity, so total area = 1

8 Why all this is relevant … Any measurement is that contains noise is treated as a random variable, x The distribution p(x) embodies both the ‘true value’ of the quantity being measured and the measurement noise All quantities derived from a random variable are themselves random variables, so … The algebra of random variables allows you to understand how measurement noise affects inferences made from the data

9 Basic Description of Distributions

10 p(x) x x mode Mode x at which distribution has peak most-likely value of x peak

11 But modes can be deceptive … p(x) x x mode peak 010 x N 0-13 1-218 2-311 3-48 4-511 5-614 6-78 7-87 8-911 9-109 Sure, the 1-2 range has the most counts, but most of the measurements are bigger than 2! 100 realizations of x

12 p(x) x x median Median 50% chance x is smaller than x median 50% chance x is bigger than x median No special reason the median needs to coincide with the peak 50%

13 p(x) x Expected value or ‘mean’ x you would get if you took the mean of lots of realizations of x 0 1 2 3 4 123 Let’s examine a discrete distribution, for simplicity...

14 x N 1 20 2 80 3 40 Total140 mean = [ 20  1 + 80  2 + 40  3 ] / 140 = (20/140)  1 + (80/140)  2 + (40/140)  3 = p(1)  1 + p(2)  2 + p(3)  3 = Σ i p(x i ) x i Hypothetical table of 140 realizations of x

15 by analogy for a smooth distribution Expected value of x E(x) =  -  +  x p(x) dx

16 by the way … You can compute the expected (“mean”) value of any function of x this way … E(x) =  -  +  x p(x) dx E(x 2 ) =  -  +  x 2 p(x) dx E(  x) =  -  +   x p(x) dx etc.

17 Beware E(x 2 )  E(x) 2 E(x)  E(  x) 2 and so forth …

18 p(x) x Width of a distribution Here’s a perfectly sensible way to define the width of a distribution… 50% 25% W 50 … it’s not used much, though

19 p(x) x Width of a distribution Here’s another way… … multiply and integrate E(x) Parabola [x-E(x)] 2

20 p(x) x Variance =  2 =  -  +  [x-E(x)] 2 p(x) dx E(x) [x-E(x)] 2 [x-E(x)] 2 p(x)x E(x) Compute this total area … Idea is that if distribution is narrow, then most of the probability lines up with the low spot of the parabola But if it is wide, then some of the probability lines up with the high parts of the parabola

21 p(x) x  variance =  A measure of width …  we don’t immediately know its relationship to area, though … E(x)

22 the Gaussian or normal distribution p(x) = exp{ - (x-x) 2 / 2  2 ) 1  (2  )  expected value variance Memorize me !

23 x = 1  = 1 x = 3  = 0.5 x x p(x) Examples of Normal Distributions

24 x p(x) x x+2  x-2  95% Expectation = Median = Mode = x 95% of probability within 2  of the expected value Properties of the normal distribution

25 Functions of a random variable any function of a random variable is itself a random variable

26 If x has distribution p(x) the y(x) has distribution p(y) = p[x(y)] dx/dy

27 This follows from the rule for transforming integrals … 1 =  x 1 x 2 p(x) dx =  y 1 y 2 p[x(y)] dx/dy dy Limits so that y 1 =y(x 1 ), etc.

28 example Let x have a uniform (white) distribution of [0,1] p(x) 0x1 1 Uniform probability that x is anywhere between 0 and 1

29 Let y = x 2 then x=y ½ y(x=0)=0 y(x=1)=1 dx/dy=½y -½ p[x(y)]=1 So p(y)=½y -½ on the interval [0,1] 1

30 Numerical test histogram of 1000 random numbers Histogram of x, generated with Excel’s rand() function which claims to be based upon a uniform distribution Histogram of x 2, generated by squaring x’s from above Plausible that it’s proportional to 1/  y Plausible that it’s uniform

31 multivariate distributions

32 example Liberty island is inhabited by both pigeons and seagulls 40% of the birds are pigeons and 60% of the birds are gulls 50% of pigeons are white and 50% are grey 100% of gulls are white

33 Two variables species s takes two values pigeon p and gull g color c takes two values white w and tan t Of 100 birds, 20 are white pigeons 20 are tan pigeons 60 are white gulls 0 are tan gulls

34 What is the probability that a bird has species s and color c ? c wt p g s 20% 60%0% Note: sum of all boxes is 100% a random bird, that is

35 This is called the Joint Probability and is written P(s,c)

36 Two continuous variables say x 1 and x 2 have a joint probability distribution and written p(x 1, x 2 ) with   p(x 1, x 2 ) dx 1 dx 2 = 1

37 The probability that x 1 is between x 1 and x 1 +dx 1 and x 2 is between x 2 and x 2 +dx 2 is p(x 1, x 2 ) dx 1 dx 2 so   p(x 1, x 2 ) dx 1 dx 2 = 1

38 You would contour a joint probability distribution and it would look something like x2x2 x1x1

39 What is the probability that a bird has color c ? c wt p g s 20% 60%0% start with P(s,c) 80%20% and sum columns To get P(c) Of 100 birds, 20 are white pigeons 20 are tan pigeons 60 are white gulls 0 are tan gulls

40 What is the probability that a bird has species s ? c wt p g s 20% 60%0% start with P(s,c) 60% 40% and sum rows To get P(s) Of 100 birds, 20 are white pigeons 20 are tan pigeons 60 are white gulls 0 are tan gulls

41 These operations make sense with distributions, too x2x2 x1x1 x2x2 x1x1 x1x1 p(x 1 ) p(x 1 ) =  p(x 1,x 2 ) dx 2 x2x2 p(x 2 ) p(x 2 ) =  p(x 1,x 2 ) dx 1 distribution of x 1 (irrespective of x 2 ) distribution of x 2 (irrespective of x 1 )

42 Given that a bird is species s what is the probability that it has color c ? c wt p g s 50% 100%0% Note, all rows sum to 100 Of 100 birds, 20 are white pigeons 20 are tan pigeons 60 are white gulls 0 are tan gulls

43 This is called the Conditional Probability of c given s and is written P(c|s) similarly …

44 Given that a bird is color c what is the probability that it has species s ? c wt p g s 25%100% 75%0% Note, all columns sum to 100 Of 100 birds, 20 are white pigeons 20 are tan pigeons 60 are white gulls 0 are tan gulls So 25% of white birds are pigeons

45 This is called the Conditional Probability of s given c and is written P(s|c)

46 Beware! P(c|s)  P(s|c) c wt p g s 50% 100%0% c wt p g s 25%100% 75%0%

47 note P(s,c) = P(s|c) P(c) c wt p g s 20 600 c wt p g s 25100 750 = 8020  c wt 25% of 80 is 20

48 and P(s,c) = P(c|s) P(s) c wt p g s 20 600 =  c wt p g s 50 1000 60 40 p g s 50% of 40 is 20

49 and if P(s,c) = P(s|c) P(c) = P(c|s) P(s) then P(s|c) = P(c|s) P(s) / P(c) and P(c|s) = P(s|c) P(c) / P(s) … which is called Bayes Theorem

50 Why Bayes Theorem is important Consider the problem of fitting a straight line to data, d, where the intercept and slope are given by the vector m. If we guess m and use it to predict d we are doing something like P(d|m) But if we observe d and use it to estimate m then we are doing something like P(m|d) Bayes Theorem provides a framework for relating what we do to get P(d|m) to what we do to get P(m|d)

51 Expectation Variance And Covariance Of a multivariate distribution

52 The expected value of x 1 and x 2 are calculated in a fashion analogous to the one-variable case: E(x 1 )=  x 1 p(x 1,x 2 ) dx 1 dx 2 E(x 2 )=  x 2 p(x 1,x 2 ) dx 1 dx 2 x2x2 x1x1 Note E(x 1 ) =  x 1 p(x 1,x 2 ) dx 1 dx 2 =  x 1 [  p(x 1,x 2 )dx 2 ] dx 1 =  x 1 p(x 1 ) dx 1 So the formula really is just the expectation of a one- variable distribution

53 The variance of x 1 and x 2 are calculated in a fashion analogous to the one-variable case, too:  x1 2 =  (x 1 -x 1 ) 2 p(x 1,x 2 ) dx 1 dx 2 with x 1 =E(x 1 ) and similarly for  x2 2 x2x2 x1x1 Note, once again  x1 2 =  (x 1 -x 1 ) 2 p(x 1,x 2 ) dx 1 dx 2 =  (x 1 -x 1 ) 2 [  p(x 1,x 2 ) dx 2 ] dx 2 =  (x 1 -x 1 ) 2 p(x 1 ) dx 1 So the formula really is just the variance of a one-variable distribution

54 Note that in this distribution if x 1 is bigger than x 1, then x 2 is bigger than x 2 and if x 1 is smaller than x 1, then x 2 is smaller than x 2 x2x2 x1x1 Expected value x1x1 x2x2 This is a positive correlation

55 Conversely, in this distribution if x 1 is bigger than x 1, then x 2 is smaller than x 2 and if x 1 is smaller than x 1, then x 2 is smaller than x 2 x2x2 x1x1 Expected value x1x1 x2x2 This is a negative correlation

56 This correlation can be quantified by multiplying the distribution by a four-quadrant function x2x2 x1x1 x1x1 x2x2 + + - - And then integrating. The function (x 1 -x 1 )(x 2 -x 2 ) works fine cov(x 1,x 2 ) =  (x 1 -x 1 ) (x 2 -x 2 ) p(x 1,x 2 ) dx 1 dx 2 Called the “covariance”

57 Note that the vector x with elements x i = E(x i )=  x i p(x 1,x 2 ) dx 1 dx 2 is the expectation of x and the matrix C x with elements C ij =  (x i -x i ) (x j -x j ) p(x 1,x 2 ) dx 1 dx 2 has diagonal elements equal to the variance of x i C x ii =  xi 2 and off-diagonal elements equal to the covariance of x i and x j C x ij = cov(x i,x j )

58 “Center” of multivatiate distribution x “Width” and “Correlatedness” of multivariate distribution summarized a lot – but not everything – about a multivariate distribution

59 Functions of a set of random variables, x A vector of of N random variables in a vector, x

60 given y(x) Do you remember how to transform the integral  …  p(x) d N x =  …  ? d N y =

61 given y(x) then  …  p(x) d N x =  …  p[x(y)] |dx/dy| d N y = Jacobian determinant, that is, the determinant of matrix J ij whose elements are dx i /dy j

62 But here’s something that’s EASIER … Suppose y(x) is a linear function y=Mx Then we can easily calculate the expectation of y y i = E(y i ) =  …  y i p(x 1 … x N ) dx 1 …dx N =  …   M ij x j p(x 1 … x N ) dx 1 … dx N =  M ij  …  x j p(x 1 … x N ) dx 1 … dx N =  M ij E(x i ) =  M ij x i So y=Mx

63 And we can easily calculate the covariance C y ij =  …  (y i – y i ) (y j – y j ) p(x 1,x 2 ) dx 1 dx 2 =  …  Σ p M ip (x p – x p ) Σ q M jq (x q – x q ) p(x 1 …x N ) dx 1 …dx N = Σ p M ip Σ q M jq  …  (x p – x p ) (x q – x q ) p(x 1 …x N ) dx 1 …dx N = Σ p M ip Σ q M jq C x pq So C y = M C x M T Memorize!

64 Note that these rules work regardless of the distribution of x if y is linearly related to x, y=Mx then y=Mx (rule for means) C y = M C x M T (rule for propagating error)


Download ppt "Lecture 4 Probability and what it has to do with data analysis."

Similar presentations


Ads by Google