Presentation is loading. Please wait.

Presentation is loading. Please wait.

Introduction to Distributions and Probability Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research.

Similar presentations


Presentation on theme: "Introduction to Distributions and Probability Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research."— Presentation transcript:

1 Introduction to Distributions and Probability Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research

2 Overview Distributions Distributions History of probability History of probability Definitions of probability Definitions of probability Random variable Random variable Probability density function Probability density function Normal, Binomial and Poisson distributions Normal, Binomial and Poisson distributions

3 Introduction to Probability Density Functions Normal Distribution / Normal Distribution / Gaussian / Bell curve Gaussian / Bell curve Poisson named after French Mathematician Poisson named after French Mathematician Binomial related to binary factors (Bernoulli Trials) Binomial related to binary factors (Bernoulli Trials)

4 Early use of Normal Distribution Gauss was a German mathematician who solved mystery of where Ceres would appear after it disappeared behind the Sun. Gauss was a German mathematician who solved mystery of where Ceres would appear after it disappeared behind the Sun. He assumed the errors formed a Normal distribution and managed to accurately predict the orbit of Ceres He assumed the errors formed a Normal distribution and managed to accurately predict the orbit of Ceres

5 What is the relationship between the Normal or Gaussian distribution and probability?

6 Probability “The probable is what usually happens” Aristotle “I cannot believe that God plays dice with the cosmos” Albert Einstein

7 Origins of Probability Early interest in permutations Vedic literature 400 BC Early interest in permutations Vedic literature 400 BC Distinguished origins in betting and gambling! Distinguished origins in betting and gambling! Pascal and Fermat studied division of stakes in gambling (1654) Pascal and Fermat studied division of stakes in gambling (1654) Enlightenment – seen as helping public policy, social equity Enlightenment – seen as helping public policy, social equity Astronomy – Gauss (1801) Astronomy – Gauss (1801) Social and genetic – Galton (1885) Social and genetic – Galton (1885) Experimental design – Fisher (1936) Experimental design – Fisher (1936)

8 Types of Probability Two basic definitions: 1) Frequentist Classical Proportion of times an event occurs in a long series of ‘trials’ 2) Subjectivist Bayesian Strength of belief in event happening

9 Frequentists vs. Bayesians Two entrenched camps Two entrenched camps Scientists tend to use the frequentist approach Scientists tend to use the frequentist approach Bayesians gaining ground Bayesians gaining ground Most scientists use frequentist methods but incorrectly interpret results in a Bayesian way! Most scientists use frequentist methods but incorrectly interpret results in a Bayesian way!

10 Frequentists Consider tossing a fair coin Consider tossing a fair coin In any trial, event may be a ‘head’ or ‘tail’ i.e. binary In any trial, event may be a ‘head’ or ‘tail’ i.e. binary Repeated tossing gives series of ‘events’ Repeated tossing gives series of ‘events’ In long run prob of heads=0.5 In long run prob of heads=0.5 THTTHHHHTHHHTHHHTTHTTTHHTTHTTHHHTTTHHTHHHTTTTTHHH 0.6 0.56 0.52 0.6 0.56 0.52

11 Frequentist Probability Note the difference between ‘long run’ probability and an individual trial Note the difference between ‘long run’ probability and an individual trial In an individual trial a head either occurs (X=1) or does not occur (X=0) In an individual trial a head either occurs (X=1) or does not occur (X=0) Patient either survives or dies following an MI Patient either survives or dies following an MI Prob of dying after MI ≈ 30% based on a previous long series from a population of individuals who experienced MI Prob of dying after MI ≈ 30% based on a previous long series from a population of individuals who experienced MI

12 Subjective Probability Based on strength of belief Based on strength of belief But more akin to thinking of clinician making a diagnosis But more akin to thinking of clinician making a diagnosis Faced with patient with chest pain, based on past experience, believes prob of heart disease is 20% Faced with patient with chest pain, based on past experience, believes prob of heart disease is 20% Person tossing coin believes prob of head is 1/2 Person tossing coin believes prob of head is 1/2

13 Comparison of definitions of Probability Problems of subjective probability Problems of subjective probability Probability for same patient can vary even with same clinician Probability for same patient can vary even with same clinician Person can believe prob of head is 0.1 even if it is a fair coin Person can believe prob of head is 0.1 even if it is a fair coin Subjectivists argue they are more realistic Subjectivists argue they are more realistic This course sticks to ‘frequentist’ and ‘model-based’ methods of probability This course sticks to ‘frequentist’ and ‘model-based’ methods of probability

14 Random Variable Consider rolling 2 dice and we want to summarise the probabilities of all possible outcomes Consider rolling 2 dice and we want to summarise the probabilities of all possible outcomes We call the outcome a random variable X which can have any value in this case from 2 to 12 We call the outcome a random variable X which can have any value in this case from 2 to 12 Enumerate all probabilities in sample space S Enumerate all probabilities in sample space S P (2) = 1/6x1/6 = 1/36, P (3)=2/36, P (4) = 3/36, etc….. P (2) = 1/6x1/6 = 1/36, P (3)=2/36, P (4) = 3/36, etc…..

15 Probability Density Function for rolling two dice 2 3 4 5 6 7 8 9 10 11 12 2 3 4 5 6 7 8 9 10 11 12 1/36 2/36 3/36 4/36 5/36 6/36 5/36 4/36 3/36 2/36 1/36 2 3 4 5 6 7 8 9 10 11 12 2 3 4 5 6 7 8 9 10 11 12 6/36 5/36 4/36 3/36 2/36 1/36

16 Probability Density Function for rolling two dice 2 3 4 5 6 7 8 9 10 11 12 2 3 4 5 6 7 8 9 10 11 12 6/36 5/36 4/36 3/36 2/36 1/36 What is probability of getting 12? Answer 1/36 What is probability of getting more than 8? Ans. 10/36

17 Probability Density Function for continuous variable 2 3 4 5 6 7 8 9 10 11 12 2 3 4 5 6 7 8 9 10 11 12 1/36 2/36 3/36 4/36 5/36 6/36 5/36 4/36 3/36 2/36 1/36 2 3 4 5 6 7 8 9 10 11 12 2 3 4 5 6 7 8 9 10 11 12 6/36 5/36 4/36 3/36 2/36 1/36

18 Consider distribution of weight in kg; all values possible not just discrete 2 3 4 5 6 7 8 9 10 11 12 2 3 4 5 6 7 8 9 10 11 12 20…….30……40…… 50 ……60…….70…….80…..90….100….110…… 120 20…….30……40…… 50 ……60…….70…….80…..90….100….110…… 120 Probability Weight in kilograms

19 Probability Density Function in SPSS Use Analyze / Descriptive Statistics / Frequencies and select no table and charts box as below

20 Probability Density Function in SPSS Data from ‘LDL Data.sav’ of baseline LDL cholesterol

21 Normal Distribution Note that a Normal or Gaussian curve is defined by two parameters: Mean µ and Standard Deviation σ And often written as N ( µ, σ ) Hence any Normal distribution has mathematical form Impossible to be integrated so area under the curve obtained by numerical integration and tabulated!

22 Normal Distribution As noted earlier the curve is symmetrical about the mean and so p ( x ) > mean = 0.5 or 50% And p ( x ) < mean = 0.5 or 50% And p (a < x < b) = p(b) – p(a) 50%

23 Normal Distribution and Probabilities So we now have a way of working out the probability of any value or range of values of a variables IF a Normal distribution is a reasonable fit to the data p (a < x < b) = p(b) – p(a) which is the area under the curve between a and b 50%

24 Normal Distribution Most of area lies between +1 and -1 SD (64%) The large majority lie between +2 and -2 SDs (95%)

25 Probability Density Function (PDF) = Normal Distribution

26 How well does my data fit a Normal Distribution? Note median and mean virtually the same Skewness = 0.039, close to zero Skewness is measure of symmetry (0=perfect symetry) Eyeball test - fitted normal curve looks good!

27 Try Q-Q plot in Analyze / Descriptive Statistics/ Q-Q plot Plot compares Expected Normal distribution with real data and if data lies on line y = x then the Normal Distribution is a good fit Note still an eyeball test! Is this a good fit?

28 I used to be Normal until I discovered Kilmogorov-Smirnoff! Eyeball Test indicates distribution is approximately Normal but K-S test is significant indicating discrepancy compared to Normal WARNING: DO NOT RELY ON THIS TEST

29 Consider the distribution of survival times following surgery for colorectal cancer Note median=835 days and mean=848 Skewness = 2.081, very skewed (> 1.0) Strong tail to right! Approximately Normal?

30 Try a log transformation for right positive skewed data? Better but now slightly skewed to left!

31 Examples of skewed distributions in Health Research Discrete random variables – hospital admissions, cigarettes smoked, alcohol consumption, costs Continuous RV – BMI, cholesterol, BP 30%

32 The Binomial Distribution ‘Binomial’ means ‘two numbers’. ‘Binomial’ means ‘two numbers’. Outcomes of health research are often measured by whether they have occurred or not. Outcomes of health research are often measured by whether they have occurred or not. For example, recovered from disease, admitted to hospital, died, etc For example, recovered from disease, admitted to hospital, died, etc May be modelled by assuming that the number of events n has a binomial distribution with a fixed probability of event p May be modelled by assuming that the number of events n has a binomial distribution with a fixed probability of event p

33 The Binomial Distribution Based on work of Jakob Bernoulli, a Swiss mathematician Based on work of Jakob Bernoulli, a Swiss mathematician Refused a church appointment and instead studied mathematics Refused a church appointment and instead studied mathematics Early use was for games of chance but now used in every human endeavour Early use was for games of chance but now used in every human endeavour When n = 1 this is called a Bernoulli trial When n = 1 this is called a Bernoulli trial Binomial distribution is distribution for a series of Bernoulli trials Binomial distribution is distribution for a series of Bernoulli trials

34 The Binomial Distribution Binomial distribution written as B ( n, p) where n is the total number of events and p = prob of an event Binomial distribution written as B ( n, p) where n is the total number of events and p = prob of an event This is a Binomial This is a Binomial Distribution with Distribution with p=0.25 and n=20 p=0.25 and n=20

35 The Binomial Distribution Binomial distributions used for binary factors and so used to assess percentages or proportions Utilised in Cross-tabulation and logistic regression Note as N gets larger or P ~0.5 then Binomial is Equal to Normal Distr. B(n,p) ~ N (np, np(1-p))

36 The Poisson Distribution Poisson distribution (1838), named after its inventor Simeon Poisson who was a French mathematician. He found that if we have a rare event (i.e. p is small) and we know the expected or mean ( or µ) number of occurrences, the probabilities of 0, 1, 2... events are given by:

37 The Poisson Distribution Note similarity to Binomial In fact when p is small and n is large B(n, p) ~ P (µ = np) Also for large values of µ: P (µ) ~ N ( µ, µ ) Hence if n and p not known could use Poisson instead

38 The Poisson Distribution In health research often used to model the number of events assumed to be random: Number of hip replacement failures, Number of cases of C. diff infection, Diagnoses of leukaemia around nuclear power stations, Number of H1N1 cases in Scotland, Etc.

39 Summary Many of variables measured in Health Research form distributions which approximate to common distributions with known mathematical propertiesMany of variables measured in Health Research form distributions which approximate to common distributions with known mathematical properties Normal, Poisson, Binomial, etc…Normal, Poisson, Binomial, etc… Note a relationship for all centredNote a relationship for all centred around the exponential distribution around the exponential distribution Where e = 2.718 All belong to the Exponential Family of distributions All belong to the Exponential Family of distributions These probability distributions are critical to applying statistical methods These probability distributions are critical to applying statistical methods

40 SPSS Practical Read in data file ‘LDL Data.sav’ Read in data file ‘LDL Data.sav’ Consider adherence to statins, baseline LDL, min Chol achieved, BMI, duration of statin use Consider adherence to statins, baseline LDL, min Chol achieved, BMI, duration of statin use Assess distributions for normality Assess distributions for normality If non-normal consider a transformation If non-normal consider a transformation Try to carry out Q-Q plots Try to carry out Q-Q plots


Download ppt "Introduction to Distributions and Probability Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research."

Similar presentations


Ads by Google