Download presentation
Presentation is loading. Please wait.
1
Copyright (c) Bani Mallick1 Lecture 4 Stat 651
2
Copyright (c) Bani Mallick2 Topics in Lecture #4 Probability The bell-shaped (normal) curve Normal probability plots (the q-q plot) to check for normality of continuous data Use of Table 1 in the back of the book
3
Copyright (c) Bani Mallick3 Topics in Lecture #4 Normal probability calculations Data Transformations Sampling distributions: sample means are random variables! Standard error of the sample mean Central Limit Theorem A simple confidence interval
4
Copyright (c) Bani Mallick4 Book Sections Covered in Lecture #4 Chapter 4.10, in detail Chapter 4.11 (read on your own) Chapter 4.12, in detail Chapter 5.1 Chapter 5.2
5
Copyright (c) Bani Mallick5 Lecture 3 Review Box plots are probably the best way to compare populations graphically You can detect shifts and changes in variation Also identifies outliers
6
Copyright (c) Bani Mallick6 Lecture 3 Review q-q plots are a simple way to understand whether the data are approximately bell- shaped
7
Copyright (c) Bani Mallick7 Lecture 3 Review q-q plots are a simple way to understand whether the data are approximately bell- shaped If they are sort of straight, then normality of the population relative frequency histogram is not too badly off
8
Copyright (c) Bani Mallick8 q-q plot for the healthy women
9
Copyright (c) Bani Mallick9 Lecture 3 Review For bell-shaped populations, we have empirical rules Approximately 68% (90%) (95%) of the population lies within 1 (1.645) (1.96) population standard deviations of the population mean
10
Copyright (c) Bani Mallick10 Lecture 3 Review In many of our examples, we have seen that there look to be differences among populations. How can we tell if the differences are real? We will say that populations are different if the differences we observe are more than can be expected by sample-to-sample variability.
11
Copyright (c) Bani Mallick11 Lecture 3 Review Random variables are any outcome (qualitative or numerical) from an experiment involving random sampling from a population The idea of a model is to write down a formula for the population histogram as a function of 1-2 parameters which are estimated from the data. If you know the parameters of the model, then you know everything about probabilities in that population
12
Copyright (c) Bani Mallick12 Using the Normal Model The entire point of the normal model is to make probability statements In practice, we estimate the population mean by the sample mean We estimate the population standard deviation by the sample standard deviation Then we estimate probabilities, by pretending the sample quantities = the population ones
13
Copyright (c) Bani Mallick13 Various Cases Suppose we want to know what % of a population lies below a specified value, c We write this by asking: what is Pr(X < c) The value c is any arbitrary value, e.g., 6 X is any random variable with a population mean and a population standard deviation
14
Copyright (c) Bani Mallick14 Pr(X < c) for Normal Populations Compute the z-score Look up value in Table 1, page 1091 (white board explanation)
15
Copyright (c) Bani Mallick15 Mechanics NHANES: suppose healthy women’s ages are normally distributed with mean = 40 and standard deviation = 6 What is the chance that a randomly selected person from this population is aged c = 43.3 or less We write this in symbols as pr(X < 43.3)
16
Copyright (c) Bani Mallick16 Mechanics = 40, = 6 pr(X < 43.3) is what we want z = (43.3 - )/ = 0.55 = z-score Look up in Table 1: The value 0.55 is on page 1092: first column is 0.5, first row is 0.05: add them to get 0.55, and look up the value Pr(X < 43.3) = 0.7088
17
Copyright (c) Bani Mallick17 Various Cases Suppose we want to know what % of a population lies above a specified value, c We write this by asking: what is Pr(X > c) The value c is any arbitrary value, e.g., 6 X is any random variable with a population mean and a population standard deviation
18
Copyright (c) Bani Mallick18 Pr(X > c) for Normal Populations This is simply 1 – Pr(X <= c). Compute the z-score (c- )/ Look up the value for z in Table 1 Subtract this value from 1.0
19
Copyright (c) Bani Mallick19 Mechanics = 40, = 6 Chance that a randomly selected person from this population is aged 46 or more pr(X > 46) z = (46 - )/ = 1 Look up in Table 1 for 1.00: get 0.8413 Because you are asking for > 46, subtract from 1 to get pr(X > 46) = 1 – 0.8413 =.1587
20
Copyright (c) Bani Mallick20 Mechanics = 40, = 6 Chance that a randomly selected person from this population is aged 46 or less pr(X <= 46) z = (46 - )/ = 1 Look up in Table 1: chance is 84.13%
21
Copyright (c) Bani Mallick21 Mechanics = 40, = 6 Chance that a randomly selected person from this population is aged 34 or less pr(X <= 34) z = (34 - )/ = -1 Look up in Table 1: chance is 0.1587 = 15.87%
22
Copyright (c) Bani Mallick22 Aortic Stenosis Data Two populations: healthy kids and kids with aortic stenosis Two outcomes: body surface area and aortic value area Size adjusted aortic value areas is the ratio of aortic value area to body surface area
23
Copyright (c) Bani Mallick23 Stenosis Data, AVA to BSA Ratio: Note the huge outlier in the stenotic kids. He/she has a huge aortic value area relative to his/her body size
24
Copyright (c) Bani Mallick24 Aortic Stenosis Data Healthy kids and AVA/BSA Ratio Sample mean = 1.38, s = 0.51 Let’s pretend the population has = 1.4, = 0.5 As it turns out, the sample mean of stenotic kids is 0.7 So, let’s ask: for healthy kids, what is pr(X < 0.7)?
25
Copyright (c) Bani Mallick25 Aortic Stenosis Data Healthy kids and AVA/BSA Ratio = 1.4, = 0.5 For healthy kids, what pr(X <= 0.7)? z = (0.7 - )/ = -1.4 look up in Table 1 You should get 0.0808
26
Copyright (c) Bani Mallick26 Aortic Stenosis Data For healthy kids, pr(X <= 0.7) = 0.0808 Stenotic kids have a mean ava/bsa ratio of 0.7 Thus, the average stenotic kid has a lower ava/bsa ratio than 91.92% of healthy kids 91.92% = 100% - 8.08%
27
Copyright (c) Bani Mallick27 Not all Data are Normally Distributed “Time to an event”, e.g., time to a heart attack Number of things that happen, e.g., number of heart attacks These typically have a skew shape
28
Copyright (c) Bani Mallick28 Not all Data are Normally Distributed These typically have a skew shape Statisticians have special models to handle this (Gamma, Poisson) You will usually try to eliminate some of the skewness by data transformation
29
Copyright (c) Bani Mallick29 Not all Data are Normally Distributed The standard data transformations are Square root Logarithm: but if you have zeros in the data set, you have to add a small constant, since log(0) =
30
Copyright (c) Bani Mallick30 Inference The basic building blocks for inference are statistics Let’s start with the population mean , the sample mean and the sample standard deviation s Standard error (of the mean) is
31
Copyright (c) Bani Mallick31 Inference The sample mean is a random variable This means that it varies from sample to sample Of course, if we were able to “sample” the entire population, the sample mean would equal the population mean
32
Copyright (c) Bani Mallick32 Inference The sample mean is a random variable Its own “population” mean is It’s standard deviation is Note how the standard deviation of the sample mean becomes smaller as the sample size becomes larger Why does this make sense?
33
Copyright (c) Bani Mallick33 Central Limit Theorem The sample mean is a random variable Its own “population” mean is It’s standard deviation is In “large enough” samples, the sample mean is very nearly normally distributed, i.e., has a bell--shaped histogram What does this mean?
34
Copyright (c) Bani Mallick34 Warning It is incredibly easy to have difficulty understanding that the sample mean is itself a random variable But it is the crucial concept If I take repeated samples and compute the sample mean each time, I will not get the same number. Thus, the sample mean is a random variable
35
Copyright (c) Bani Mallick35 Women’s Interview Survey of Health Funny case-control study Seemed to indicate that those women who ate a lot of non-chocolate sweets were at higher risk of breast cancer 271 women controls were interview for their diets They completed 6 24-hour recalls
36
Copyright (c) Bani Mallick36 Women’s Interview Survey of Health 271 women controls were interview for their diets and completed 6 24-hour recalls Hawthorne effect: the more you ask people about their lives, the more they will change Does this happen here? If so, we’d expect that their caloric intake decreased the more they were asked about their diet
37
Copyright (c) Bani Mallick37 Women’s Interview Survey of Health To test the Hawthorne effect, we took the average caloric intake from the first two interviews, and subtracted it from the average caloric intake from the last 2 interviews X = (average of 5 & 6) – (average of 1 & 2) Do you think the population mean of X is positive or negative?
38
Copyright (c) Bani Mallick38 WOMEN’S INTERVIEW SURVEY OF HEALTH (WISH) My guess was that because of various factors (societal pressure, awareness of diet, Hawthorne effect), they will report fewer calories at the second time period My hypothesis is that the population mean of X is < 0.
39
Copyright (c) Bani Mallick39 WISH: Change in Caloric Intake Does it look like a big change?
40
Copyright (c) Bani Mallick40 WISH: Change in Calories Does this look straight enough to be happy thinking that X is approximately normally distributed?
41
Copyright (c) Bani Mallick41 WISH What does an IQR of 838 mean?
42
Copyright (c) Bani Mallick42 WISH The sample size is n = 271 The sample mean change = -180 calories! The sample standard deviation = 612 The sample standard error = 37 Empirical rule, the chance is 95% that the population mean is with 1.96 * 37 = 74 of - 180, i.e., between - 254 and -106
43
Copyright (c) Bani Mallick43 WISH Empirical rule, the chance is 95% that the population mean between - 254 and -106 What does this mean? Is there a Hawthorne effect going on? Can you attach a probability to this?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.