Presentation is loading. Please wait.

Presentation is loading. Please wait.

Statistical Analysis I Mosuk Chow, PhD Senior Scientist and Professor Department of Statistics December 8, 2015 CTSI BERD Research Methods Seminar Series.

Similar presentations


Presentation on theme: "Statistical Analysis I Mosuk Chow, PhD Senior Scientist and Professor Department of Statistics December 8, 2015 CTSI BERD Research Methods Seminar Series."— Presentation transcript:

1 Statistical Analysis I Mosuk Chow, PhD Senior Scientist and Professor Department of Statistics December 8, 2015 CTSI BERD Research Methods Seminar Series

2 Biostatistics, Epidemiology, Research Design(BERD) BERD Goals: l Match the needs of investigators to the appropriate biostatisticians/epidemiologists/methodologists l Provide BERD support to investigators l Offer BERD education to students and investigators via in-person, videoconferenced, and on-line classes http://ctsi.psu.edu/ctsi-programs/biostatisticsepidemiologyresearch-design/

3 Statistics Encompasses l Study design n Selection of efficient design (cohort study/case-control study) n Sample size n Randomization l Data collection l Summarizing data n Important first step in understanding the data collected l Analyzing data to draw conclusions l Communicating the results of analyses

4 Keys to Successful Collaboration Between Statistician and Investigator: A Two-Way Street l Involve statistician at beginning of project (planning/design phase) l Specific objectives l Communication n avoid jargon n willingness to explain details

5 Keys to Successful Collaboration: A Two-Way Street l Respect n Knowledge n Skills n Experience n Time l Embrace statistician as a member of the research team l Fund statistician on grant application for best collaboration n Most statisticians are supported by grants, not by Institutional funds

6 Statistical Analysis l Describing data n Numeric or graphic l Statistical Inference n Estimation of parameters of interest n Hypothesis testing n Regression modeling l Interpretation and presentation of the results

7 Describing data: Basic Terms l Measurement – assignment of a number to a characteristic of an object or event l Data – collection of measurements l Sample – collected data l Population – all possible data l Variable – a property or characteristic of the population/sample – e.g., gender, weight, blood pressure.

8 Example of data set/sample Data on albumin and bilirubin levels before and after treatment with a study drug

9 Describing Data l Types of data l Summary measures (numeric) l Visually describing data (graphical)

10 Types of Variables l Qualitative or Categorical n Binary (or dichotomous) True/False, Yes/No n Nominal – no natural ordering Ethnicity n Ordinal – Categories have natural ranks u Degree of agreement (strong, modest, weak) u Size of tumor (small, medium, large) l Quantitative n Ratio - Ordered, constant scale, natural zero (age, weight) n Interval- Ordered, constant scale, no natural zero u Differences make sense, but ratios do not u Temperature in Celsius (30°-20°=20°-10°, but 20°/10° is not twice as hot)

11 Types of Measurements for Quantitative Variables l Continuous: Weight, Height, Age l Discrete: a countable number of values n The number of births, Age in years l Likert scale: “agree”, “strongly agree”, etc. Somewhere between ordinal and discrete n Scales with <= 4 possibilities are usually considered to be ordinal. n Scales with >=7 possibilities are usually considered to be discrete.

12 Descriptive Statistics Quantitative variable l Measure(s) of central location/tendency n Mean n Median n Mode l Measure(s) of variability (dispersion) n describe the spread of the distribution

13 l Summary Measures of dispersion/variation n Minimum and Maximum n Range = Maximum – Minimum n Sample variances ( abbreviated s 2 ) and standard deviation (s or SD) with denominator=n-1 Descriptive Statistics (cont.)

14 Other Measures of Variation l Interquartile range (IQR): 75 th percentile – 25 th percentile l MAD: median absolute deviation l CV: Coefficient of variation n Ratio of SD over sample mean n Measure relative variability n Independent of measurement units n Useful for comparing two or more sets of data

15 Tell whole story of data, detect outliers l Histogram l Stem and Leaf Plot l Box Plot Describing data graphically

16 Histogram l Divide range of data into intervals (bins) of equal width. l Count the number of observations in each class. 113 men Each bar spans a width of 5 mmHg. The height represents the number of individuals in that range of SBP.

17 Histogram of SBP Bin Width = 20 mmHg Bin Width = 1 mmHg

18 Stem and Leaf Plot l Provides a good summary of data structure l Easy to construct and much less prone to error than the tally method of finding a histogram 2889 301112334455556667777899 4001111122333444455567789 5011234 “stem”: the first digit or digits of the number. “leaf” : the trailing digit.

19 Box Plot: SBP for 113 Males Sample Median Blood Pressure 75 th Percentile 25 th Percentile Largest Observation Smallest Observation

20 Descriptive Statistics (cont.) Categorical variable l Frequency (counts) distribution l Relative frequency (percentages) l Pie chart l Bar graph

21 Describe relationship between two variables One quantitative and one categorical l Descriptive statistics within each category l Side by side boxplots/histograms Both quantitative l Scatter plot Both categorical l Contingency table

22 A process of making inference (an estimate, prediction, or decision) about a population (parameters) based on a sample (statistics) drawn from that population. Statistical Inference Statistics (Vary from sample to sample) Parameters (Fixed, unknown) Population Sample Inference

23 Statistical Inference Questions to ask in selecting appropriate methods l Are observation units independent? l How many variables are of interest? l Type and distribution of variable(s)? l One-sample or two-sample problem? l Are samples independent? l Parameters of interest (mean, variance, proportion)? l Sample size sufficient for the chosen method? (see decision making flow chart in the handout)

24 Estimation of population mean l We don’t know the population mean μ but would like to estimate it. l We draw a sample from the population. l We calculate the sample mean X. l How close is X to μ? l Statistical theory will tell us how close X is to μ. l Statistical inference is the process of trying to draw conclusions about the population from the sample.

25 Key Statistical Concept l Question: How close is the sample mean to the population mean? l Statistical Inference for sample mean n Sample mean will change from sample to sample n We need a statistical model to quantify the distribution of sample means (Sampling distribution) n Sometimes, need “normal distribution” for the population data

26 Normal Distribution l Normal distribution, denoted by N(µ,  2 ), is characterized by two parameters µ: The mean is the center.  : The standard deviation measures the spread (variability). Mean Standard Deviation Standard Deviation Mean Probability density function

27 Distribution of Blood Pressure in Men (population) Y: Blood pressure Y~ N( µ,  2 ) Parameters: Mean, µ = 125 mmHg SD,  = 14 mmHg 99.7% 95% 68% The 68-95-99.7 rule for normal distribution applied to the distribution of systolic blood pressure in men.

28 Sampling Distribution l The sampling distribution refers to the distribution of the sample statistics (e.g. sample means) over all possible samples of size n that could have been selected from the study population. l If the population data follow normal distribution N(µ,  2 ), then the sample means follow normal distribution N(µ,  2 /n). l What if the population data do not come from normal distribution?

29 Central Limit Theorem (CLT) l If the sample size is large, the distribution of sample means approximates a normal distribution. ~ N( µ,  2 /n) l The Central Limit Theorem works even when the population is not normally distributed (or even not continuous). http://onlinestatbook.com/stat_sim/sampling_dist/index.h tml http://onlinestatbook.com/stat_sim/sampling_dist/index.h tml For sample means, the standard rule is n > 60 for the Central Limit Theorem to kick in, depending on how “abnormal” the population distribution is. 60 is a worst-case scenario.

30 Sampling Distribution l By CLT, about 95% of the time, the sample mean will be within two standard errors of the population mean. n This tells us how “close” the sample statistic should be to the population parameter. l Standard errors (SE) measure the precision of your sample statistic. l A small SE means it is more precise. l The SE is the standard deviation of the sampling distribution of the statistic.

31 Standard Error of Sample Mean l The standard error of sample mean (SEM) is a measure of the precision of the sample mean. n  : standard deviation (SD) of population distribution. SEM = The standard deviation is not the standard error of a statistic!

32 Example l Measure systolic blood pressure on random sample of 100 students Sample sizen = 100 Sample mean = 125 mm Hg Sample SDs = 14.0 mm Hg l Population SD (  ) can be replaced by sample SD for large sample SEM =

33 Confidence Interval for population mean l An approximate 95% confidence interval for population mean µ is: ± 2×SEM or precisely l is a random variable (vary from sample to sample), so confidence interval is random and it has 95% chance of covering µ before a sample is selected. l Once a sample is taken, we observe, then either µ is within the calculated interval or it is not. l The confidence interval gives the range of plausible values for µ.


Download ppt "Statistical Analysis I Mosuk Chow, PhD Senior Scientist and Professor Department of Statistics December 8, 2015 CTSI BERD Research Methods Seminar Series."

Similar presentations


Ads by Google