 Assumptions are an essential part of statistics and the process of building and testing models.  There are many different assumptions across the range.

Slides:



Advertisements
Similar presentations
Assumptions underlying regression analysis
Advertisements

Mean, Proportion, CLT Bootstrap
BPS - 5th Ed. Chapter 241 One-Way Analysis of Variance: Comparing Several Means.
CHAPTER 21 Inferential Statistical Analysis. Understanding probability The idea of probability is central to inferential statistics. It means the chance.
Inference for Regression
AP STATISTICS LESSON 11 – 1 (DAY 3) Matched Pairs t Procedures.
5/15/2015Slide 1 SOLVING THE PROBLEM The one sample t-test compares two values for the population mean of a single variable. The two-sample test of a population.
AP Statistics – Chapter 9 Test Review
1 Analysis of Variance This technique is designed to test the null hypothesis that three or more group means are equal.
Statistics II: An Overview of Statistics. Outline for Statistics II Lecture: SPSS Syntax – Some examples. Normal Distribution Curve. Sampling Distribution.
MARE 250 Dr. Jason Turner Hypothesis Testing II. To ASSUME is to make an… Four assumptions for t-test hypothesis testing:
Statistics Are Fun! Analysis of Variance
Topic 2: Statistical Concepts and Market Returns
Inference about a Mean Part II
AP Statistics Section 10.2 A CI for Population Mean When is Unknown.
Chapter 11: Inference for Distributions
Inference about Population Parameters: Hypothesis Testing
Non-parametric statistics
Nonparametrics and goodness of fit Petter Mostad
INFERENTIAL STATISTICS – Samples are only estimates of the population – Sample statistics will be slightly off from the true values of its population’s.
The Data Analysis Plan. The Overall Data Analysis Plan Purpose: To tell a story. To construct a coherent narrative that explains findings, argues against.
AM Recitation 2/10/11.
Chapter 10 Hypothesis Testing
Fundamentals of Hypothesis Testing: One-Sample Tests
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 6 – Multiple comparisons, non-normality, outliers Marshall.
Review of Statistical Inference Prepared by Vera Tabakova, East Carolina University ECON 4550 Econometrics Memorial University of Newfoundland.
+ The Practice of Statistics, 4 th edition – For AP* STARNES, YATES, MOORE Chapter 9: Testing a Claim Section 9.3a Tests About a Population Mean.
+ Chapter 9 Summary. + Section 9.1 Significance Tests: The Basics After this section, you should be able to… STATE correct hypotheses for a significance.
1 CSI5388: Functional Elements of Statistics for Machine Learning Part I.
Making decisions about distributions: Introduction to the Null Hypothesis 47:269: Research Methods I Dr. Leonard April 14, 2010.
CHAPTER 18: Inference about a Population Mean
Significance Tests: THE BASICS Could it happen by chance alone?
+ Chapter 12: Inference for Regression Inference for Linear Regression.
10.2 Tests of Significance Use confidence intervals when the goal is to estimate the population parameter If the goal is to.
The Practice of Statistics Third Edition Chapter 10: Estimating with Confidence Copyright © 2008 by W. H. Freeman & Company Daniel S. Yates.
+ Chapter 12: More About Regression Section 12.1 Inference for Linear Regression.
1 Chapter 10: Introduction to Inference. 2 Inference Inference is the statistical process by which we use information collected from a sample to infer.
Confidence intervals and hypothesis testing Petter Mostad
The Robust Approach Dealing with real data. Estimating Population Parameters Four properties are considered desirable in a population estimator:  Sufficiency.
6/4/2016Slide 1 The one sample t-test compares two values for the population mean of a single variable. The two-sample t-test of population means (aka.
Nonparametric Tests IPS Chapter 15 © 2009 W.H. Freeman and Company.
STA Lecture 251 STA 291 Lecture 25 Testing the hypothesis about Population Mean Inference about a Population Mean, or compare two population means.
Section 10.1 Confidence Intervals
Nonparametric Statistics
Introduction to the Practice of Statistics Fifth Edition Chapter 6: Introduction to Inference Copyright © 2005 by W. H. Freeman and Company David S. Moore.
Copyright ©2011 Brooks/Cole, Cengage Learning Inference about Simple Regression Chapter 14 1.
SW318 Social Work Statistics Slide 1 One-way Analysis of Variance  1. Satisfy level of measurement requirements  Dependent variable is interval (ordinal)
BASIC STATISTICAL CONCEPTS Chapter Three. CHAPTER OBJECTIVES Scales of Measurement Measures of central tendency (mean, median, mode) Frequency distribution.
DTC Quantitative Methods Bivariate Analysis: t-tests and Analysis of Variance (ANOVA) Thursday 14 th February 2013.
Chapter 6 Lecture 3 Sections: 6.4 – 6.5. Sampling Distributions and Estimators What we want to do is find out the sampling distribution of a statistic.
Statistical Inference Statistical inference is concerned with the use of sample data to make inferences about unknown population parameters. For example,
Copyright © Cengage Learning. All rights reserved. 9 Inferences Based on Two Samples.
Analysis of Variance STAT E-150 Statistical Methods.
+ The Practice of Statistics, 4 th edition – For AP* STARNES, YATES, MOORE Unit 5: Hypothesis Testing.
Review of Statistical Inference Prepared by Vera Tabakova, East Carolina University.
Essential Statistics Chapter 171 Two-Sample Problems.
HYPOTHESIS TESTING FOR DIFFERENCES BETWEEN MEANS AND BETWEEN PROPORTIONS.
Statistical Inference for the Mean Objectives: (Chapter 8&9, DeCoursey) -To understand the terms variance and standard error of a sample mean, Null Hypothesis,
Data Screening. What is it? Data screening is very important to make sure you’ve met all your assumptions, outliers, and error problems. Each type of.
Nonparametric Tests PBS Chapter 16 © 2009 W.H. Freeman and Company.
STAT 312 Chapter 7 - Statistical Intervals Based on a Single Sample
Psychology Unit Research Methods - Statistics
Hypothesis Testing and Confidence Intervals (Part 1): Using the Standard Normal Lecture 8 Justin Kern October 10 and 12, 2017.
APPROACHES TO QUANTITATIVE DATA ANALYSIS
T-test for 2 Independent Sample Means
Warmup To check the accuracy of a scale, a weight is weighed repeatedly. The scale readings are normally distributed with a standard deviation of
Basic Practice of Statistics - 3rd Edition Two-Sample Problems
CHAPTER 12 More About Regression
Essential Statistics Two-Sample Problems - Two-sample t procedures -
Introductory Statistics
Presentation transcript:

 Assumptions are an essential part of statistics and the process of building and testing models.  There are many different assumptions across the range of different statistical tests and procedures, but for now we are focusing on several of the most common assumptions and how to check them.

 When assumptions are broken, we cannot make accurate conclusions. Or worse, we make completely false conclusions.  Example: Bringing dessert to a potluck dinner… you assume others will bring main dishes, sides. What if everyone just brings dessert?

 Many of the statistical tests that we use in this course are called “parametric” tests.  Many parametric tests are based on the normal distribution. If we use such a test when our data is not, in fact, normal, we will probably get inaccurate estimates. Parametric Statistics: We assume that the data comes from specific class of probability distributions, indexed by a set of parameters. We make inferences about these parameters, selecting a unique distribution as a model for the population.

 Normally distributed data  This can mean different things depending on the context– sometimes we need a population variable to be normally distributed, other times we need the errors in our model to be normally distributed.  Homogeneity of variance  Variances should be the same in the data we are using for a given test. So, if we are comparing means between 2, 3 or more groups, we want to know if the variances for each group are different from one another or not.  Interval data  That is, equal intervals along the entire range for a given variable. For example, age measured in years is an interval variable. But, age measured in three categories (young, middle aged, old) would not be interval (it is ordinal).  Independence  This can also mean many different things, but usually it means that the data from different cases are independent (behavior or action from one case does not influence the behavior or action of another case).

 In many tests we assume that the sampling distribution is normally distributed– but remember that we don’t actually know if the sampling distribution is normally distributed or not.  But we know that if the sample data are approximately normal, then the sampling distribution should also be normal.  Also, by the Central Limit theorem, we know that when we have large sample sizes, then we can be even more confident that the sampling distribution is normally distributed.

 Histograms plotted with the normal curve.  In this example I took some data on cars, plotted the histogram for the max price of the various cars and then plotted the normal curve on it as well.

 Quantile-quantile plot, or Q-Q plot.  Quantiles are values of a variable that split the dataset into equal-size groups  So, we rank our observations and divide them into equal fractions, say 1/10ths. Consider the value that separates the first 1/10 th from the second, then the second from the third…  We plot these points on the y-axis, then repeat the procedure on the x-axis for a normal curve.  If our data are normal, we will get a straight diagonal line. If we see values further from the diagonal line, this indicates deviations from normality.

Max Price of Cars in Dataset Length of Cars in Dataset

 First, we can always just look at descriptive statistics: means, medians, SD’s, skewness, kurtosis, etc.

 We can also conduct a Shapiro-Wilk test, which compares the scores in the sample to a normally distributed set of scores.  Null hypothesis = the distribution of the sample is the same as a normal distribution.  Alt hypothesis = the distribution of the sample is not the same as a normal distribution. ▪ Thus, if we find a significant p-value (e.g, p <.05) then this means that we have a non-normal distribution.

Max Price of Cars in Dataset Our test is highly significant, since the p-value is <.0001 Thus, we reject the null hypothesis and accept our alternative hypothesis: our distribution is not normal.

We have a normal distribution Just not enough data to reject the hypothesis of normality Our distribution is very non-normal Our distribution is slightly non-normal, but we have so much data we can reject the normal hypothesis anyway High P-value  Low P-value 

 Homogeneity of variance simply means that 2 or more variances of interest are equal.  If you have data on groups of data, the variance of a variable of interest should not be different for these groups.  Importantly: We are only concerned here with the variance (spread). It is not problematic if the means in the groups are different. Two different distributions for two different groups. They have the same mean, but different variances.

 Levene’s Test: Allows us to examine whether the variances between groups are different or not.  Null hypothesis = the variances for the groups are equal.  Alt hypothesis = the variances for the groups are not equal. ▪ Thus, if we find a significant p-value (e.g, p <.05) then this means that we have unequal variances between groups.

 Levene’s test is often significant when you have a very large sample of data.  Hartley’s F max (or, variance ratio) is a ratio between the group with the largest variance and the group with smallest variance.

Source: Wikipedia, linked to

 It is useful to look at both the variance ratio and Levene’s test when you are using very large data.  We will return to using and applying test of homogeneity of variance as we work with statistical tests that rely on this assumption.

 There are two procedures that we can use to directly correct problems in a given dataset:  Outliers ▪ An observation that is numerically distinct from the rest of the data. ▪ Example: survey of undergraduate majors that lead to highest incomes.  Transforming Data ▪ Used when we need to meet the assumptions of a normal distribution in order to conduct a particular statistical test, but we have failed the assumption of normality (e.g., Shapiro-Wilk test)

 Outliers wreck havoc with an otherwise normal distribution.  A common cause of outliers is when we actually have two distinct sub-groups in the same sample. From: Hedges and Shah BMC Bioinformatics :31 doi: /

 Removing the case(s). This is also called “trimming”.  Quite literally, deleting the case (or hiding the case) so that it is not part of any analysis that you conduct.  You have to be able to justify this decision– we cannot just delete cases because they do not ‘fit’ what we expect. The key issue is whether the outliers are actually very different from the rest of the sample.  Common reasons for removing/trimming: ▪ Data is out of range, perhaps due to data entry error (scale is 1-100, value is 200) ▪ Measurement error (something in our measurement appears to produce incorrect values for some cases).

 Changing the value of the case  Essentially we replace the outlier cases with some other value.  For example, assign the next highest or lowest value in the sample that is not an outlier.  For fairly obvious reasons, this is usually the least preferred option.

 Consider the following data points:  Removing cases:  Changing cases:

 Transforming the data  As we will discuss in a moment, transformations can correct certain types of problems, such as a few cases at the far end of one tail.  For example, we can apply a power transformation (such as the log, square, etc) to reduce the skew in a distribution. Source:

 Helps to correct for skew and non- symmetric distributions.  Commonly used when the dependent variable in a regression is highly skewed.  Power transformations will not ‘fix’ distributions that are bimodal, uniform, etc.  Y q : reduces negative skew  (e.g., squares, cubes, etc)  Log Y or –(Y -q ): reduces positive skew Source:

 Transformations have consequences:  You are changing the hypothesis slightly, e.g., examining log of income rather than actual income in dollars.  Transformations can get complicated, so it may be better to use robust tests rather than relying too heavily on transformations. Log of variable