Quantitative Data Analysis: Statistics – Part 2. Overview Part 1  Picturing the Data  Pitfalls of Surveys  Averages  Variance and Standard Deviation.

Slides:



Advertisements
Similar presentations
Chapter 6 Sampling and Sampling Distributions
Advertisements

Inferential Statistics
Hypothesis Testing A hypothesis is a claim or statement about a property of a population (in our case, about the mean or a proportion of the population)
5. Statistical Inference: Estimation
Inferential Statistics & Hypothesis Testing
STATISTICAL INFERENCE PART V
Chapter 10: Hypothesis Testing
10 Hypothesis Testing. 10 Hypothesis Testing Statistical hypothesis testing The expression level of a gene in a given condition is measured several.
Chapter 7 Sampling and Sampling Distributions
The Simple Regression Model
8 - 10: Intro to Statistical Inference
Hypothesis Testing for Population Means and Proportions
Inference about a Mean Part II
Copyright © 2014, 2013, 2010 and 2007 Pearson Education, Inc. Chapter Hypothesis Tests Regarding a Parameter 10.
Chapter 11: Inference for Distributions
Inferential Statistics
Statistical Analysis. Purpose of Statistical Analysis Determines whether the results found in an experiment are meaningful. Answers the question: –Does.
AM Recitation 2/10/11.
Statistics 11 Hypothesis Testing Discover the relationships that exist between events/things Accomplished by: Asking questions Getting answers In accord.
Probability Distributions and Test of Hypothesis Ka-Lok Ng Dept. of Bioinformatics Asia University.
II.Simple Regression B. Hypothesis Testing Calculate t-ratios and confidence intervals for b 1 and b 2. Test the significance of b 1 and b 2 with: T-ratios.
Review of Statistical Inference Prepared by Vera Tabakova, East Carolina University ECON 4550 Econometrics Memorial University of Newfoundland.
Review of Basic Statistics. Definitions Population - The set of all items of interest in a statistical problem e.g. - Houses in Sacramento Parameter -
More About Significance Tests
Statistics Primer ORC Staff: Xin Xin (Cindy) Ryan Glaman Brett Kellerstedt 1.
Go to Index Analysis of Means Farrokh Alemi, Ph.D. Kashif Haqqi M.D.
STATISTICAL INFERENCE PART VII
Estimation of Statistical Parameters
Population All members of a set which have a given characteristic. Population Data Data associated with a certain population. Population Parameter A measure.
CHAPTER 18: Inference about a Population Mean
Barnett/Ziegler/Byleen Finite Mathematics 11e1 Learning Objectives for Section 11.5 Normal Distributions The student will be able to identify what is meant.
© 2008 McGraw-Hill Higher Education The Statistical Imagination Chapter 10. Hypothesis Testing II: Single-Sample Hypothesis Tests: Establishing the Representativeness.
Essential Statistics Chapter 131 Introduction to Inference.
Ch9. Inferences Concerning Proportions. Outline Estimation of Proportions Hypothesis concerning one Proportion Hypothesis concerning several proportions.
Quantitative Data Analysis: Statistics – Part 2. Overview Part 1  Picturing the Data  Pitfalls of Surveys  Averages  Variance and Standard Deviation.
5.1 Chapter 5 Inference in the Simple Regression Model In this chapter we study how to construct confidence intervals and how to conduct hypothesis tests.
1 Chapter 9 Hypothesis Testing. 2 Chapter Outline  Developing Null and Alternative Hypothesis  Type I and Type II Errors  Population Mean: Known 
Interval Estimation and Hypothesis Testing Prepared by Vera Tabakova, East Carolina University.
Slide Slide 1 Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley. Overview.
Chapter 8 Parameter Estimates and Hypothesis Testing.
Fall 2002Biostat Statistical Inference - Confidence Intervals General (1 -  ) Confidence Intervals: a random interval that will include a fixed.
Statistics What is statistics? Where are statistics used?
+ The Practice of Statistics, 4 th edition – For AP* STARNES, YATES, MOORE Unit 5: Estimating with Confidence Section 11.1 Estimating a Population Mean.
© Copyright McGraw-Hill 2004
1 Estimation of Population Mean Dr. T. T. Kachwala.
Statistical Inference Statistical inference is concerned with the use of sample data to make inferences about unknown population parameters. For example,
© 2010 Pearson Prentice Hall. All rights reserved Chapter Hypothesis Tests Regarding a Parameter 10.
Chapter 9: Introduction to the t statistic. The t Statistic The t statistic allows researchers to use sample data to test hypotheses about an unknown.
+ Unit 5: Estimating with Confidence Section 8.3 Estimating a Population Mean.
+ Z-Interval for µ So, the formula for a Confidence Interval for a population mean is To be honest, σ is never known. So, this formula isn’t used very.
Review Statistical inference and test of significance.
Confidence Intervals. Point Estimate u A specific numerical value estimate of a parameter. u The best point estimate for the population mean is the sample.
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 7 Inferences Concerning Means.
Statistical principles: the normal distribution and methods of testing Or, “Explaining the arrangement of things”
T-TEST. Outline  Introduction  T Distribution  Example cases  Test of Means-Single population  Test of difference of Means-Independent Samples 
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
4-1 Statistical Inference Statistical inference is to make decisions or draw conclusions about a population using the information contained in a sample.
Chapter 9 Introduction to the t Statistic
Class Six Turn In: Chapter 15: 30, 32, 38, 44, 48, 50 Chapter 17: 28, 38, 44 For Class Seven: Chapter 18: 32, 34, 36 Chapter 19: 26, 34, 44 Quiz 3 Read.
Theoretical distributions: the Normal distribution.
Chapter 8: Estimating with Confidence
Confidence Intervals.
Chapter 8: Estimating with Confidence
Daniela Stan Raicu School of CTI, DePaul University
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
CHAPTER 18: Inference about a Population Mean
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Data Analysis and Statistical Software I ( ) Quarter: Autumn 02/03
Presentation transcript:

Quantitative Data Analysis: Statistics – Part 2

Overview Part 1  Picturing the Data  Pitfalls of Surveys  Averages  Variance and Standard Deviation Part 2  The Normal Distribution  Z-Tests  Confidence Intervals  T-Tests

The Normal Distribution

Abraham de Moivre, the 18th century statistician and consultant to gamblers was often called upon to make lengthy computations about coin flips. de Moivre noted that when the number of events (coin flips) increased, the shape of the binomial distribution approached a very smooth curve. In 1809 Carl Gauss developed the formula for the normal distribution and showed that the distribution of many natural phenomena are at least approximately normally distributed.

Abraham de Moivre Born 26 May 1667 Died 27 November 1754 Born in Champagne, France wrote a textbook on probability theory, "The Doctrine of Chances: a method of calculating the probabilities of events in play". This book came out in four editions, 1711 in Latin, and 1718, 1738 and 1756 in English. In the later editions of his book, de Moivre gives the first statement of the formula for the normal distribution curve.

Carl Friedrich Gauss Born 30 April 1777 Died 23 February 1855 Born in Lower Saxony, Germany In 1809 Gauss published the monograph “Theoria motus corporum coelestium in sectionibus conicis solem ambientium” where among other things he introduces and describes several important statistical concepts, such as the method of least squares, the method of maximum likelihood, and the normal distribution.

The Normal Distribution

Age of students in a class Body temperature Pulse rate Shoe size IQ score Diameter of trees Height?

The Normal Distribution

Density Curves: Properties

The Normal Distribution The graph has a single peak at the center, this peak occurs at the mean The graph is symmetrical about the mean The graph never touches the horizontal axis The area under the graph is equal to 1

Characterization A normal distribution is bell-shaped and symmetric. The distribution is determined by the mean mu,  and the standard deviation sigma, . The mean mu controls the center and sigma controls the spread.

Same Mean, Different Standard Deviation 101

Different Mean, Different Standard Deviation 101

Different Mean, Same Standard Deviation 101

The Normal Distribution If a variable is normally distributed, then:  within one standard deviation of the mean there will be approximately 68% of the data  within two standard deviations of the mean there will be approximately 95% of the data  within three standard deviations of the mean there will be approximately 99.7% of the data

The Normal Distribution

Why? One reason the normal distribution is important is that many psychological and organsational variables are distributed approximately normally. Measures of reading ability, introversion, job satisfaction, and memory are among the many psychological variables approximately normally distributed. Although the distributions are only approximately normal, they are usually quite close.

Why? A second reason the normal distribution is so important is that it is easy for mathematical statisticians to work with. This means that many kinds of statistical tests can be derived for normal distributions. Almost all statistical tests discussed in this text assume normal distributions. Fortunately, these tests work very well even if the distribution is only approximately normally distributed. Some tests work well even with very wide deviations from normality.

So what? Imagine we undertook an experiment where we measured staff productivity before and after we introduced a computer system to help record solutions to common issues of work  Average productivity before = 6.4  Average productivity after = 9.2

So what? Before = 6.4 After =

So what? Before = 6.4 After =

So what? Before = 6.4 After =

So what? Before = 6.4 After =

So what? Before = 6.4 After =

So what? Before = 6.4 After =

So what? Before = 6.4 After =

So what? Before = 6.4 After = σ σ σ

One Tail / Two Tail One-Tailed  H0 : m1 >= m2  HA : m1 < m2 Two-Tailed  H0 : m1 = m2  HA : m1 <>m2

STANDARD NORMAL DISTRIBUTION Normal Distribution is defined as N(mean, (Std dev)^2) Standard Normal Distribution is defined as N(0, (1)^2)

STANDARD NORMAL DISTRIBUTION Using the following formula : will convert a normal table into a standard normal table.

Exercise If the average IQ in a given population is 100, and the standard deviation is 15, what percentage of the population has an IQ of 145 or higher ?

Answer P(X >= 145) P(Z >= (( )/15)) P(Z >= 3) From tables: 99.87% are less than 3 => 0.13% of population

Trends in Statistical Tests used in Research Papers HistoricallyCurrently Results in: Accept/Reject Results in: p-Value Results in: Approx. Mean

Confidence Intervals A confidence interval is used to express the uncertainty in a quantity being estimated. There is uncertainty because inferences are based on a random sample of finite size from a population or process of interest. To judge the statistical procedure we can ask what would happen if we were to repeat the same study, over and over, getting different data (and thus different confidence intervals) each time.

Confidence Intervals

Born April 16, 1894 Died August 5, 1981 Born in Bessarabia, Imperial Russia statistician who spent most of his professional career at the University of California, Berkeley. Developed modern scientific sampling (random samples) in 1934, the Neyman- Pearson lemma in 1933 and the confidence interval in Jerzy Neyman

Born 11 August 1895 Died 12 June 1980 Born in Hampstead, London Son of Karl Pearson Leading British statistician Developed the Neyman- Pearson lemma in Egon Pearson

Neyman and Pearson's joint work formally started in the spring of From 1928 to 1934, they published several important papers on the theory of testing statistical hypotheses. In developing their theory, Neyman and Pearson recognized the need to include alternative hypotheses and they perceived the errors in testing hypotheses concerning unknown population values based on sample observations that are subject to variation. They called the error of rejecting a true hypothesis the first kind of error and the error of accepting a false hypothesis the second kind of error. They called a hypothesis that completely specifies a probability distribution a simple hypothesis. A hypothesis that is not a simple hypothesis is a composite hypothesis. Their joint work lead to Neyman developing the idea of confidence interval estimation, published in 1937.

Confidence Intervals Neyman, J. (1937) "Outline of a theory of statistical estimation based on the classical theory of probability" Philos. Trans. Roy. Soc. London. Ser. A., Vol. 236 pp. 333–380.

Confidence Intervals If we know the true population mean and sample n individuals, we know that if the data is normally distributed, Average mean of these n samples has a 95% chance of falling into the interval.

Confidence Intervals where the standard error for a 95% CI may be calculated as follows;

Example 1

Did FF have more of the popular vote than FG-L ?  In a random sample of 721 respondents : 382 FF 339 FG-L Can we conclude that FF had more than 50% of the popular vote ?

Example 1 - Solution  Sample proportion = p = 382/721 = 0.53  Sample size = n = 721  Standard Error = (SqRt((p(1-p)/n))) = 0.02  95% Confidence Interval / (0.02) / [0.49, 0.57] Thus, we cannot conclude that FF had more of the popular vote, since this interval spans 50%. So, we say: "the data are consistent with the hypothesis that there is no difference"

Example 2

Did Obama have more of the popular vote than McCain ?  In a random sample of 1000 respondents 532 Obama 468 McCain Can we conclude that Obama had more than 50% of the popular vote ?

Example 2 – 95% CI  Sample proportion = p = 532/1000 =  Sample size = n = 1000  Standard Error = (SqRt((p(1-p)/n))) =  95% Confidence Interval / (0.016) / [0.5006, ] Thus, we can conclude that Obama had more of the popular vote, since this interval does not span 50%. So, we say : "the data are consistent with the hypothesis that there is a difference in a 95% CI"

Example 2 – 99% CI  Sample proportion = p = 532/1000 =  Sample size = n = 1000  Standard Error = (SqRt((p(1-p)/n))) =  99% Confidence Interval / (0.016) / [0.491, 0.573] Thus, we cannot conclude that Obama had more of the popular vote, since this interval does span 50%. So, we say : "the data are consistent with the hypothesis that there is no difference in a 99% CI"

Example 2 – 99.99% CI  Sample proportion = p = 532/1000 =  Sample size = n = 1000  Standard Error = (SqRt((p(1-p)/n))) =  99.99% Confidence Interval / (0.016) / [0.472, 0.592] Thus, we cannot conclude that Obama had more of the popular vote, since this interval does span 50%. So, we say : "the data are consistent with the hypothesis that there is no difference in a 99.99% CI"

T-Tests

William Sealy Gosset Born June 13, 1876 Died October 16, 1937 Born in Canterbury, England On graduating from Oxford in 1899, he joined the Dublin brewery of Arthur Guinness & Son. Published significant paper in 1908 concerning the t- distribution

Gosset acquired his statistical knowledge by study, and he also spend two terms in 1906–1907 in the biometric laboratory of Karl Pearson. Gosset applied his knowledge for Guinness both in the brewery and on the farm - to the selection of the best yielding varieties of barley, and to compare the different brewing processes for changing raw materials into beer. Gosset and Pearson had a good relationship and Pearson helped Gosset with the mathematics of his papers. Pearson helped with the 1908 paper but he had little appreciation of their importance. The papers addressed the brewer's concern with small samples, while the biometrician typically had hundreds of observations and saw no urgency in developing small-sample methods.

T-Tests Student (1908), “The Probable Error of a Mean” Biometrika, Vol. 6, No. 1, pp.1-25.

T-Tests Guinness did not allow its employees to publish results but the management decided to allow Gossett to publish it under a pseudonym - Student. Hence we have the Student's t-test.

T-Tests powerful parametric test for calculating the significance of a small sample mean necessary for small samples because their distributions are not normal one first has to calculate the "degrees of freedom"

~ THE GOLDEN RULE ~ Use the t-Test when your sample size is less than 30

T-Tests If the underlying population is normal If the underlying population is not skewed and reasonable to normal (n < 15) If the underlying population is skewed and there are no major outliers (n > 15) If the underlying population is skewed and some outliers (n > 24)

T-Tests Form of Confidence Interval with t- Value Mean +/- tValue * SE as before as before

Two Sample T-Test: Unpaired Sample Consider a questionnaire on computer use to final year undergraduates in year 2007 and the same questionnaire give to undergraduates in As there is no direct one-to-one correspondence between individual students (in fact, there may be different number of students in different classes), you have to sum up all the responses of a given year, obtain an average from that, down the same for the following year, and compare averages.

Two Sample T-Test: Paired Sample If you are doing a questionnaire that is testing the BEFORE/AFTER effect of parameter on the same population, then we can individually calculate differences between each sample and then average the differences. The paired test is a much strong (more powerful) statistical test.

Choosing the right test

Choosing a statistical test

Choosing a statistical test