Introduction to Statistics for Engineers

Slides:



Advertisements
Similar presentations
Hypothesis Testing making decisions using sample data.
Advertisements

Introduction to Basic Statistical Methodology. CHAPTER 1 ~ Introduction ~
Hypothesis Testing A hypothesis is a claim or statement about a property of a population (in our case, about the mean or a proportion of the population)
Beyond Null Hypothesis Testing Supplementary Statistical Techniques.
EPIDEMIOLOGY AND BIOSTATISTICS DEPT Esimating Population Value with Hypothesis Testing.
Click on image for full.pdf article Links in article to access datasets.
Inferences About Process Quality
UWHC Scholarly Forum April 17, 2013 Ismor Fischer, Ph.D. UW Dept of Statistics, UW Dept of Biostatistics and Medical Informatics
1 Faculty of Social Sciences Induction Block: Maths & Statistics Lecture 5: Generalisability of Social Research and the Role of Inference Dr Gwilym Pryce.
AM Recitation 2/10/11.
Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 12 Analyzing the Association Between Quantitative Variables: Regression Analysis Section.
June 18, 2008Stat Lecture 11 - Confidence Intervals 1 Introduction to Inference Sampling Distributions, Confidence Intervals and Hypothesis Testing.
Topics: Statistics & Experimental Design The Human Visual System Color Science Light Sources: Radiometry/Photometry Geometric Optics Tone-transfer Function.
6.1 - One Sample One Sample  Mean μ, Variance σ 2, Proportion π Two Samples Two Samples  Means, Variances, Proportions μ 1 vs. μ 2.
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
Essential Statistics Chapter 131 Introduction to Inference.
INTRODUCTION TO INFERENCE BPS - 5th Ed. Chapter 14 1.
CHAPTER 14 Introduction to Inference BPS - 5TH ED.CHAPTER 14 1.
Maximum Likelihood Estimator of Proportion Let {s 1,s 2,…,s n } be a set of independent outcomes from a Bernoulli experiment with unknown probability.
Chapter 20 Testing hypotheses about proportions
Statistical Hypotheses & Hypothesis Testing. Statistical Hypotheses There are two types of statistical hypotheses. Null Hypothesis The null hypothesis,
McGraw-Hill/Irwin Copyright © 2007 by The McGraw-Hill Companies, Inc. All rights reserved. Chapter 8 Hypothesis Testing.
Chapter 1 Overview and Descriptive Statistics 1111.1 - Populations, Samples and Processes 1111.2 - Pictorial and Tabular Methods in Descriptive.
Slide Slide 1 Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley. Overview.
: An alternative representation of level of significance. - normal distribution applies. - α level of significance (e.g. 5% in two tails) determines the.
AP Statistics Section 11.1 B More on Significance Tests.
CHAPTER Basic Definitions and Properties  P opulation Characteristics = “Parameters”  S ample Characteristics = “Statistics”  R andom Variables.
© Copyright McGraw-Hill 2004
Testing Hypotheses about a Population Proportion Lecture 31 Sections 9.1 – 9.3 Wed, Mar 22, 2006.
Statistical Inference Statistical inference is concerned with the use of sample data to make inferences about unknown population parameters. For example,
Introduction to Basic Statistical Methods Part 1: “Statistics in a Nutshell” UWHC Scholarly Forum March 19, 2014 Ismor Fischer, Ph.D. UW Dept of Statistics.
Introduction to Basic Statistical Methods Part 1: Statistics in a Nutshell UWHC Scholarly Forum May 21, 2014 Ismor Fischer, Ph.D. UW Dept of Statistics.
Statistical Inference for the Mean Objectives: (Chapter 8&9, DeCoursey) -To understand the terms variance and standard error of a sample mean, Null Hypothesis,
Review Statistical inference and test of significance.
Hypothesis Tests for 1-Proportion Presentation 9.
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 7 Inferences Concerning Means.
Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 1 FINAL EXAMINATION STUDY MATERIAL III A ADDITIONAL READING MATERIAL – INTRO STATS 3 RD EDITION.
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
Outline Sampling Measurement Descriptive Statistics:
STAT 312 Chapter 7 - Statistical Intervals Based on a Single Sample
Statistics for Managers Using Microsoft® Excel 5th Edition
Supplemental Lecture Notes
Chapter 5 STATISTICAL INFERENCE: ESTIMATION AND HYPOTHESES TESTING
STAT 311 Chapter 1 - Overview and Descriptive Statistics
FINAL EXAMINATION STUDY MATERIAL III
CHAPTER 6 Statistical Inference & Hypothesis Testing
Hypothesis Testing and Confidence Intervals (Part 1): Using the Standard Normal Lecture 8 Justin Kern October 10 and 12, 2017.
Statistical Data Analysis
STAT 312 Chapter 7 - Statistical Intervals Based on a Single Sample
Hypothesis Testing: Hypotheses
Hypothesis Testing Summer 2017 Summer Institutes.
Week 11 Chapter 17. Testing Hypotheses about Proportions
Two-sided p-values (1.4) and Theory-based approaches (1.5)
Inference for Proportions
Essential Statistics Introduction to Inference
CHAPTER 10 Comparing Two Populations or Groups
Hypothesis Testing A hypothesis is a claim or statement about the value of either a single population parameter or about the values of several population.
Introduction to Basic Statistical Methodology
STATISTICS IN A NUTSHELL
Testing Hypotheses about a Population Proportion
Statistical Data Analysis
CHAPTER 10 Comparing Two Populations or Groups
Confidence Intervals.
CHAPTER 10 Comparing Two Populations or Groups
Chapter 1 Overview and Descriptive Statistics
CHAPTER 10 Comparing Two Populations or Groups
Carrying Out Significance Tests
CHAPTER 10 Comparing Two Populations or Groups
Testing Hypotheses about a Population Proportion
Testing Hypotheses about a Population Proportion
Presentation transcript:

Introduction to Statistics for Engineers Summer 2016 Ismor Fischer UW Dept of Statistics 1227 Medical Science Center ifischer@wisc.edu

STATISTICS IN A NUTSHELL

Supplemental Lecture Notes 1 - Introduction 2 - Exploratory Data Analysis 3 - Probability Theory 4 - Classical Probability Distributions 5 - Sampling Distribs / Central Limit Theorem 6 - Statistical Inference 7 - Correlation and Regression (8 - Survival Analysis)

What is “random variation” in the distribution of a population? Examples: Toasting time, Temperature settings, etc. of a population of toasters… POPULATION 1: Little to no variation (e.g., product manufacturing) In engineering situations such as this, we try to maintain “quality control”… i.e., “tight tolerance levels,” high precision, low variability. O O O O O But what about a population of, say, people?

What is “random variation” in the distribution of a population? Example: Body Temperature (F) POPULATION 1: Little to no variation (e.g., clones) Density Most individual values ≈ population mean value Very little variation about the mean! 98.6 F

What is “random variation” in the distribution of a population? Example: Body Temperature (F) Examples: Gender, Race, Age, Height, Annual Income,… POPULATION 2: Much variation (more common) Density Much more variation about the mean!

What are “statistics,” and how can they be applied to real issues? Example: Suppose a certain company insists that it complies with “gender equality” regulations among its employee population, i.e., approx. 50% male and 50% female. GLOBAL OPERATION DYNAMICS, INC. To test this claim, let us select a random sample of n = 100 employees, and count X = the number of males. (If the claim is true, then we expect X  50.)     X = 64 males (+ 36 females)      etc.      Questions: If the claim is true, how likely is this experimental result? (“p-value”) Could the difference (14 males) be due to random chance variation, or is it statistically significant?

The experiment in this problem can be modeled by a random sequence of n = 100 independent coin tosses (Heads = Male, Tails = Female). It can be mathematically proved that, if the coin is “fair” (“unbiased”), then in 100 tosses: probability of obtaining at least 0 Heads away from 50 is = 1.0000 “certainty” probability of obtaining at least 1 Head away from 50 is = 0.9204 probability of obtaining at least 2 Heads away from 50 is = 0.7644 probability of obtaining at least 3 Heads away from 50 is = 0.6173 probability of obtaining at least 4 Heads away from 50 is = 0.4841 probability of obtaining at least 5 Heads away from 50 is = 0.3682 probability of obtaining at least 6 Heads away from 50 is = 0.2713 probability of obtaining at least 7 Heads away from 50 is = 0.1933 probability of obtaining at least 8 Heads away from 50 is = 0.1332 probability of obtaining at least 9 Heads away from 50 is = 0.0886 probability of obtaining at least 10 Heads away from 50 is = 0.0569 probability of obtaining at least 11 Heads away from 50 is = 0.0352 probability of obtaining at least 12 Heads away from 50 is = 0.0210 probability of obtaining at least 13 Heads away from 50 is = 0.0120 probability of obtaining at least 14 Heads away from 50 is = 0.0066 etc.  0 ...... …..from 0 to 100 Heads….. The  = .05 cutoff is called the significance level. 0.0066 is called the p-value of the sample. Because our p-value (.0066) is less than the significance level (.05), our data suggest that the coin is indeed biased, in favor of Heads. Likewise, our evidence suggests that employee gender in this company is biased, in favor of Males.

What are “statistics,” and how can they be applied to real issues? Example: Suppose a certain company insists that it complies with “gender equality” regulations among its employee population, i.e., approx. 50% male and 50% female. GLOBAL OPERATION DYNAMICS, INC. HYPOTHESIS EXPERIMENT To test this claim, let us select a random sample of n = 100 employees, and count X = the number of males. (If the claim is true, then we expect X  50.)   OBSERVATIONS     X = 64 males (+ 36 females)    etc.       Questions: If the claim is true, how likely is this experimental result? (“p-value”) Could the difference (14 males) be due to random chance variation, or is it statistically significant?

PROBABILITY THEORY ...... ANALYSIS CONCLUSION The experiment in this problem can be modeled by a random sequence of n = 100 independent coin tosses (Heads = Male, Tails = Female). It can be mathematically proved that, if the coin is “fair” (“unbiased”), then in 100 tosses: probability of obtaining at least 0 Heads away from 50 is = 1.0000 “certainty” probability of obtaining at least 1 Head away from 50 is = 0.9204 probability of obtaining at least 2 Heads away from 50 is = 0.7644 probability of obtaining at least 3 Heads away from 50 is = 0.6173 probability of obtaining at least 4 Heads away from 50 is = 0.4841 probability of obtaining at least 5 Heads away from 50 is = 0.3682 probability of obtaining at least 6 Heads away from 50 is = 0.2713 probability of obtaining at least 7 Heads away from 50 is = 0.1933 probability of obtaining at least 8 Heads away from 50 is = 0.1332 probability of obtaining at least 9 Heads away from 50 is = 0.0886 probability of obtaining at least 10 Heads away from 50 is = 0.0569 probability of obtaining at least 11 Heads away from 50 is = 0.0352 probability of obtaining at least 12 Heads away from 50 is = 0.0210 probability of obtaining at least 13 Heads away from 50 is = 0.0120 probability of obtaining at least 14 Heads away from 50 is = 0.0066 etc.  0 ...... PROBABILITY The  = .05 cutoff is called the significance level. THEORY ANALYSIS 0.0066 is called the p-value of the sample. Because our p-value (.0066) is less than the significance level (.05), our data suggest that the coin is indeed biased, in favor of Heads. Likewise, our evidence suggests that employee gender in this company is biased, in favor of Males. CONCLUSION

“Classical Scientific Method” Hypothesis – Define the study population... What’s the question? Experiment – Designed to test hypothesis Observations – Collect sample measurements Analysis – Do the data formally tend to support or refute the hypothesis, and with what strength? (Lots of juicy formulas...) Conclusion – Reject or retain hypothesis; is the result statistically significant? Interpretation – Translate findings in context! Statistics is implemented in each step of the classical scientific method!

Example Click on image for full .pdf article Links in article to access datasets

HOWEVER… Random Sample Women in U.S. who have given birth Study Question: How can we estimate “mean age at first birth” of women in the U.S.? POPULATION “Random Variable” X = Age at first birth Without knowing every value in the population, it is not possible to determine the exact value of  with 100% “certainty.” Suppose we know that X follows a “normal distribution” (a.k.a. “bell curve”) in the population. That is, the Population Distribution of X ~ N(, ). HOWEVER… standard deviation σ  and  are “population characteristics” i.e., “parameters” (fixed, unknown) mean μ = ??? Random Sample {x1, x2, x3, x4, … , x400} FORMULA mean

“Sampling Distribution” ~ ??? Women in U.S. who have given birth Study Question: How can we estimate “mean age at first birth” of women in the U.S.? POPULATION “Random Variable” X = Age at first birth is an example of a “sample characteristic” = “statistic.” (numerical info culled from a sample) This is called a “point estimate“ of  from the one sample. Can it be improved, and if so, how? Choose a bigger sample, which should reduce “variability.” Average the sample means of many samples, not just one. (introduces “sampling variability”) “Sampling Distribution” ~ ??? Suppose we know that X follows a “normal distribution” (a.k.a. “bell curve”) in the population. That is, the Population Distribution of X ~ N(, ). standard deviation σ  and  are “population characteristics” i.e., “parameters” (fixed, unknown) mean μ = ??? Random Sample {x1, x2, x3, x4, … , x400} FORMULA mean

Random Sample Statistical Inference and Hypothesis Testing Women in U.S. who have given birth Study Question: How can we estimate “mean age at first birth” of women in the U.S.? Statistical Inference and Hypothesis Testing POPULATION “Random Variable” X = Age at first birth “Null Hypothesis” Year 2010: Suppose we know that X follows a “normal distribution” (a.k.a. “bell curve”) in the population. Present: Is μ = 25.4 still true? H0: public education, awareness programs socioeconomic conditions, etc. Or, is the “alternative hypothesis” HA: μ ≠ 25.4 true? That is, X ~ N(25.4, 1.5). i.e., either or ? (2-sided) μ < 25.4 μ > 25.4 standard deviation σ = 1.5 standard deviation σ μ < 25.4 μ > 25.4 Does the sample statistic tend to support H0, or refute H0 in favor of HA? mean μ = ??? mean μ = 25.4 Random Sample mean {x1, x2, x3, x4, … , x400} FORMULA

25.453 25.747 25.253 25.547 “P-VALUE” of our sample In order to answer this question, we must account for the amount of variability of different values, from one random sample of n = 400 individuals to another. We will see three things: EXPERIMENT THEORY 95% CONFIDENCE INTERVAL FOR µ 25.453 25.747 BASED ON OUR SAMPLE DATA, the true value of μ today is between 25.453 and 25.747, with 95% “confidence” (…akin to “probability”). 95% ACCEPTANCE REGION FOR H0 25.253 25.547 IF H0 is true, then we would expect a random sample mean to lie between 25.253 and 25.547, with 95% probability. IF H0 is true, then we would expect a random sample mean that is at least 0.2 away from 25.4 (as ours was), to occur with probability .00383 (= 0.383%)… VERY RARELY! ,which is less t “P-VALUE” of our sample

In order to answer this question, we must account for the amount of variability of different values, from one random sample of n = 400 individuals to another. HOW CAN WE USE ANY OR ALL OF THESE THREE OBJECTS TO TEST THE NULL HYPOTHESIS H0: µ = 25.4? We will see three things: 95% CONFIDENCE INTERVAL FOR µ 25.453 25.747 BASED ON OUR SAMPLE DATA, the true value of μ today is between 25.453 and 25.747, with 95% “confidence” (…akin to “probability”). 95% ACCEPTANCE REGION FOR H0 25.253 25.547 IF H0 is true, then we would expect a random sample mean to lie between 25.253 and 25.547, with 95% probability. IF H0 is true, then we would expect a random sample mean that is at least 0.2 away from 25.4 (as ours was), to occur with probability .00383 (= 0.383%)… VERY RARELY! ,which is less t “P-VALUE” of our sample

In order to answer this question, we must account for the amount of variability of different values, from one random sample of n = 400 individuals to another. We will see three things: 95% CONFIDENCE INTERVAL FOR µ 25.453 25.747 BASED ON OUR SAMPLE DATA, the true value of μ today is between 25.453 and 25.747, with 95% “confidence” (…akin to “probability”). Our data value lies in the 5% REJECTION REGION. 95% ACCEPTANCE REGION FOR H0 25.253 25.547 IF H0 is true, then we would expect a random sample mean to lie between 25.253 and 25.547, with 95% probability. IF H0 is true, then we would expect a random sample mean that is at least 0.2 away from 25.4 (as ours was), to occur with probability .00383 (= 0.383%)… VERY RARELY! ,which is less t Less than .05 < “P-VALUE” of our sample SIGNIFICANCE LEVEL (α)

< FORMAL CONCLUSIONS: In order to answer this question, we must account for the amount of variability of different values, from one random sample of n = 400 individuals to another. We will see three things: Our data value lies in the 5% REJECTION REGION. FORMAL CONCLUSIONS: The 95% confidence interval corresponding to our sample mean does not contain the “null value” of the population mean, μ = 25.4. The 95% acceptance region for the null hypothesis does not contain the value of our sample mean, . The p-value of our sample, .00383, is less than the predetermined α = .05 significance level. Based on our sample data, we may reject the null hypothesis H0: μ = 25.4 in favor of the two-sided alternative hypothesis HA: μ ≠ 25.4, at the α = .05 significance level. INTERPRETATION: According to the results of this study, there exists a statistically significant difference between the mean ages at first birth in 2010 (25.4 years old) and today, at the 5% significance level. Moreover, the evidence from the sample data suggests that the population mean age today is older than in 2010, rather than younger, by about 0.2 years. 95% CONFIDENCE INTERVAL FOR µ 25.453 25.747 BASED ON OUR SAMPLE DATA, the true value of μ today is between 25.453 and 25.747, with 95% “confidence” (…akin to “probability”). 95% ACCEPTANCE REGION FOR H0 25.253 25.547 IF H0 is true, then we would expect a random sample mean to lie between 25.253 and 25.547, with 95% probability. IF H0 is true, then we would expect a random sample mean that is at least 0.2 away from 25.4 (as ours was), to occur with probability .00383 (= 0.383%)… VERY RARELY! ,which is less t Less than .05 < “P-VALUE” of our sample SIGNIFICANCE LEVEL (α)

SUMMARY: Why are these methods so important? They help to distinguish whether or not differences between populations are statistically significant, i.e., genuine, beyond the effects of random chance. Computationally intensive techniques that were previously intractable are now easily obtainable with modern PCs, etc. If your particular field of study involves the collection of quantitative data, then eventually you will either: 1 - need to conduct a statistical analysis of your own, or 2 - read another investigator’s methods, results, and conclusions in a book or professional research journal. Moral: You can run, but you can’t hide….

Women in U.S. who have given birth Arithmetic Mean Geometric Mean Harmonic Mean Each of these gives an estimate of  for a particular sample. Any general sample estimator for  is denoted by the symbol Likewise for and Study Question: How can we estimate “mean age at first birth” of women in the U.S.? POPULATION “Random Variable” X = Age at first birth Suppose we know that X follows a “normal distribution” (a.k.a. “bell curve”) in the population. That is, the Population Distribution of X ~ N(, ). standard deviation σ  and  are “population characteristics” i.e., “parameters” (fixed, unknown) mean μ = ??? Random sample of size n {x1, x2, x3, x4, … , xn} FORMULA mean

Other possible parameters: “Sampling Distribution” ~ ??? Women in U.S. who have given birth Study Question: How can we estimate “mean age at first birth” of women in the U.S.? Other possible parameters: standard deviation median minimum maximum POPULATION “Random Variable” X = Age at first birth is an example of a “sample characteristic” = “statistic.” (numerical info culled from a sample) This is called a “point estimate“ of  from the one sample. Can it be improved, and if so, how? Choose a bigger sample, which should reduce “variability.” Average the sample means of many samples, not just one. (introduces “sampling variability”) “Sampling Distribution” ~ ??? ????????? Suppose we know that X follows a “normal distribution” (a.k.a. “bell curve”) in the population. That is, the Population Distribution of X ~ N(, ). standard deviation σ  and  are “population characteristics” i.e., “parameters” (fixed, unknown) ??? How big??? mean μ = ??? Random Sample {x1, x2, x3, x4, … , x400} FORMULA mean