Outliers and Influential Points

Slides:



Advertisements
Similar presentations
Analyzing and Interpreting Data To understand a set of data, you need to organize and summarize the values. A measure of central tendency is used to.
Advertisements

AP Stat Day Days until AP Exam
CCGPS Coordinate Algebra
Variation, uncertainties and models Marian Scott School of Mathematics and Statistics, University of Glasgow June 2012.
Box plot Edexcel S1 Mathematics 2003 (or box and whisker plot)
Influential Points and Outliers Debbi Amanti Debbi Amanti.
Box and Whiskers with Outliers. Outlier…… An extremely high or an extremely low value in the data set when compared with the rest of the values. The IQR.
Data Analysis 33 The amount of Omega 3 fish oil in capsules labeled 1,000 mg is measured for four manufacturers’ products yielding the following box.
C. D. Toliver AP Statistics
Additional Measures of Center and Spread
Residuals.
Section 10-3 Regression.
Measures of Position - Quartiles
Chapter 2: Modeling Distributions of Data
Probabilistic & Statistical Techniques Eng. Tamer Eshtawi First Semester Eng. Tamer Eshtawi First Semester
Chapter 3 Bivariate Data
Introduction Data sets can be compared and interpreted in the context of the problem. Data values that are much greater than or much less than the rest.
AP Statistics Chapters 3 & 4 Measuring Relationships Between 2 Variables.
Jan Shapes of distributions… “Statistics” for one quantitative variable… Mean and median Percentiles Standard deviations Transforming data… Rescale:
1 Distribution Summaries Measures of central tendency Mean Median Mode Measures of spread Range Standard Deviation Interquartile Range (IQR)
Box Plot A plot showing the minimum, maximum, first quartile, median, and third quartile of a data set; the middle 50% of the data is indicated by a.
Unit 3 Section 3-4.
Linear Regression.
Review Measures of central tendency
+ Chapter 2: Modeling Distributions of Data Section 2.1 Describing Location in a Distribution The Practice of Statistics, 4 th edition - For AP* STARNES,
Objectives The student will be able to: find the variance of a data set. find the standard deviation of a data set.
Verbal SAT vs Math SAT V: mean=596.3 st.dev=99.5 M: mean=612.2 st.dev=96.1 r = Write the equation of the LSRL Interpret the slope of this line Interpret.
Multiple Regression BPS chapter 28 © 2006 W.H. Freeman and Company.
Box and Whisker Plots Measures of Central Tendency.
WARM-UP Do the work on the slip of paper (handout)
Organizing Data AP Stats Chapter 1. Organizing Data Categorical Categorical Dotplot (also used for quantitative) Dotplot (also used for quantitative)
+ The Practice of Statistics, 4 th edition – For AP* STARNES, YATES, MOORE Chapter 3: Describing Relationships Section 3.2 Least-Squares Regression.
Notes Unit 1 Chapters 2-5 Univariate Data. Statistics is the science of data. A set of data includes information about individuals. This information is.
1.3 Describing Quantitative Data with Numbers Pages Objectives SWBAT: 1)Calculate measures of center (mean, median). 2)Calculate and interpret measures.
Understanding and Comparing Distributions Ch. 5 Day 1 Notes AP Statistics EQ: How do we make boxplots and why? How do we compare distributions?
More Univariate Data Quantitative Graphs & Describing Distributions with Numbers.
What is a box-and-whisker plot? 5-number summary Quartile 1 st, 2 nd, and 3 rd quartiles Interquartile Range Outliers.
Introductory Statistics Lesson 2.5 A Objective: SSBAT find the first, second and third quartiles of a data set. SSBAT find the interquartile range of a.
Residuals, Influential Points, and Outliers
Chapter 1 Lesson 4 Quartiles, Percentiles, and Box Plots.
Probability & Statistics Box Plots. Describing Distributions Numerically Five Number Summary and Box Plots (Box & Whisker Plots )
(Unit 6) Formulas and Definitions:. Association. A connection between data values.
Describing Data Week 1 The W’s (Where do the Numbers come from?) Who: Who was measured? By Whom: Who did the measuring What: What was measured? Where:
Describing Relationships. Least-Squares Regression  A method for finding a line that summarizes the relationship between two variables Only in a specific.
Week 2 Normal Distributions, Scatter Plots, Regression and Random.
Statistics Vocab Notes Unit 4. Mean The average value of a data set, found by adding all values and dividing by the number of data points Example: 5 +
Get out your notes we previously took on Box and Whisker Plots.
Measures of Central Tendency & Center of Spread
Unit 2 Section 2.5.
Suppose the maximum number of hours of study among students in your sample is 6. If you used the equation to predict the test score of a student who studied.
Measures of Central Tendency & Center of Spread
Unit 4 Statistics Review
Box and Whisker Plots Algebra 2.
Section 3.3 Linear Regression
AP Statistics, Section 3.3, Part 1
Approximate the answers by referring to the box plot.
Describe the spread of the data:
Organizing Data AP Stats Chapter 1.
Residuals, Influential Points, and Outliers
Statistics and Data (Algebraic)
Thursday, February 6th What are the measures of center?
Problems of Tutorial 9 (Problem 4.12, Page 120) Download the “Data for Exercise ” from the class website. The data consist of 1 response variable.
5 Number Summaries.
Core Focus on Linear Equations
Statistics Vocab Notes
Tukey Box Plots Review.
Chapter 3: Describing Relationships
MATH 2311 Section 1.4.
STAT 515 Statistical Methods I Sections
Presentation transcript:

Outliers and Influential Points Erik Johnson AP Statistics 5/25/04 erik.PPT

Definitions Outlier: A value in a set of data that does not fit with the rest of the data Influential point: A point in a data set that has leverage on the regression coefficient Leverage: A point which, when removed, the regression line changes substantially is said to have leverage Q1, Q3: the boundaries in which approximately half of the data is contained Interquartile range: Q3-Q1 erik.PPT

Outliers Data points more than 2 standard deviations away from the mean of the data set Data points that do not fit the pattern governed by the rest of the data In regression, any data point that has an unusually large residual erik.PPT

How can I tell if a point in my data set is an outliers? Take the IQR (interquartile range) of your data set and multiply it by 1.5. Subtract that number from Quartile 1 and then from Quartile 3. Any number lying outside these points can be considered an outlier. Now you try a sample problem on outliers! erik.PPT

Sample Problem on IQR In a data set with 5 number summary [12,18,19,21,25], how many values can be considered outliers? A) None B) Exactly 1 C) At least 1 D) Exactly 2 E) At least 2 erik.PPT

IF YOU ANSWERED C….. YOU’RE RIGHT!!!!! The interquartile range for this set of data is 3, and when multiplied by 1.5 you get 4.5. Adding this number to 21 gives you 25.5, which is larger than the maximum value of the data set. This means that there are no outliers on the upper side of the data. When you subtract 4.5 from 18, you get 13.5. The minimum value of 12 is outside this number, meaning that there is at least 1 outlier in the set of data. erik.PPT

Influential Points Influential points are normally outliers in the X direction, but are not always outliers in terms of regression A point is said to influence the data if it is responsible for changes to the LSR line. Any point that has leverage on a set of data is an influential point erik.PPT

There are no outliers on either the X or Y axis To the right is a chart of a data set with a perfect linear regression of r^2=1 and an equation of Y=X There are no outliers on either the X or Y axis erik.PPT

Now look at this graph. The X value previously at 5 has been moved to 8. The equation has changed and the r^2 value has significantly decreased erik.PPT

The point (8,5) is an influential point in this data set Watch how the regression line changes as the point (8,5) is added The point (8,5) is an influential point in this data set erik.PPT

Sample Problem on Influential Points Given the plot below, which of the following can you conclude about the data point in the upper right-hand corner? A) It is an Outlier in the Regression B) It is an Influential Point C) It does not fit the pattern of the data D) It has a large residual E) All of the Above erik.PPT

The correct answer is…… B erik.PPT

Explanation for Sample Question Since the data point in question seems to fit the general pattern of the other observations in the data set, there is no evidence to call it an outlier in terms of regression. Likewise, it will not have a large residual when a LSR line is fit to the data. This data point IS an influential point, because it has an X value differing greatly from the others in the set. erik.PPT

THE END erik.PPT