Statistical Data Analysis 2011/2012 M. de Gunst Lecture 5.

Slides:



Advertisements
Similar presentations
Measures of Dispersion
Advertisements

Chap 8: Estimation of parameters & Fitting of Probability Distributions Section 6.1: INTRODUCTION Unknown parameter(s) values must be estimated before.
Chap 10: Summarizing Data 10.1: INTRO: Univariate/multivariate data (random samples or batches) can be described using procedures to reveal their structures.
Chapter 8 Linear Regression © 2010 Pearson Education 1.
Basic Business Statistics (10th Edition)
Chapter 10 Simple Regression.
© 2002 Prentice-Hall, Inc.Chap 3-1 Basic Business Statistics (8 th Edition) Chapter 3 Numerical Descriptive Measures.
Copyright (c) Bani Mallick1 Lecture 2 Stat 651. Copyright (c) Bani Mallick2 Topics in Lecture #2 Population and sample parameters More on populations.
Copyright © 2010 Pearson Education, Inc. Publishing as Prentice Hall Ch. 2-1 Statistics for Business and Economics 7 th Edition Chapter 2 Describing Data:
Statistics 350 Lecture 16. Today Last Day: Introduction to Multiple Linear Regression Model Today: More Chapter 6.
Overview of Lecture Parametric Analysis is used for
Parametric Inference.
PSYC512: Research Methods PSYC512: Research Methods Lecture 6 Brian P. Dyre University of Idaho.
Chapter 11: Inference for Distributions
Statistics 350 Lecture 8. Today Last Day: Finish last day Today: Old Faithful, Start Chapter 3.
© 2003 Prentice-Hall, Inc.Chap 3-1 Business Statistics: A First Course (3 rd Edition) Chapter 3 Numerical Descriptive Measures.
Chap 3-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 3 Describing Data: Numerical Statistics for Business and Economics.
Business Statistics - QBM117 Statistical inference for regression.
Bootstrapping applied to t-tests
Correlation & Regression
© 2005 The McGraw-Hill Companies, Inc., All Rights Reserved. Chapter 12 Describing Data.
Objectives 1.2 Describing distributions with numbers
Rules of Data Dispersion By using the mean and standard deviation, we can find the percentage of total observations that fall within the given interval.
May 28, 2008Stat Lecture 3 - Numerical Summaries 1 Exploring Data Numerical Summaries of One Variable Statistics Lecture 3.
Summary statistics Using a single value to summarize some characteristic of a dataset. For example, the arithmetic mean (or average) is a summary statistic.
Model Building III – Remedial Measures KNNL – Chapter 11.
1 CSI5388: Functional Elements of Statistics for Machine Learning Part I.
Topic 5 Statistical inference: point and interval estimate
1 Introduction to Estimation Chapter Concepts of Estimation The objective of estimation is to determine the value of a population parameter on the.
Prof. Dr. S. K. Bhattacharjee Department of Statistics University of Rajshahi.
Lecture 12 Statistical Inference (Estimation) Point and Interval estimation By Aziza Munir.
Modified by ARQ, from © 2002 Prentice-Hall.Chap 3-1 Numerical Descriptive Measures Chapter %20ppts/c3.ppt.
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
Analyzing Data: Comparing Means Chapter 8. Are there differences? One of the fundament questions of survey research is if there is a difference among.
Inference for Regression Simple Linear Regression IPS Chapter 10.1 © 2009 W.H. Freeman and Company.
Sullivan – Fundamentals of Statistics – 2 nd Edition – Chapter 3 Section 2 – Slide 1 of 27 Chapter 3 Section 2 Measures of Dispersion.
The Robust Approach Dealing with real data. Estimating Population Parameters Four properties are considered desirable in a population estimator:  Sufficiency.
INVESTIGATION 1.
1 Lecture 16: Point Estimation Concepts and Methods Devore, Ch
Chapter 5 Measures of Variability. 2 Measures of Variability Major Points The general problem The general problem Range and related statistics Range and.
Statistics Chapter 1: Exploring Data. 1.1 Displaying Distributions with Graphs Individuals Objects that are described by a set of data Variables Any characteristic.
Measures of Spread 1. Range: the distance from the lowest to the highest score * Problem of clustering differences ** Problem of outliers.
INVESTIGATION Data Colllection Data Presentation Tabulation Diagrams Graphs Descriptive Statistics Measures of Location Measures of Dispersion Measures.
June 11, 2008Stat Lecture 10 - Review1 Midterm review Chapters 1-5 Statistics Lecture 10.
8.2 Testing the Difference Between Means (Independent Samples,  1 and  2 Unknown) Key Concepts: –Sampling Distribution of the Difference of the Sample.
Chapter 5 Sampling Distributions. The Concept of Sampling Distributions Parameter – numerical descriptive measure of a population. It is usually unknown.
Case Selection and Resampling Lucila Ohno-Machado HST951.
Statistical Data Analysis 2010/2011 M. de Gunst Lecture 10.
Statistical Data Analysis 2011/2012 M. de Gunst Lecture 7.
Statistical Data Analysis 2011/2012 M. de Gunst Lecture 2.
Statistical Data Analysis 2010/2011 M. de Gunst Lecture 9.
Statistical Data Analysis 2011/2012 M. de Gunst Lecture 3.
Example Suppose that we take a random sample of eight fuel efficient cars and record the gas mileage. Calculate the following for the results: 33, 41,
Statistical Data Analysis 2011/2012 M. de Gunst Lecture 6.
Statistical Data Analysis 2011/2012 M. de Gunst Lecture 4.
Inference for distributions: - Comparing two means.
Non-parametric Approaches The Bootstrap. Non-parametric? Non-parametric or distribution-free tests have more lax and/or different assumptions Properties:
Descriptive Statistics(Summary and Variability measures)
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
STA248 week 121 Bootstrap Test for Pairs of Means of a Non-Normal Population – small samples Suppose X 1, …, X n are iid from some distribution independent.
Estimating standard error using bootstrap
Introduction For inference on the difference between the means of two populations, we need samples from both populations. The basic assumptions.
Measures of Dispersion
Correlation, Bivariate Regression, and Multiple Regression
Virtual COMSATS Inferential Statistics Lecture-26
Description of Data (Summary and Variability measures)
Simple Linear Regression - Introduction
CHAPTER 21: Comparing Two Means
Lecture 2 Chapter 3. Displaying and Summarizing Quantitative Data
Presentation transcript:

Statistical Data Analysis 2011/2012 M. de Gunst Lecture 5

Statistical Data Analysis 2 Statistical Data Analysis: Introduction Topics Summarizing data Exploring distributions Bootstrap Robust methods Nonparametric tests Analysis of categorical data Multiple linear regression

Statistical Data Analysis 3 Today’s topic: Robust methods (Chapter 5) 5.1. Robust estimators for location L-estimators M-estimators Asymptotic influence function quality of 5.2. Asymptotic efficiency estimator 5.3. Adaptive estimation 5.4. Robust estimators for spread MAD M-estimators

Statistical Data Analysis 4 Robust methods: Introduction Situation realizations of, independent, unknown distr. F Assumptions on F? Robust methods: insensitive to small deviations from model assumptions: n large deviations in some observations → outliers n small deviations in all observations, e.g. rounding n small deviation in assumed (in)dependence structure We consider estimation methods: different types of estimators First: estimators for location of F

Statistical Data Analysis Robust estimators for location (1) Situation realizations of, independent, unknown distr. F First: estimators for location of F Which ones do we know and what do they estimate? Blackboard L-estimators: linear combination of order statistics mean median new: trimmed means MLEs extension: M-estimators: solution M n of

Statistical Data Analysis 6 Robust estimators for location (2) Data: Newcombs measurements to determine the speed of light > newcomb How to estimate the location of this data? Example

Statistical Data Analysis 7 Robust estimators for location (3) (Newcombs data continued) How to estimate the location of this data? > mean(newcomb) [1] #(rounded) or > median(newcomb) [1] 27 Can we trust these estimates? Which one is better? How to judge? I. Influence of outliers – measure for robustness II. Efficiency – measure for accuracy Example

Statistical Data Analysis 8 Influence of extreme values on estimator (1) (Newcombs data continued) I. What is influence of extreme values on these estimators? How to determine? #All data: > mean(newcomb); median(newcomb) [1] [1] 27 > sd(newcomb) [1] #Without the two smallest values -44 and -2 that are outliers in boxplot: > nn=sort(newcomb)[-c(1,2)] ; sd(nn) #check: indeed sd is now smaller [1] 5.08 > mean(nn); median(nn) [1] #rather different [1] 27.5 #less different #With one additional values 17, in range of data: > ncplus1=c(newcomb,17); mean(ncplus1); median(ncplus1) [1] #almost same as original [1] 27 #same as original #With two additional values 17 and 26, but #with one very extreme value of 1726 due to typo: > ncplus2=c(newcomb,1723); mean(ncplus2); median(ncplus2) [1] #very different: difference [1] 27 #same as original What do we see? Different answers with and without outliers Median more robust against extreme values than mean Example

Statistical Data Analysis 9 Influence of extreme values on estimator (2) In example: and Med n (X i ) different with and without outliers. In this case Med n (X i ) least influenced. Does not hold for every underlying F: need measure For general measure of influence on estimator consider influence for large n Example on blackboard

Statistical Data Analysis 10 Asymptotic influence function (1) For general measure of influence on estimator consider influence for large n Depends on F: Definition: IF(y) = IF(Y, F) = y - E F X 1 is asymptotic influence function of estimator of F Asymptotic influence function large for |y| large: bad Shape of asymptotic influence function is important: bounded is good

Statistical Data Analysis 11 Asymptotic influence function (2) Shape of asymptotic influence function is important: bounded is good For M-estimators: IF(y,F) = IFs of MLEs IFs of and α-trimmed mean

Statistical Data Analysis Accuracy of estimator (1) (Newcombs data continued) II. How accurate are the estimates? How to determine? What is sd of estimator’s distribution? How to determine? Derive bootstrap estimate of distribution of the estimator With that determine estimate of sd of estimator’s distribution: Mean: > bootmean=bootstrap(newcomb,mean,B=1000) > sd(bootmean) [1] 1.34 Median: > bootmedian=bootstrap(newcomb,median,B=1000) > sd(bootmedian) [1] 0.68 What do we see? Median has smaller estimated sd than mean; Median seems more accurate Example

Statistical Data Analysis 13 Accuracy of estimator (2) In example: estimate of sd of : 1.34 estimate of sd of Med n (X i ): 0.68 In this case Med n (X i ) more accurate Does not hold for every underlying F: need measure For general measure of accuracy of estimator consider its variance for large n small variance for large n : estimator asymptotically efficient

Statistical Data Analysis 14 Asymptotic variance (1) For general measure of accuracy of estimator consider its variance for large n small variance: estimator asymptotically efficient Depends on F: often T n T(F), n → ∞ and (T n ᅳ T(F)) N(0,A(F)) or Var(T n ) ≈ A(F)/n ← always becomes very small for very large n! Definition: A(F) is (normalized) asymptotic variance of T n for F A(F) is proper general measure for accuracy

Statistical Data Analysis 15 Asymptotic variance (2) A(F) general measure for accuracy of estimator: Var(T n ) ≈ A(F)/n Relationship with efficiency: Two estimators T1 and T2 How many observations does T2 need with respect T1 for same variance? For n observations T1 has variance Var(T1 n ) ≈ A1(F)/n For m observations T2 has variance Var(T2 m )≈ A2(F)/m Asymptotically variances equal if A1(F)/n = A2(F)/m or m = A2(F)/ A1(F) n Definition: A2(F)/ A1(F) is the asymptotic relative efficiency of T1 w.r.t. T2

Statistical Data Analysis 16 Asymptotic variance (3) A(F) general measure for accuracy of estimator: Var(T n ) ≈ A(F)/n Lower bound: A(F) >= 1/I F I F Fisher information for location under F Estimator most efficient if lower bound attained Depends on F : A(F) for trimmed means as function of α Estimator that is (version of) MLE for F 0 attains lower bound if underlying F belongs to location family of F 0 But then have to know underlying F for choosing proper estimator!

Statistical Data Analysis Adaptive estimation In practice: F unknown n investigate distribution of data n choose estimator that is robust ánd has minimal variance for the distribution that you “see”

Statistical Data Analysis Robust estimators for spread Situation realizations of, independent, unknown distr. F Next: estimators for spread of F Known: sample sd, sample variance, interquartile range New: MAD n = M-estimators for spread is solution of Which ones robust?

Statistical Data Analysis 19 Recap Robust estimators 5.1. Robust estimators for location L-estimators M-estimators Asymptotic influence function – measure robustness 5.2. Asymptotic efficiency – measure accuracy 5.3. Adaptive estimation 5.4. Robust estimators for spread MAD M-estimators Quality of estimator

Statistical Data Analysis 20 Robust estimators The end