Download presentation
Presentation is loading. Please wait.
Published byKelley Dean Modified over 9 years ago
1
Statistical Data Analysis 2011/2012 M. de Gunst Lecture 5
2
Statistical Data Analysis 2 Statistical Data Analysis: Introduction Topics Summarizing data Exploring distributions Bootstrap Robust methods Nonparametric tests Analysis of categorical data Multiple linear regression
3
Statistical Data Analysis 3 Today’s topic: Robust methods (Chapter 5) 5.1. Robust estimators for location L-estimators M-estimators Asymptotic influence function quality of 5.2. Asymptotic efficiency estimator 5.3. Adaptive estimation 5.4. Robust estimators for spread MAD M-estimators
4
Statistical Data Analysis 4 Robust methods: Introduction Situation realizations of, independent, unknown distr. F Assumptions on F? Robust methods: insensitive to small deviations from model assumptions: n large deviations in some observations → outliers n small deviations in all observations, e.g. rounding n small deviation in assumed (in)dependence structure We consider estimation methods: different types of estimators First: estimators for location of F
5
Statistical Data Analysis 5 5.1. Robust estimators for location (1) Situation realizations of, independent, unknown distr. F First: estimators for location of F Which ones do we know and what do they estimate? Blackboard L-estimators: linear combination of order statistics mean median new: trimmed means MLEs extension: M-estimators: solution M n of
6
Statistical Data Analysis 6 Robust estimators for location (2) Data: Newcombs measurements to determine the speed of light > newcomb 28 26 33 24 34 -44 27 16 40 -2 29 22 24 21 25 30 23 29 31 19 24 20 36 32 36 28 25 21 28 29 37 25 28 26 30 32 36 26 30 22 36 23 27 27 28 27 31 27 26 33 26 32 32 24 39 28 24 25 32 25 29 27 28 29 16 23 How to estimate the location of this data? Example
7
Statistical Data Analysis 7 Robust estimators for location (3) (Newcombs data continued) How to estimate the location of this data? > mean(newcomb) [1] 26.21 #(rounded) or > median(newcomb) [1] 27 Can we trust these estimates? Which one is better? How to judge? I. Influence of outliers – measure for robustness II. Efficiency – measure for accuracy Example
8
Statistical Data Analysis 8 Influence of extreme values on estimator (1) (Newcombs data continued) I. What is influence of extreme values on these estimators? How to determine? #All data: > mean(newcomb); median(newcomb) [1] 26.21 [1] 27 > sd(newcomb) [1] 10.75 #Without the two smallest values -44 and -2 that are outliers in boxplot: > nn=sort(newcomb)[-c(1,2)] ; sd(nn) #check: indeed sd is now smaller [1] 5.08 > mean(nn); median(nn) [1] 27.75 #rather different [1] 27.5 #less different #With one additional values 17, in range of data: > ncplus1=c(newcomb,17); mean(ncplus1); median(ncplus1) [1] 26.07 #almost same as original [1] 27 #same as original #With two additional values 17 and 26, but #with one very extreme value of 1726 due to typo: > ncplus2=c(newcomb,1723); mean(ncplus2); median(ncplus2) [1] 51.58 #very different: difference 25.37 [1] 27 #same as original What do we see? Different answers with and without outliers Median more robust against extreme values than mean Example
9
Statistical Data Analysis 9 Influence of extreme values on estimator (2) In example: and Med n (X i ) different with and without outliers. In this case Med n (X i ) least influenced. Does not hold for every underlying F: need measure For general measure of influence on estimator consider influence for large n Example on blackboard
10
Statistical Data Analysis 10 Asymptotic influence function (1) For general measure of influence on estimator consider influence for large n Depends on F: Definition: IF(y) = IF(Y, F) = y - E F X 1 is asymptotic influence function of estimator of F Asymptotic influence function large for |y| large: bad Shape of asymptotic influence function is important: bounded is good
11
Statistical Data Analysis 11 Asymptotic influence function (2) Shape of asymptotic influence function is important: bounded is good For M-estimators: IF(y,F) = IFs of MLEs IFs of and α-trimmed mean
12
Statistical Data Analysis 12 5.2. Accuracy of estimator (1) (Newcombs data continued) II. How accurate are the estimates? How to determine? What is sd of estimator’s distribution? How to determine? Derive bootstrap estimate of distribution of the estimator With that determine estimate of sd of estimator’s distribution: Mean: > bootmean=bootstrap(newcomb,mean,B=1000) > sd(bootmean) [1] 1.34 Median: > bootmedian=bootstrap(newcomb,median,B=1000) > sd(bootmedian) [1] 0.68 What do we see? Median has smaller estimated sd than mean; Median seems more accurate Example
13
Statistical Data Analysis 13 Accuracy of estimator (2) In example: estimate of sd of : 1.34 estimate of sd of Med n (X i ): 0.68 In this case Med n (X i ) more accurate Does not hold for every underlying F: need measure For general measure of accuracy of estimator consider its variance for large n small variance for large n : estimator asymptotically efficient
14
Statistical Data Analysis 14 Asymptotic variance (1) For general measure of accuracy of estimator consider its variance for large n small variance: estimator asymptotically efficient Depends on F: often T n T(F), n → ∞ and (T n ᅳ T(F)) N(0,A(F)) or Var(T n ) ≈ A(F)/n ← always becomes very small for very large n! Definition: A(F) is (normalized) asymptotic variance of T n for F A(F) is proper general measure for accuracy
15
Statistical Data Analysis 15 Asymptotic variance (2) A(F) general measure for accuracy of estimator: Var(T n ) ≈ A(F)/n Relationship with efficiency: Two estimators T1 and T2 How many observations does T2 need with respect T1 for same variance? For n observations T1 has variance Var(T1 n ) ≈ A1(F)/n For m observations T2 has variance Var(T2 m )≈ A2(F)/m Asymptotically variances equal if A1(F)/n = A2(F)/m or m = A2(F)/ A1(F) n Definition: A2(F)/ A1(F) is the asymptotic relative efficiency of T1 w.r.t. T2
16
Statistical Data Analysis 16 Asymptotic variance (3) A(F) general measure for accuracy of estimator: Var(T n ) ≈ A(F)/n Lower bound: A(F) >= 1/I F I F Fisher information for location under F Estimator most efficient if lower bound attained Depends on F : A(F) for trimmed means as function of α Estimator that is (version of) MLE for F 0 attains lower bound if underlying F belongs to location family of F 0 But then have to know underlying F for choosing proper estimator!
17
Statistical Data Analysis 17 5.3. Adaptive estimation In practice: F unknown n investigate distribution of data n choose estimator that is robust ánd has minimal variance for the distribution that you “see”
18
Statistical Data Analysis 18 5.4. Robust estimators for spread Situation realizations of, independent, unknown distr. F Next: estimators for spread of F Known: sample sd, sample variance, interquartile range New: MAD n = M-estimators for spread is solution of Which ones robust?
19
Statistical Data Analysis 19 Recap Robust estimators 5.1. Robust estimators for location L-estimators M-estimators Asymptotic influence function – measure robustness 5.2. Asymptotic efficiency – measure accuracy 5.3. Adaptive estimation 5.4. Robust estimators for spread MAD M-estimators Quality of estimator
20
Statistical Data Analysis 20 Robust estimators The end
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.