Chap 10: Summarizing Data 10.1: INTRO: Univariate/multivariate data (random samples or batches) can be described using procedures to reveal their structures.

Slides:



Advertisements
Similar presentations
Descriptive Measures MARE 250 Dr. Jason Turner.
Advertisements

Economics 105: Statistics Go over GH 11 & 12 GH 13 & 14 due Thursday.
Measures of Dispersion
Modeling Process Quality
1 Chapter 1: Sampling and Descriptive Statistics.
Descriptive Statistics
MEASURES OF SPREAD – VARIABILITY- DIVERSITY- VARIATION-DISPERSION
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 6-1 Chapter 6 The Normal Distribution and Other Continuous Distributions.
Business Statistics: A First Course, 5e © 2009 Prentice-Hall, Inc. Chap 6-1 Chapter 6 The Normal Distribution Business Statistics: A First Course 5 th.
Descriptive Statistics  Summarizing, Simplifying  Useful for comprehending data, and thus making meaningful interpretations, particularly in medium to.
(c) 2007 IUPUI SPEA K300 (4392) Outline: Numerical Methods Measures of Central Tendency Representative value Mean Median, mode, midrange Measures of Dispersion.
LECTURE 12 Tuesday, 6 October STA291 Fall Five-Number Summary (Review) 2 Maximum, Upper Quartile, Median, Lower Quartile, Minimum Statistical Software.
Exploratory Data Analysis. Computing Science, University of Aberdeen2 Introduction Applying data mining (InfoVis as well) techniques requires gaining.
REPRESENTATION OF DATA.
1 Statistical Analysis - Graphical Techniques Dr. Jerrell T. Stracener, SAE Fellow Leadership in Engineering EMIS 7370/5370 STAT 5340 : PROBABILITY AND.
Chap 6-1 Copyright ©2013 Pearson Education, Inc. publishing as Prentice Hall Chapter 6 The Normal Distribution Business Statistics: A First Course 6 th.
B AD 6243: Applied Univariate Statistics Understanding Data and Data Distributions Professor Laku Chidambaram Price College of Business University of Oklahoma.
Numerical Descriptive Techniques
Summary statistics Using a single value to summarize some characteristic of a dataset. For example, the arithmetic mean (or average) is a summary statistic.
LECTURE 8 Thursday, 19 February STA291 Fall 2008.
© Copyright McGraw-Hill CHAPTER 3 Data Description.
Review of Chapters 1- 5 We review some important themes from the first 5 chapters 1.Introduction Statistics- Set of methods for collecting/analyzing data.
University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 08/10/ :23 PM 1 Some basic statistical concepts, statistics.
1.1 - Populations, Samples and Processes Pictorial and Tabular Methods in Descriptive Statistics Measures of Location Measures of Variability.
Chapter 2 Describing Data.
Describing distributions with numbers
Biostatistics Class 1 1/25/2000 Introduction Descriptive Statistics.
Measures of Dispersion How far the data is spread out.
Business Statistics Spring 2005 Summarizing and Describing Numerical Data.
Limits to Statistical Theory Bootstrap analysis ESM April 2006.
Statistics Lecture 3. Last class: types of quantitative variable, histograms, measures of center, percentiles and measures of spread…well, we shall.
Chapter 3, Part B Descriptive Statistics: Numerical Measures n Measures of Distribution Shape, Relative Location, and Detecting Outliers n Exploratory.
Numerical Measures. Measures of Central Tendency (Location) Measures of Non Central Location Measure of Variability (Dispersion, Spread) Measures of Shape.
Statistics: Unlocking the Power of Data Lock 5 STAT 250 Dr. Kari Lock Morgan Describing Data: One Quantitative Variable SECTIONS 2.2, 2.3 One quantitative.
Chapter 16 Exploratory data analysis: numerical summaries CIS 2033 Based on Textbook: A Modern Introduction to Probability and Statistics Instructor:
Statistical Data Analysis 2011/2012 M. de Gunst Lecture 2.
Chapter 13 Sampling distributions
Chapter 1 Introduction to Statistics. Section 1.1 Fundamental Statistical Concepts.
Honors Statistics Chapter 3 Measures of Variation.
Statistical Data Analysis 2011/2012 M. de Gunst Lecture 5.
© 2012 W.H. Freeman and Company Lecture 2 – Aug 29.
Chapter 6: Descriptive Statistics. Learning Objectives Describe statistical measures used in descriptive statistics Compute measures of central tendency.
Copyright © 2013, 2010 and 2007 Pearson Education, Inc. Chapter Estimating the Value of a Parameter 9.
Describing Data: Summary Measures. Identifying the Scale of Measurement Before you analyze the data, identify the measurement scale for each variable.
GROUPED DATA LECTURE 5 OF 6 8.DATA DESCRIPTIVE SUBTOPIC
CHAPTER 4 NUMERICAL METHODS FOR DESCRIBING DATA What trends can be determined from individual data sets?
Estimating standard error using bootstrap
Parameter, Statistic and Random Samples
Estimating the Value of a Parameter
Estimating the Value of a Parameter Using Confidence Intervals
Chapter 16: Exploratory data analysis: numerical summaries
Probability and Statistics for Computer Scientists Second Edition, By: Michael Baron Chapter 8: Introduction to Statistics CIS Computational Probability.
Statistics for Business and Economics
BAE 5333 Applied Water Resources Statistics
CHAPTER 3 Data Description 9/17/2018 Kasturiarachi.
NUMERICAL DESCRIPTIVE MEASURES
IET 603 Quality Assurance in Science & Technology
Density Curves and Normal Distribution
STATISTICS INFORMED DECISIONS USING DATA
STA 291 Spring 2008 Lecture 5 Dustin Lueker.
STA 291 Spring 2008 Lecture 5 Dustin Lueker.
Ch13 Empirical Methods.
Estimating the Value of a Parameter
Shape of Distributions
Exploratory data analysis: numerical summaries
(-4)*(-7)= Agenda Bell Ringer Bell Ringer
The Normal Distribution
Introductory Statistics
STATISTICS INFORMED DECISIONS USING DATA
Presentation transcript:

Chap 10: Summarizing Data 10.1: INTRO: Univariate/multivariate data (random samples or batches) can be described using procedures to reveal their structures via graphical displays (Empirical CDFs, Histograms,…) that are to Data what PMFs and PDFs are to Random Variables. Numerical summaries (location and spread measures) and the effects of outliers on these measures & graphical summaries (Boxplots) will be investigated.

10.2: CDF based methods : The Empirical CDF (ecdf) ECDF is the data analogue of the CDF of a random variable. The ECDF is a graphical display that conveniently summarizes data sets.

The Empirical CDF (cont’d) The random variables are independent Bernoulli random variables:

10.2.2: The Survival Function In medical or reliability studies, sometime data consist of times of failure or death; thus, it becomes more convenient to use the survival function rather than the CDF. The sample survival function (ESF) gives the proportion of the data greater than t and is given by: Survival plots (plots of ESF) may be used to provide information about the hazard function that may be thought as the instantaneous rate of mortality for an individual alive at time t and is defined to be:

The Survival Function (cont’d) From page 149, the method for the first order: which expresses how extremely unreliable (huge variance for large values of t) the empirical log- survival function is.

10.2.3:QQP(quantile-quantile plots) Useful for comparing CDFs by plotting quantiles of one dist’n versus quantiles of another dist’n.

10.2.3: Q-Q plots Q-Q plot is useful in comparing CDFs as it plots the quantiles of one dist’n versus the quantiles of the other dist’n. Additive model Additive model: Multiplicative model Multiplicative model:

10.3: Histograms, Density curves & Stem-and-Leaf Plots Kernel PDF estimate:

10.4: Location Measures : The Arithmetic Mean is sensitive to outliers (not robust) : The Median is a robust measure of location : The Trimmed Mean is another robust location measure

Location Measures (cont’d) The trimmed mean (discard only a certain number of the observations) is introduced as a natural compromise between the mean (discard no observations) and the median (discard all but 1 or 2 observations) Another compromise between is was proposed by Huber (1981) who suggested to minimize: or to solve (its solution will be called an M-estimate)

10.4.4: M-Estimates (Huber, 1964)

10.4.4: M-Estimates (cont’d) M-estimates coincide with MLEs because: The computation of an M-estimate is a nonlinear minimization problem that must be solved using an iterative method (such as Newton-Raphson,…) Such a minimizer is unique for convex functions. Here, we assume that is known; but in practice, a robust estimate of (to be seen in Section 10.5) should be used instead.

10.4.5: Comparison of Location Estimates Among the location estimate introduced in this section, which one is the best? Not easy ! For symmetric underlying dist’n, all 4 statistics (sample mean, sample median, alpha-trimmed mean, and M-estimate) estimate the center of symmetry. For non symmetric underlying dist’n, these 4 statistics estimate 4 different pop’n parameters namely (pop’n mean, pop’n median, pop’n trimmed mean, and a functional of the CDF by ways of the weight function ). Idea: Run some simulations; compute more than one estimate of location and pick the winner.

10.4.6: Estimating Variability of Location Estimates by the Bootstrap Using a computer, we can generate (simulate) many samples B (large) of size n from a common known dist’n F. From each sample, we compute the value of the location estimate. The empirical dist’n of the resulting values is a good approximation (for large B) to the dist’n function of. Unfortunately, F is NOT known in general. Just plug-in the empirical cdf for F and bootstrap ( = resample from ).

10.4.6: Bootstrap (cont’d) A sample of size n from is a sample of size n drawn with replacement from the observed data that produce. Thus, Read example A on page 368. Bootstrap dist’n can be used to form an approximate CI and to test for hypotheses.

10.5:Measures of Dispersion A measure of dispersion (scale) gives a numerical indication of the “scatteredness” of a batch of numbers. The most common measure of dispersion is the sample standard deviation Like the sample mean, the sample standard deviation is NOT robust (sensitive to outliers). Two simple robust measures of dispersion are the IQR (interquartile range) and the MAD (median absolute deviation from the median).

10.6: Box Plots Tukey invented a graphical display (boxplot) that indicates the center of a data set (median), the spread of the data (IQR) and the presence of outliers (possible). Boxplot gives also an indication of the symmetry / asymmetry (skewness) of the dist’n of data values. Later, we will see how boxplots can be effectively used to compare batches of numbers.

10.7: Conclusion Several graphical tools were introduced in this chapter as methods of presenting and summarizing data. Some aspects of the sampling dist’ns (assume a stochastic model for the data) of these summaries were discussed. Bootstrap methods (approximating a sampling dist’n and functionals) were also revisited.

Parametric Bootstrap: Example: Estimating a population mean It is known that explosives used in mining leave a crater that is circular in shape with a diameter that follows an exponential dist’n. Suppose a new form of explosive is tested. The sample crater diameters (cm) are as follows: It would be inappropriate to use as a 90% CI for the pop’n mean via the t-curve (df=19)

Parametric Bootstrap: (cont’d) because such a CI is based on the normality assumption for the parent pop’n. The parametric bootstrap replaces the exponential pop’n dist’n F with unknown mean by the known exponential dist’n F* with mean Then resamples of size n=20 are drawn from this surrogate pop’n. Using Minitab, we can generate B=1000 such samples of size n=20 and compute the sample mean of each of these B samples. A bootstrap CI can be obtained by trimming off 5% from each tail. Thus, a parametric bootstrap 90% CI is given by: (50 th smallest = ,951 st largest = )

Non-Parametric Bootstrap: If we do not assume that we are sampling from a normal pop’n or some other specified shape pop’n, then we must extract all the information about the pop’n from the sample itself. Nonparametric bootstrapping is to bootstrap a sampling dist’n for our estimate by drawing samples with replacement from our original (raw) data. Thus, a nonparametric bootstrap 90% CI of is obtained by taking the 5 th and 95 th percentiles of among these resamples.