Use of Quantile Functions in Data Analysis. In general, Quantile Functions (sometimes referred to as Inverse Density Functions or Percent Point Functions)

Slides:



Advertisements
Similar presentations
Normal Distribution The shaded area is the probability of z > 1.
Advertisements

Statistical Reasoning for everyday life
Describing Quantitative Variables
Lesson 10: Linear Regression and Correlation
Biomedical Statistics Testing for Normality and Symmetry Teacher:Jang-Zern Tsai ( 蔡章仁 ) Student: 邱瑋國.
STA305 week 31 Assessing Model Adequacy A number of assumptions were made about the model, and these need to be verified in order to use the model for.
McGraw-Hill Ryerson Copyright © 2011 McGraw-Hill Ryerson Limited. Adapted by Peter Au, George Brown College.
Chapter 6 Continuous Random Variables and Probability Distributions
Evaluating Hypotheses
1 Empirical and probability distributions 0.4 exploratory data analysis.
Chapter 6 The Normal Distribution and Other Continuous Distributions
Copyright (c) Bani Mallick1 Lecture 4 Stat 651. Copyright (c) Bani Mallick2 Topics in Lecture #4 Probability The bell-shaped (normal) curve Normal probability.
Statistics and Probability Theory Prof. Dr. Michael Havbro Faber
Chapter 6 The Normal Distribution & Other Continuous Distributions
Copyright ©2011 Pearson Education, Inc. publishing as Prentice Hall 6-1 Chapter 6 The Normal Distribution & Other Continuous Distributions Statistics for.
Chapter 5 Continuous Random Variables and Probability Distributions
1 Numerical Summary Measures Lecture 03: Measures of Variation and Interpretation, and Measures of Relative Position.
Understanding and Comparing Distributions
Understanding and Comparing Distributions
Business Statistics: A First Course, 5e © 2009 Prentice-Hall, Inc. Chap 6-1 Chapter 6 The Normal Distribution Business Statistics: A First Course 5 th.
Chapter 6 The Normal Distribution & Other Continuous Distributions
Chapter 4 Continuous Random Variables and Probability Distributions
Basic Statistics Standard Scores and the Normal Distribution.
1 Statistical Analysis - Graphical Techniques Dr. Jerrell T. Stracener, SAE Fellow Leadership in Engineering EMIS 7370/5370 STAT 5340 : PROBABILITY AND.
Chap 6-1 Copyright ©2013 Pearson Education, Inc. publishing as Prentice Hall Chapter 6 The Normal Distribution Business Statistics: A First Course 6 th.
CORRELATION & REGRESSION
The Examination of Residuals. The residuals are defined as the n differences : where is an observation and is the corresponding fitted value obtained.
Continuous Probability Distributions  Continuous Random Variable  A random variable whose space (set of possible values) is an entire interval of numbers.
Second graded Homework Assignment: 1.80; 1.84; 1.92; 1.110; 1.115; (optional: 1.127) Due in Labs on Sept
1 Statistical Distribution Fitting Dr. Jason Merrick.
The Examination of Residuals. Examination of Residuals The fitting of models to data is done using an iterative approach. The first step is to fit a simple.
Measures of central tendency are statistics that express the most typical or average scores in a distribution These measures are: The Mode The Median.
Exploratory Data Analysis Observations of a single variable.
Sullivan – Fundamentals of Statistics – 2 nd Edition – Chapter 7 Section 4 – Slide 1 of 11 Chapter 7 Section 4 Assessing Normality.
Report Writing. A report should be self-explanatory. It should be capable of being read and understood without reference to the original project description.
Essential Statistics Chapter 21 Describing Distributions with Numbers.
Basic Business Statistics
Lecture 7 Sections 2.3 – 2.4 Objectives: More Detailed Summary Quantities − Quartiles and IQR − Boxplots − Quantile Plots.
Sampling and estimation Petter Mostad
Stracener_EMIS 7305/5305_Spr08_ Reliability Data Analysis and Model Selection Dr. Jerrell T. Stracener, SAE Fellow Leadership in Engineering.
UNIT #1 CHAPTERS BY JEREMY GREEN, ADAM PAQUETTEY, AND MATT STAUB.
Applied Quantitative Analysis and Practices
Normal Distributions.
Statistical Data Analysis 2011/2012 M. de Gunst Lecture 2.
Example x y We wish to check for a non zero correlation.
Chapter 20 Statistical Considerations Lecture Slides The McGraw-Hill Companies © 2012.
Statistics for Managers Using Microsoft Excel, 5e © 2008 Pearson Prentice-Hall, Inc.Chap 6-1 Statistics for Managers Using Microsoft® Excel 5th Edition.
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc.. Chap 6-1 Chapter 6 The Normal Distribution and Other Continuous Distributions Basic Business.
Copyright © Cengage Learning. All rights reserved. 5 Joint Probability Distributions and Random Samples.
Chap 6-1 Chapter 6 The Normal Distribution Statistics for Managers.
1 Statistical Analysis - Graphical Techniques Dr. Jerrell T. Stracener, SAE Fellow Leadership in Engineering EMIS 7370/5370 STAT 5340 : PROBABILITY AND.
Business Statistics, A First Course (4e) © 2006 Prentice-Hall, Inc. Chap 6-1 Chapter 6 The Normal Distribution Business Statistics, A First Course 4 th.
Copyright © Cengage Learning. All rights reserved. 8 PROBABILITY DISTRIBUTIONS AND STATISTICS.
Thursday, May 12, 2016 Report at 11:30 to Prairieview
The normal distribution
The symmetry statistic
Parameter, Statistic and Random Samples
Statistics -S1.
BAE 6520 Applied Environmental Statistics
BAE 5333 Applied Water Resources Statistics
Model validation and prediction
STAT 206: Chapter 6 Normal Distribution.
Statistical Methods For Engineers
The normal distribution
Continuous Statistical Distributions: A Practical Guide for Detection, Description and Sense Making Unit 3.
Statistics for Managers Using Microsoft® Excel 5th Edition
Advanced Algebra Unit 1 Vocabulary
The Normal Distribution
Presentation transcript:

Use of Quantile Functions in Data Analysis

In general, Quantile Functions (sometimes referred to as Inverse Density Functions or Percent Point Functions) return the Value X at which P(X) = [specified cumulative probability] given that particular distribution.

Recall the normal distribution

So for the specified cumulative probability of 0.025, the value of X turns out to be In other words, Ф(-1.96) = 0.025, so Ф -1 (0.025) =-1.96 Notice, then, that Ф -1 is producing the quantile function, Q

Idealised Samples i – ½ n i = 1,2,…….n Note that these are evenly spaced between 0 and 1 p i = These are produced by first defining a sequence of n probability points using

For example, for n = 20, we have p 1, p 2, ……….p 20 equal to 0.025, 0.075, ………… Probability points may be produced with the R function ppoints:

Now suppose that a distribution has quantile function Q. Then, if the random variable Z has qf Q, from the original definition, P(Z ≤ Q(p i ) )= p i i =1….. n Hence, Q(p 1 )…………….Q(p n ) may be regarded as an idealised sample of size n.

The Uniform Distribution 0 1 1

0 1 1 Area 0.2 The quantile function is easy here e.g. Q(0.2)=0.2

The Uniform Distribution Area p p In general, Q(p)=p

Hence p 1 …………. p n may be regarded as an idealised sample of size n from the U(0, 1) distribution.

The Normal Distribution We have already seen that the normal distribution has quantile function given by Ф -1. Thus an idealised N(0, 1) sample of size 20, will be Ф -1 (0.025), Ф -1 (0.075), Ф -1 (0.125), …..

In R the result is found with qnorm

A histogram can be plotted using hist(z)

Exponential Distribution The Exp(1) distribution has quantile function Q given by Q(p) = - ln(1 - p) (exercise!) In R it is given by the function qexp. Thus an idealised Exp(1) sample of size 20, together with its histogram are as follows:

Empirical Quantile Functions Suppose that we have a set of n observations y 1, …… y n of a variable y. Let y (1) ≤ ≤ y (n) be their order statistics. Analogously to earlier work we define the empirical quantile function Q e of these data by Q e (p i ) = y ( i ) with linear interpolation for other values of p in the interval (0, 1).

Thus, for any p, Q e (p) is the essentially the value of the variable, y, below which a fraction, p, of the observations lie. Example: the three quartiles of the distribution of the data are given by Q e (1/4), Q e (1/2) and Q e (3/4).

Example Maximum daily ozone concentrations. The R data frame ozone is made available on the module web pages. It gives maximum daily ozone concentrations, in parts per billion, measured at Stamford, Connecticut and Yonkers, New York, during the period May - September one year. Data are given only for the 132 days (out of a total of 152) on which readings were successfully obtained at both locations.

The sorted Stamford data are given by

The R code > hist(stamford, xlab="ozone concentration", main="Stamford ozone concentrations") > plot(density(stamford), main=" density estimate for Stamford ozone concentrations ") produces the histogram and (kernel) density estimate shown in the next slides. (For details of the latter, see 5.6 of Venables and Ripley (2002).)

A simple summary is given by > summary(stamford) Min. 1st Qu. Median Mean 3rd Qu. Max while much the same is achieved with the use of the R (empirical) quantile function, which by default evaluates this at p = (0.00, 0.25, 0.50, 0.75, 1.00) > quantile(stamford) 0% 25% 50% 75% 100%

Both the histogram and the summaries show the data to be quite positively skewed. This is quite typical of data constrained to be positive. We can specify any vector of points at which to evaluate the quantile function: > p = seq(0,1,0.1) > p [1]

Q-Q Plots We wish to investigate whether observations y 1 ……. y n may reasonably be regarded as a random sample from some theoretical distribution. A Q-Q plot compares the empirical quantile function Q e of the data with the theoretical quantile function Q of the distribution.

We plot y (i) (= Q e (p i )) against Q(p i ) for the probability points p 1,p 2 etc. as defined earlier. If the data follow an approximately straight line with slope 1 and intercept 0, the observed values are said to be compatible with the theoretical distribution. i.e. y (i) ≈ Q(p i )

Refinement If the data do not lie on the specific line just mentioned, but instead lie on a line with slope b and intercept a, i.e. y (i) ≈ a + bQ(p i ) This indicates compatibility with the appropriate distribution scaled by b and shifted by a.

For example if Q is the quantile function of the N(0,1) distribution, then the relation suggests compatibility with the N(a,b 2 ) distribution.

Stamford Ozone Concentrations We check for normality. A normal Q-Q plot compares empirical qf of the data with the qf of the N(0,1) distribution. The R function qqnorm constructs the plot.

The function qqline adds the straight line y = a +bx corresponding to the normal distribution N(a,b 2 ) whose lower and upper quartiles match those of the data. (This is a resistant fit, unaffected by one or two extreme observations).

The lack of linearity in the plotted data suggests that they are not well modelled by a normal distribution. Rather, the convex shape indicates positive skewness (as was seen in the histogram). We investigate whether a transformation would help. A check is carried out as to whether the square roots of the Stamford ozone concentrations can be modelled by a normal distribution.

(Notice that, when using R, x^y gives the y th power of x, and, of course, a power of 0.5 gives a square root).

The plotted data are reasonably well fitted by the straight line which has slope and intercept 9.0. Thus the data are reasonably well modelled by the N(9, ) distribution (whose lower and upper quartiles match those of the data). Note that the sample mean and standard deviation are 9.1 and respectively.

Note: The equation of the straight line can be obtained by inspection, or more accurately by using the two points that it is defined by, i.e. (Q(1/4), Q e (1/4)), (Q(3/4), Q e (3/4)) Appropriate R commands give these as (-0.675, 7.106) and (0.675, )

Example The R vector abbey in the package MASS gives 31 determinations of nickel content (μg g -1 ) in a Canadian syenite rock.

We check whether the data are reasonably modelled by an exponential distribution. There is no predefined function in R to construct an exponential Q-Q plot, so we have to work from first principles.

qexp(ppoints(31)) gives the theoretical quantile values at 31 probability points. sort(abbey) gives the sorted experimental values

The following command produces a Q-Q plot with axes labelled accordingly.

If we ignore the highest observation, the data appear to be reasonably compatible with an exponential distribution with mean around 12.5 (exp(0.08)).

However, the probability that the highest of 31 independent observations from an exponential distribution of mean 12.5 is as great as 125 is only , so there must be considerable doubt about the suitability of an exponential model here. A further possibility is that some transformation of the data may be modelled by a normal distribution. Try this out in the practicals.

Boxplots These are useful for comparing distributions of the same variable. The R code: produces the following boxplot

Recall that the lower and upper ends of each box give the first and third quartiles of the corresponding distribution and the centre line indicates the median.