Proportion Data Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.

Slides:



Advertisements
Similar presentations
Dummy Dependent variable Models
Advertisements

Assumptions underlying regression analysis
Using R Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.
© Department of Statistics 2012 STATS 330 Lecture 32: Slide 1 Stats 330: Lecture 32.
Analysis of Categorical Data Nick Jackson University of Southern California Department of Psychology 10/11/
Effect Size Overheads1 The Effect Size The effect size (ES) makes meta-analysis possible. The ES encodes the selected research findings on a numeric scale.
Introduction to Categorical Data Analysis
Count Data Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.
Statistical Methods Chichang Jou Tamkang University.
Statistics 350 Lecture 16. Today Last Day: Introduction to Multiple Linear Regression Model Today: More Chapter 6.
Generalised linear models
Log-linear and logistic models Generalised linear model ANOVA revisited Log-linear model: Poisson distribution logistic model: Binomial distribution Deviances.
Log-linear and logistic models
ARE OBSERVATIONS OBTAINED DIFFERENT?. ARE OBSERVATIONS OBTAINED DIFFERENT? You use different statistical tests for different problems. We will examine.
Statistics in MATLAB COMM2M Harry R. Erwin, PhD University of Sunderland.
Practical Meta-Analysis -- D. B. Wilson
Copyright © 2014 by McGraw-Hill Higher Education. All rights reserved.
Contrasts Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.
Summary of Quantitative Analysis Neuman and Robson Ch. 11
Chapter 9: Introduction to the t statistic
Generalized Linear Models
Regression Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.
1 Regression Models with Binary Response Regression: “Regression is a process in which we estimate one variable on the basis of one or more other variables.”
5-1 Introduction 5-2 Inference on the Means of Two Populations, Variances Known Assumptions.
Comparing Two Samples Harry R. Erwin, PhD
Analysis of Covariance Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.
Practical Meta-Analysis -- The Effect Size -- D. B. Wilson 1 The Effect Size The effect size (ES) makes meta-analysis possible The ES encodes the selected.
Chapter 3: Generalized Linear Models 3.1 The Generalization 3.2 Logistic Regression Revisited 3.3 Poisson Regression 1.
Central Tendency Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.
Summary of Remainder Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.
ALISON BOWLING THE GENERAL LINEAR MODEL. ALTERNATIVE EXPRESSION OF THE MODEL.
Excepted from HSRP 734: Advanced Statistical Methods June 5, 2008.
Experimental Design and Modelling Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.
Variance Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.
Statistical Modelling Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.
Introduction to Statistics Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.
Repeated Measures  The term repeated measures refers to data sets with multiple measurements of a response variable on the same experimental unit or subject.
General Linear Models; Generalized Linear Models Hal Whitehead BIOL4062/5062.
Logistic regression. Analysis of proportion data We know how many times an event occurred, and how many times did not occur. We want to know if these.
University of Warwick, Department of Sociology, 2014/15 SO 201: SSAASS (Surveys and Statistics) (Richard Lampard) Week 7 Logistic Regression I.
Multiple Regression and Model Building Chapter 15 Copyright © 2014 by The McGraw-Hill Companies, Inc. All rights reserved.McGraw-Hill/Irwin.
Forecasting Choices. Types of Variable Variable Quantitative Qualitative Continuous Discrete (counting) Ordinal Nominal.
Binary Response Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.
Multiple Regression Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.
Statistical Inference Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.
© Department of Statistics 2012 STATS 330 Lecture 22: Slide 1 Stats 330: Lecture 22.
IE241: Introduction to Design of Experiments. Last term we talked about testing the difference between two independent means. For means from a normal.
Statistics 2: generalized linear models. General linear model: Y ~ a + b 1 * x 1 + … + b n * x n + ε There are many cases when general linear models are.
© Department of Statistics 2012 STATS 330 Lecture 24: Slide 1 Stats 330: Lecture 24.
Remembering way back: Generalized Linear Models Ordinary linear regression What if we want to model a response that is not Gaussian?? We may have experiments.
Dependent Variable Discrete  2 values – binomial  3 or more discrete values – multinomial  Skewed – e.g. Poisson Continuous  Non-normal.
1 Fighting for fame, scrambling for fortune, where is the end? Great wealth and glorious honor, no more than a night dream. Lasting pleasure, worry-free.
Nonparametric Statistics
Stats 242.3(02) Statistical Theory and Methodology.
LOGISTIC REGRESSION. Purpose  Logistical regression is regularly used when there are only two categories of the dependent variable and there is a mixture.
Nonparametric Statistics
BINARY LOGISTIC REGRESSION
Logistic regression.
A priori violations In the following cases, your data violates the normality and homoskedasticity assumption on a priori grounds: (1) count data  Poisson.
Logistic Regression APKC – STATS AFAC (2016).
Chapter 13 Nonlinear and Multiple Regression
THE LOGIT AND PROBIT MODELS
Generalized Linear Models
Analysis of Variance Harry R. Erwin, PhD
Introduction to logistic regression a.k.a. Varbrul
CHAPTER 29: Multiple Regression*
Nonparametric Statistics
Statistics review Basic concepts: Variability measures Distributions
females males Analyses with discrete variables
Presentation transcript:

Proportion Data Harry R. Erwin, PhD School of Computing and Technology University of Sunderland

Resources Crawley, MJ (2005) Statistics: An Introduction Using R. Wiley. Freund, RJ, and WJ Wilson (1998) Regression Analysis, Academic Press. Gentle, JE (2002) Elements of Computational Statistics. Springer. Gonick, L., and Woollcott Smith (1993) A Cartoon Guide to Statistics. HarperResource (for fun).

Introduction These four demonstration sessions of this class address special types of data: –Counts –Proportions (this lecture) –Survival analysis –Binary responses

Frequencies and Proportions With frequency data, we know how often something happened, but not how often it didn’t happen. With proportion data, we know both. Applied to: –Mortality and infection rates –Response to clinical treatment –Voting –Sex ratios –Proportional response to experimental treatments

Working With Proportions Traditionally, proportion data was modelled by using the percentage as the response variable. This is bad for four reasons: 1.Errors are not normally distributed. 2.Non-constant variance. 3.Response is bounded by 0.0 and The size of the sample, n, is lost.

General Approach Use a general linear model ( glm ). family = binomial (i.e., unfair coin flip) Uses two vectors, one of the success counts and the other of the failure counts. number of failures + number of successes = binomial denominator, n y<-cbind(successes, failures) model<-glm(y~whatever,binomial)

How R Handles Proportions Weighted regression (weighted by the individual sample sizes). logit link to ensure linearity If percentage cover data –Do an arc-sine transformation, followed by conventional modelling (normal errors, constant variance). If percentage change in a continuous measurement –ANCOVA with final weight as the response and initial weight as a covariate, or –Use the relative growth rate (log(final/initial)) as response. –Both produce normal errors.

Tests To compare a single binomial proportion to a constant, use binom.test. To compare two samples, use prop.test. Only use the following methods for complex models: –Regression tables –Contingency tables

Count Data on Proportions R supports the usual arcsine and probit transformations: –arcsine makes the error distribution normal –probit linearises the relationship between percentage mortality and log(dose) However, it is usually better to use the logit transformation and assume you have binomial data.

Odds The logistic model for p as a function of x is: p = exp(a+bx)/(1 + exp(a+bx)) The book notes that this is obviously non-linear. To linearise it, consider instead the odds p/q (as in gambling, where q is 1-p): p/q = exp(a+bx) Or: ln(p/q) = a + bx ln(p/q) is called the logit transformation of p

R and logit R does not simply do a linear regression of ln(p/q) against x. It also handles: –non-constant binomial variance –logit(p) going to -  and + . –differences between sample sizes using weighted regression.

Over-dispersion and Hypothesis Testing Everything addressed earlier is still available for proportions data. This includes ANOVA, ANCOVA, and regression analysis. Significance is assessed using  2 tests. Hypothesis testing with binomial errors is less clear- cut than normal errors. Large samples (>30) are necessary. The degree to which the approximation is satisfactory is unknown. p will not be exactly known. Over-dispersion must usually be addressed. The residual scaled deviance should be about the residual df. Use family = quasibinomial for over-dispersion.

Book Examples See discussion of how to model with binomial errors. Logistic regression example. Categorical explanatory variables example. ANCOVA example.