Modeling Wim Buysse RUFORUM 1 December 2006 Research Methods Group.

Slides:



Advertisements
Similar presentations
Assumptions underlying regression analysis
Advertisements

SADC Course in Statistics Modelling ideas in general – an appreciation (Session 20)
Chapter 12 Inference for Linear Regression
© Department of Statistics 2012 STATS 330 Lecture 32: Slide 1 Stats 330: Lecture 32.
Presentation on Probability Distribution * Binomial * Chi-square
FTP Biostatistics II Model parameter estimations: Confronting models with measurements.
Outline input analysis input analyzer of ARENA parameter estimation
Introduction to Categorical Data Analysis
Maximum likelihood estimates What are they and why do we care? Relationship to AIC and other model selection criteria.
ANOVA: ANalysis Of VAriance. In the general linear model x = μ + σ 2 (Age) + σ 2 (Genotype) + σ 2 (Measurement) + σ 2 (Condition) + σ 2 (ε) Each of the.
Generalised linear models
Log-linear and logistic models Generalised linear model ANOVA revisited Log-linear model: Poisson distribution logistic model: Binomial distribution Deviances.
Log-linear and logistic models
Chapter 11 Multiple Regression.
Some standard univariate probability distributions
Discrete Probability Distributions
Topic 3: Regression.
Handling Categorical Data. Learning Outcomes At the end of this session and with additional reading you will be able to: – Understand when and how to.
An Introduction to Logistic Regression
Linear statistical models 2009 Count data  Contingency tables and log-linear models  Poisson regression.
Regression Model Building Setting: Possibly a large set of predictor variables (including interactions). Goal: Fit a parsimonious model that explains variation.
Generalized Linear Models
Review for Final Exam Some important themes from Chapters 9-11 Final exam covers these chapters, but implicitly tests the entire course, because we use.
Aaker, Kumar, Day Ninth Edition Instructor’s Presentation Slides
Chapter 13: Inference in Regression
Overview of Meta-Analytic Data Analysis
Simple Linear Regression
Copyright © 2010 Pearson Education, Inc. Warm Up- Good Morning! If all the values of a data set are the same, all of the following must equal zero except.
Choosing and using statistics to test ecological hypotheses
Chapter 3: Generalized Linear Models 3.1 The Generalization 3.2 Logistic Regression Revisited 3.3 Poisson Regression 1.
بسم الله الرحمن الرحیم.. Multivariate Analysis of Variance.
1 1 Slide Multiple Regression n Multiple Regression Model n Least Squares Method n Multiple Coefficient of Determination n Model Assumptions n Testing.
 A probability function is a function which assigns probabilities to the values of a random variable.  Individual probability values may be denoted by.
Various topics Petter Mostad Overview Epidemiology Study types / data types Econometrics Time series data More about sampling –Estimation.
University of Warwick, Department of Sociology, 2014/15 SO 201: SSAASS (Surveys and Statistics) (Richard Lampard) Week 7 Logistic Regression I.
Introduction Many experiments result in measurements that are qualitative or categorical rather than quantitative. Humans classified by ethnic origin Hair.
The Triangle of Statistical Inference: Likelihoood Data Scientific Model Probability Model Inference.
Educational Research Chapter 13 Inferential Statistics Gay, Mills, and Airasian 10 th Edition.
© Department of Statistics 2012 STATS 330 Lecture 20: Slide 1 Stats 330: Lecture 20.
Copyright © 2010 Pearson Education, Inc. Slide
Generalized linear MIXED models
Single-Factor Studies KNNL – Chapter 16. Single-Factor Models Independent Variable can be qualitative or quantitative If Quantitative, we typically assume.
Copyright © Cengage Learning. All rights reserved. Chi-Square and F Distributions 10.
Chapter 3 Discrete Random Variables and Probability Distributions  Random Variables.2 - Probability Distributions for Discrete Random Variables.3.
Chapter 15 The Chi-Square Statistic: Tests for Goodness of Fit and Independence PowerPoint Lecture Slides Essentials of Statistics for the Behavioral.
Statistics 2: generalized linear models. General linear model: Y ~ a + b 1 * x 1 + … + b n * x n + ε There are many cases when general linear models are.
Lecture PowerPoint Slides Basic Practice of Statistics 7 th Edition.
ALISON BOWLING MAXIMUM LIKELIHOOD. GENERAL LINEAR MODEL.
More on regression Petter Mostad More on indicator variables If an independent variable is an indicator variable, cases where it is 1 will.
Variance Stabilizing Transformations. Variance is Related to Mean Usual Assumption in ANOVA and Regression is that the variance of each observation is.
Comparing Counts Chapter 26. Goodness-of-Fit A test of whether the distribution of counts in one categorical variable matches the distribution predicted.
Assumptions of Multiple Regression 1. Form of Relationship: –linear vs nonlinear –Main effects vs interaction effects 2. All relevant variables present.
Logistic Regression and Odds Ratios Psych DeShon.
Nonparametric Statistics
Biostatistics Regression and Correlation Methods Class #10 April 4, 2000.
Class Seven Turn In: Chapter 18: 32, 34, 36 Chapter 19: 26, 34, 44 Quiz 3 For Class Eight: Chapter 20: 18, 20, 24 Chapter 22: 34, 36 Read Chapters 23 &
Statistics and probability Dr. Khaled Ismael Almghari Phone No:
Stats Methods at IC Lecture 3: Regression.
Howard Community College
Statistical Modelling
Introduction to Statistics: Probability and Types of Analysis
Chapter 25 Comparing Counts.
Quantitative Methods What lies beyond?.
CHAPTER 29: Multiple Regression*
Probability & Statistics Probability Theory Mathematical Probability Models Event Relationships Distributions of Random Variables Continuous Random.
Quantitative Methods What lies beyond?.
Chapter 26 Comparing Counts.
Chapter 26 Comparing Counts Copyright © 2009 Pearson Education, Inc.
Chapter 26 Comparing Counts.
Presentation transcript:

Modeling Wim Buysse RUFORUM 1 December 2006 Research Methods Group

Part 1. General Linear Models Research Methods Group

General Linear Models Dataset from Research Methods Group

General Linear Models Dataset from p Research Methods Group

General Linear Models Effects of three levels of sorbic acid (Sorbic) and six levels of water activity (Water) on survival of Salmonella typhimurium (Density) Water density = log(density/ml) Research Methods Group

ANOVA approach Research Methods Group General Linear Models

Results Research Methods Group General Linear Models

The same data, but each treatment is presented as a ‘dummy variable’. (Warning: for educational purposes only.) Research Methods Group General Linear Models

Regression with a first independent variable. Research Methods Group General Linear Models

We add a second independent variable. Research Methods Group General Linear Models

We add a third one. Research Methods Group General Linear Models

We add a fourth one. Research Methods Group General Linear Models

We continue to construct the model. Research Methods Group General Linear Models

Finally, the results. Research Methods Group General Linear Models

Comparison of the two approaches. Research Methods Group General Linear Models

Comparison of the two approaches: -They give the same results (in terms of SS.) -The approach to choose depends on what you want to know. -The regression approach still works when the ANOVA approach is not possible anymore (for instance when there are missing values). Research Methods Group General Linear Models

Example: modelling approach with normally distributed data. Protocol and dataset. Research Methods Group

Example: modelling approach with normally distributed data. Data: Screening of suitable species for three-year fallow file = Fallow N.xls Protocol: p. 13 Research Methods Group

The analysis approach is written down in chapter 19 of ‘Good statistical practice for natural resources research’ Research Methods Group Example: modelling approach with normally distributed data.

Modelling approach: general 5 steps: 1.(Visual) exploration to discover trends and relationships 2.Choose a possible model: The trend you see Knowledge of the experimental design Biological/scientific knowledge of the process 3.Fitting = estimation of parameters 4.Check = assessing the ‘fit’ 5.Interpretation to answer the objectives. Research Methods Group

Expanding the model ANOVA and regression Same calculations Data = pattern + noise = systematic component + random component Same assumptions Systematic components are additive Variability of the groups is similar The random component is (rather) normally distributed. The random variability of “y” around the systematic component is not affected by this systematic component. Research Methods Group

GENERAL LINEAR MODELS Research Methods Group

GENERAL LINEAR MODELS Research Methods Group

GENERAL LINEAR MODELS Research Methods Group Data = pattern + noise Pattern: is explained by a linear combination of the independent variables (Data ≈ N(m,v) and the variance is rather constant across the different groups) Noise: N(0,1) and the variance is rather constant across the different groups

Expanding the model If the data are not normally distributed or if the variance of the different groups is not similar: Possible approach = transformation of the data = « linearising » the model Problems: -You don’t work anymore on a scale that has a biological meaning. -Retransforming the standard errors back to the original scale is not possible anymore. Research Methods Group

Better solution: GENERAL LINEAR MODELS => GENERALIZED LINEAR MODELS Research Methods Group Less restrictions; two essential differences: 1.Data can be distributed according to the family of exponential distributions = Normal, Binomial, Poisson, Gamma, Negative binomial 2.Link function: the link between E(Y) and the independent variables is not longer a linear combination of the independent variables. It is also possible that the linear combination of the independent variables is a function of can also be a linear combination of a function of E(Y). (We don’t transform the dependent variables but include the transformation into the model). Expanding the model

Research Methods Group Also: - The systematic component (linear combination of independent variables) can include both continuous and categorical variables and even polynomials But still: -The variance is constant across the different groups (or has become constant because of the transformation through the link function) Expanding the model Better solution: GENERAL LINEAR MODELS => GENERALIZED LINEAR MODELS

Generalised linear models Research Methods Group Statistical theory is more difficult, but the menus in GenStat and the way you can interpret the output is very similar to what we know from ANOVA and regression.

Research Methods Group = =

Example 1. Logistic regression Example: cardio-vascular disease according to age Research Methods Group age and chd.xls

Example: same data but according to age group Research Methods Group Example 1. Logistic regression

Example: the linear regression is not an appropriate model and the predictions at the extremes will not be correct Research Methods Group Example 1. Logistic regression

Example: test χ 2 test: limited information Research Methods Group Example 1. Logistic regression

Bernoulli process: an (independent) event that can have two possible outcomes (1 – 0, success- failure, …); with a given probability of succes Tossing a coin: head or tail; p = 0,5 Throwing 6 with a dice (success) compared to throwing any other number; p = 1/6 Conducting a survey: is the head of the household male or female?; calculate p from the proportion found in the collected data Screening of cardio-vascular diseases. p disease = 43 out of 100 individuals = 0.43 Research Methods Group Example 1. Logistic regression

In GenStat Research Methods Group Example 1. Logistic regression

Logistic function Research Methods Group Example 1. Logistic regression

Logistic function Sigmoid form Linear in the middle The probability is restricted between 0 et 1 Small values: flatten towards 0; large values: flatten towards 1 Research Methods Group Example 1. Logistic regression

GenStat output Similar, but ‘deviance’ instead of ‘variance’ and test χ 2 instead of F Research Methods Group Example 1. Logistic regression

GenStat output model Research Methods Group Logit(CHD) = -5,31 + 0,1109 AGE Example 1. Logistic regression

Research Methods Group Logit(CHD) = -5,31 + 0,1109 AGE Example 1. Logistic regression

Research Methods Group Example 1. Logistic regression

Binomial distribution: when we repeat the Bernoulli process, the order of success or failure can change Example: head of household in a survey Research Methods Group Example 1. Logistic regression

Calculation of probabilities if success = female headed household with p = 0,2 Research Methods Group Example 1. Logistic regression

Calculated probabilities for obtaining success Research Methods Group We can now construct a frequency distribution of obtaining success Probability = long-run frequency = frequency when very many data = binomial distribution Example 1. Logistic regression

Binomial distribution Counts of a categorical variable Example: experiment of survival of trees from different provenances File: survival trees.xls Research Methods Group Example 1. Logistic regression

Several approaches possible Research Methods Group 1 Example 1. Logistic regression

Several approaches possible Research Methods Group 1 Example 1. Logistic regression

Research Methods Group 2 Example 1. Logistic regression Several approaches possible

Research Methods Group 2 Example 1. Logistic regression Several approaches possible

Research Methods Group 3 Example 1. Logistic regression Several approaches possible

Research Methods Group 3 Example 1. Logistic regression Several approaches possible

The Bernoulli distribution is a special case of the binomial distribution There exist ‘families of distributions’. Research Methods Group Example 1. Logistic regression

There is of course a difference in the variability that is explained. Research Methods Group Example 1. Logistic regression

Example 2. Modelling counts We used logistic regression to analyse counts. Bernoulli distribution: distribution of success of events that follow a Bernoulli process (1 or 0, yes or no) Binomial distribution: distribution of possible (and independent) combinations of Bernoulli events So, more like analysis of proportions. Next: Poisson distribution: distribution of counts of Bernoulli events Research Methods Group

Poisson distribution: distribution of counts of Bernoulli events BUT: p is very small n is very big p*n < 5 Events happen randomly and independent of each other. Research Methods Group Example 2. Modelling counts

Poisson distribution = distribution of rare events Number of civil airplane crashes (when there is no war) in the whole world during several years. Number of infected seeds in seed lots that are certified by a controlling agency. Number of individuals of a rare tree species in a square kilometre in the same Agro Ecological Zone. Research Methods Group Example 2. Modelling counts

THUS The distribution that best describes counts is not automatically a Poisson distribution. It depends of the context. Research Methods Group Example 2. Modelling counts

Some mathematical statistics Research Methods Group The proportion mean/variance must be 1. = Poisson index In GenStat: (s 2 -m)/m Example 2. Modelling counts

We briefly have seen already other counts: χ 2 test Research Methods Group χ 2 test: is there evidence of an association between two discrete variables H 0 : no association H 1 : association Example 2. Modelling counts

We could use another kind of probability to calculate the test statistic Research Methods Group Example 2. Modelling counts

But now we look at the table in another way. If we consider the counts in the table as a variable, we could construct a frequency distribution. Research Methods Group Example 2. Modelling counts

Long run frequency distribution = probability distribution We just expanded the binomial distribution into the multinomial distribution. Binomial distribution: Independent observations p success = everywhere the same. The probability that an individual observation falls into a specific cell of the table is the same for all cells. Multinomial observation: + The number of total observations is fixed. Research Methods Group Example 2. Modelling counts

If the total number of observations was not fixed => Poisson distribution BUT Thanks to a lot of difficult statistical theory: we can also use the Poisson distribution even if the total number of observation is not fixed. Research Methods Group Example 2. Modelling counts

CONCLUSION Even though the context is important to decide whether we can use the Poisson distribution to analyse counts (‘distribution of rare events’) Generally: Analysis of ‘multiway contingency tables’ => Poisson distribution + logarithm link = LOGLINEAR MODELING Research Methods Group Example 2. Modelling counts

Analysis of counts = Often we can use the Poisson distribution But not always Research Methods Group Example 2. Modelling counts

Example 2. Loglinear modelling Research Methods Group =

Adding interactions Example 2. Loglinear modelling

Research Methods Group = χ 2 test Loglinear modelling Example 2. Loglinear modelling

Research Methods Group Modelling of complex datasets: Adding or dropping terms and interactions in the model and changing their order Good model (‘good fit’ ) when the ‘residual deviance’ becomes almost equal to the number of degrees of freedom (or ‘mean deviance’ = 0) At that moment we can assume that the remaining residual variability is caused by the random variability (noise) Adding too many terms: ‘residual deviance’ => 0 Example 2. Loglinear modelling

Research Methods Group Example: lambs.xls Example 2. Loglinear modelling