Some statistical basics Marian Scott
Why bother with Statistics We need statistical skills to: Make sense of numerical information, Summarise data, Present results (graphically), Test hypotheses Construct models
Variables- number and type Univariate: there is one variable of interest measured on the individuals in the sample. We may ask: What is the distribution of results-this may be further resolved into questions concerning the mean or average value of the variable and the scatter or variability in the results?
Bivariate Bivariate two variables of interest are measured on each member of the sample. We may ask : How are the two variables related? If one variable is time, how does the other variable change? How can we model the dependence of one variable on the other?
Multivariate Multivariate many variables of interest are measured on the individuals in the sample, we might ask: What relationships exist between the variables? Is it possible to reduce the number of variables, but still retain 'all' the information? Can we identify any grouping of the individuals on the basis of the variables?
Data types Numerical: a variable may be either continuous or discrete. For a discrete variable, the values taken are whole numbers (e.g. number of chromosome abnormalities, numbers of eggs). For a continuous variable, values taken are real numbers (positive or negative and including fractional parts) (e.g. blood lead level, alkalinity, weight, temperature).
categorical Categorical: a limited number of categories or classes exist, each member of the sample belongs to one and only one of the classes e.g. sex is categorical. Sex is a nominal categorical variable since the categories are unordered. Dose of a drug or level of diluent (eg recorded as low, medium,high) would be an ordinal categorical variable since the different classes are ordered
Inference and Statistical Significance Sample Population inference Is the sample representative? Is the population homogeneous? Since only a sample has been taken from the population we cannot be 100% certain Significance testing
Hypothesis Testing II Null hypothesis: usually no effect Alternative hypothesis: effect Make a decision based on the evidence (the data) There is a risk of getting it wrong! Two types of error:- reject null when we shouldnt - Type I dont reject null when we should - Type II
Significance Levels We cannot reduce probabilities of both Type I and Type II errors to zero. So we control the probability of a Type I error. This is referred to as the Significance Level or p- value. Generally p-value of <0.05 is considered a reasonable risk of a Type I error. (beyond reasonable doubt)
Statistical Significance vs. Practical Importance Statistical significance is concerned with the ability to discriminate between treatments given the background variation. Practical importance relates to the scientific domain and is concerned with scientific discovery and explanation.
Power Power is related to Type II error probability of power = 1 - making a Type II error Aim: to keep power as high as possible
Sample size calculations What is the objective of the experiment? How much of a difference is it important to be able to detect (the effect size)? At what significance level do you want to conduct the test? (decrease the significance level, reduces power) What is the power of the experiment (what is the probability that you will detect such a difference when it actually exists)? How variable is the population? Greater variation needs larger sample size to achieve the same power
Power Curves
Modelling continuous variables- checking Normality Normal density function and histogram Check for symmetry Other possibility-Normal probability plot
Modelling continuous variables- checking Normality Normal probability plot Should show a straight line p-value of test is also reported (null: data are Normally distributed)
Statistical inference Hypothesis testing and the p-value Statistical significance vs real-world importance Confidence intervals
Confidence intervals- an alternative to hypothesis testing A confidence interval is a range of credible values for the population parameter. The confidence coefficient is the percentage of times that the method will in the long run capture the true population parameter. A common form is sample estimator 2* estimated standard error
Statistical models Outcomes or Responses these are the results of the practical work and are sometimes referred to as dependent variables. Causes or Explanations these are the conditions or environment within which the outcomes or responses have been observed and are sometimes referred to as independent variables, but more commonly known as covariates.
Statistical models In experiments many of the covariates have been determined by the experimenter but some may be aspects that the experimenter has no control over but that are relevant to the outcomes or responses. In observational studies, these are usually not under the control of the experimenter but are recorded as possible explanations of the outcomes or responses.
Specifying a statistical models Models specify the way in which outcomes and causes link together, eg. Metabolite = Temperature The = sign does not indicate equality in a mathematical sense and there should be an additional item on the right hand side giving a formula:- Metabolite = Temperature + Error
statistical model interpretation Metabolite = Temperature + Error The outcome Metabolite is explained by Temperature and other things that we have not recorded which we call Error. The task that we then have in terms of data analysis is simply to find out if the effect that Temperature has is large in comparison to that which Error has so that we can say whether or not the Metabolite that we observe is explained by Temperature.
Correlations and linear relationships Strength of linear relationship Simple indicator lying between –1 and +1 Check your plots for linearity
gene correlations
Interpreting correlations The correlation coefficient is used as a measure of the linear relationship between two variables, The correlation coefficient is a measure of the strength of the linear association between two variables. If the relationship is non-linear, the coefficient can still be evaluated and may appear sensible, so beware- plot the data first.
Simple regression model The basic regression model assumes: The average value of the response x, is linearly related to the explanatory t, The spread of the response x, about the average is the SAME for all values of t, The VARIABILITY of the response x, about the average follows a NORMAL distribution for each value of t.
Simple regression model Model is fit typically using least squares Goodness of fit of model assessed based on residual sum of squares and R 2 Assumptions checked using residual plots Inference about model parameters carried out using hypothesis tests or confidence intervals
statistical model interpretation The traditional statistical tests such as t-tests, ANOVA, ANCOVA and regression are each special cases of a more general type of model, making a number of assumptions - t-tests work where there are two groups, ANOVA works with categorical explanatory variables, regression assumes that explanatory variables are continuous, Our explanatory variables are not like this, they are mixtures of continuous and categorical, so we need a more flexible approach- the G(eneral) L(inear) M(odel).
General linear models General Linear Models (GLMs) are a comprehensive set of techniques that cover a wide range of analyses. Problems that make use of number of specific techniques may be specified as GLM problems using a unified specification called a Model Syntax. The form of the Model Syntax varies a little from statistics package to statistics package, but is essentially just a way of unambiguously specifying what the relationship is between variables (categorical or continuous).
Examples ExampleTraditional TestGLM word equation Comparing the effect of burning and clipping on bracken Two sample t-test SHOOTS = MANAGEMENT Comparing the effect of two different drugs with a placebo One-way analysis of variance EFFECT = DRUG Comparing the yield between fertilisers conducting the experiment in several fields One-way analysis of variance with blocking YIELD = FIELD + FERTILISER Investigating the relationship between height and weight in people Regression WEIGHT = HEIGHT Investigating the relationship between oxygen consumption and weight in scampi, taking level of activity into account Analysis of covariance, with emphasis on regression OXYGEN = WEIGHT + ACTIVITY or under different assumptions (an interaction between the terms) OXYGEN = WEIGHT | ACTIVITY
summary hypothesis tests and confidence intervals are used to make inferences we build statistical models to explore relationships and explain variation the modelling framework is a general one – general linear models, generalised additive models assumptions should be checked.