Introduction to Statistics Alastair Kerr, PhD. Overview Understanding samples and distributions Binomial and normal distributions Describing data Visualising.

Introduction to Statistics Alastair Kerr, PhD

Overview Understanding samples and distributions Binomial and normal distributions Describing data Visualising data How to test hypotheses Asking the right question Statistical power Choosing the correct test How to interpret results and improve experimental design Bayesian analysis Multiple testing correction Confounding factors and how to avoid them What are correct replicates Avoiding common errors

Think about these statements (discuss at end) Paraphrased from real conversations: – “We used a t-test to compare our samples” – “These genes are the most highly expressed in my experiment: this must be significant” – “No significant difference between these samples therefore the samples are the same” – “Yes I have replicates, I ran the same sample 3 times” – “We ignored those points, they are obviously wrong!” – “ X and Y are related as the p-value is 1e-168!” – “I need you to show this data is significant”

Basic Probability Which of these sequence of numbers is random? (outcomes 0 or 1, unsorted data)  1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1  1 1 0 0 1 0 1 1 0 1 0 1 0 1 1 1 1 1 1 0 1  1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Binomial Distribution Thought experiment Everyone flip a coin 10 times and count number of ‘heads’  Most frequent observation? Least?  Pattern of observations between these? How would these factors affect the graph shape?  Using a dice instead of a coin and looking for the number 6?  Increasing the number of times the coin was flipped?  Decreasing the number of people flipping coins?

Binomial Distributions

Types of Data Discrete or Continuous  Discrete: values for a finite number of samples  Continuous: infinite population... Parametric or Non-parametric  Fits a known distribution  Fits specific properties  Specific tests are available if and only if the data is parametric

Normal Distribution the curve has a single peak the mean (average) lies at the centre of the distribution distribution is symmetrical around the mean the two tails of the distribution extend indefinitely and never touch the horizontal axis (continuous distribution) the shape of the distribution is determined by its Mean (µ) and Standard Deviation (σ).

Variance and standard deviation Variance is just how dispersed your data is from the mean. Formalised:  "The average of the square of the distance of each data point from the mean" Standard deviation is the square root of the variance  aka RMS [or root mean squared] deviation  Really just the distance to the mean from a ‘average’ sample

Normal distribution 95% of the data are within 2σ [standard deviation] of the mean For a sample, the confidence interval of the mean is σ/√N

Anscombe's quartet

Understanding 'average' When talking about average or mean, we commonly refer to the arithmetic mean.  sum of samples / number of samples Other Pythagorean means: geometric and harmonic geometric mean – average of factors Log(data)→numeric mean→anti-log mean harmonic mean – average of rates 1/data → numeric mean → 1/mean Other ways to describe  Mode - most common value  Median – central value in an ordered list of numbers

Quartiles and Quantiles Quantiles are points taken at regular intervals on a ranked list of data  The 100-quantiles are called percentiles.  The 10-quantiles are called deciles.  The 5-quantiles are called quintiles.  The 4-quantiles are called quartiles. Quartiles  'middle 50', or inter quartile range [IQR] = 1 st to 3 rd quartile  first quartile (lower quartile) cuts off lowest 25% of data = 25th percentile  second quartile (median) cuts data set in half = 50th percentile  third quartile (upper quartile) cuts off highest 25% of data, or lowest 75% = 75th percentile

Visualisation: Tukey's boxplot aka candlestick box = 50% of data whisker =lines dots = outliers Easy way to visualise the properties of multiple distributions beside each other

Visualisation: Cumulative Distribution Function

How does this CDF differ?

Hypothesis testing Define your question  Bad: “Is this significant?” You need to compare to a model, usually that model is random chance  Good: “Does this data differ significantly from random chance compared to this other set?”

Hypothesis testing Test a hypothesis NOT a result  Bad: Gene XYZ is the most expressed in our data set, is it significant? Ok to get hypothesis to test from eye-balling data, but define on a biological concept, not a cherry-picked data point  OK to use to build a hypothesis: cold shock protein cspC is the most expressed gene, does this experiment enrich for cold shock proteins?  OK if enough REPLICATES

Null Hypothesis

Statistical Power plot of sampling distributions

Hypothesis testing 'Bayesian' analysis – model testing against is not random – Instead 'Priors” exist, knowledge of the system – Examples The 3 envelope puzzle Odds at racing

Understanding Results What is the likelihood someone who has been tested positive for a breast cancer screen having the disease? Given: 1% prevalence in population 90% sensitivity (90% chance of detection if true cancer patient) 9% error rate (if test is positive) nine in 10? eight in 10? one in 10? one in 100?

Hypothesis testing Test if parametric by using a non-parametric test against the normal distribution – e.g. Shapiro-Wilk or Anderson-Darling test Question: are samples A and B different?  Null hypothesis What is the likelihood that differences between A and B are from random chance You are testing ONE hypothesis. If it does not pass, the inverse question is not necessarily true

Testing 2 groups If Normal Distribution  Analysis of variance [ANOVA]  e.g. t-test Most powerful tests to use but data MUST resemble parametric If Non-Parametric KS [Kolmogorov-Smirnov] test (Q-Q testing) Mann-Whitney (rank sum) Chi-squared Fishers exact test if small numbers Test if parametric by using a non-parametric test against the normal distribution – e.g. Shapiro-Wilk or Anderson- Darling test Test if parametric by using a non-parametric test against the normal distribution – e.g. Shapiro-Wilk or Anderson- Darling test

P-values: multiple testing

P-values: Correlation & Causation

Confounding Factor Z (untested) X (mutagen) Y (sample response) X

Replicates Your statement about your data is limited by what you tested by replication. – It may be significant but for different reasons that you think Replicates show the noise in the system: – but what system? Technical, each experimental unit – Variations in Machine / Pipetting / Temp Biological: Changes in what you are examining – from person to person, cell to cell, grown condition to growth condition

Blocked Experimental Design Record as much as metadata possible Ensure variations are as randomised between treatments as possible Important variations should be blocked separately Arrange groups of experimental units that are similar to each other e.g. rather than 2 groups: WT Vs Mutant mice, have 4 groups, (2 of 2) by blocking male and females separately

Define the Number of Biological Repeats

Proper Controls and Regression towards the mean What is wrong with the following Hypothesis: Drug X reduces cell growth Take 10% of fastest growing cells in a culture Take growth rate Add drug X Take growth rate Compare growth rate Before adding drug (Control) After adding drug (Sample)

Discuss the problems with each of these “We used a t-test to compare our samples” “These genes are the most highly expressed in my experiment: this must be significant” “No significant difference between these samples therefore the samples are the same” “Yes I have replicates, I ran the same sample 3 times” “We ignored those points, they are obviously wrong!” “ X and Y are related as the p-value is 1e-168!” “I need you to show this data is significant”

Introduction to Statistics Alastair Kerr, PhD. Overview Understanding samples and distributions Binomial and normal distributions Describing data Visualising.

Similar presentations

Presentation on theme: "Introduction to Statistics Alastair Kerr, PhD. Overview Understanding samples and distributions Binomial and normal distributions Describing data Visualising."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Introduction to Statistics Alastair Kerr, PhD. Overview Understanding samples and distributions Binomial and normal distributions Describing data Visualising.

Similar presentations

Presentation on theme: "Introduction to Statistics Alastair Kerr, PhD. Overview Understanding samples and distributions Binomial and normal distributions Describing data Visualising."— Presentation transcript:

Similar presentations

About project

Feedback