Canadian Bioinformatics Workshops www.bioinformatics.ca.

Slides:



Advertisements
Similar presentations
Chapter 3, Numerical Descriptive Measures
Advertisements

Statistics 100 Lecture Set 6. Re-cap Last day, looked at a variety of plots For categorical variables, most useful plots were bar charts and pie charts.
Jan Shapes of distributions… “Statistics” for one quantitative variable… Mean and median Percentiles Standard deviations Transforming data… Rescale:
Chapter 6 The Normal Distribution and Other Continuous Distributions
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 6-1 Chapter 6 The Normal Distribution and Other Continuous Distributions.
Correlation and Regression Analysis
Business Statistics: A First Course, 5e © 2009 Prentice-Hall, Inc. Chap 6-1 Chapter 6 The Normal Distribution Business Statistics: A First Course 5 th.
Use of Quantile Functions in Data Analysis. In general, Quantile Functions (sometimes referred to as Inverse Density Functions or Percent Point Functions)
Inference for regression - Simple linear regression
© 2005 The McGraw-Hill Companies, Inc., All Rights Reserved. Chapter 12 Describing Data.
Exploratory Data Analysis. Computing Science, University of Aberdeen2 Introduction Applying data mining (InfoVis as well) techniques requires gaining.
Chap 6-1 Copyright ©2013 Pearson Education, Inc. publishing as Prentice Hall Chapter 6 The Normal Distribution Business Statistics: A First Course 6 th.
B AD 6243: Applied Univariate Statistics Understanding Data and Data Distributions Professor Laku Chidambaram Price College of Business University of Oklahoma.
Summary statistics Using a single value to summarize some characteristic of a dataset. For example, the arithmetic mean (or average) is a summary statistic.
Essential Statistics in Biology: Getting the Numbers Right
June 11, 2008Stat Lecture 10 - Review1 Midterm review Chapters 1-5 Statistics Lecture 10.
MMSI – SATURDAY SESSION with Mr. Flynn. Describing patterns and departures from patterns (20%–30% of exam) Exploratory analysis of data makes use of graphical.
Chapter 3, Part B Descriptive Statistics: Numerical Measures n Measures of Distribution Shape, Relative Location, and Detecting Outliers n Exploratory.
Basic Business Statistics
Edpsy 511 Exploratory Data Analysis Homework 1: Due 9/19.
Sampling and estimation Petter Mostad
1 Probability: Introduction Definitions,Definitions, Laws of ProbabilityLaws of Probability Random VariablesRandom Variables DistributionsDistributions.
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
24 Nov 2007Data Management and Exploratory Data Analysis 1 Exploratory Data Analysis Exploratory Data Analysis (EDA) is an Approach that Employs a Variety.
AP Statistics Review Day 1 Chapters 1-4. AP Exam Exploring Data accounts for 20%-30% of the material covered on the AP Exam. “Exploratory analysis of.
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Statistics and probability Dr. Khaled Ismael Almghari Phone No:
Canadian Bioinformatics Workshops
Exploratory Data Analysis
Chapter 6 The Normal Distribution and Other Continuous Distributions
Descriptive Statistics ( )
Thursday, May 12, 2016 Report at 11:30 to Prairieview
Canadian Bioinformatics Workshops
Parameter, Statistic and Random Samples
Exploratory Data Analysis
Continuous Distributions
BAE 6520 Applied Environmental Statistics
Probability and Statistics for Computer Scientists Second Edition, By: Michael Baron Chapter 8: Introduction to Statistics CIS Computational Probability.
Normal Distribution and Parameter Estimation
BAE 5333 Applied Water Resources Statistics
Descriptive measures Capture the main 4 basic Ch.Ch. of the sample distribution: Central tendency Variability (variance) Skewness kurtosis.
Understanding and Comparing Distributions
Understanding and Comparing Distributions
Describing Distributions Numerically
STAT 206: Chapter 6 Normal Distribution.
Description of Data (Summary and Variability measures)
Checking Regression Model Assumptions
Understanding and Comparing Distributions
Numerical Descriptive Measures
Georgi Iskrov, MBA, MPH, PhD Department of Social Medicine
Alafia river: Autocorrelation Autocorrelation of standardized flow.
Checking Regression Model Assumptions
Quartile Measures DCOVA
Types of (random) variables
Statistics for Managers Using Microsoft® Excel 5th Edition
Understanding and Comparing Distributions
The normal distribution
Regression Assumptions
Honors Statistics Review Chapters 4 - 5
Linear Regression and Correlation
Advanced data management
Lesson Plan Day 1 Lesson Plan Day 2 Lesson Plan Day 3
Georgi Iskrov, MBA, MPH, PhD Department of Social Medicine
Regression Assumptions
The Normal Distribution
Presentation transcript:

Canadian Bioinformatics Workshops

2Module #: Title of Module

Module 2 Exploratory Data Analysis (EDA) Exploratory Data Analysis and Essential Statistics using R Boris Steipe Toronto, September 8– D EPARTMENT OF B IOCHEMISTRY D EPARTMENT OF M OLECULAR G ENETICS Includes material originally developed by Raphael Gottardo † † Odysseus listneing to the song of the sirens. Late Archaic (500–480 BCE)

Module 2: Exploratory Data Analysis (EDA) bioinformatics.ca Exploratory Data Analysis (EDA) What is EDA? Basics of EDA: Boxplots, Histograms, Scatter plots,Transformations,QQ-plot Applications to microarray and flow cytometry data

Module 2: Exploratory Data Analysis (EDA) bioinformatics.ca what is EDA ? Statistical practice concerned with (among other things): uncover underlying structure, extract important variables, detect outliers and anomalies, test underlying assumptions, develop models Named by John Tukey Extremely Important

Module 2: Exploratory Data Analysis (EDA) bioinformatics.ca EDA techniques Mostly graphical Plotting the raw data (histograms, scatterplots, etc.) Plotting simple statistics such as means, standard deviations, medians, box plots, etc Positioning such plots so as to maximize our natural pattern-recognition abilities A clear picture is worth a thousand words!

Module 2: Exploratory Data Analysis (EDA) bioinformatics.ca a few tips Avoid 3-D graphics Don't show too much information on the same graph (color, patterns, etc) Stay away from Excel, Excel is not a statistics package! R provides a great environment for EDA with good graphics capabilities.

Module 2: Exploratory Data Analysis (EDA) bioinformatics.ca examples of poor plots

Module 2: Exploratory Data Analysis (EDA) bioinformatics.ca examples of poor plots

Module 2: Exploratory Data Analysis (EDA) bioinformatics.ca examples of poor plots

Module 2: Exploratory Data Analysis (EDA) bioinformatics.ca examples of poor plots

Module 2: Exploratory Data Analysis (EDA) bioinformatics.ca probability distributions Can be either discrete or continuous (uniform, Bernoulli, normal, etc) Defined by a density function, p(x) or f(x) Bernoulli distribution Be(p): flip a coin (T=0, H=1). Assume probability of H is x <- 0:1 f <- dbinom(x, size=1, prob=0.1) plot(x, f, xlab="x", ylab="density", type="h", lwd=5)

Module 2: Exploratory Data Analysis (EDA) bioinformatics.ca probability distributions Random sampling: Generate 100 observations from a Be(0.1) set.seed(100) x <- rbinom(100, size=1, prob=0.1) hist(x)

Module 2: Exploratory Data Analysis (EDA) bioinformatics.ca probability distributions Normal distribution N(μ,σ 2 ) μ is the mean and σ 2 is the variance. Extremely important because of the Central Limit Theorem: if a random variable is the sum of a large number of small random variables, it will be normally distributed. x <- seq(-4, 4, 0.1) f <- dnorm(x, mean=0, sd=1) plot(x, f, xlab="x", ylab="density", lwd=5, type="l") The area under the curve is the probability of observing a value between 0 and 2.

Module 2: Exploratory Data Analysis (EDA) bioinformatics.ca probability distributions Random sampling: Generate 100 observations from a N(0,1) set.seed(100) x <- rnorm(100, mean=0, sd=1) hist(x) Histograms can be used to estimate densities!

Module 2: Exploratory Data Analysis (EDA) bioinformatics.ca quantiles (Theoretical) Quantiles: The p-quantile has the property that there is a probability p of getting a value less than or equal to it. The 50% quantile is called the median. 90% of the probability (area under the curve) is to the left of the red vertical line. q90 <- qnorm(0.90, mean = 0, sd = 1) x <- seq(-4, 4, 0.1) f <- dnorm(x, mean=0, sd=1) plot(x, f, xlab="x", ylab="density", type="l", lwd=5) abline(v=q90, col=2, lwd=5)

Module 2: Exploratory Data Analysis (EDA) bioinformatics.ca descriptive statistics Empirical Quantiles: The p-quantile has the property that p% of the observations are less than or equal to it. Empirical quantiles can be easily obtained in R. > set.seed(100) > x <- rnorm(100, mean=0, sd=1) > quantile(x) 0% 25% 50% 75% 100% > quantile(x, probs=c(0.1, 0.2, 0.9)) 10% 20% 90%

Module 2: Exploratory Data Analysis (EDA) bioinformatics.ca descriptive statistics We often need to quickly 'quantify' a data set, and this can be done using a set of summary statistics (mean, median, variance, standard deviation). > mean(x) [1] > median(x) [1] > IQR(x) [1] > var(x) [1] > summary(x) Min. 1st Qu. Median Mean 3rd Qu. Max

Module 2: Exploratory Data Analysis (EDA) bioinformatics.ca Box-plot Descriptive statistics can be intuitively summarized in a Box-plot. > boxplot(x) IQR 1.5 x IQR Everything above and below 1.5 x IQR is considered an "outlier". 75% quantile Median 25% quantile IQR = Inter Quantile Range = 75% quantile – 25% quantile

Module 2: Exploratory Data Analysis (EDA) bioinformatics.ca QQ–plot Many statistical methods make some assumption about the distribution of the data (e.g. Normal). The quantile-quantile plot provides a way to visually verify such assumptions. The QQ-plot shows the theoretical quantiles versus the empirical quantiles. If the distribution assumed (theoretical one) is indeed the correct one, we should observe a straight line.

Module 2: Exploratory Data Analysis (EDA) bioinformatics.ca QQ–plot Only valid for the normal distribution! qqnorm(x) qqline(x, col=2)

Module 2: Exploratory Data Analysis (EDA) bioinformatics.ca QQ–plot Clearly the t distribution with two degrees of freedom is not Normal. set.seed(100) t <- rt(100, df=2) qqnorm(t) qqline(t, col=2) x <- seq(-4, 4, 0.1) f1 <- dnorm(x, mean=0, sd=1) f2 <- dt(x, df=2) plot(x, f1, xlab="x", ylab="density", lwd=5, type="l") lines(x, f2, lwd=5, col=2) legend(-4,.4, c("Normal", "t2"), col=1:2, lwd=5)

Module 2: Exploratory Data Analysis (EDA) bioinformatics.ca QQ–plot set.seed(101) generateVariates <- function(n) { Nvar < Vout <- c() for (i in 1:n) { x <- runif(Nvar, -0.01, 0.01) Vout <- c(Vout, sum(x) ) } return(Vout) } x <- generateVariates(1000) y <- rnorm(1000, mean=0, sd=1) qqplot(x, y) qqline(x, y, col=2) Verify the CLT.

Module 2: Exploratory Data Analysis (EDA) bioinformatics.ca QQ–plot Comparing two samples. set.seed(100) x <- rt(100, df=2) y <- rnorm(100, mean=0, sd=1) qqplot(x, y) Exercise: try different values of df for rt().

Module 2: Exploratory Data Analysis (EDA) bioinformatics.ca scatter plots Biological data sets often contain several variables, they are multivariate. Scatter plots allow us to look at two variables at a time. This can be used to assess independence. # GvHD flow cytometry data gvhd <- read.table("GvHD+.txt", header=TRUE) # Only extract the CD3 positive cells gvhdCD3p 280, 3:6]) plot(gvhdCD3p[, 1:2]) cor(gvhdCD3p[, 1], gvhdCD3p[, 2])

Module 2: Exploratory Data Analysis (EDA) bioinformatics.ca scatter plots vs. correlations Note that in this example, the correlation between CD8.B.PE and CD4.FITC is Correlation is only a good descriptor for linear dependence. # Quick comment on correlation set.seed(100) theta <- runif(1000, 0, 2*pi) x <- cos(theta) y <- sin(theta) plot(x, y) What is the correlation?

Module 2: Exploratory Data Analysis (EDA) bioinformatics.ca trellis graphics Trellis Graphics is a family of techniques for viewing complex, multi-variable data sets. plot(gvhdCD3p, pch=".") Note that I have changed the plotting symbol. Many more possibilities in the "lattice" package

Module 2: Exploratory Data Analysis (EDA) bioinformatics.ca EDA of flow-cytometric data The boxplot function can be used to display several variables at a time. boxplot(gvhdCD3p) Exercise: Interpret this plot.

Module 2: Exploratory Data Analysis (EDA) bioinformatics.ca EDA of flow-cytometric data We discover a mix of distinct cell subpopulations. par(mfrow=c(2, 2)) hist(gvhdCD3p[, 1], 50, main=names(gvhdCD3p)[1], xlab="fluorescent intensity", ylim=c(0, 120)) hist(gvhdCD3p[, 2], 50, main=names(gvhdCD3p)[2], xlab="fluorescent intensity", ylim=c(0, 120)) hist(gvhdCD3p[, 3], 50, main=names(gvhdCD3p)[3], xlab="fluorescent intensity", ylim=c(0, 120)) hist(gvhdCD3p[, 4], 50, main=names(gvhdCD3p)[4], xlab="fluorescent intensity", ylim=c(0, 120))

Module 2: Exploratory Data Analysis (EDA) bioinformatics.ca EDA: HIV microarray data HIV Data: The expression levels of 7680 genes were measured in CD4-T-cell lines at time t = 24 hours after infection with HIV type 1 virus. 12 positive controls (HIV genes). 4 replicates (2 with a dye swap)

Module 2: Exploratory Data Analysis (EDA) bioinformatics.ca EDA: HIV microarray data We get the data after the image analysis is done. For each spot we have an estimate of the intensity in both channels. The data matrix therefore is of size 7680 x 8. One of the array chips.

Module 2: Exploratory Data Analysis (EDA) bioinformatics.ca EDA: HIV microarray data data <- read.table(file="hiv.raw.data.24h.txt", sep="\t", header=TRUE) summary(data) boxplot(data)

Module 2: Exploratory Data Analysis (EDA) bioinformatics.ca EDA: HIV microarray data par(mfrow=c(2, 2)) hist(data[, 1], 50, main=names(data)[1], xlab="fluorescent intensity") hist(data[, 2], 50, main=names(data)[2], xlab="fluorescent intensity") hist(data[, 5], 50, main=names(data)[5], xlab="fluorescent intensity") hist(data[, 6], 50, main=names(data)[6], xlab="fluorescent intensity") par(OldPar)

Module 2: Exploratory Data Analysis (EDA) bioinformatics.ca EDA: HIV microarray data # 'apply' will apply the function to all rows of the data matrix mean <- apply(data[, 1:4], 1, "mean") sd <- apply(data[, 1:4], 1, "sd") plot(mean, sd) trend <- lowess(mean, sd) lines(trend, col=2, lwd=5) lowess fit LOcally WEighted Scatter plot Smoother; used to estimate the trend in a scatter plot. Non parametric!

Module 2: Exploratory Data Analysis (EDA) bioinformatics.ca EDA: Transformations Observations: The data are highly skewed. The standard deviation is not constant, as it increases with the mean. Solution: Look for a transformation that will make the data more symmetric and the variance more constant. With positive data the log transformation is often appropriate.

Module 2: Exploratory Data Analysis (EDA) bioinformatics.ca EDA: Transformations data <- log(read.table(file="hiv.raw.data.24h.txt", sep="\t", header=TRUE)) summary(data) boxplot(data)

Module 2: Exploratory Data Analysis (EDA) bioinformatics.ca EDA: HIV microarray data par(mfrow=c(2, 2)) hist(data[, 1], 50, main=names(data)[1], xlab="fluorescent intensity") hist(data[, 2], 50, main=names(data)[2], xlab="fluorescent intensity") hist(data[, 5], 50, main=names(data)[5], xlab="fluorescent intensity") hist(data[, 6], 50, main=names(data)[6], xlab="fluorescent intensity") par(OldPar)

Module 2: Exploratory Data Analysis (EDA) bioinformatics.ca EDA: Transformations The standard deviation has become almost independent of the mean. # 'apply' will apply the function to all rows of the data matrix mean <- apply(data[, 1:4], 1, "mean") sd <- apply(data[, 1:4], 1, "sd") plot(mean, sd) trend <- lowess(mean, sd) lines(trend, col=2, lwd=5)

Module 2: Exploratory Data Analysis (EDA) bioinformatics.ca EDAfor microarray data: always log Makes the data more symmetric, large observations are not as influential The variance is (more) constant Turns multiplication into addition (log(ab)=log(a)+log(b)) In practice use log base 2, log2(x)=log(x)/log(2)

Module 2: Exploratory Data Analysis (EDA) bioinformatics.ca EDA for gene expression Is this a good way to look at the data? # scatter plot plot(data[, 1], data[, 5], xlab=names(data)[1], ylab=names(data)[5])

Module 2: Exploratory Data Analysis (EDA) bioinformatics.ca EDA for gene expression: MA plots M (minus) is the log ratio. A (average) is overall intensity. # MA plot A <- (data[, 1]+data[, 5])/2 M <- (data[, 1]-data[, 5]) plot(A, M, xlab="A", ylab="M")

Module 2: Exploratory Data Analysis (EDA) bioinformatics.ca EDA for gene expression: MA plots How do we find differentially expressed genes? # MA plots per replicate par(mfrow=c(2, 2)) A1 <- (data[, 1]+data[, 5])/2 M1 <- (data[, 1]-data[, 5]) plot(A1, M1, xlab="A", ylab="M", main="rep 1") trend <- lowess(A1, M1) lines(trend, col=2, lwd=5) A2 <- (data[, 2]+data[, 6])/2 M2 <- (data[, 2]-data[, 6]) plot(A2, M2, xlab="A", ylab="M", main="rep 2") trend <- lowess(A2, M2) lines(trend, col=2, lwd=5) A3 <- (data[, 3]+data[, 7])/2 M3 <- (data[, 3]-data[, 7]) plot(A3, M3, xlab="A", ylab="M", main="rep 3") trend <- lowess(A3, M3) lines(trend, col=2, lwd=5) A4 <- (data[, 4]+data[, 8])/2 M4 <- (data[, 4]-data[, 8]) plot(A4, M4, xlab="A", ylab="M", main="rep 4") trend <- lowess(A4, M4) lines(trend, col=2, lwd=5) par(OldPar)

Module 2: Exploratory Data Analysis (EDA) bioinformatics.ca Summary EDA should be the first step in any statistical analysis! Good modeling starts and ends with EDA. R provides a great framework for EDA.

Module 2: Exploratory Data Analysis (EDA) bioinformatics.ca We are on a Coffee Break & Networking Session