Statistics for genomics

Slides:



Advertisements
Similar presentations
Design of Experiments Lecture I
Advertisements

Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A. Hall.
Multivariate Methods Pattern Recognition and Hypothesis Testing.
A Bayesian hierarchical modeling approach to reconstructing past climates David Hirst Norwegian Computing Center.
Dimension reduction : PCA and Clustering Agnieszka S. Juncker Slides: Christopher Workman and Agnieszka S. Juncker Center for Biological Sequence Analysis.
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman.
New Methods in Ecology Complex statistical tests, and why we should be cautious!
Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.
1 1 Slide © 2003 South-Western/Thomson Learning™ Slides Prepared by JOHN S. LOUCKS St. Edward’s University.
Chapter 12 Section 1 Inference for Linear Regression.
Discriminant Analysis Testing latent variables as predictors of groups.
1 Doing Statistics for Business Doing Statistics for Business Data, Inference, and Decision Making Marilyn K. Pelosi Theresa M. Sandifer Chapter 11 Regression.
Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 12 Analyzing the Association Between Quantitative Variables: Regression Analysis Section.
Data Collection & Processing Hand Grip Strength P textbook.
RNAseq analyses -- methods
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.
Introduction to Statistics Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.
1 Critical Review of Published Microarray Studies for Cancer Outcome and Guidelines on Statistical Analysis and Reporting Authors: A. Dupuy and R.M. Simon.
Microarray data analysis David A. McClellan, Ph.D. Introduction to Bioinformatics Brigham Young University Dept. Integrative Biology.
Chapter 4 Linear Regression 1. Introduction Managerial decisions are often based on the relationship between two or more variables. For example, after.
PCB 3043L - General Ecology Data Analysis. OUTLINE Organizing an ecological study Basic sampling terminology Statistical analysis of data –Why use statistics?
© Copyright McGraw-Hill Correlation and Regression CHAPTER 10.
MATH – High School Common Core Vs Kansas Standards.
Blind Information Processing: Microarray Data Hyejin Kim, Dukhee KimSeungjin Choi Department of Computer Science and Engineering, Department of Chemical.
Chapter 15 The Chi-Square Statistic: Tests for Goodness of Fit and Independence PowerPoint Lecture Slides Essentials of Statistics for the Behavioral.
Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.
Data Mining and Decision Support
Data Analytics CMIS Short Course part II Day 1 Part 1: Introduction Sam Buttrey December 2015.
What is Science? SECTION 1.1. What Is Science and Is Not  Scientific ideas are open to testing, discussion, and revision  Science is an organize way.
Chapter 13 Understanding research results: statistical inference.
1 Statistics & R, TiP, 2011/12 Multivariate Methods  Multivariate data  Data display  Principal component analysis Unsupervised learning technique 
Class Seven Turn In: Chapter 18: 32, 34, 36 Chapter 19: 26, 34, 44 Quiz 3 For Class Eight: Chapter 20: 18, 20, 24 Chapter 22: 34, 36 Read Chapters 23 &
Methods of multivariate analysis Ing. Jozef Palkovič, PhD.
Advanced Data Analytics
Unsupervised Learning
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Exploring Microarray data
Applied Biostatistics: Lecture 2
Inference for Regression
PCB 3043L - General Ecology Data Analysis.
Understanding Results
CHAPTER 11 Inference for Distributions of Categorical Data
Basic machine learning background with Python scikit-learn
12 Inferential Analysis.
Machine Learning Ali Ghodsi Department of Statistics
CHAPTER 10 Correlation and Regression (Objectives)
Computer Science & Engineering Department University of Connecticut
Review of Hypothesis Testing
Statistical Inference about Regression
CHAPTER 11 Inference for Distributions of Categorical Data
Residuals and Residual Plots
Dimension reduction : PCA and Clustering
12 Inferential Analysis.
CHAPTER 11 Inference for Distributions of Categorical Data
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
CHAPTER 11 Inference for Distributions of Categorical Data
Multivariate Methods Berlin Chen
CHAPTER 11 Inference for Distributions of Categorical Data
Multivariate Methods Berlin Chen, 2005 References:
CHAPTER 11 Inference for Distributions of Categorical Data
CHAPTER 11 Inference for Distributions of Categorical Data
CHAPTER 11 Inference for Distributions of Categorical Data
CHAPTER 11 Inference for Distributions of Categorical Data
Analyzing and Interpreting Quantitative Data
CHAPTER 11 Inference for Distributions of Categorical Data
Hypothesis Testing - Chi Square
Unsupervised Learning
Differential Expression of RNA-Seq Data
Presentation transcript:

Statistics for genomics Mayo-Illinois Computational Genomics Course June 11, 2019 Dave Zhao Department of Statistics University of Illinois at Urbana-Champaign

Preparation install.packages(c("Seurat", "glmnet", "ranger", "caret")) Download sample GSM2818521 of GSE109158 from https://urlzs.com/7UNr6

Objective

We will illustrate how to identify the appropriate statistical method for a genomics analysis Statistical toolbox Statistical method

Classifying statistical tools Data structure No dependent variables Continuous outcome Censored outcomes Etc. Visualize Identify latent factors Cluster observations Select features Statistical task APPROPRIATE STATISTICAL METHODS

Examples from basic statistics

Examples from basic statistics Rosner (2015)

Increased uncertainty Increased scale Distinctive features of genomic data can require specialized statistical tools to handle: Increased complexity Increased uncertainty Increased scale Wu, Chen et al. (2011)

Increased uncertainty Increased scale Distinctive features of genomic data can require specialized statistical tools to handle: Increased complexity Increased uncertainty Increased scale New technology Wu, Chen et al. (2011)

Increased uncertainty Increased scale Many variables Distinctive features of genomic data can require specialized statistical tools to handle: Increased complexity Increased uncertainty Increased scale Wu, Chen et al. (2011)

Increased uncertainty Increased scale Distinctive features of genomic data can require specialized statistical tools to handle: Increased complexity Increased uncertainty Increased scale Few samples Wu, Chen et al. (2011)

Increased uncertainty Increased scale Distinctive features of genomic data can require specialized statistical tools to handle: Increased complexity Increased uncertainty Increased scale High heterogeneity Wu, Chen et al. (2011)

We will discuss some statistical tools that are frequently used in genomics We will also discuss a general framework for organizing new tools.

Choosing the right tool for a given biological question requires creativity and experience Statistical toolbox Statistical method

There are other frameworks that are organized by biological question rather than statistical tool

Example analysis using R

Use R packages to perform data analysis

To illustrate these tools, we will analyze single-cell RNA-seq data in R

Anatomy of a basic R command pandey = read.table("GSM2818521_larva_counts_matrix.txt") Case-sensitive Function(): performs pre-programmed calculations given inputs and options Variable: stores values and outputs of function, name cannot contain whitespace and cannot start with a special character

Basic preprocessing of single-cell RNA-seq data using Seurat library(Seurat) s_obj = CreateSeuratObject(counts = pandey, min.cells = 3, min.features = 200) s_obj = NormalizeData(s_obj) s_obj = FindVariableFeatures(s_obj) s_obj = ScaleData(s_obj)

Statistical methods for genomics

Where statistics appears in a standard genomic analysis workflow Not covered today Experimental design Quality control Preprocessing Normalization and batch correction Analysis Biological interpretation Uses statistics but is highly dependent on technology The focus of today’s discussion

Classifying statistical tools Data structure No dependent variables Continuous outcome Censored outcomes Etc. Visualize Identify latent factors Cluster observations Select features Statistical task APPROPRIATE STATISTICAL METHODS

Classifying statistical tasks Supervised Testing Plots Description Inference Estimation Tables Prediction Unsupervised Latent factors Latent groups

Classifying data structures Can vary widely, and classification is difficult Important factors in genomics: Data type Number of samples relative to number of variables

PCA > dim(pandey) [1] 24105 4365 > pandey[1:3, 1:2] larvalR2_AAACCTGAGACAGAGA.1 larvalR2_AAACCTGAGACTTTCG.1 SYN3 0 0 PTPRO 1 0 EPS8 0 0 Research question: Can the gene expression information be summarized in fewer features?

PCA Supervised Testing Plots Description Inference Estimation Tables Prediction Statistical task: calculate a (usually small) set of latent factors that captures most of the information in the dataset Data structure: can be applied to all data structures Unsupervised Latent factors Latent groups

PCA ISL Chapter 6

PCA using Seurat s_obj = RunPCA(s_obj)

Graph clustering Research question: How many cell types exist in the larval zebrafish habenula?

Graph clustering Supervised Testing Plots Description Inference Estimation Tables Prediction Statistical task: construct latent groups into which the observations fall Data structure: relatively small number of features and complicated cluster structure Unsupervised Latent factors Latent groups

Graph clustering

Comparison to other clustering methods Distance Hierarchical clustering K-means clustering vs. Relationship Spectral clustering Graph clustering http://webpages.uncc.edu/jfan/itcs4122.html

Graph clustering using Seurat s_obj = FindNeighbors(s_obj, dims = 1:10) s_obj = FindClusters(s_obj, dims = 1:10, resolution = 0.1)

t-SNE plot Research question: How to visualize the different cell types?

t-SNE plot Statistical task: visualize observations in low dimensions Supervised Testing Plots Description Inference Estimation Tables Prediction Statistical task: visualize observations in low dimensions Data structure: can be applied to all data structures Unsupervised Latent factors Latent groups

t-SNE plot https://en.wikipedia.org/wiki/Spring_system

t-SNE plot s_obj = RunTSNE(s_obj) DimPlot(s_obj, reduction = "tsne", label = TRUE)

Wilcoxon test Research question: Does the expression of the gene PDYN differ between clusters 5 and 6?

Wilcoxon test Supervised Testing Plots Description Inference Estimation Tables Prediction Statistical task: test whether there is an association between two variables Data structure: one variable is dichotomous Unsupervised Latent factors Latent groups

Wilcoxon test using Seurat markers = FindMarkers(s_obj, ident.1 = 5, ident.2 = 6) markers["PDYN",]

FDR control Research question: Which genes differ between clusters 5 and 6?

FDR control Supervised Testing Plots Description Inference Estimation Tables Prediction Statistical task: identify which of many hypothesis tests are truly significant Data structure: p-values are available and statistically independent Unsupervised Latent factors Latent groups

FDR control FDR = False Discovery Rate = expected value of # false discoveries total # of discoveries Statistical methods reject the largest number of hypothesis tests while maintaining FDR ≤𝛼, for some preset 𝛼

FDR control using Seurat markers = FindMarkers(s_obj, ident.1 = 5, ident.2 = 6) head(markers) sum(markers$p_val_adj <= 0.05)

Random forest classification Research question: Given the principal components of the RNA-seq expression values of all genes from a new cell, how can we determine the cell’s type? ?

Random forest classification Supervised Testing Plots Description Inference Estimation Tables Prediction Statistical task: learn prediction rule between dependent and independent variables Data structure: one categorical dependent variable, relatively few independent variables, complicated relationship Unsupervised Latent factors Latent groups

Random forest classification Independent variables Prediction ISL Chapter 8

Random forest classification using caret and ranger library(caret) pcs = Embeddings(s_obj, reduction = "pca")[-(1:2), 1:10] class = as.factor(Idents(s_obj))[-(1:2)] dataset = data.frame(class, pcs) rf_fit = train(class ~ ., data = dataset, method = "ranger", trControl = trainControl(method = "cv", number = 3)) new_pcs = Embeddings(s_obj, reduction = "pca")[1:2, 1:10] predict(rf_fit, new_pcs) Idents(s_obj)[1:2]

Lasso Research question: If we can only measure the expression of 10 genes in a new cell, which should we measure in order to most accurately predict the cell’s type? ?

Lasso Supervised Testing Plots Description Inference Estimation Tables Prediction Statistical task: identify a small set of independent variables that are useful for predicting dependent variables Data structure: simple (linear) prediction rule, dependent variables are associated with only a few (unknown) independent variables Unsupervised Latent factors Latent groups

Lasso Estimates coefficients in the regression model 𝑌= 𝛽 0 + 𝑋 1 𝛽 1 +…+ 𝑋 𝑝 𝛽 𝑝

Lasso using glmnet library(glmnet) counts = t(GetAssayData(s_obj, slot = "counts"))[-(1:2),] pcs = Embeddings(s_obj, reduction = "pca")[-(1:2), 1:10] lasso = cv.glmnet(counts, pcs, family = "mgaussian", nfolds = 3) lambda = min(lasso$lambda[lasso$nzero <= 10]) coefs = coef(lasso, s = lambda) rownames(coefs$PC_1)[which(coefs$PC_1 != 0)] new_counts = t(GetAssayData(s_obj, slot = "counts"))[1:2,] new_pcs = predict(lasso, newx = new_counts, s = lambda)[,, 1] predict(rf_fit, new_pcs) Idents(s_obj)[1:2]

Conclusions

Statistical toolbox Supervised Testing Plots Description Inference Estimation Tables Prediction Unsupervised Latent factors Latent groups

Statistical tools PCA Graph clustering t-SNE plot Wilcoxon test FDR control Random forest classification Lasso

Can be applied to single-cell RNA-seq and beyond

To learn more Take systematic courses in basic statistics, statistical learning, and R/python Study recently published papers in your field of interest that use your technology of interest Consult tutorials, workshops, lab mates, and Google Thank you