Bioinformatics for biologists Dr. Habil Zare, PhD PI of Oncinfo Lab Assistant Professor, Department of Computer Science Texas State University Presented at University of Texas, Health Science Center – San Antonio 20 November 2015
Part 2 - R - Differential expression analysis (DESeq2) - Finding gene homologs (Ensembl Biomart)
R R provides an excellent environment for biological data analysis. There are many useful R packages that facilitate bioinformatics and statistical analysis. For example, DESeq2 is an R package that identifies differentially expressed genes. edgR in an alternative. Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015
How to start R in BioLinux? Open a terminal, type “R”, and press “Enter”. This is the R prompt indicating that R is ready to take commands. Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015
Type “date()” and press “Enter”. Functions in R In R, most processes are done by calling a function. E.g., the date() function will print the current date. Type “date()” and press “Enter”. To quite an R session, call q() function. For now, we do NOT save our workspace. Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015
Getting help within R Put a “?” in front of a function name and press “Enter”, to read the corresponding help document. Will provide information on the function “q()”. The arguments are the input to the function. Press “q” to get out of this. Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015
Installing DESeq2 package The simplest way to install DESeq2 is to open an R session and copy and paste the following: Source('https://bioconductor.org/biocLite.R') biocLite('DESeq2') For now, we do NOT want to update other packages. Type “n” and then “Enter”. Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015
Differentially expressed genes Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015
Differentially expressed genes These genes are relatively more expressed in AML compared to MDS Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015
Differentially expressed genes These genes are relatively more expressed in AML compared to MDS How to identify such genes? Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015
DESeq2 Tests for differential expression based on negative binomial distribution. Input Output DESeq2 Count data for 2 different conditions (# of mapped reads from RNASeq) P-value of the null hypothesis that the distribution of the counts is identical Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015
Are you in the right directory? Before you start, make sure you are in the correct directory. The pwd command in Linux shows the current directory. Typing “pwd” and then “Enter” will show your current path. Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015
Loading the data in R Use read.table() function to read count data in R. E.g., copy the following lines to the R session to read Pasilla data: datafile <- 'sample_data/pasilla_gene_counts.tsv' pasillaCountTable <- read.table( datafile, header=TRUE, row.names=1 ) pasillaCountTable is a data table (matrix). Use dim() to get its size (dimension): It has 14,599 rows and 7 columns. Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015
pasilla data The splicing factor pasilla (NOVA1 and NOVA2 ortholog) was knocked-down in Drosophila cell cultures. Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015
Loading the data in R Let’s look at the first rows of data. Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015
Names of the first 6 rows (genes). Loading the data in R Let’s look at the first rows of data. Names of the first 6 rows (genes). Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015
Loading the data in R Each column corresponds to a sample. Column names Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015
Loading the data in R We can access any specific element of the data as follows: Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015
Open the file with LiberOffice Calc Double click on pasilla_gene_counts.tsv The default settings are fine. Click on “OK”. Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015
LiberOffice Calc It is an open-source alternative to Excel. Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015
LiberOffice Calc The values are the same as the table read in R. Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015
Selecting a subset of data According to meta-data (not shown here), only 4 out of 7 available samples are paired-end. For simplicity, we restrict our analysis to these 4 samples. It is easy to do so in R. countTable <- pasillaCountTable[,c('untreated3', 'untreated4', 'treated2', 'treated3')] Equivalently, copy and paste the following in R: countTable <- pasillaCountTable[,c(3,4,6,7)] Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015
Selecting a subset of data The selected data has the same number of rows (genes) but smaller number of columns (samples). Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015
Defining the conditions We store the meta data on the biological condition of the samples in a “data frame”: metaData <- data.frame(row.names=colnames(countTable), condition=c('untr', 'untr', 'trea', 'trea')) Type “metaData” to make sure it has the correct information. Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015
Computing p-values Run the following: library(DESeq2) dds <- DESeqDataSetFromMatrix(countData= countTable, colData= metaData, design = ~ condition) dds <- DESeq(dds) res <- results(dds) Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015
Loads DESeq2 package into R. Computing p-values Run the following: Loads DESeq2 package into R. library(DESeq2) dds <- DESeqDataSetFromMatrix(countData= countTable, colData= metaData, design = ~ condition) dds <- DESeq(dds) res <- results(dds) Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015
Converts data in appropriate format for DESeq2 Computing p-values Run the following: library(DESeq2) dds <- DESeqDataSetFromMatrix(countData= countTable, colData= metaData, design = ~ condition) dds <- DESeq(dds) res <- results(dds) Converts data in appropriate format for DESeq2 Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015
Computing p-values Run the following: library(DESeq2) dds <- DESeqDataSetFromMatrix(countData= countTable, colData= metaData, design = ~ condition) dds <- DESeq(dds) res <- results(dds) If you have difficulty typing “~” in BioLinux on Windows, try “Page Down” key. Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015
Computing p-values Run the following: Computes p-values. library(DESeq2) dds <- DESeqDataSetFromMatrix(countData= countTable, colData= metaData, design = ~ condition) dds <- DESeq(dds) res <- results(dds) Computes p-values. Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015
Computing p-values Run the following: Extract results library(DESeq2) dds <- DESeqDataSetFromMatrix(countData= countTable, colData= metaData, design = ~ condition) dds <- DESeq(dds) res <- results(dds) Extract results Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015
Understanding DESeq2 results “res” has 6 columns And one row per gene. Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015
Understanding DESeq2 results The last column is the most interesting one because it reports the adjusted p-values. Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015
Distribution of p-values hist(res[,'padj']) plots the histogram of p-values. Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015
Cumulative distribution function Cumulative density functions (CDFs) are generally preferred to histograms. plot(ecdf(res[,'padj'])) ecdf() computes CDF. Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015
Cumulative distribution function abline(v=0.01,col='red',lty=2) abline() adds a line to the plot. Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015
Which p-values are significant? which(res[,'padj']<10^(-9)) Provides the induces of genes with p-value < Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015
The differentially expressed genes inds <- which(res[,'padj']< 10^(-9)) de <- rownames(res)[inds] Remember that the rows or the input matrix countTable were names based on the Flybase gene IDs. Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015
Comma Separated Values (CSV) format. Saving the results write.csv(de,file='de.csv',row.names=FALSE) Saves the de genes in Comma Separated Values (CSV) format. Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015
Looking at the saved results Double click on the de.csv file you just saved. Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015
Looking at the saved results LiberOffice Calc shows the context of the csv file. Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015
Cleaning the saved results Delete the row and save Delete the row. Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015
Saving the cleaned file Click here to save the cleaned file. Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015
You are done with identifying the differentially expressed genes. Applause You are done with identifying the differentially expressed genes. Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015
Finding gene homologs Use biomart to find the homologs of the de genes in human. http://www.ensembl.org/biomart/ Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015
Choose Drosophila species. Ensembl Biomart Select genes dtabase. Choose Drosophila species. Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015
Uploading data to Ensembl Biomart Click on Filters. Open GENE section. Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015
Uploading data to Ensembl Biomart Choose Flybase Gene IDs Upload the de.csv file that you just created Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015
Getting results from Ensembl Biomart Click on Homologs. Click on Attributes Optionally, uncheck Ensemble Transcript ID. Open ORTHOLOGS section. Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015
Getting results from Ensembl Biomart Scroll down and check Human Ensembl Gene ID. Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015
Getting results from Ensembl Biomart Click on Results at the top. Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015
Saving results from Ensembl Biomart Choose XLS file format and check “Unique results only”. Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015
Saving results from Ensembl Biomart Download the results by clicking on “GO” . Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015
Results from Ensembl Biomart Open mart_export.xls file that you just downloaded. Copy Human Ensembl Gene IDs that are provided in this column. No homologs were found for some genes. Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015
References: DESeq2 manualhttps://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.pdf Huber W and Reyes A. pasilla: Data package with per-exon and per-gene read counts of RNA-seq samples of Pasilla knock-down ,Genome Research 2011. R package version 0.10.0. I prepared these guidelines to facilitate the “Bioinformatics for biologists workshop”, 20 Nov 2015, UTHSC – San Antonio. http://oncinfo.org/Bioinformatics+for+biologist+workshop Instaling BioLinux using VM, Dr. Habil Zare 27 Oct 2015