Presentation is loading. Please wait.

Presentation is loading. Please wait.

Bioinformatics for biologists

Similar presentations


Presentation on theme: "Bioinformatics for biologists"— Presentation transcript:

1 Bioinformatics for biologists
Dr. Habil Zare, PhD PI of Oncinfo Lab Assistant Professor, Department of Computer Science Texas State University Presented at University of Texas, Health Science Center – San Antonio 20 November 2015

2 Part 2 - R - Differential expression analysis (DESeq2)
- Finding gene homologs (Ensembl Biomart)

3 R R provides an excellent environment for biological data analysis. There are many useful R packages that facilitate bioinformatics and statistical analysis. For example, DESeq2 is an R package that identifies differentially expressed genes. edgR in an alternative. Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

4 How to start R in BioLinux?
Open a terminal, type “R”, and press “Enter”. This is the R prompt indicating that R is ready to take commands. Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

5 Type “date()” and press “Enter”.
Functions in R In R, most processes are done by calling a function. E.g., the date() function will print the current date. Type “date()” and press “Enter”. To quite an R session, call q() function. For now, we do NOT save our workspace. Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

6 Getting help within R Put a “?” in front of a function name and press “Enter”, to read the corresponding help document. Will provide information on the function “q()”. The arguments are the input to the function. Press “q” to get out of this. Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

7 Installing DESeq2 package
The simplest way to install DESeq2 is to open an R session and copy and paste the following: Source(' biocLite('DESeq2') For now, we do NOT want to update other packages. Type “n” and then “Enter”. Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

8 Differentially expressed genes
Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

9 Differentially expressed genes
These genes are relatively more expressed in AML compared to MDS Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

10 Differentially expressed genes
These genes are relatively more expressed in AML compared to MDS How to identify such genes? Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

11 DESeq2 Tests for differential expression based on negative binomial distribution. Input Output DESeq2 Count data for 2 different conditions (# of mapped reads from RNASeq) P-value of the null hypothesis that the distribution of the counts is identical Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

12 Are you in the right directory?
Before you start, make sure you are in the correct directory. The pwd command in Linux shows the current directory. Typing “pwd” and then “Enter” will show your current path. Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

13 Loading the data in R Use read.table() function to read count data in R. E.g., copy the following lines to the R session to read Pasilla data: datafile <- 'sample_data/pasilla_gene_counts.tsv' pasillaCountTable <- read.table( datafile, header=TRUE, row.names=1 ) pasillaCountTable is a data table (matrix). Use dim() to get its size (dimension): It has 14,599 rows and 7 columns. Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

14 pasilla data The splicing factor pasilla (NOVA1 and NOVA2 ortholog) was knocked-down in Drosophila cell cultures. Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

15 Loading the data in R Let’s look at the first rows of data.
Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

16 Names of the first 6 rows (genes).
Loading the data in R Let’s look at the first rows of data. Names of the first 6 rows (genes). Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

17 Loading the data in R Each column corresponds to a sample.
Column names Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

18 Loading the data in R We can access any specific element of the data as follows: Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

19 Open the file with LiberOffice Calc
Double click on pasilla_gene_counts.tsv The default settings are fine. Click on “OK”. Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

20 LiberOffice Calc It is an open-source alternative to Excel.
Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

21 LiberOffice Calc The values are the same as the table read in R.
Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

22 Selecting a subset of data
According to meta-data (not shown here), only 4 out of 7 available samples are paired-end. For simplicity, we restrict our analysis to these 4 samples. It is easy to do so in R. countTable <- pasillaCountTable[,c('untreated3', 'untreated4', 'treated2', 'treated3')] Equivalently, copy and paste the following in R: countTable <- pasillaCountTable[,c(3,4,6,7)] Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

23 Selecting a subset of data
The selected data has the same number of rows (genes) but smaller number of columns (samples). Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

24 Defining the conditions
We store the meta data on the biological condition of the samples in a “data frame”: metaData <- data.frame(row.names=colnames(countTable), condition=c('untr', 'untr', 'trea', 'trea')) Type “metaData” to make sure it has the correct information. Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

25 Computing p-values Run the following: library(DESeq2)
dds <- DESeqDataSetFromMatrix(countData= countTable, colData= metaData, design = ~ condition) dds <- DESeq(dds) res <- results(dds) Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

26 Loads DESeq2 package into R.
Computing p-values Run the following: Loads DESeq2 package into R. library(DESeq2) dds <- DESeqDataSetFromMatrix(countData= countTable, colData= metaData, design = ~ condition) dds <- DESeq(dds) res <- results(dds) Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

27 Converts data in appropriate format for DESeq2
Computing p-values Run the following: library(DESeq2) dds <- DESeqDataSetFromMatrix(countData= countTable, colData= metaData, design = ~ condition) dds <- DESeq(dds) res <- results(dds) Converts data in appropriate format for DESeq2 Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

28 Computing p-values Run the following:
library(DESeq2) dds <- DESeqDataSetFromMatrix(countData= countTable, colData= metaData, design = ~ condition) dds <- DESeq(dds) res <- results(dds) If you have difficulty typing “~” in BioLinux on Windows, try “Page Down” key. Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

29 Computing p-values Run the following: Computes p-values.
library(DESeq2) dds <- DESeqDataSetFromMatrix(countData= countTable, colData= metaData, design = ~ condition) dds <- DESeq(dds) res <- results(dds) Computes p-values. Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

30 Computing p-values Run the following: Extract results library(DESeq2)
dds <- DESeqDataSetFromMatrix(countData= countTable, colData= metaData, design = ~ condition) dds <- DESeq(dds) res <- results(dds) Extract results Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

31 Understanding DESeq2 results
“res” has 6 columns And one row per gene. Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

32 Understanding DESeq2 results
The last column is the most interesting one because it reports the adjusted p-values. Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

33 Distribution of p-values
hist(res[,'padj']) plots the histogram of p-values. Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

34 Cumulative distribution function
Cumulative density functions (CDFs) are generally preferred to histograms. plot(ecdf(res[,'padj'])) ecdf() computes CDF. Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

35 Cumulative distribution function
abline(v=0.01,col='red',lty=2) abline() adds a line to the plot. Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

36 Which p-values are significant?
which(res[,'padj']<10^(-9)) Provides the induces of genes with p-value < Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

37 The differentially expressed genes
inds <- which(res[,'padj']< 10^(-9)) de <- rownames(res)[inds] Remember that the rows or the input matrix countTable were names based on the Flybase gene IDs. Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

38 Comma Separated Values (CSV) format.
Saving the results write.csv(de,file='de.csv',row.names=FALSE) Saves the de genes in Comma Separated Values (CSV) format. Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

39 Looking at the saved results
Double click on the de.csv file you just saved. Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

40 Looking at the saved results
LiberOffice Calc shows the context of the csv file. Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

41 Cleaning the saved results
Delete the row and save Delete the row. Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

42 Saving the cleaned file
Click here to save the cleaned file. Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

43 You are done with identifying the differentially expressed genes.
Applause You are done with identifying the differentially expressed genes. Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

44 Finding gene homologs Use biomart to find the homologs of the de genes in human. Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

45 Choose Drosophila species.
Ensembl Biomart Select genes dtabase. Choose Drosophila species. Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

46 Uploading data to Ensembl Biomart
Click on Filters. Open GENE section. Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

47 Uploading data to Ensembl Biomart
Choose Flybase Gene IDs Upload the de.csv file that you just created Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

48 Getting results from Ensembl Biomart
Click on Homologs. Click on Attributes Optionally, uncheck Ensemble Transcript ID. Open ORTHOLOGS section. Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

49 Getting results from Ensembl Biomart
Scroll down and check Human Ensembl Gene ID. Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

50 Getting results from Ensembl Biomart
Click on Results at the top. Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

51 Saving results from Ensembl Biomart
Choose XLS file format and check “Unique results only”. Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

52 Saving results from Ensembl Biomart
Download the results by clicking on “GO” . Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

53 Results from Ensembl Biomart
Open mart_export.xls file that you just downloaded. Copy Human Ensembl Gene IDs that are provided in this column. No homologs were found for some genes. Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

54 References: DESeq2 manualhttps://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.pdf Huber W and Reyes A. pasilla: Data package with per-exon and per-gene read counts of RNA-seq samples of Pasilla knock-down ,Genome Research R package version I prepared these guidelines to facilitate the “Bioinformatics for biologists workshop”, Nov 2015, UTHSC – San Antonio. Instaling BioLinux using VM, Dr. Habil Zare Oct 2015


Download ppt "Bioinformatics for biologists"

Similar presentations


Ads by Google