Bioinformatics for biologists

Slides:

Advertisements

Similar presentations

A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group.

Advertisements

The essentials managers need to know about Excel

EViews Student Version. Today’s Workshop Basic grasp of how EViews manages data Creating Workfiles Importing data Running regressions Performing basic.

 Statistics package  Graphics package  Programming language  Can be used to share/reproduce analyses  Many new packages being created - can be downloaded.

Using…. EasyCBM Reasons to use EasyCBM

Chapter 3 Tables and Page Layout

Visualizing Multiple Physician Office Locations Exercise 9 GIS in Planning and Public Health Wansoo Im, Ph.D.

A Simple Guide to Using SPSS© for Windows

Chapter 7 Data Management. Agenda Database concept Import data Input and edit data Sort data Function Filter data Create range name Calculate subtotal.

XP New Perspectives on Microsoft Office Excel 2003, Second Edition- Tutorial 11 1 Microsoft Office Excel 2003 Tutorial 11 – Importing Data Into Excel.

WINKS 7 Tutorial 6 – Opening an Excel data file Permission granted for use for instruction and for personal use. © Alan C. Elliott, 2015.

RIMS II Online Order and Delivery System Tutorial on Downloading and Viewing Multipliers.

Working with SharePoint Document Libraries. What are document libraries? Document libraries are collections of files that you can share with team members.

1 Excel Lesson 3 Organizing the Worksheet Microsoft Office 2010 Introductory Pasewark & Pasewark.

How to Create Shapefiles For NiJel Using QGIS: Before you start creating shapefiles make sure you have OpenOffice install, QGIS, and File Transfer Protocol.

Importing your Own Data To display in GIS Lab 4a: (Table Join) Mapping By State, County, or Nation.

Carolina Environmental Program UNC Chapel Hill The Analysis Engine – A New Tool for Model Evaluation, Sensitivity and Uncertainty Analysis, and more… Alison.

4/22/2017 5:36 PM EViews Training Creating Workfiles.

LINDSEY BREWER CSSCR (CENTER FOR SOCIAL SCIENCE COMPUTATION AND RESEARCH) UNIVERSITY OF WASHINGTON September 17, 2009 Introduction to SPSS (Version 16)

Arko Barman with modification by C.F. Eick COSC 4335 Data Mining Spring 2015.

1 Welcome to the GrameneMart Tutorial A tool for batch data sequence retrieval 1.Select a Gramene dataset to search against. 2.Add filters to the dataset.

Information Security 493. Lab 11.3: Encrypt a Windows File Windows operating systems since Windows 2000 have included the ability to encrypt files. Follow.

Learning the TSP2: a guide for students at the 国際総合学類筑波大学 RUNNING REGRESSIONS FROM A SPREADSHEET FILE If you are using a network browser to view this program,

CAD3D Project. SketchUp - Project Create a new SketchUp project called InitialsXX where the XX are your first and last initial. Use the Rectangle tool.

Introduction to Excel Line Graphing The ‘Quick’ and ‘Easy’ guide to using Microsoft Excel for Line Graphing * Created by: Bunch of BHS science teachers.

1 Working with MS SQL Server Textbook Chapter 14.

Galaxy for Bioinformatics Analysis An Introduction TCD Bioinformatics Support Team Fiona Roche, PhD Date: 31/08/15.

Management Information Systems MS Access MS Access is an application software that facilitates us to create Database Management Systems (DBMS)

1 OPOL Training (OrderPro Online) Prepared by Christina Van Metre Independent Educational Consultant CTO, Business Development Team © Training Version.

CONSTRUCTING RELATIVE & CUMULATIVE FREQUENCY DISTRIBUTIONS using EXCEL & WORD.

Colleague, Excel & Word Best of Friends Presented by: Joan Kaun & Yvonne Nelson College of the Rockies.

MS Access 2007 Management Information Systems 1. Overview 2  What is MS Access?  Access Terminology  Access Window  Database Window  Create New Database.

Page 1 Non-Payroll Cost Transfer Enhancements Last update January 24, 2008 What are the some of the new enhancements of the Non-Payroll Cost Transfer?

Change in your CAD Project File - it happens all the time in robotics.

SEEMiS / Child’s Plan Guidance

FIX Eye FIX Eye Getting started: The guide EPAM Systems B2BITS.

IPlant Collaborative Discovery Environment RNA-seq Basic Analysis Log in with your iPlant ID; three orange icons.

6 th Annual Focus Users’ Conference 6 th Annual Focus Users’ Conference Import Testing Data Presented by: Adrian Ruiz Presented by: Adrian Ruiz.

Microsoft® Excel Key and format dates and times. 1 Use Date & Time functions. 2 Use date and time arithmetic. 3 Use the IF function. 4 Create.

ID Mapping to accessions from different databases. COST Functional Modeling Workshop April, Helsinki.

Installing BioLinux on Mac OS X or Windows using a virtual machine Dr. Habil Zare, PhD.

Bioinformatics for biologists Dr. Habil Zare, PhD PI of Oncinfo Lab Assistant Professor, Department of Computer Science Texas State University Presented.

Installing BioLinux on Mac OS X or Windows using a virtual machine Dr. Habil Zare, PhD.

Welcome to the GrameneMart Tutorial A tool for batch data sequence retrieval 1.Select a Gramene dataset to search against. 2.Add filters to the dataset.

1 Berger Jean-Baptiste

Instructions for using this template. Remember this is Jeopardy, so where I have written “Answer” this is the prompt the students will see, and where.

Bioinformatics for biologists (2) Dr. Habil Zare, PhD PI of Oncinfo Lab Assistant Professor, Department of Computer Science Texas State University Presented.

Emdeon Office Batch Management Services This document provides detailed information on Batch Import Services and other Batch features.

Hudson Fare Files 103 – Alternate Fare Files

A step-by-Step Guide For labels or merges

DRAWING LINES To draw lines click View in the Main Menu Toolbar -> Toolbars and check the Editor option. The Editor toolbar will appear amongst the toobars.

Bioinformatics for biologists (2)

Statistical Analysis with Excel

Bioinformatics for biologists

Workshop on Microbiome and Health

Upgrading To PowerPoint 2007.

Macrosystems EDDIE: Getting Started + Troubleshooting Tips

TRAINING OF FOCAL POINTS ON THE CountrySTAT/FENIX SYSTEM

ID Mapping tools: Converting Accessions between Databases

Sirena Hardy HRMS Trainer

TRAINING OF FOCAL POINTS on the CountrySTAT SYSTEM based on FENIX

PrognosTILs app Damien Drubay –

From Excel to Outlook: creating a distribution list via Blackboard.

Eviews Tutorial for Labor Economics Lei Lei

Macrosystems EDDIE: Getting Started + Troubleshooting Tips

IBM SCPM PIT Data Download/Upload

IBM SCPM PIT Data Download/Upload

Welcome to the GrameneMart Tutorial

Lesson 13 Working with Tables

MODULE 5: CREATING GOOD THEMATIC MAPS

Presentation transcript:

Bioinformatics for biologists Dr. Habil Zare, PhD PI of Oncinfo Lab Assistant Professor, Department of Computer Science Texas State University Presented at University of Texas, Health Science Center – San Antonio 20 November 2015

Part 2 - R - Differential expression analysis (DESeq2) - Finding gene homologs (Ensembl Biomart)

R R provides an excellent environment for biological data analysis. There are many useful R packages that facilitate bioinformatics and statistical analysis. For example, DESeq2 is an R package that identifies differentially expressed genes. edgR in an alternative. Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

How to start R in BioLinux? Open a terminal, type “R”, and press “Enter”. This is the R prompt indicating that R is ready to take commands. Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

Type “date()” and press “Enter”. Functions in R In R, most processes are done by calling a function. E.g., the date() function will print the current date. Type “date()” and press “Enter”. To quite an R session, call q() function. For now, we do NOT save our workspace. Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

Getting help within R Put a “?” in front of a function name and press “Enter”, to read the corresponding help document. Will provide information on the function “q()”. The arguments are the input to the function. Press “q” to get out of this. Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

Installing DESeq2 package The simplest way to install DESeq2 is to open an R session and copy and paste the following: Source('https://bioconductor.org/biocLite.R') biocLite('DESeq2') For now, we do NOT want to update other packages. Type “n” and then “Enter”. Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

Differentially expressed genes Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

Differentially expressed genes These genes are relatively more expressed in AML compared to MDS Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

Differentially expressed genes These genes are relatively more expressed in AML compared to MDS How to identify such genes? Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

DESeq2 Tests for differential expression based on negative binomial distribution. Input Output DESeq2 Count data for 2 different conditions (# of mapped reads from RNASeq) P-value of the null hypothesis that the distribution of the counts is identical Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

Are you in the right directory? Before you start, make sure you are in the correct directory. The pwd command in Linux shows the current directory. Typing “pwd” and then “Enter” will show your current path. Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

Loading the data in R Use read.table() function to read count data in R. E.g., copy the following lines to the R session to read Pasilla data: datafile <- 'sample_data/pasilla_gene_counts.tsv' pasillaCountTable <- read.table( datafile, header=TRUE, row.names=1 ) pasillaCountTable is a data table (matrix). Use dim() to get its size (dimension): It has 14,599 rows and 7 columns. Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

pasilla data The splicing factor pasilla (NOVA1 and NOVA2 ortholog) was knocked-down in Drosophila cell cultures. Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

Loading the data in R Let’s look at the first rows of data. Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

Names of the first 6 rows (genes). Loading the data in R Let’s look at the first rows of data. Names of the first 6 rows (genes). Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

Loading the data in R Each column corresponds to a sample. Column names Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

Loading the data in R We can access any specific element of the data as follows: Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

Open the file with LiberOffice Calc Double click on pasilla_gene_counts.tsv The default settings are fine. Click on “OK”. Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

LiberOffice Calc It is an open-source alternative to Excel. Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

LiberOffice Calc The values are the same as the table read in R. Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

Selecting a subset of data According to meta-data (not shown here), only 4 out of 7 available samples are paired-end. For simplicity, we restrict our analysis to these 4 samples. It is easy to do so in R. countTable <- pasillaCountTable[,c('untreated3', 'untreated4', 'treated2', 'treated3')] Equivalently, copy and paste the following in R: countTable <- pasillaCountTable[,c(3,4,6,7)] Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

Selecting a subset of data The selected data has the same number of rows (genes) but smaller number of columns (samples). Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

Defining the conditions We store the meta data on the biological condition of the samples in a “data frame”: metaData <- data.frame(row.names=colnames(countTable), condition=c('untr', 'untr', 'trea', 'trea')) Type “metaData” to make sure it has the correct information. Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

Computing p-values Run the following: library(DESeq2) dds <- DESeqDataSetFromMatrix(countData= countTable, colData= metaData, design = ~ condition) dds <- DESeq(dds) res <- results(dds) Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

Loads DESeq2 package into R. Computing p-values Run the following: Loads DESeq2 package into R. library(DESeq2) dds <- DESeqDataSetFromMatrix(countData= countTable, colData= metaData, design = ~ condition) dds <- DESeq(dds) res <- results(dds) Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

Converts data in appropriate format for DESeq2 Computing p-values Run the following: library(DESeq2) dds <- DESeqDataSetFromMatrix(countData= countTable, colData= metaData, design = ~ condition) dds <- DESeq(dds) res <- results(dds) Converts data in appropriate format for DESeq2 Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

Computing p-values Run the following: library(DESeq2) dds <- DESeqDataSetFromMatrix(countData= countTable, colData= metaData, design = ~ condition) dds <- DESeq(dds) res <- results(dds) If you have difficulty typing “~” in BioLinux on Windows, try “Page Down” key. Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

Computing p-values Run the following: Computes p-values. library(DESeq2) dds <- DESeqDataSetFromMatrix(countData= countTable, colData= metaData, design = ~ condition) dds <- DESeq(dds) res <- results(dds) Computes p-values. Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

Computing p-values Run the following: Extract results library(DESeq2) dds <- DESeqDataSetFromMatrix(countData= countTable, colData= metaData, design = ~ condition) dds <- DESeq(dds) res <- results(dds) Extract results Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

Understanding DESeq2 results “res” has 6 columns And one row per gene. Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

Understanding DESeq2 results The last column is the most interesting one because it reports the adjusted p-values. Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

Distribution of p-values hist(res[,'padj']) plots the histogram of p-values. Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

Cumulative distribution function Cumulative density functions (CDFs) are generally preferred to histograms. plot(ecdf(res[,'padj'])) ecdf() computes CDF. Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

Cumulative distribution function abline(v=0.01,col='red',lty=2) abline() adds a line to the plot. Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

Which p-values are significant? which(res[,'padj']<10^(-9)) Provides the induces of genes with p-value < Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

The differentially expressed genes inds <- which(res[,'padj']< 10^(-9)) de <- rownames(res)[inds] Remember that the rows or the input matrix countTable were names based on the Flybase gene IDs. Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

Comma Separated Values (CSV) format. Saving the results write.csv(de,file='de.csv',row.names=FALSE) Saves the de genes in Comma Separated Values (CSV) format. Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

Looking at the saved results Double click on the de.csv file you just saved. Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

Looking at the saved results LiberOffice Calc shows the context of the csv file. Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

Cleaning the saved results Delete the row and save Delete the row. Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

Saving the cleaned file Click here to save the cleaned file. Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

You are done with identifying the differentially expressed genes. Applause You are done with identifying the differentially expressed genes. Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

Finding gene homologs Use biomart to find the homologs of the de genes in human. http://www.ensembl.org/biomart/ Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

Choose Drosophila species. Ensembl Biomart Select genes dtabase. Choose Drosophila species. Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

Uploading data to Ensembl Biomart Click on Filters. Open GENE section. Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

Uploading data to Ensembl Biomart Choose Flybase Gene IDs Upload the de.csv file that you just created Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

Getting results from Ensembl Biomart Click on Homologs. Click on Attributes Optionally, uncheck Ensemble Transcript ID. Open ORTHOLOGS section. Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

Getting results from Ensembl Biomart Scroll down and check Human Ensembl Gene ID. Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

Getting results from Ensembl Biomart Click on Results at the top. Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

Saving results from Ensembl Biomart Choose XLS file format and check “Unique results only”. Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

Saving results from Ensembl Biomart Download the results by clicking on “GO” . Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

Results from Ensembl Biomart Open mart_export.xls file that you just downloaded. Copy Human Ensembl Gene IDs that are provided in this column. No homologs were found for some genes. Bioinformatics for biologists workshop, Dr. Habil Zare, Oncinfo Lab UTHSC San Antonio, 20 Nov 2015

References: DESeq2 manualhttps://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.pdf Huber W and Reyes A. pasilla: Data package with per-exon and per-gene read counts of RNA-seq samples of Pasilla knock-down ,Genome Research 2011. R package version 0.10.0. I prepared these guidelines to facilitate the “Bioinformatics for biologists workshop”, 20 Nov 2015, UTHSC – San Antonio. http://oncinfo.org/Bioinformatics+for+biologist+workshop Instaling BioLinux using VM, Dr. Habil Zare 27 Oct 2015