Differentially expressed genes Sample class prediction etc.

Slides:



Advertisements
Similar presentations
NGS Bioinformatics Workshop 2.1 Tutorial – Next Generation Sequencing and Sequence Assembly Algorithms May 3rd, 2012 IRMACS Facilitator: Richard.
Advertisements

IMGS 2012 Bioinformatics Workshop: File Formats for Next Gen Sequence Analysis.
IMGS 2012 Bioinformatics Workshop: RNA Seq using Galaxy
Peter Tsai Bioinformatics Institute, University of Auckland
RNA-Seq An alternative to microarray. Steps Grow cells or isolate tissue (brain, liver, muscle) Isolate total RNA Isolate mRNA from total RNA (poly.
MCB Lecture #21 Nov 20/14 Prokaryote RNAseq.
RNA-Seq An alternative to microarray. Steps Grow cells or isolate tissue (brain, liver, muscle) Isolate total RNA Isolate mRNA from total RNA (poly.
RNA-seq Analysis in Galaxy
Biological Sequence Analysis BNFO 691/602 Spring 2014 Mark Reimers
Before we start: Align sequence reads to the reference genome
NGS Analysis Using Galaxy
Whole Exome Sequencing for Variant Discovery and Prioritisation
Mapping NGS sequences to a reference genome. Why? Resequencing studies (DNA) – Structural variation – SNP identification RNAseq – Mapping transcripts.
Introduction to RNA-Seq and Transcriptome Analysis
Li and Dewey BMC Bioinformatics 2011, 12:323
Expression Analysis of RNA-seq Data
Bioinformatics and OMICs Group Meeting REFERENCE GUIDED RNA SEQUENCING.
Genomics Virtual Lab: analyze your data with a mouse click Igor Makunin School of Agriculture and Food Sciences, UQ, April 8, 2015.
Transcriptome analysis With a reference – Challenging due to size and complexity of datasets – Many tools available, driven by biomedical research – GATK.
File formats Wrapping your data in the right package Deanna M. Church
Normalization Intro to R Carol Bult The Jackson Laboratory Functional Genomics (BMB550) Spring 2012 February 7, 2012.
RNAseq analyses -- methods
NGS data analysis CCM Seminar series Michael Liang:
Next Generation DNA Sequencing
RNA-Seq in Galaxy Igor Makunin QAAFI, Internal Workshop, April 17, 2015.
RNA-seq workshop ALIGNMENT
An Introduction to RNA-Seq Transcriptome Profiling with iPlant.
Introductory RNA-seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis.
The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop RNA-Seq using the Discovery Environment And COGE.
Functional Genomics Carol Bult, Ph.D. Course coordinator The Jackson Laboratory Winter/Spring 2012 Keith Hutchison, Ph.D. Course co-coordinator.
EDACC Primary Analysis Pipelines Cristian Coarfa Bioinformatics Research Laboratory Molecular and Human Genetics.
Introduction To Next Generation Sequencing (NGS) Data Analysis
Quick introduction to genomic file types Preliminary quality control (lab)
DM ChurchLast Updated: 7 May 2012 Intro to Next Generation Sequencing.
Introductory RNA-seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis.
Genomics I: The Transcriptome RNA Expression Analysis Determining genomewide RNA expression levels.
RNA-Seq Primer Understanding the RNA-Seq evidence tracks on the GEP UCSC Genome Browser Wilson Leung08/2014.
RNA-Seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis is doing the.
Data Workflow Overview Genomics High- Throughput Facility Genome Analyzer IIx Institute for Genomics and Bioinformatics Computation Resources Storage Capacity.
Introduction to RNAseq
Bioinformatics for biologists Dr. Habil Zare, PhD PI of Oncinfo Lab Assistant Professor, Department of Computer Science Texas State University Presented.
Applied Bioinformatics Week 9 Jens Allmer. Theory I Gene Expression Microarray.
Bioinformatics support at School of Biological Sciences
The iPlant Collaborative
The iPlant Collaborative
No reference available
RNA-Seq in Galaxy Igor Makunin DI/TRI, March 9, 2015.
Manuel Holtgrewe Algorithmic Bioinformatics, Department of Mathematics and Computer Science PMSB Project: RNA-Seq Read Simulation.
Moderní metody analýzy genomu - analýza Mgr. Nikola Tom Brno,
Accessing and visualizing genomics data
User-friendly Galaxy interface and analysis workflows for deep sequencing data Oskari Timonen and Petri Pölönen.
Introduction to Exome Analysis in Galaxy Carol Bult, Ph.D. Professor Deputy Director, JAX Cancer Center Short Course Bioinformatics Workshops 2014 Disclaimer…I.
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Introductory RNA-seq Transcriptome Profiling of the hy5 mutation in Arabidopsis thaliana.
Practice:submit the ChIP_Streamline.pbs 1.Replace with your 2.Make sure the.fastq files are in your GMS6014 directory.
Canadian Bioinformatics Workshops
From Reads to Results Exome-seq analysis at CCBR
Introductory RNA-seq Transcriptome Profiling
Cancer Genomics Core Lab
NGS Analysis Using Galaxy
Dr. Christoph W. Sensen und Dr. Jung Soh Trieste Course 2017
First Bite of Variant Calling in NGS/MPS Precourse materials
S1 Supporting information Bioinformatic workflow and quality of the metrics Number of slides: 10.
How to store and visualize RNA-seq data
Introductory RNA-Seq Transcriptome Profiling
ChIP-seq Robert J. Trumbly
Additional file 2: RNA-Seq data analysis pipeline
RNA-Seq Data Analysis UND Genomics Core.
The Variant Call Format
Presentation transcript:

Differentially expressed genes Sample class prediction etc. Biological question Differentially expressed genes Sample class prediction etc. Experimental design Churchill, March 15 Microarray experiment Bult, Lecture 5 Image analysis Bult, Lecture 6 Normalization Hibbs, Lectures 10 and 11 http://statwww.epfl.ch/davison/teaching/Microarrays/ETHZ/ Estimation Testing Clustering Discrimination Biological verification and interpretation Blake, Lecture 16 and 17

Project Steps Find and Download Array Data Normalize Array Data Analyze Data i.e., generate gene lists Differentially expressed genes, genes in clusters, etc. Interpret Gene Lists Use the annotations of genes in your lists Gene Ontology terms are available for many organisms, but not all

Getting The Data Search GEO (or whatever) for a data set of interest. Download the data files e.g., Affy .CEL files, Affy .CDF files, etc. Upload to home directory

Normalize the Data Sent you all a script (2/23/2012) to RMA normalize the Ackerman array data available from my home directory

library(affy) library(makecdfenv) Array.CDF=make.cdf.env(“MoGene-1_0-st-v1.cdf”) CELData=ReadAffy() CELData@cdfName=“Array.CDF” rma.CELData = rma(CELData) rma.expr = exprs(rma.CELData) rma.expr.df = data.frame(ProbeID=row.names(rma.expr),rma.expr) write.table(rma.expr.df,"rma.expr.dat",sep="\t",row=F,quote=F)

What is a library? What does the ReadAffy() function do?What are possible arguments for the ReadAffy() function? What class of R object is rma.CELData? What class of R object is rma.expr? What class of R object is rma.expr.df?

slotNames(CELData) phenoData(CELData)

This is what rma.expr.df looks like in Excel……

Plotting summarized probeset intensities across the Ackerman arrays… Plotting summarized probeset intensities across the Ackerman arrays….(non normalized) jpeg("boxplot.jpeg") boxplot(CELData, names=CELData$sample, col="blue") dev.off()

Plotting summarized probeset intensities across the Ackerman arrays… Plotting summarized probeset intensities across the Ackerman arrays….(normalized) mydata=rma.expr.df jpeg("normal_boxplot.jpg") boxplot(mydata[-1], main = "Normalized Intensities", xlab="Array", ylab="Intensities", col="blue") dev.off()

Next time Posted articles from Gary Churchill. If you only read one article, read Churchill 2004 See also Gary’s web site: http://churchill.jax.org/software/rmaanova.shtml Look at Sample Data and Tutorial After that lecture we will begin analysis of microarray data MAANOVA

Gigabases Cost per Kb Cost Throughput Lucinda Fulton, The Genome Center at Washington University

Sequencing Technologies http://www.geospiza.com/finchtalk/uploaded_images/plates-and-slides-718301.png

Sequence “Space” Roche 454 – Flow space AB SOLiD – Color space Measure pyrophosphate released by a nucleotide when it is added to a growing DNA chain Flow space describes sequence in terms of these base incorporations http://www.youtube.com/watch?v=bFNjxKHP8Jc AB SOLiD – Color space Sequencing by DNA ligation via synthetic DNA molecules that contain two nested known bases with a flouorescent dye Each base sequenced twice http://www.youtube.com/watch?v=nlvyF8bFDwM&feature=related Illumina/Solexa – Base space Single base extentions of fluorescent-labeled nucleotides with protected 3 ‘ OH groups Sequencing via cycles of base addition/detection followed deprotection of the 3’ OH http://www.youtube.com/watch?v=77r5p8IBwJk&feature=related GenomeTV – Next Generation Sequencing (lecture) http://www.youtube.com/watch?v=g0vGrNjpyA8&feature=related http://finchtalk.geospiza.com/2008/03/color-space-flow-space-sequence-space_23.html

“Standard” File formats Sequence containers FASTA FASTQ BAM/SAM Alignments BAM/SAM MAF Annotation BED GFF/GTF/GFF3 WIG Variation VCF GVF

Tools Alignments Transcriptomics Variant calling BLAST: not for NGS BWA Bowtie Maq … Transcriptomics Tophat Cufflinks … Variant calling ssahaSNP Mosaic … Counting (Chip-Seq, etc) FindPeaks PeakSeq

FASTQ: Data Format FASTQ References/Documentation Text based Encodes sequence calls and quality scores with ASCII characters Stores minimal information about the sequence read 4 lines per sequence Line 1: begins with @; followed by sequence identifier and optional description Line 2: the sequence Line 3: begins with the “+” and is followed by sequence identifiers and description (both are optional) Line 4: encoding of quality scores for the sequence in line 2 References/Documentation http://maq.sourceforge.net/fastq.shtml Cock et al. (2009). Nuc Acids Res 38:1767-1771.

FASTQ Example For analysis, it may be necessary to convert to the Sanger form of FASTQ…For example, Illumina stores quality scores ranging from 0-62; Sanger quality scores range from 0-93. Solexa quality scores have to be converted to PHRED quality scores. FASTQ example from: Cock et al. (2009). Nuc Acids Res 38:1767-1771.

SAM (Sequence Alignment/Map) It may not be necessary to align reads from scratch…you can instead use existing alignments in SAM format SAM is the output of aligners that map reads to a reference genome Tab delimited w/ header section and alignment section Header sections begin with @ (are optional) Alignment section has 11 mandatory fields BAM is the binary format of SAM http://samtools.sourceforge.net/

Mandatory Alignment Fields http://samtools.sourceforge.net/SAM1.pdf

Alignment Examples Alignments in SAM format http://samtools.sourceforge.net/SAM1.pdf

Valid BED files chr1 86114265 86116346 nsv433165 chr2 1841774 1846089 nsv433166 chr16 2950446 2955264 nsv433167 chr17 14350387 14351933 nsv433168 chr17 32831694 32832761 nsv433169 chr17 32831694 32832761 nsv433170 chr18 61880550 61881930 nsv433171 chr1 16759829 16778548 chr1:21667704 270866 - chr1 16763194 16784844 chr1:146691804 407277 + chr1 16763194 16784844 chr1:144004664 408925 - chr1 16763194 16779513 chr1:142857141 291416 - chr1 16763194 16779513 chr1:143522082 293473 - chr1 16763194 16778548 chr1:146844175 284555 - chr1 16763194 16778548 chr1:147006260 284948 - chr1 16763411 16784844 chr1:144747517 405362 +

Galaxy See Tutorial 1 http://main.g2.bx.psu.edu/ Build and share data and analysis workflows No programming experience required Strong and growing development and user community

Dialog/Parameter Selection History Tools

Tutorial Web Site Tutorial 5 http://www.ncbi.nlm.nih.gov/staff/church/GenomeAnalysis/index.shtml Tutorial 5

RNA Seq Workflow Convert data to FASTQ Upload files to Galaxy Quality Control Throw out low quality sequence reads, etc. Map reads to a reference genome Many algorithms available Trade off between speed and sensitivity Data summarization Associating alignments with genome annotations Counts Data Visualization Statistical Analysis

Typical RNA_Seq Project Work Flow Tissue Sample Total RNA mRNA cDNA FASTQ file Sequencing QC TopHat Cufflinks Gene/Transcript/Exon Expression Visualization Statistical Analysis JAX Computational Sciences Service

TopHat http://tophat.cbcb.umd.edu/ TopHat is a good tool for aligning RNA Seq data compared to other aligners (Maq, BWA) because it takes splicing into account during the alignment process. Figure from: Trapnell et al. (2010). Nature Biotechnology 28:511-515. Trapnell et al. (2009). Bioinformatics 25:1105-1111.

TopHat is built on the Bowtie alignment algorithm. The TopHat pipeline. RNA-Seq reads are mapped against the whole reference genome, and those reads that do not map are set aside. An initial consensus of mapped regions is computed by Maq. Sequences flanking potential donor/acceptor splice sites within neighboring regions are joined to form potential splice junctions. The IUM reads are indexed and aligned to these splice junction sequences. Trapnell C et al. Bioinformatics 2009;25:1105-1111

Cufflinks Assembles transcripts, Estimates their abundances, and http://cufflinks.cbcb.umd.edu/ Assembles transcripts, Estimates their abundances, and Tests for differential expression and regulation in RNA-Seq samples Trapnell et al. (2010). Nature Biotechnology 28:511-515.