STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology NGS Part I RNA-Seq Short Reads Sequence Analysis Feb 29, 2012 Daniel Fernandez and Alejandro.

Slides:



Advertisements
Similar presentations
RNA-Seq as a Discovery Tool
Advertisements

RNA-seq library prep introduction
Functional Genomics with Next-Generation Sequencing
Marius Nicolae Computer Science and Engineering Department
RNA-Seq based discovery and reconstruction of unannotated transcripts
IMGS 2012 Bioinformatics Workshop: RNA Seq using Galaxy
RNAseq.
Sequence analysis with Scripture
12/04/2017 RNA seq (I) Edouard Severing.
Peter Tsai Bioinformatics Institute, University of Auckland
DEG Mi-kyoung Seo.
RNA-seq: the future of transcriptomics ……. ?
Transcriptome Assembly and Quantification from Ion Torrent RNA-Seq Data Alex Zelikovsky Department of Computer Science Georgia State University Joint work.
Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520
Transcriptomics Jim Noonan GENE 760.
RNA-Seq based discovery and reconstruction of unannotated transcripts in partially annotated genomes 3 Serghei Mangul*, Adrian Caciula*, Ion.
Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut.
Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut.
Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut.
mRNA-Seq: methods and applications
Software for Robust Transcript Discovery and Quantification from RNA-Seq Ion Mandoiu, Alex Zelikovsky, Serghei Mangul.
RNA-Seq and RNA Structure Prediction
Wfleabase.org/docs/tileMEseq0905.pdf Notes and statistics on base level expression May 2009Don Gilbert Biology Dept., Indiana University
LECTURE 2 Splicing graphs / Annoteted transcript expression estimation.
Li and Dewey BMC Bioinformatics 2011, 12:323
Expression Analysis of RNA-seq Data
Todd J. Treangen, Steven L. Salzberg
Transcriptome analysis With a reference – Challenging due to size and complexity of datasets – Many tools available, driven by biomedical research – GATK.
RNAseq analyses -- methods
Variables: – T(p) - set of candidate transcripts on which pe read p can be mapped within 1 std. dev. – y(t) -1 if a candidate transcript t is selected,
Schedule change Day 2: AM - Introduction to RNA-Seq (and a touch of miRNA-Seq) Day 2: PM - RNA-Seq practical (Tophat + Cuffdiff pipeline on Galaxy) Day.
TopHat Mi-kyoung Seo. Today’s paper..TopHat Cole Trapnell at the University of Washington's Department of Genome Sciences Steven Salzberg Center.
RNA-Seq Analysis Simon V4.1.
Adrian Caciula Department of Computer Science Georgia State University Joint work with Serghei Mangul (UCLA) Ion Mandoiu (UCONN) Alex Zelikovsky (GSU)
The iPlant Collaborative
RNA-Seq Assembly 转录组拼接 唐海宝 基因组与生物技术研究中心 2013 年 11 月 23 日.
1 Global expression analysis Monday 10/1: Intro* 1 page Project Overview Due Intro to R lab Wednesday 10/3: Stats & FDR - * read the paper! Monday 10/8:
RNA-Seq Primer Understanding the RNA-Seq evidence tracks on the GEP UCSC Genome Browser Wilson Leung08/2014.
Introduction to RNAseq
RNA-seq: Quantifying the Transcriptome
TOX680 Unveiling the Transcriptome using RNA-seq Jinze Liu.
Biases in RNA-Seq data. Transcript length bias Two transcripts of length 50 and 100 have the same abundance in a control sample. The expression of both.
No reference available
Lecture 12 RNA – seq analysis.
Short read alignment BNFO 601. Short read alignment Input: –Reads: short DNA sequences (upto a few hundred base pairs (bp)) produced by a sequencing machine.
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology.
An Integer Programming Approach to Novel Transcript Reconstruction from Paired-End RNA-Seq Reads Serghei Mangul Department of Computer Science Georgia.
CyVerse Workshop Transcriptome Assembly. Overview of work RNA-Seq without a reference genome Generate Sequence QC and Processing Transcriptome Assembly.
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
RNA Seq Analysis Aaron Odell June 17 th Mapping Strategy A few questions you’ll want to ask about your data… - What organism is the data from? -
Canadian Bioinformatics Workshops
RNA-Seq with the Tuxedo Suite Monica Britton, Ph.D. Sr. Bioinformatics Analyst September 2015 Workshop.
RNA-Seq Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520
RNA-Seq Primer Understanding the RNA-Seq evidence tracks on
Simon v RNA-Seq Analysis Simon v
RNA Quantitation from RNAseq Data
Dr. Christoph W. Sensen und Dr. Jung Soh Trieste Course 2017
Gene expression from RNA-Seq
RNA-Seq analysis in R (Bioconductor)
S1 Supporting information Bioinformatic workflow and quality of the metrics Number of slides: 10.
Gene expression estimation from RNA-Seq data
Reference based assembly
Transcriptome analysis
Alternative Splicing QTLs in European and African Populations
Inference of alternative splicing from RNA-Seq data with probabilistic splice graphs BMI/CS Spring 2019 Colin Dewey
Dec. 22, 2011 live call UCONN: Ion Mandoiu, Sahar Al Seesi
Quantitative analyses using RNA-seq data
Sequence Analysis - RNA-Seq 2
Presentation transcript:

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology NGS Part I RNA-Seq Short Reads Sequence Analysis Feb 29, 2012 Daniel Fernandez and Alejandro Quiroz 1

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology 1 st ACT (1 hour) Introduction INTERLUDE Chill Out Sessions with DJ Bowtie (10 min) 2 nd ACT (1 hour 50 min) Homework help Q4 and Q5. 2

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Central Dogma of MB GENOME TRANSCRIPTOME BIOLOGYBIOLOGY REVERSEENGINEERINGREVERSEENGINEERING

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Reverse Engineering: We can use sequencing to find the genome state RNA-Seq Transcription Wang, Z Nature Reviews Genetics 2009

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Reverse Engineering: Once sequenced the problem becomes computational Sequenced reads cells sequencer Library preparation genome read coverage Alignment

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Overview of the session We’ll cover the 3 main computational challenges of sequence analysis for counting applications: Read mapping: Placing short reads in the genome Reconstruction: Finding the regions that originated the reads Quantification: Assigning scores to regions Finding regions that are differentially represented between two or more samples.

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Trapnell, Salzberg, Nature Biotechnology 2009

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Short read mapping software for RNA-Seq Seed-extendShort indelsUse base qual B-WUse base qual MaqNoYESBWAYES BFASTYesNOBowtieNO GASSSTYesNOSoap2NO RMAPYesYES SeqMapYesNO SHRiMPYesNO

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology What software to use If read quality is good (error rate < 1%) and there is a reference. BWA is a very good choice. If read quality is not good or the reference is phylogenetically far (e.g. Wolf to dog) and you have a server with enough memory SHRiMP or BFAST should be a sensitive but relatively fast choice. What about RNA-Seq?

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology RNA-Seq read mapping is more complex than just sequencing 10s kb100s bp RNA-Seq reads can be spliced, and spliced reads are most informative

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Method 1: Seed-extend spliced alignment

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Method 1I: Exon-first spliced alignment

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Short read mapping software for RNA-Seq Seed-extendShort indelsUse base qual Exon-firstUse base qual GSNAPNoNOMapSpliceNO QPALMAYesNOSpliceMapNO STAMPYYesYESTopHatNO BLATYesNO Exon-first alignments will map contiguous first at the expense of spliced hits

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology The Broad Institute of MIT and Harvard A desktop application for the visualization and interactive exploration of genomic data IGV: Integrative Genomics Viewer Microarrays Epigenomics RNA-Seq NGS alignments Comparative genomics

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Visualizing read alignments with IGV Long marks Medium marks Punctuate marks

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Visualizing read alignments with IGV — RNASeq Gap between reads spanning exons

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Visualizing read alignments with IGV — RNASeq close-up What are the gray reads? We will revisit later.

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Overview of the session The 3 main computational challenges of sequence analysis for counting applications: Read mapping: Placing short reads in the genome Reconstruction: Finding the regions that originate the reads Quantification: Assigning scores to regions Finding regions that are differentially represented between two or more samples.

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Scripture for RNA-Seq: Extending segmentation to discontiguous regions

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology The transcript reconstruction problem 10s kb100s bp Challenges: Genes exist at many different expression levels, spanning several ordersof magnitude. Reads originate from both mature mRNA (exons) and immature mRNA(introns) and it can be problematic to distinguish between them. Reads are short and genes can have many isoforms making itchallenging to determine which isoform produced each read. There are two main approaches to this problem, first lets discuss Scripture’s

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Merge windows & build transcript graph Filter & report isoforms Scripture Overview Map reads Scan “discontiguous” windows

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Method I: Direct assembly

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Method II: Genome-guided

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Transcriptome reconstruction method summary

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Pros and cons of each approach Transcript assembly methods are the obvious choicefor organisms without a reference sequence. Genome-guided approaches are ideal for annotatinghigh-quality genomes and expanding the catalog ofexpressed transcripts and comparing transcriptomesof different cell types or conditions. Hybrid approaches for lesser quality ortranscriptomes that underwent majorrearrangements, such as in cancer cell. More than 1000 fold variability in expression levesmakes assembly a harder problem for transcriptomeassembly compared with regular genome assembly. Genome guided methods are very sensitive toalignment artifacts.

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology RNA-Seq transcript reconstruction software AssemblyPublishedGenome Guided OasisNOCufflinks Trans-ABySSYESScripture TrinityNO

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Differences between Cufflinks and Scripture Scripture was designed with annotation in mind. It reportsall possible transcripts that are significantly expressed given the aligned data ( Maximum sensitivity ). Cuffl links was designed with quantification in mind. It limits reported isoforms to the minimal number thatexplains the data ( Maximum precision ).

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Differences between Cufflinks and Scripture - Example Annotation Scripture Cufflinks Alignments

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Overview of the session The 3 main computational challenges of sequence analysis for counting applications: Read mapping: Placing short reads in the genome Reconstruction: Finding the regions that originate the reads Quantification: Assigning scores to regions Finding regions that are differentially represented between two or more samples.

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Quantification Fragmentation of transcripts results in length bias: longer transcripts have higher counts Different experiments have different yields. Normalization is required for cross lane comparisons: Reads per kilobase of exonic sequence per million mapped reads (Mortazavi et al Nature methods 2008) This is all good when genes have one isoform.

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Quantification with multiple isoforms How do we define the gene expression? How do we compute the expression of each isoform?

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Computing gene expression Idea1: RPKM of the constitutive reads (Neuma, Alexa-Seq, Scripture)

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Computing gene expression — isoform deconvolution

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Computing gene expression — isoform deconvolution If we knew the origin of the reads we could compute each isoform’s expression. The gene’s expression would be the sum of the expression of all its isoforms. E = RPKM 1 + RPKM 2 + RPKM 3

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Programs to measure transcript expression Implemented method Alexa-seqGene expression by constitutive exons ERANGEGene expression by using all Exons ScriptureGene expression by constitutive exons CufflinksTranscript deconvolution by solving the maximum likelihood problem MISOTranscript deconvolution by solving the maximum likelihood problem RSEMTranscript deconvolution by solving the maximum likelihood problem

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Impact of library construction methods

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Library construction improvements — Paired-end sequencing Adapted from the Helicos website

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Paired-end reads are easier to associate to isoforms P1P1 P2P2 P3P3 Isoform 1 Isoform 2 Isoform 3 Paired ends increase isoform deconvolution confidence P 1 originates from isoform 1 or 2 but not 3. P 2 and P 3 originate from isoform 1 Do paired-end reads also help identifying reads originating in isoform 3?

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology We can estimate the insert size distribution P1P1 P2P2 d1d1 d2d2 Splice and compute insert distance Estimate insert size empirical distribution Get all single isoform reconstructions

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology … and use it for probabilistic read assignment Isoform 1 Isoform 2 Isoform 3 d1d1 d2d2 d1d1 d2d2 P(d > d i )

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology And improve quantification Katz et al Nature Methods 2008

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Paired-end improve reconstructions Paired-end data complements the connectivity graph

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology And merge regions Single reads Paired reads

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Or split regions Single reads Paired reads

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Summary Paired-end reads are now routine in Illumina and SOLiDsequencers. Paired end alignment is supported by most short read aligners Transcript quantification depends heavily in paired-end data Transcript reconstruction is greatly improved when using paired-ends (work in progress)

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology The libraries we will work with are strand sepcific

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Summary Several methods now exist to build strand sepecificRNA-Seq libraries. Quantification methods support strand specific libraries.For example Scripture will compute expression on bothstrand if desired.

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Overview of the session The 3 main computational challenges of sequence analysis for counting applications: Read mapping: Placing short reads in the genome Reconstruction: Finding the regions that originate the reads Quantification: Assigning scores to regions Finding regions that are differentially represented between two or more samples.

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology The problem. Finding genes that have different expression between twoor more conditions. Find gene with isoforms expressed at different levelsbetween two or more conditions. Find differentially used slicing events Find alternatively used transcription start sites Find alternatively used 3 ’ UTRs

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Differential gene expression using RNA-Seq (Normalized) read counts  Hybridization intensity We observe the individual events.

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology The Poisson model Suppose you have 2 conditions and R replicates for each conditions and each replicate in its own lane L. Lets consider a single gene G. Let C ik the number of reads aligned to G in lane i of condition k then (k=1,2) and i=(1,…R). Assume for simplicity that all lanes give the same number of reads (otherwise introduce a normalization constant) Assume C ik distributes Poisson with unknown mean m ik. Use a GLN to estimate m ik using two parameters, a gene dependent parameter a and a sample dependent parameter s k log(m ik ) = a + s k to obtain two estimators m 1 and m 2 Alternatively estimate a mean m using all replicates for all conditions

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology The Poisson model G is differentially expressed when m 1 != m 2 Is P(C 1k,C 2k |m) is close to P(C 1k |m 1 )P(C 2k |m 2 ) The likelihood ratio test is ideal to see this and since the difference between the two models is one variable it distributes X 2 of degree 1. The X 2 can be used to assess significance. For details see Auer and Doerge - Statistical Design and Analysis of RNA Sequencing Data genomics Marioni et al – RNASeq: An assessment technical of reproducibility and comparison with gene expression arrays Genome Reasearch 2008.

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Cufflinks differential issoform ussage Let a gene G have n isoforms and let p 1, …, p n the estimated fraction of expression of each isoform. Call this a the isoform expression distribution P for G Given two samples we the differential isoform usage amounts to determine whether H 0 : P 1 = P 2 or H 1 : P 1 != P 2 are true. To compare distributions Cufflinks utilizes an information content based metric of how different two distributions are called the Jensen-Shannon divergence: The square root of the JS distributes normal.

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology RNA-Seq differential expression software Underlying modelNotes DegSeqNormal. Mean and variance estimated from replicates Works directly from reference transcriptome and read alignment EdgeRNegative BionomialGene expression table DESeqPoissonGene expression table MyrnaEmpiricalSequence reads and reference transcriptome