Adrian Caciula Department of Computer Science Georgia State University Joint work with Serghei Mangul (UCLA) Ion Mandoiu (UCONN) Alex Zelikovsky (GSU)

Adrian Caciula Department of Computer Science Georgia State University Joint work with Serghei Mangul (UCLA) Ion Mandoiu (UCONN) Alex Zelikovsky (GSU) Transcriptome Reconstruction from Single RNA-Seq Reads Using EM Algorithm with Expected Deviation Minimization Enhancement ISBRA 2013, Charlotte, NC

RNA-Seq: Background and Related work EM-EDM: EM Algorithm with Expected Deviation Minimization 1. Candidate transcripts construction 2. EM for Isoform Expression Estimation 3. EDM: Expected Deviation Minimization Experimental Results Conclusions ISBRA 2013, Charlotte, NC Outline

Alternative Splicing [Griffith and Marra 07]

Advances in Next Generation Sequencing http://www.economist.com/node/16349358 Roche/454 FLX Titanium 400-600 million reads/run 400bp avg. length Illumina HiSeq 2000 Up to 6 billion PE reads/run 35-100bp read length SOLiD 4/5500 1.4-2.4 billion PE reads/run 35-50bp read length Ion Proton Sequencer High-throughput RNA sequencing (RNA-Seq) allows to reduce the sequencing cost and significantly increase data throughput.

Genome-Guided RNA-Seq Protocol From RNA – through the process of hybridization- Make cDNA & shatter into Fragments Sequence fragment ends ABCDE Map reads to genome Gene Expression (GE)Isoform Expression (IE) ABC AC DE Isoform Discovery (ID) ISBRA 2013, Charlotte, NC [Nicolae, et. al., 10]

Transcriptome Reconstruction Given partial or incomplete information about something make we need to make an informed guess about the missing or unknown data. 7

Transcriptome Reconstruction Types Genome-independent reconstruction (de novo) —de Brujin k-mer graph Genome-guided reconstruction (ab initio) —Spliced read mapping —Exon identification —Splice graph Annotation-guided reconstruction —Use existing annotation (known transcripts) —Focus on discovering novel transcripts 8

Previous approaches Genome-independent reconstruction —Trinity(2011), Velvet(2008), TransABySS(2008) Genome-guided reconstruction —Scripture(2010) >Reports “all” transcripts —Cufflinks(2010), IsoLasso(2011), SLIDE(2012) >Minimizes set of transcripts explaining reads Annotation-guided reconstruction —RABT(2011), DRUT(2011) 9

RNA-Seq: background and related work EM-EDM: EM Algorithm with Expected Deviation Minimization 1. Candidate transcripts construction 2. EM for Isoform Expression Estimation 3. EDM: Expected Deviation Minimization Experimental Results Conclusions ISBRA 2013, Charlotte, NC Outline

EM Algorithm with Expected Deviation Minimization EM-EDM algorithm starts with — a set of N known candidate transcripts and —initialize their frequencies (expression levels), f t, with EM estimates. —then incorporates EDM, to improve the accuracy of EM. ISBRA 2013, Charlotte, NC

Step 1: Map the RNA-Seq reads to genome (using TopHat) Step 2: Construct Splice Graph - G(V,E) –V : exons –E: splicing events Step 3: Build the candidate transcripts –depth-first-search (DFS) Step 4: Apply EM-EDM to compute expression levels for all candidates Step 5: Filter candidate transcripts based on expression levels Genome EM Algorithm with Expected Deviation Minimization

1. Candidate transcripts construction Gene representation Pseudo-exons (pse i ) - regions of a gene between consecutive transcriptional or splicing events Gene - set of non-overlapping pseudo-exons e1e1 e3e3 e5e5 e2e2 e4e4 e6e6 S pse1 E pse1 S pse2 E pse2 S pse3 E pse3 S pse4 E pse4 S pse5 E pse5 S pse6 E pse6 S pse7 E pse7 Pseudo -exons: e1e1 e5e5 pse 1 pse 2 pse 3 pse 4 pse 5 pse 6 pse 7 Tr 1 : Tr 2 : Tr 3 : 13

1. Candidate transcripts construction Splice Graph Construction Genome pse 1 pse 4 pse 2 pse 3 pse 5 pse 6 pse 7 pse 8 pse 9 TSS pseudo-exons TES Single Spliced Reads

2. EM for Isoform Expression Estimation Read Ambiguity in Isoform Expression ABCDE AC

Previous approaches to Isoform Expression [Nicolae et. Al. 10] —Fragment length distribution [Li et al. 10] —EM Algorithm, single reads [Feng et al. 10] —Convex quadratic program, pairs used only for ID [Trapnell et al. 10] —Extends Jiang’s model to paired reads —Fragment length distribution

Read-Isoform Compatibility Q a represents the probability of observing the read from the genome locations described by the alignment a. - This is computed from the base quality scores as described in [Nicolae et. al., 10] Reads Transcripts

Fragment length distribution For Single reads F a is defined as the probability of observing a fragment with length of u bases or fewer. For more details see IsoEM [Nicolae et. al., 10] ABC AC ABC AC ABC AC i j F a (i) F a (j)

Generic EM algorithm Initialization: uniform transcript frequencies f t ’s E-step: Compute the expected number n t of reads sampled from transcript t — assuming current transcript frequencies f t M-step: For each transcript t, set f t = portion of reads emitted by transcript t among all reads in the sample ML estimates for f t = n t /(n 1 +... + n T ) CAME 2011, Atlanta, GA

3. EDM: Expected Deviation Minimization EDM Motivation: Reducing the error rate is critical for detecting similar transcripts especially in those cases when one is a subeset of another: EDM is a fine tuning for frequency estimation which further improves the accuracy of the computation. ISBRA 2013, Charlotte, NC

Expected Deviation Minimization method (EDM).

The transcript frequency can be estimated by the following iterative process: Initialize f t  corresponding EM frequency EDM increments and decrements transcript frequencies in order to decrease the total deviation (between observed and expected read frequency). ISBRA 2013, Charlotte, NC

Expected Deviation Minimization method (EDM). Each iteration consists of the following three steps: Step1: Set D=1 and C=0.05. ISBRA 2013, Charlotte, NC

Expected Deviation Minimization method (EDM). Step 2: ISBRA 2013, Charlotte, NC

Expected Deviation Minimization method (EDM). Step 3: ISBRA 2013, Charlotte, NC

RNA-Seq: background and related work EM-EDM: Expectation Maximization Algorithm with Expected Deviation Minimization Enhancement 1. Gene representation and candidate transcripts 2. EM for Isoform Expression Estimation 3. EDM: Expected Deviation Minimization Experimental Results Conclusions ISBRA 2013, Charlotte, NC Outline

Simulation Setup human genome data (UCSC hg18) —UCSC database - 66, 803 isoforms —19, 372 genes. —Single error-free reads: 60M of length 100bp —for partially annotated genome -> remove from every gene exactly one isoform ISBRA 2013, Charlotte, NC

Distribution of isoforms length and gene cluster sizes in UCSC dataset ISBRA 2013, Charlotte, NC

Comparison Between Methods ISBRA 2013, Charlotte, NC

Conclusions We proposed EM-EDM annotation-guided method for transcriptome discovery and reconstruction EM-EDM overperforms existing genome-guided transcriptome assemblers in terms of Sensitivity (i.e., Cufflinks) For future work we plan the improve the filtering algorithm in order to increase the PPV and extend our work to paired-end reads. ISBRA 2013, Charlotte, NC

Thanks! ISBRA 2013, Charlotte, NC

Adrian Caciula Department of Computer Science Georgia State University Joint work with Serghei Mangul (UCLA) Ion Mandoiu (UCONN) Alex Zelikovsky (GSU)

Similar presentations

Presentation on theme: "Adrian Caciula Department of Computer Science Georgia State University Joint work with Serghei Mangul (UCLA) Ion Mandoiu (UCONN) Alex Zelikovsky (GSU)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Adrian Caciula Department of Computer Science Georgia State University Joint work with Serghei Mangul (UCLA) Ion Mandoiu (UCONN) Alex Zelikovsky (GSU)

Similar presentations

Presentation on theme: "Adrian Caciula Department of Computer Science Georgia State University Joint work with Serghei Mangul (UCLA) Ion Mandoiu (UCONN) Alex Zelikovsky (GSU)"— Presentation transcript:

Similar presentations

About project

Feedback