Serghei Mangul Department of Computer Science Georgia State University Joint work with Irina Astrovskaya, Marius Nicolae, Bassam Tork, Ion Mandoiu and.

Slides:



Advertisements
Similar presentations
RNA-Seq as a Discovery Tool
Advertisements

Marius Nicolae Computer Science and Engineering Department
RNA-Seq based discovery and reconstruction of unannotated transcripts
Alex Zelikovsky Department of Computer Science Georgia State University Joint work with Serghei Mangul, Irina Astrovskaya, Bassam Tork, Ion Mandoiu Viral.
Unsupervised Learning
 Experimental Setup  Whole brain RNA-Seq Data from Sanger Institute Mouse Genomes Project [Keane et al. 2011]  Synthetic hybrids with different levels.
RNAseq.
Peter Tsai Bioinformatics Institute, University of Auckland
Amplicon-Based Quasipecies Assembly Using Next Generation Sequencing Nick Mancuso Bassam Tork Computer Science Department Georgia State University.
The 454 and Ion PGM at the Genomics Core Facility Dr. Deborah Grove, Director for Genetic Analysis Genomics Core Facility Huck Institutes of the Life Sciences.
Marius Nicolae and Ion Măndoiu (University of Connecticut, USA)
Transcriptome Assembly and Quantification from Ion Torrent RNA-Seq Data Alex Zelikovsky Department of Computer Science Georgia State University Joint work.
Presented by: Pham Kien Cuong NUS Graduate School for Integrative Sciences and Engineering.
Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520
RNA-Seq based discovery and reconstruction of unannotated transcripts in partially annotated genomes 3 Serghei Mangul*, Adrian Caciula*, Ion.
Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut.
1 Nicholas Mancuso Department of Computer Science Georgia State University Joint work with Bassam Tork, GSU Pavel Skums, CDC Ion M ӑ ndoiu, UConn Alex.
Statistics for the Social Sciences Psychology 340 Fall 2006 Review For Exam 1.
Novel multi-platform next generation assembly methods for mammalian genomes The Baylor College of Medicine, Australian Government and University of Connecticut.
Marius Nicolae Computer Science and Engineering Department University of Connecticut Joint work with Serghei Mangul, Ion Mandoiu and Alex Zelikovsky.
Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut.
Next-Generation Sequencing: Challenges and Opportunities Ion Mandoiu Computer Science and Engineering Department University of Connecticut.
Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut.
Reconstruction of infectious bronchitis virus quasispecies from 454 pyrosequencing reads CAME 2011 Ion Mandoiu Computer Science & Engineering Dept. University.
Towards accurate detection and genotyping of expressed variants from whole transcriptome sequencing data Jorge Duitama 1, Pramod Srivastava 2, and Ion.
High Throughput Sequencing
Software for Robust Transcript Discovery and Quantification from RNA-Seq Ion Mandoiu, Alex Zelikovsky, Serghei Mangul.
Reconstruction of Haplotype Spectra from NGS Data Ion Mandoiu UTC Associate Professor in Engineering Innovation Department of Computer Science & Engineering.
Next generation sequencing Xusheng Wang 4/29/2010.
Li and Dewey BMC Bioinformatics 2011, 12:323
Todd J. Treangen, Steven L. Salzberg
Variables: – T(p) - set of candidate transcripts on which pe read p can be mapped within 1 std. dev. – y(t) -1 if a candidate transcript t is selected,
Adrian Caciula Department of Computer Science Georgia State University Joint work with Serghei Mangul (UCLA) Ion Mandoiu (UCONN) Alex Zelikovsky (GSU)
RNA-Seq Assembly 转录组拼接 唐海宝 基因组与生物技术研究中心 2013 年 11 月 23 日.
Novel transcript reconstruction from ION Torrent sequencing reads and Viral Meta-genome Reconstruction from AmpliSeq Ion Torrent data University of Connecticut.
Sahar Al Seesi and Ion Măndoiu Computer Science and Engineering
Quasispecies Assembly Using Network Flows Alex Zelikovsky Georgia State University Joint work with Kelly Westbrooks Georgia State University Irina Astrovskaya.
Gene Prediction: Similarity-Based Methods (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 15, 2005 ChengXiang Zhai Department of Computer Science.
Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.
Inferring Viral Quasispecies Spectra from NGS Reads Ion Măndoiu Computer Science & Engineering Department University of Connecticut.
California Pacific Medical Center
Bioinformatics tools for viral quasispecies reconstruction from next-generation sequencing data and vaccine optimization PD: Ion Măndoiu, UConn Co-PDs: Mazhar.
Scalable Algorithms for Next-Generation Sequencing Data Analysis Ion Mandoiu UTC Associate Professor in Engineering Innovation Department of Computer Science.
TOX680 Unveiling the Transcriptome using RNA-seq Jinze Liu.
No reference available
Alex Zelikovsky Department of Computer Science Georgia State University Joint work with Adrian Caciula (GSU), Serghei Mangul (UCLA) James Lindsay, Ion.
A Maximum Likelihood Method for Quasispecies Reconstruction Nicholas Mancuso, Georgia State University Bassam Tork, Georgia State University Pavel Skums,
Scalable Algorithms for Next-Generation Sequencing Data Analysis Ion Mandoiu UTC Associate Professor in Engineering Innovation Department of Computer Science.
Manuel Holtgrewe Algorithmic Bioinformatics, Department of Mathematics and Computer Science PMSB Project: RNA-Seq Read Simulation.
An Integer Programming Approach to Novel Transcript Reconstruction from Paired-End RNA-Seq Reads Serghei Mangul Department of Computer Science Georgia.
HW7: Evolutionarily conserved segments ENCODE region 009 (beta-globin locus) Multiple alignment of human, dog, and mouse 2 states: neutral (fast-evolving),
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Reliable Identification of Genomic Variants from RNA-seq Data Robert Piskol, Gokul Ramaswami, Jin Billy Li PRESENTED BY GAYATHRI RAJAN VINEELA GANGALAPUDI.
KGEM: an EM Error Correction Algorithm for NGS Amplicon-based Data Alexander Artyomenko.
Canadian Bioinformatics Workshops
De Novo Assembly of Mitochondrial Genomes from Low Coverage Whole-Genome Sequencing Reads Fahad Alqahtani and Ion Mandoiu University of Connecticut Computer.
ICCABS 2013 kGEM: An EM-based Algorithm for Local Reconstruction of Viral Quasispecies Alexander Artyomenko.
RNA Quantitation from RNAseq Data
Alexander Zelikovsky Computer Science Department
Jin Zhang, Jiayin Wang and Yufeng Wu
Reference based assembly
2nd (Next) Generation Sequencing
Discovery tools for human genetic variations
Alternative Splicing QTLs in European and African Populations
Inference of alternative splicing from RNA-Seq data with probabilistic splice graphs BMI/CS Spring 2019 Colin Dewey
Dec. 22, 2011 live call UCONN: Ion Mandoiu, Sahar Al Seesi
Quantitative analyses using RNA-seq data
Sequence Analysis - RNA-Seq 2
Presentation transcript:

Serghei Mangul Department of Computer Science Georgia State University Joint work with Irina Astrovskaya, Marius Nicolae, Bassam Tork, Ion Mandoiu and Alex Zelikovsky Maximum Likelihood Estimation of Incomplete Genomic Spectrum from HTS Data

Outline Introduction ML Model EM Algorithm VSEM Algorithm Experimental Results RNA-Seq 454 Conclusions and future work WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany

Outline Introduction ML Model EM Algorithm VSEM Algorithm Experimental Results RNA-Seq 454 Conclusions and future work WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany

Advances in High-Throughput Sequencing (HTS) 4 Roche/454 FLX Titanium million reads/run 400bp avg. length Illumina HiSeq 2000 Up to 6 billion PE reads/run bp read length SOLiD 4/ billion PE reads/run 35-50bp read length WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany

5 Outline Introduction ML Model EM Algorithm VSEM Algorithm Experimental Results RNA-Seq 454 Conclusions and future work

ML Model Panel : bipartite graph —LEFT: genomic sequences (strings) >unknown frequencies —RIGHT: reads >observed frequencies —EDGES: probability of the read to be emitted by the string >weights are calculated based on the mapping of the reads to the strings strings S1 S2 S3 R1 R2 R4 reads R3 WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany

Outline Introduction ML Model EM Algorithm VSEM Algorithm Experimental Results RNA-Seq 454 Conclusions and future work WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany

ML estimates of string frequencies Probability that a read is sampled from string is proportional with its frequency f(j) ML estimates for f(j) is given by n(j)/(n(1) n(N)) —n(j) - number of reads sampled from string j WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany

EM algorithm Initialization E-step: Compute the expected number n(j) of reads that come from string j under the assumption that string frequencies f(j) are correct M-step: For each string j, set the new value of f(j) equal to the portion of reads being originated by string j among all reads in the sample WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany

Outline Introduction ML Model EM Algorithm VSEM Algorithm Experimental Results RNA-Seq 454 Conclusions and future work WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany

ML Model Quality How well the maximum likelihood model explain the reads Measured by deviation between expected and observed read frequencies —expected read frequency: WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany

VSEM : Virtual String EM ML estimates of string frequencies ML estimates of string frequencies Compute expected read frequencies Compute expected read frequencies Update weights of reads in virtual string Update weights of reads in virtual string EM (Incomplete) Panel + Virtual String with 0-weights in virtual string (Incomplete) Panel + Virtual String with 0-weights in virtual string Virtual String frequency change>ε? Output string frequencies and weights Output string frequencies and weights EM YES NO WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany

Example : 1 st iteration 13 strings S1 S2 S3 R1 R2 R4 reads R3 strings S1 S2 R1 R2 R4 reads R3 Full Panel Incomplete Panel O 0.25 O WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany

14 Example : 1 st iteration 14 strings S1 S2 S3 R1 R2 R4 reads R3 strings S1 S2 R1 R2 R4 reads R3 Full Panel Incomplete Panel O 0.25 O VS WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany

Example : 1 st iteration 15 strings S1 S2 S3 R1 R2 R4 reads R3 strings S1 S2 R1 R2 R4 reads R3 Full Panel Incomplete Panel OE.25 OE ML ML VS WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany

Example : 1 st iteration 16 strings S1 S2 S3 R1 R2 R4 reads R3 strings S1 S2 R1 R2 R4 reads R3 Full Panel Incomplete Panel OE.25 OE ML ML VS WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany

Example : last iteration 17 strings S1 S2 S3 R1 R2 R4 reads R3 strings S1 S2 R1 R2 R4 reads R3 Full Panel Incomplete Panel OE.25 OE ML ML VS WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany

VSEM : Virtual String EM Decide if the panel is likely to be incomplete Estimate total frequency of missing strings Identify read spectrum emitted by missing strings WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany

VSEM : Applications RNA-Seq —inferring isoform expressions from RNA-Seq Viral Quasispecies Sequencing by 454 pyrosequencing —inferring viral quasispecies spectrum from pyrosequencing shotgun reads 19 WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany

Outline Introduction ML Model EM Algorithm VSEM Algorithm Experimental Results RNA-Seq 454 Conclusions and future work WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany

21 RNA-Seq ABCDE Make cDNA & shatter into fragments Sequence fragment ends Map reads Gene Expression (GE) ABC AC DE Isoform Discovery (ID) Isoform Expression (IE) WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany

Previous Approach IsoEM [Nicolae et al. 2011] – novel expectation- maximization algorithm for inference of alternative splicing isoform frequencies from RNA-Seq data —Single and/or paired reads —Fragment length distribution —Strand information —Base quality scores —Insert sizes (library preparation) 22 WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany

Simulation Setup Human genome UCSC/CCDS known isoforms —UCSC : isoforms, genes — CCDS : isoforms, genes GNFAtlas2 gene expression levels —geometric expression of gene isoforms Normally distributed fragment lengths — Mean 250, std. dev WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany

EXP1 : Reduced transcriptome data Comparison between IsoEM and IsoVSEM on reduced transcriptome data —in every gene 25% of isoforms is missing —isoforms inside the gene - geometric distribution(p=0.5) —select genes with number of isoforms inside the gene is less or equal to 3. —removed isoforms with frequency WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany

EXP2 : CCDS panel UCSC database represents the full panel CCDS represents the incomplete panel —reads were generated from UCSC library of isoforms —only frequencies of known isoforms(CCDS) were estimated 25 WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany

Error Fraction Curves EXP1, 30M reads of length WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany

Outline Introduction ML Model EM Algorithm VSEM Algorithm Experimental Results RNA-Seq 454 Conclusions and future work WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany

454 Pyrosequencing Pyrosequencing =Sequencing by Synthesis. GS FLX Titanium : — Divides the source genetic material into reads ( bp) 28 WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany

Previous Approach ViSpA [Astrovskaya et al. 2011] – viral spectrum assembling tool for inferring viral quasispecies sequences and their frequencies from pyroseqencing shotgun reads —align reads —built a read graph : >V – reads >E – overlap between reads >each path – candidate sequence —filter based on ML frequencies 29 WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany

ViSpA-VSEM 30 ViSPA Weighted assembler assembled Qsps Qsps Library VSEM Virtual String EM reads, weights Viral Spectrum +Statistics reads ViSpA ML estimator removing duplicated & rare qsps Stopping condition YES NO WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany

Simulation Setup Real quasispecies sequences data from [von Hahn et al. 2006] —44 sequences (1739 bp long) from the E1E2 region of Hepatitis C virus —populations sizes: 10, 20, 30, and 40 sequences —population distributions: geometric, skewed normal, uniform 31 WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany

Experimental Validation of VSEM Detection of panel incompleteness —VSEM can detect >1% of missing strings Improving quasispecies frequencies estimations Detection of reads emitted by missing string —Correlation between predicted reads and reads emitted by missing strings >65% 32 WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany

VSEM improving frequencies estimates 33 % of missing strings r.l./n.r<10%10%-20%20%-30%30%-40%40%-50%>50% rerrr r r r r ViSpA-EM100/20K ViSpA-VSEM100/20K ViSpA-EM300/20K ViSpA-VSEM300/20K ViSpA-EM100/100K ViSpA-VSEM100/100K ViSpA-EM300/100K ViSpA-VSEM300/100K WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany r - Correlation between real and predicted frequencies; err - average prediction error

ViSpA vs ViSpA-VSEM 34 ViSpAViSpA-VSEM DistributionPPVSensitivityrerrPPVSensitivityrerrGain Geometric Skewed Uniform K reads from 10 QSPS average length 300 WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany r - Correlation between real and predicted frequencies; err - average prediction error

Outline Introduction ML Model EM Algorithm VSEM Algorithm Experimental Results RNA-Seq 454 Conclusions and future work WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany

Conclusions We propose VSEM, a novel modification of EM algorithm —improves the ML frequency estimations of multiple genomic sequences —identifies reads that belong to unassembled(missing) sequences We applied VSEM to improve two tools: —IsoEM —ViSpA 36 WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany

Future work Assemble strings from the set of reads emitted by missing strings Improve other metagenomics tools 37 WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany

Acknowledgments NSF awards IIS IIS , and DBI NSF award IIS Agriculture and Food Research Initiative Competitive Grant no from the USDA National Institute of Food and Agriculture. 38 WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany

Thanks 39