Opera: Reconstructing optimal genomic scaffolds with high- throughput paired-end sequences Song Gao, Niranjan Nagarajan, Wing-Kin Sung National University.

Slides:



Advertisements
Similar presentations
Marius Nicolae Computer Science and Engineering Department
Advertisements

ILP-BASED MAXIMUM LIKELIHOOD GENOME SCAFFOLDING James Lindsay Ion Mandoiu University of Connecticut Hamed Salooti Alex ZelikovskyGeorgia State University.
Gao Song 2010/04/27. Outline Concepts Problem definition Non-error Case Edge-error Case Disconnected Components Simulated Data Future Work.
Pamela Ferretti Laboratory of Computational Metagenomics Centre for Integrative Biology University of Trento Italy Microbial Genome Assembly 1.
CS273a Lecture 4, Autumn 08, Batzoglou Some Terminology insert a fragment that was incorporated in a circular genome, and can be copied (cloned) vector.
Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut.
Genome Sequence Assembly: Algorithms and Issues Fiona Wong Jan. 22, 2003 ECS 289A.
Class 02: Whole genome sequencing. The seminal papers ``Is Whole Genome Sequencing Feasible?'' ``Whole-Genome DNA.
CS262 Lecture 11, Win07, Batzoglou Some Terminology insert a fragment that was incorporated in a circular genome, and can be copied (cloned) vector the.
1 Nicholas Mancuso Department of Computer Science Georgia State University Joint work with Bassam Tork, GSU Pavel Skums, CDC Ion M ӑ ndoiu, UConn Alex.
Assembly.
Sequencing and Assembly Cont’d. CS273a Lecture 5, Win07, Batzoglou Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT..
Novel multi-platform next generation assembly methods for mammalian genomes The Baylor College of Medicine, Australian Government and University of Connecticut.
Sequencing and Assembly Cont’d. CS273a Lecture 5, Aut08, Batzoglou Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT..
CS273a Lecture 4, Autumn 08, Batzoglou Hierarchical Sequencing.
Marius Nicolae Computer Science and Engineering Department University of Connecticut Joint work with Serghei Mangul, Ion Mandoiu and Alex Zelikovsky.
Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut.
Compartmentalized Shotgun Assembly ? ? ? CSA Two stated motivations? ?
Genome sequencing and assembly Mayo/UIUC Summer Course in Computational Biology Genome sequencing and assembly.
Reconstruction of Haplotype Spectra from NGS Data Ion Mandoiu UTC Associate Professor in Engineering Innovation Department of Computer Science & Engineering.
Assembly by short paired-end reads Wing-Kin Sung National University of Singapore.
JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* ACM-BCB 2012 Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia.
De-novo Assembly Day 4.
How to Build a Horse Megan Smedinghoff.
Genomic sequencing and its data analysis Dong Xu Digital Biology Laboratory Computer Science Department Christopher S. Life Sciences Center University.
CS 394C March 19, 2012 Tandy Warnow.
Todd J. Treangen, Steven L. Salzberg
CUGI Pilot Sequencing/Assembly Projects Christopher Saski.
A hierarchical approach to building contig scaffolds Mihai Pop Dan Kosack Steven L. Salzberg Genome Research 14(1), pp , 2004.
PE-Assembler: De novo assembler using short paired-end reads Pramila Nuwantha Ariyaratne.
GENOME SEQUENCING AND ASSEMBLY Mayo/UIUC Summer Course in Computational Biology.
1 Velvet: Algorithms for De Novo Short Assembly Using De Bruijn Graphs March 12, 2008 Daniel R. Zerbino and Ewan Birney Presenter: Seunghak Lee.
Variables: – T(p) - set of candidate transcripts on which pe read p can be mapped within 1 std. dev. – y(t) -1 if a candidate transcript t is selected,
Gao Song 2010/07/14. Outline Overview of Metagenomices Current Assemblers Genovo Assembly.
Meraculous: De Novo Genome Assembly with Short Paired-End Reads
Sequence assembly using paired- end short tags Pramila Ariyaratne Genome Institute of Singapore SOC-FOS-SICS Joint Workshop on Computational Analysis of.
Steps in a genome sequencing project Funding and sequencing strategy source of funding identified / community drive development of sequencing strategy.
The use of short-read next generation sequences to recover the evolutionary histories in multi-individual samples Systematic biology presentation Yuantong.
Adrian Caciula Department of Computer Science Georgia State University Joint work with Serghei Mangul (UCLA) Ion Mandoiu (UCONN) Alex Zelikovsky (GSU)
Fuzzypath – Algorithms, Applications and Future Developments
Metagenomics Assembly Hubert DENISE
The iPlant Collaborative
Problems of Genome Assembly James Yorke and Aleksey Zimin University of Maryland, College Park 1.
Assembly of Paired-end Solexa Reads by Kmer Extension using Base Qualities Zemin Ning The Wellcome Trust Sanger Institute.
Gena Tang Pushkar Pande Tianjun Ye Xing Liu Racchit Thapliyal Robert Arthur Kevin Lee.
1.Data production 2.General outline of assembly strategy.
University of Connecticut School of Engineering Assembler Reference Abyss Simpson et al., J. T., Wong, K., Jackman, S. D., Schein, J. E., Jones,
Whole Genome Assembly with iPlant
Sequencing technologies and Velvet assembly Lecturer : Du Shengyang September 29 , 2012.
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
OPERA highthroughput paired-end sequences Reconstructing optimal genomic scaffolds with.
Introduction to Multiple-multicast Routing Chu-Fu Wang.
Meet the ants Camponotus floridanus Carpenter ant Harpegnathos saltator Jumping ant Solenopsis invicta Red imported fire ant Pogonomyrmex barbatus Harvester.
An Integer Programming Approach to Novel Transcript Reconstruction from Paired-End RNA-Seq Reads Serghei Mangul Department of Computer Science Georgia.
ALLPATHS: De Novo Assembly of Whole-Genome Shotgun Microreads
When the next-generation sequencing becomes the now- generation Lisa Zhang November 6th, 2012.
Assembly algorithms for next-generation sequencing data
Sequence Assembly.
CAP5510 – Bioinformatics Sequence Assembly
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
Denovo genome assembly of Moniliophthora roreri
Jeong-Hyeon Choi, Sun Kim, Haixu Tang, Justen Andrews, Don G. Gilbert
Genome sequence assembly
Professors: Dr. Gribskov and Dr. Weil
Assembly.
Ssaha_pileup - a SNP/indel detection pipeline from new sequencing data
Introduction to Genome Assembly
CS 598AGB Genome Assembly Tandy Warnow.
Genome Sequencing and Assembly
Presentation transcript:

Opera: Reconstructing optimal genomic scaffolds with high- throughput paired-end sequences Song Gao, Niranjan Nagarajan, Wing-Kin Sung National University of Singapore Genome Institute of Singapore

Outline  Overview Methods - 1. Pre-Processing - 2. A Special Case - 3. Full Algorithm - 4. Graph Contraction - 5. Gap Estimation Results Ongoing Work 2

Transcripts Microbial Community Biological Entity Data Entity Genome Genomic Sequence Transcript Assembly Metagenome ReadsAnalysis ACGTTTAACAGG… TTACGATTCGATGA… GCCATAATGCAAG… CTTAGAATCGGATAGAC… AGGCATAGACTAGAG… Sequencing Machine 3

Sequence Assembly ReadsContigsScaffolds Paired-end Reads Related Research Works Contig Level OLC Framework: De Bruijn Graph: Scaffold Level Comparative Assembly: Embedded Module: Standalone Module: (I)(II) Celera Assembler [Myers et al,2000], Edena [Hernandez et al,2008], Arachne [Batzoglou et al,2002], PE Assembler [Ariyaratne et al,2011] EULER [Pevzner et al, 2001], Velvet [Zerbino et al,2008], ALLPATHS [Butler et al,2008], SOAPdenovo [Li et al,2010] AMOScmp [Pop,2004], ABBA [Salzberg,2008] EULER [Pevnezer et al, 2001], Arachne [Batzoglou et al,2002], Celera Assembler [Myers et al,2000], Velvet [Zerbino, 2008] Bambus [Pop, et al, 2004], SOPRA [Dayarian et al, 2010] 4

Scaffolding Problem [Huson et al, 2002] Value Addition Gap Filling: GapCloser Module of SOAPdenovo Repeat Resolution Long-Range Genomic Structure 1k3k2.5k Discordant Read Paired-end Read Scaffold Contig * Huson, D. H., Reinert, K., Myers E.W.: The greedy path-merging algorithm for contig scaffolding. Journal of the ACM 49(5), 603–615 (2002) 5

Data  Sequencing Errors  Read Length  Coverage Analysis Long Insert vs. Long Read [Chaisson, 2009; Zerbino, 2009] Statistics of Assembled Genomes [Schatz et al, 2010] OrganismGenome Size Grapevine500Mb Panda2.4Gb Strawberry220Mb Turkey1.1Gb * Zerbino, D.R.: Pebble and rock band: heuristic resolution of repeats and scaolding in the velvet short-read de novo assembler. PLoS ONE, 4(12) (2009) * Chaisson, M.J., Brinza, D., Pevzner, P.A.: De novo fragment assembly with short mate-paired reads: does the read length matter? Genome Research 19, (2009) # of ContigsN50 58, kb 200, kb 16, kb 128, kb # of ScaffoldN50 2, Mb 81, Mb 3, Mb 26,9171.5Mb * Schatz M. C., Arthur L. D., Steven L. S.: Assembly of large genomes using second-generation sequencing. Genome Research, 20-9, (2010) * N50: Given a set of sequences of varying lengths, the N50 length is defined as the length N for which 50% of all bases in the sequences are in a sequence of length L >= N. 6

NP-Complete [Huson et al, 2002] * Huson, D. H., Reinert, K., Myers E.W.: The greedy path-merging algorithm for contig scaffolding. Journal of the ACM 49(5), 603–615 (2002) 7

Heuristic Methods - Celera Assembler [ Myers et al,2000 ] - Euler [ Pevzner et al, 2001 ] - Jazz [ Chapman et al, 2002] - Arachne [ Batzoglou et al,2002 ] - Velvet [ Zerbino et al,2008 ] - Bambus [ Pop, et al, 2004 ] “True Complexity” Phase transition based on parameters [Hayes, 1996] Parametric Complexity [Rodney et al, 1999] Vertex Cover Problem Fixed-parameter tractabillity * Hayes, B. Can't get no satisfaction. American. Scientist. 85, (1996). 3-SAT Problem * Rodney G. D., et al. Parameterized Complexity: A Framework for Systematically Confronting Computational Intractability. DIMACS. Vol

Outline Overview  Methods - 1. Pre-Processing - 2. A Special Case - 3. Full Algorithm - 4. Graph Contraction - 5. Gap Estimation Results Ongoing Work 9

1. Pre-Processing Paired-end Reads -> Clusters [Huson et al, 2002] Chimeric Noise Filtered by simulation * Upper Bound of Paired-end Reads 3 * Huson, D. H., Reinert, K., Myers E.W.: The greedy path-merging algorithm for contig scaffolding. Journal of the ACM 49(5), 603–615 (2002) Chimera 10

No discordant clusters in final scaffold Naïve Solution +A +A+B +A-B +A+C +A-C +A+B+C +A+B-C Exponential Time +A-C+B +A-C-B … … … ABCD 2. A Special Case 11

Dynamic Programming Scaffold Tail is Sufficient Analogous to Bandwidth Problem [Saxe, 1980] Orientation of Nodes Direction of Edges Discordant Edges … * J. Saxe: Dynamic programming algorithms for recognizing small-bandwidth graphs in polynomial time SIAM J. on Algebraic and Discrete Methodd, 1(4), (1980) width(w) Upper Bound 12

Equivalence class of scaffolds S 1 and S 2 have the same tail -> They are in the same class Feature of equivalence class: - Use of the same set of contigs; - All or none of them can be extended to a solution Tail +A-B+C +D+E -A+C +D+E+F …

Equivalence Class Number of Discordant Edges (p) Chimeric Reads ACCAAAATTT ACCAAGAATTT Sequencing Errors CTAGAA CAAGAA ? Mapping Errors 3. Full Algorithm Consider discordant clusters 14

4. Graph Contraction 20k

4. Graph Contraction

Utility Genome finishing(Genome Size Estimation) Scaffold Correctness Calculate Gap Sizes Maximum Likelihood Quadratic Function Solved through quadratic programming [Goldfarb, et al, 1983] Polynomial Time g1g1 g2g2 g3g3 μ,σμ,σ 5. Gap Estimation * Goldfarb, D., Idnani, A.: A numerically stable dual method for solving strictly convex quadratic programs. Mathematical Programming, 27 (1983) 18

Outline Overview Methods - 1. Pre-Processing - 2. A Special Case - 3. Full Algorithm - 4. Graph Contraction - 5. Gap Estimation  Results Ongoing Work 19

Runtime Comparison ◆ E. coli ★ B. pseudomallei ◆ S. cerevisiae ◆ D. melanogaster Bambus50s16m2m3m SOPRA49m-2h5h Opera4s7m11s30s Coverage of 300bp insert library: >20X Coverage of 10kbp insert library: 2X Contigs assembled using Velvet 20 ◆ Simulated data set using MetaSim ★ In house data

Scaffold Contiguity 21

Scaffold Correctness 22

Scaffold Correctness E.coliS. cerevisiaeD. melanogaster Opera134 Bambus

Ongoing Work Genome SizeN50 Opera~2Gbp765.5Kbp SSpace281.7Kbp A Rodent Genome A Tree Genome Genome SizeN50Max Length Opera~300Mbp209.9Kbp921.8Kbp 24

Ongoing Work Repeats Lower bounds and better scaffold Multiple Libraries Other applications Metagenomics Cancer Genomics Link: 25

Acknowledgement Questions? Wing-Kin Sung Niranjan Nagarajan Pramila N. Ariyaratne Fundings: A*STAR of Singapore Ministry of Education, Singapore NUS Graduate School for Integrative Sciences and Engineering (NGS) 26