Download presentation
Presentation is loading. Please wait.
Published byHerbert Collins Modified over 9 years ago
1
Opera: Reconstructing optimal genomic scaffolds with high- throughput paired-end sequences Song Gao, Niranjan Nagarajan, Wing-Kin Sung National University of Singapore Genome Institute of Singapore
2
Outline Overview Methods - 1. Pre-Processing - 2. A Special Case - 3. Full Algorithm - 4. Graph Contraction - 5. Gap Estimation Results Ongoing Work 2
3
Transcripts Microbial Community Biological Entity Data Entity Genome Genomic Sequence Transcript Assembly Metagenome ReadsAnalysis ACGTTTAACAGG… TTACGATTCGATGA… GCCATAATGCAAG… CTTAGAATCGGATAGAC… AGGCATAGACTAGAG… Sequencing Machine 3
4
Sequence Assembly ReadsContigsScaffolds Paired-end Reads Related Research Works Contig Level OLC Framework: De Bruijn Graph: Scaffold Level Comparative Assembly: Embedded Module: Standalone Module: (I)(II) Celera Assembler [Myers et al,2000], Edena [Hernandez et al,2008], Arachne [Batzoglou et al,2002], PE Assembler [Ariyaratne et al,2011] EULER [Pevzner et al, 2001], Velvet [Zerbino et al,2008], ALLPATHS [Butler et al,2008], SOAPdenovo [Li et al,2010] AMOScmp [Pop,2004], ABBA [Salzberg,2008] EULER [Pevnezer et al, 2001], Arachne [Batzoglou et al,2002], Celera Assembler [Myers et al,2000], Velvet [Zerbino, 2008] Bambus [Pop, et al, 2004], SOPRA [Dayarian et al, 2010] 4
5
Scaffolding Problem [Huson et al, 2002] Value Addition Gap Filling: GapCloser Module of SOAPdenovo Repeat Resolution Long-Range Genomic Structure 1k3k2.5k Discordant Read Paired-end Read Scaffold Contig * Huson, D. H., Reinert, K., Myers E.W.: The greedy path-merging algorithm for contig scaffolding. Journal of the ACM 49(5), 603–615 (2002) 5
6
Data Sequencing Errors Read Length Coverage Analysis Long Insert vs. Long Read [Chaisson, 2009; Zerbino, 2009] Statistics of Assembled Genomes [Schatz et al, 2010] OrganismGenome Size Grapevine500Mb Panda2.4Gb Strawberry220Mb Turkey1.1Gb * Zerbino, D.R.: Pebble and rock band: heuristic resolution of repeats and scaolding in the velvet short-read de novo assembler. PLoS ONE, 4(12) (2009) * Chaisson, M.J., Brinza, D., Pevzner, P.A.: De novo fragment assembly with short mate-paired reads: does the read length matter? Genome Research 19, 336-346 (2009) # of ContigsN50 58,61118.2kb 200,60436.7kb 16,48728.1kb 128,27112.6kb # of ScaffoldN50 2,0931.33Mb 81,4691.22Mb 3,2631.44Mb 26,9171.5Mb * Schatz M. C., Arthur L. D., Steven L. S.: Assembly of large genomes using second-generation sequencing. Genome Research, 20-9, 1165-1173 (2010) * N50: Given a set of sequences of varying lengths, the N50 length is defined as the length N for which 50% of all bases in the sequences are in a sequence of length L >= N. 6
7
NP-Complete [Huson et al, 2002] * Huson, D. H., Reinert, K., Myers E.W.: The greedy path-merging algorithm for contig scaffolding. Journal of the ACM 49(5), 603–615 (2002) 7
8
Heuristic Methods - Celera Assembler [ Myers et al,2000 ] - Euler [ Pevzner et al, 2001 ] - Jazz [ Chapman et al, 2002] - Arachne [ Batzoglou et al,2002 ] - Velvet [ Zerbino et al,2008 ] - Bambus [ Pop, et al, 2004 ] “True Complexity” Phase transition based on parameters [Hayes, 1996] Parametric Complexity [Rodney et al, 1999] Vertex Cover Problem Fixed-parameter tractabillity * Hayes, B. Can't get no satisfaction. American. Scientist. 85, 108-112 (1996). 3-SAT Problem * Rodney G. D., et al. Parameterized Complexity: A Framework for Systematically Confronting Computational Intractability. DIMACS. Vol 49. 1999 8
9
Outline Overview Methods - 1. Pre-Processing - 2. A Special Case - 3. Full Algorithm - 4. Graph Contraction - 5. Gap Estimation Results Ongoing Work 9
10
1. Pre-Processing Paired-end Reads -> Clusters [Huson et al, 2002] Chimeric Noise Filtered by simulation * Upper Bound of Paired-end Reads 3 * Huson, D. H., Reinert, K., Myers E.W.: The greedy path-merging algorithm for contig scaffolding. Journal of the ACM 49(5), 603–615 (2002) Chimera 10
11
No discordant clusters in final scaffold Naïve Solution +A +A+B +A-B +A+C +A-C +A+B+C +A+B-C Exponential Time +A-C+B +A-C-B … … … ABCD 2. A Special Case 11
12
Dynamic Programming Scaffold Tail is Sufficient Analogous to Bandwidth Problem [Saxe, 1980] Orientation of Nodes Direction of Edges Discordant Edges … * J. Saxe: Dynamic programming algorithms for recognizing small-bandwidth graphs in polynomial time SIAM J. on Algebraic and Discrete Methodd, 1(4), 363- 369 (1980) width(w) Upper Bound 12
13
Equivalence class of scaffolds S 1 and S 2 have the same tail -> They are in the same class Feature of equivalence class: - Use of the same set of contigs; - All or none of them can be extended to a solution Tail +A-B+C +D+E -A+C +D+E+F …
14
Equivalence Class Number of Discordant Edges (p) Chimeric Reads ACCAAAATTT ACCAAGAATTT Sequencing Errors CTAGAA CAAGAA ? Mapping Errors 3. Full Algorithm Consider discordant clusters 14
15
4. Graph Contraction 20k
16
4. Graph Contraction
18
Utility Genome finishing(Genome Size Estimation) Scaffold Correctness Calculate Gap Sizes Maximum Likelihood Quadratic Function Solved through quadratic programming [Goldfarb, et al, 1983] Polynomial Time g1g1 g2g2 g3g3 μ,σμ,σ 5. Gap Estimation * Goldfarb, D., Idnani, A.: A numerically stable dual method for solving strictly convex quadratic programs. Mathematical Programming, 27 (1983) 18
19
Outline Overview Methods - 1. Pre-Processing - 2. A Special Case - 3. Full Algorithm - 4. Graph Contraction - 5. Gap Estimation Results Ongoing Work 19
20
Runtime Comparison ◆ E. coli ★ B. pseudomallei ◆ S. cerevisiae ◆ D. melanogaster Bambus50s16m2m3m SOPRA49m-2h5h Opera4s7m11s30s Coverage of 300bp insert library: >20X Coverage of 10kbp insert library: 2X Contigs assembled using Velvet 20 ◆ Simulated data set using MetaSim ★ In house data
21
Scaffold Contiguity 21
22
Scaffold Correctness 22
23
Scaffold Correctness E.coliS. cerevisiaeD. melanogaster Opera134 Bambus1955423 23
24
Ongoing Work Genome SizeN50 Opera~2Gbp765.5Kbp SSpace281.7Kbp A Rodent Genome A Tree Genome Genome SizeN50Max Length Opera~300Mbp209.9Kbp921.8Kbp 24
25
Ongoing Work Repeats Lower bounds and better scaffold Multiple Libraries Other applications Metagenomics Cancer Genomics Link: https://sourceforge.net/projects/operasf/https://sourceforge.net/projects/operasf/ 25
26
Acknowledgement Questions? Wing-Kin Sung Niranjan Nagarajan Pramila N. Ariyaratne Fundings: A*STAR of Singapore Ministry of Education, Singapore NUS Graduate School for Integrative Sciences and Engineering (NGS) 26
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.