ILP-BASED MAXIMUM LIKELIHOOD GENOME SCAFFOLDING James Lindsay Ion Mandoiu University of Connecticut Hamed Salooti Alex ZelikovskyGeorgia State University.

Slides:



Advertisements
Similar presentations
CS 336 March 19, 2012 Tandy Warnow.
Advertisements

Marius Nicolae Computer Science and Engineering Department
RNA-Seq based discovery and reconstruction of unannotated transcripts
JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia.
Fast Algorithms For Hierarchical Range Histogram Constructions
Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.
“Devo verificare un’equivalenza polinomiale…Che fò? Fò dù conti” (Prof. G. Di Battista)
Gao Song 2010/04/27. Outline Concepts Problem definition Non-error Case Edge-error Case Disconnected Components Simulated Data Future Work.
A Randomized Linear-Time Algorithm to Find Minimum Spanning Trees David R. Karger David R. Karger Philip N. Klein Philip N. Klein Robert E. Tarjan.
Lectures on Network Flows
Circuit Retiming with Interconnect Delay CUHK CSE CAD Group Meeting One Evangeline Young Aug 19, 2003.
Background: Scan-Based Delay Fault Testing Sequentially apply initialization, launch test vector pairs that differ by 1-bit shift A vector pair induces.
Genome Sequence Assembly: Algorithms and Issues Fiona Wong Jan. 22, 2003 ECS 289A.
1 Optimization problems such as MAXSAT, MIN NODE COVER, MAX INDEPENDENT SET, MAX CLIQUE, MIN SET COVER, TSP, KNAPSACK, BINPACKING do not have a polynomial.
Class 02: Whole genome sequencing. The seminal papers ``Is Whole Genome Sequencing Feasible?'' ``Whole-Genome DNA.
Novel multi-platform next generation assembly methods for mammalian genomes The Baylor College of Medicine, Australian Government and University of Connecticut.
Randomness in Computation and Communication Part 1: Randomized algorithms Lap Chi Lau CSE CUHK.
Systematic Analysis of Interactome: A New Trend in Bioinformatics KOCSEA Technical Symposium 2010 Young-Rae Cho, Ph.D. Assistant Professor Department of.
Reconstruction of Haplotype Spectra from NGS Data Ion Mandoiu UTC Associate Professor in Engineering Innovation Department of Computer Science & Engineering.
Graph Algorithms Using Depth First Search Prepared by John Reif, Ph.D. Distinguished Professor of Computer Science Duke University Analysis of Algorithms.
JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* ACM-BCB 2012 Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia.
De-novo Assembly Day 4.
LECTURE 2 Splicing graphs / Annoteted transcript expression estimation.
Approximating the MST Weight in Sublinear Time Bernard Chazelle (Princeton) Ronitt Rubinfeld (NEC) Luca Trevisan (U.C. Berkeley)
Mon C222 lecture by Veli Mäkinen Thu C222 study group by VM  Mon C222 exercises by Anna Kuosmanen Algorithms in Molecular Biology, 5.
Todd J. Treangen, Steven L. Salzberg
A hierarchical approach to building contig scaffolds Mihai Pop Dan Kosack Steven L. Salzberg Genome Research 14(1), pp , 2004.
Network Models (2) Tran Van Hoai Faculty of Computer Science & Engineering HCMC University of Technology Tran Van Hoai.
PE-Assembler: De novo assembler using short paired-end reads Pramila Nuwantha Ariyaratne.
1 Introduction to Approximation Algorithms. 2 NP-completeness Do your best then.
Fundamentals of Algorithms MCS - 2 Lecture # 7
Binary Search From solving a problem to verifying an answer.
Improved Approximation Algorithms for the Quality of Service Steiner Tree Problem M. Karpinski Bonn University I. Măndoiu UC San Diego A. Olshevsky GaTech.
Handover and Tracking in a Camera Network Presented by Dima Gershovich.
Serghei Mangul Department of Computer Science Georgia State University Joint work with Irina Astrovskaya, Marius Nicolae, Bassam Tork, Ion Mandoiu and.
Resource Mapping and Scheduling for Heterogeneous Network Processor Systems Liang Yang, Tushar Gohad, Pavel Ghosh, Devesh Sinha, Arunabha Sen and Andrea.
CSE 589 Part VI. Reading Skiena, Sections 5.5 and 6.8 CLR, chapter 37.
NP-COMPLETE PROBLEMS. Admin  Two more assignments…  No office hours on tomorrow.
NP-Complete problems.
Data Structures and Algorithms in Parallel Computing Lecture 2.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
OPERA highthroughput paired-end sequences Reconstructing optimal genomic scaffolds with.
1. For minimum vertex cover problem in the following graph give
Iterative Improvement for Domain-Specific Problems Lecturer: Jing Liu Homepage:
CSE280Stefano/Hossein Project: Primer design for cancer genomics.
Character Design and Stamp Algorithms for Character Projection Electron-Beam Lithography P. Du, W. Zhao, S.H. Weng, C.K. Cheng, and R. Graham UC San Diego.
A Binary Linear Programming Formulation of the Graph Edit Distance Presented by Shihao Ji Duke University Machine Learning Group July 17, 2006 Authors:
Learning Hidden Graphs Hung-Lin Fu 傅 恆 霖 Department of Applied Mathematics Hsin-Chu Chiao Tung Univerity.
KGEM: an EM Error Correction Algorithm for NGS Amplicon-based Data Alexander Artyomenko.
ICCABS 2013 kGEM: An EM-based Algorithm for Local Reconstruction of Viral Quasispecies Alexander Artyomenko.
School of Computing Clemson University Fall, 2012
New Characterizations in Turnstile Streams with Applications
Metafast High-throughput tool for metagenome comparison
Denovo genome assembly of Moniliophthora roreri
Genome sequence assembly
Approximating the MST Weight in Sublinear Time
Lectures on Network Flows
Graph Algorithms Using Depth First Search
CIS 700: “algorithms for Big Data”
How to Build a Horse: Final Report
Integer Programming (정수계획법)
CSE 373 Data Structures and Algorithms
Introduction Basic formulations Applications
Md. Abul Kashem, Chowdhury Sharif Hasan, and Anupam Bhattacharjee
CSE 589 Applied Algorithms Spring 1999
SPQR Tree.
Integer Programming (정수계획법)
Chapter 1. Formulations.
Fragment Assembly 7/30/2019.
Presentation transcript:

ILP-BASED MAXIMUM LIKELIHOOD GENOME SCAFFOLDING James Lindsay Ion Mandoiu University of Connecticut Hamed Salooti Alex ZelikovskyGeorgia State University

De-novo Assembly Sequencing Contigs Scaffolds Reads Genome Assembly Scaffolding 2

Why Scaffolding? Annotation Comparative biology Re-sequencing and gap filling Structural variation! gene XYZ3’ UTR5’ UTR Scaffold gene XYZ No scaffold 3

Why Scaffolding? Annotation Comparative biology Re-sequencing and gap filling Structural variation! gene XYZ3’ UTR5’ UTR Sanger Sequencing gene XYZ3’ UTR5’ UTR Biologist: There are holes in my genes! 4

Why Scaffolding? Annotation Comparative biology Re-sequencing and gap Filling Structural variation! 5

Read Pairs Paired Read Construction 2kb same strand and orientation R1 R2 Informative Reads Align each read against the contigs Uniquely mapped reads Save repetitive Both reads in a pair must map to different contigs 6

Linkage Information Possible States Two contigs are adjacent if: A read pair spans the contigs State (A, B, C, D) Depends on orientation of the read Order of contigs is arbitrary Each read pair can be “consistent” with one of the four states 5’ 3’ contig icontig j R1 R2 A B C D 7

Scaffolding Problem Input Contigs Linkage information from paired reads Output Contig orientation Contig ordering Relative distance between contigs Objective: Find the longest and most accurate scaffolds 8

Existing Tools Bambus2 GRASS OPERA MIP SCAPRA SOPRA SGA SOAPDENOVO SSPACE 9

SILP Flow 10 mapping of reads onto contigs with bowtie2 scaffolding graph construction graph decomposition into 3-connected components via SPQR trees maximum likelihood contig orientation via ILP+NSDP decomposition into paths of orientation-compatible edges via bipartite matching gap estimation via quadratic programming

Scaffolding Graph Scaffolding graph: G = (V, E, w) V = set of all contigs E = set of pairs of contigs connected with mapped read pairs of a particular state (A,B,C, or D) Edge weight = probability of read pairs being correctly aligned: Amount of repeat overlap Contig coverage dissimilarity 11 contigs read pairs

Structure of Scaffold Graph 12 2kb Paired reads have upper bound Contigs have minimum size Upper bound of # contigs spanned Scaffold graph has bounded width (Opera) 2kb

Elimination Order: SPQR-tree 13 Bi-connected Tri-connected Elimination order

Maximum Likelihood Contig Orientation Given read pair r, let p r be probability of r being correctly aligned For given orientation O, let R O be subset of all reads R agreeing with O Probability of O being correct is estimated as Log likelihood Equivalent to maximize 14

Integer Linear Program Formulation Variables: Binary S i = 0 if i-th contig keeps default orientation, =1 if it is flipped Binary S ij = 0 if contigs in (i,j)-edge are both flipped or both not, = 1 otherwise Binary A ij =1 if the edge (i,j) in state A, = 0 otherwise (similarlyB ij,C ij, D ij ) Weight of edge Objective 15

ILP Constraints Connecting S i and S ij Connecting S ij and A ij Forbidding 2-cycles  Forbidding 3-cycles 16 j i k j i k j i k j i k j i j i

Non-Serial Dynamic Programming 1-cuts 17 Splitting 1-Cut ? ? ? ?? ? ?? Collapsing 1-Cut ? ?? 2-Component A ILP_A ? ? ?? 2-Component B ILP_B

Non-Serial Dynamic Programming 2-cuts 18 3-Component B Splitting 2-Cut 3-Component A ? ? ? ? ? ILP_B Collapsing 2-cut ? ? ?? ? 2-Cut ILP_A  + ILP_A  ILP_A

Total Ordering via Bipartite Matching ILP BipartiteBipartite Total Output  graph  matching  order 19 B D E A F C B F E A C D B F E A C D E F B A C D E F B A C D

Gap Estimation via QP Maximum likelihood gap estimation following Opera and previous scaffolders 20 CONTIG iCONTIG i+1CONTIG i+2CONTIG i+3

Results Verification on simulation Simulate contigs from real draft assemblies (GAGE) & real reads Control error rate Actual orientation, order and position is known 21 Evaluation on real assembly Draft Human genome & real reads Alignment based evaluation is challenging

GAGE Simulation 22 Step: Staph, Rhodo and Chr14 assemblies Mix all contig sizes, sample uniformly at random to build simulated contigs (or gaps) Align reads against simulated contigs Control error rates Repeat 10x Metrics: Binary classification (n-1) real edges Corrected N50 (break edges at bad points) Runtime Scaffold size distribution

Simulation: MCC 23

Simulation: N50: Staph 24

Simulation: N50: Rhodo 25

Simulation: N50: Chr14 26

NA x: Runtime 27

Metagenomics: GAGE Simulation 28

Conclusions Powerful and flexible ILP for genome scale problems NSDP is a viable solution for big problems ILP will probably work well in Metagenomics domain Validation through N50, gene content, Future work: Validate through recent framework from Genome Biology 03/

Thank you! 30

Nonserial Dynamic Programming (NSDP) Compute ILP solution in stages such that each uses results from the previous stage. Use SPQR-tree to determine variable elimination order stack = [root of SPQR-tree] visited = {empty} while stack is not empty do p = stack.pop() foreach child q of p do if p not in visited then stack.push(p) endif endfor if p is root then solfinal = ILP(p) else (s,t) = getcut(p, parent(p)) sol00 = ILP(p, s=0, t=0), sol01 = ILP(p, s=0, t=1) sol01 = ILP(p, s=0, t=1), sol00 = ILP(p, s=1, t=1) endif endwhile 31 Scaffolding NSDP Algorithm