ILP-BASED MAXIMUM LIKELIHOOD GENOME SCAFFOLDING James Lindsay Ion Mandoiu University of Connecticut Hamed Salooti Alex ZelikovskyGeorgia State University
De-novo Assembly Sequencing Contigs Scaffolds Reads Genome Assembly Scaffolding 2
Why Scaffolding? Annotation Comparative biology Re-sequencing and gap filling Structural variation! gene XYZ3’ UTR5’ UTR Scaffold gene XYZ No scaffold 3
Why Scaffolding? Annotation Comparative biology Re-sequencing and gap filling Structural variation! gene XYZ3’ UTR5’ UTR Sanger Sequencing gene XYZ3’ UTR5’ UTR Biologist: There are holes in my genes! 4
Why Scaffolding? Annotation Comparative biology Re-sequencing and gap Filling Structural variation! 5
Read Pairs Paired Read Construction 2kb same strand and orientation R1 R2 Informative Reads Align each read against the contigs Uniquely mapped reads Save repetitive Both reads in a pair must map to different contigs 6
Linkage Information Possible States Two contigs are adjacent if: A read pair spans the contigs State (A, B, C, D) Depends on orientation of the read Order of contigs is arbitrary Each read pair can be “consistent” with one of the four states 5’ 3’ contig icontig j R1 R2 A B C D 7
Scaffolding Problem Input Contigs Linkage information from paired reads Output Contig orientation Contig ordering Relative distance between contigs Objective: Find the longest and most accurate scaffolds 8
Existing Tools Bambus2 GRASS OPERA MIP SCAPRA SOPRA SGA SOAPDENOVO SSPACE 9
SILP Flow 10 mapping of reads onto contigs with bowtie2 scaffolding graph construction graph decomposition into 3-connected components via SPQR trees maximum likelihood contig orientation via ILP+NSDP decomposition into paths of orientation-compatible edges via bipartite matching gap estimation via quadratic programming
Scaffolding Graph Scaffolding graph: G = (V, E, w) V = set of all contigs E = set of pairs of contigs connected with mapped read pairs of a particular state (A,B,C, or D) Edge weight = probability of read pairs being correctly aligned: Amount of repeat overlap Contig coverage dissimilarity 11 contigs read pairs
Structure of Scaffold Graph 12 2kb Paired reads have upper bound Contigs have minimum size Upper bound of # contigs spanned Scaffold graph has bounded width (Opera) 2kb
Elimination Order: SPQR-tree 13 Bi-connected Tri-connected Elimination order
Maximum Likelihood Contig Orientation Given read pair r, let p r be probability of r being correctly aligned For given orientation O, let R O be subset of all reads R agreeing with O Probability of O being correct is estimated as Log likelihood Equivalent to maximize 14
Integer Linear Program Formulation Variables: Binary S i = 0 if i-th contig keeps default orientation, =1 if it is flipped Binary S ij = 0 if contigs in (i,j)-edge are both flipped or both not, = 1 otherwise Binary A ij =1 if the edge (i,j) in state A, = 0 otherwise (similarlyB ij,C ij, D ij ) Weight of edge Objective 15
ILP Constraints Connecting S i and S ij Connecting S ij and A ij Forbidding 2-cycles Forbidding 3-cycles 16 j i k j i k j i k j i k j i j i
Non-Serial Dynamic Programming 1-cuts 17 Splitting 1-Cut ? ? ? ?? ? ?? Collapsing 1-Cut ? ?? 2-Component A ILP_A ? ? ?? 2-Component B ILP_B
Non-Serial Dynamic Programming 2-cuts 18 3-Component B Splitting 2-Cut 3-Component A ? ? ? ? ? ILP_B Collapsing 2-cut ? ? ?? ? 2-Cut ILP_A + ILP_A ILP_A
Total Ordering via Bipartite Matching ILP BipartiteBipartite Total Output graph matching order 19 B D E A F C B F E A C D B F E A C D E F B A C D E F B A C D
Gap Estimation via QP Maximum likelihood gap estimation following Opera and previous scaffolders 20 CONTIG iCONTIG i+1CONTIG i+2CONTIG i+3
Results Verification on simulation Simulate contigs from real draft assemblies (GAGE) & real reads Control error rate Actual orientation, order and position is known 21 Evaluation on real assembly Draft Human genome & real reads Alignment based evaluation is challenging
GAGE Simulation 22 Step: Staph, Rhodo and Chr14 assemblies Mix all contig sizes, sample uniformly at random to build simulated contigs (or gaps) Align reads against simulated contigs Control error rates Repeat 10x Metrics: Binary classification (n-1) real edges Corrected N50 (break edges at bad points) Runtime Scaffold size distribution
Simulation: MCC 23
Simulation: N50: Staph 24
Simulation: N50: Rhodo 25
Simulation: N50: Chr14 26
NA x: Runtime 27
Metagenomics: GAGE Simulation 28
Conclusions Powerful and flexible ILP for genome scale problems NSDP is a viable solution for big problems ILP will probably work well in Metagenomics domain Validation through N50, gene content, Future work: Validate through recent framework from Genome Biology 03/
Thank you! 30
Nonserial Dynamic Programming (NSDP) Compute ILP solution in stages such that each uses results from the previous stage. Use SPQR-tree to determine variable elimination order stack = [root of SPQR-tree] visited = {empty} while stack is not empty do p = stack.pop() foreach child q of p do if p not in visited then stack.push(p) endif endfor if p is root then solfinal = ILP(p) else (s,t) = getcut(p, parent(p)) sol00 = ILP(p, s=0, t=0), sol01 = ILP(p, s=0, t=1) sol01 = ILP(p, s=0, t=1), sol00 = ILP(p, s=1, t=1) endif endwhile 31 Scaffolding NSDP Algorithm