Presentation is loading. Please wait.

Presentation is loading. Please wait.

ILP-BASED MAXIMUM LIKELIHOOD GENOME SCAFFOLDING James Lindsay Ion Mandoiu University of Connecticut Hamed Salooti Alex ZelikovskyGeorgia State University.

Similar presentations


Presentation on theme: "ILP-BASED MAXIMUM LIKELIHOOD GENOME SCAFFOLDING James Lindsay Ion Mandoiu University of Connecticut Hamed Salooti Alex ZelikovskyGeorgia State University."— Presentation transcript:

1 ILP-BASED MAXIMUM LIKELIHOOD GENOME SCAFFOLDING James Lindsay Ion Mandoiu University of Connecticut Hamed Salooti Alex ZelikovskyGeorgia State University

2 De-novo Assembly Sequencing Contigs Scaffolds Reads Genome Assembly Scaffolding 2

3 Why Scaffolding? Annotation Comparative biology Re-sequencing and gap filling Structural variation! gene XYZ3’ UTR5’ UTR Scaffold gene XYZ No scaffold 3

4 Why Scaffolding? Annotation Comparative biology Re-sequencing and gap filling Structural variation! gene XYZ3’ UTR5’ UTR Sanger Sequencing gene XYZ3’ UTR5’ UTR Biologist: There are holes in my genes! 4

5 Why Scaffolding? Annotation Comparative biology Re-sequencing and gap Filling Structural variation! 5

6 Read Pairs Paired Read Construction 2kb same strand and orientation R1 R2 Informative Reads Align each read against the contigs Uniquely mapped reads Save repetitive Both reads in a pair must map to different contigs 6

7 Linkage Information Possible States Two contigs are adjacent if: A read pair spans the contigs State (A, B, C, D) Depends on orientation of the read Order of contigs is arbitrary Each read pair can be “consistent” with one of the four states 5’ 3’ contig icontig j R1 R2 A B C D 7

8 Scaffolding Problem Input Contigs Linkage information from paired reads Output Contig orientation Contig ordering Relative distance between contigs Objective: Find the longest and most accurate scaffolds 8

9 Existing Tools Bambus2 GRASS OPERA MIP SCAPRA SOPRA SGA SOAPDENOVO SSPACE 9

10 SILP Flow 10 mapping of reads onto contigs with bowtie2 scaffolding graph construction graph decomposition into 3-connected components via SPQR trees maximum likelihood contig orientation via ILP+NSDP decomposition into paths of orientation-compatible edges via bipartite matching gap estimation via quadratic programming

11 Scaffolding Graph Scaffolding graph: G = (V, E, w) V = set of all contigs E = set of pairs of contigs connected with mapped read pairs of a particular state (A,B,C, or D) Edge weight = probability of read pairs being correctly aligned: Amount of repeat overlap Contig coverage dissimilarity 11 contigs read pairs

12 Structure of Scaffold Graph 12 2kb Paired reads have upper bound Contigs have minimum size Upper bound of # contigs spanned Scaffold graph has bounded width (Opera) 2kb

13 Elimination Order: SPQR-tree 13 Bi-connected Tri-connected Elimination order

14 Maximum Likelihood Contig Orientation Given read pair r, let p r be probability of r being correctly aligned For given orientation O, let R O be subset of all reads R agreeing with O Probability of O being correct is estimated as Log likelihood Equivalent to maximize 14

15 Integer Linear Program Formulation Variables: Binary S i = 0 if i-th contig keeps default orientation, =1 if it is flipped Binary S ij = 0 if contigs in (i,j)-edge are both flipped or both not, = 1 otherwise Binary A ij =1 if the edge (i,j) in state A, = 0 otherwise (similarlyB ij,C ij, D ij ) Weight of edge Objective 15

16 ILP Constraints Connecting S i and S ij Connecting S ij and A ij Forbidding 2-cycles  Forbidding 3-cycles 16 j i k j i k j i k j i k j i j i

17 Non-Serial Dynamic Programming 1-cuts 17 Splitting 1-Cut ? ? ? ?? ? ?? Collapsing 1-Cut ? ?? 2-Component A ILP_A ? ? ?? 2-Component B ILP_B

18 Non-Serial Dynamic Programming 2-cuts 18 3-Component B Splitting 2-Cut 3-Component A ? ? ? ? ? ILP_B Collapsing 2-cut ? ? ?? ? 2-Cut ILP_A  + ILP_A  ILP_A

19 Total Ordering via Bipartite Matching ILP BipartiteBipartite Total Output  graph  matching  order 19 B D E A F C B F E A C D B F E A C D E F B A C D E F B A C D

20 Gap Estimation via QP Maximum likelihood gap estimation following Opera and previous scaffolders 20 CONTIG iCONTIG i+1CONTIG i+2CONTIG i+3

21 Results Verification on simulation Simulate contigs from real draft assemblies (GAGE) & real reads Control error rate Actual orientation, order and position is known 21 Evaluation on real assembly Draft Human genome & real reads Alignment based evaluation is challenging

22 GAGE Simulation 22 Step: Staph, Rhodo and Chr14 assemblies Mix all contig sizes, sample uniformly at random to build simulated contigs (or gaps) Align reads against simulated contigs Control error rates Repeat 10x Metrics: Binary classification (n-1) real edges Corrected N50 (break edges at bad points) Runtime Scaffold size distribution

23 Simulation: MCC 23

24 Simulation: N50: Staph 24

25 Simulation: N50: Rhodo 25

26 Simulation: N50: Chr14 26

27 NA12878 2x: Runtime 27

28 Metagenomics: GAGE Simulation 28

29 Conclusions Powerful and flexible ILP for genome scale problems NSDP is a viable solution for big problems ILP will probably work well in Metagenomics domain Validation through N50, gene content, Future work: Validate through recent framework from Genome Biology 03/ 2014 http://genomebiology.com/2014/15/3/R42 http://genomebiology.com/2014/15/3/R42/table/T1 29

30 Thank you! 30

31 Nonserial Dynamic Programming (NSDP) Compute ILP solution in stages such that each uses results from the previous stage. Use SPQR-tree to determine variable elimination order stack = [root of SPQR-tree] visited = {empty} while stack is not empty do p = stack.pop() foreach child q of p do if p not in visited then stack.push(p) endif endfor if p is root then solfinal = ILP(p) else (s,t) = getcut(p, parent(p)) sol00 = ILP(p, s=0, t=0), sol01 = ILP(p, s=0, t=1) sol01 = ILP(p, s=0, t=1), sol00 = ILP(p, s=1, t=1) endif endwhile 31 Scaffolding NSDP Algorithm


Download ppt "ILP-BASED MAXIMUM LIKELIHOOD GENOME SCAFFOLDING James Lindsay Ion Mandoiu University of Connecticut Hamed Salooti Alex ZelikovskyGeorgia State University."

Similar presentations


Ads by Google