Download presentation
Presentation is loading. Please wait.
Published byEmily Martinez Modified over 10 years ago
1
JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia State University
2
De-novo Assembly Paradigm Sequencing The Contigs The Scaffolds The Reads The Genome Assembly Scaffolding
3
Why Scaffolding? Annotation Comparative biology Re-sequencing and gap filling Structural variation! gene XYZ3’ UTR5’ UTR Scaffold gene XYZ No scaffold
4
Why Scaffolding? Annotation Comparative biology Re-sequencing and gap filling Structural variation! gene XYZ3’ UTR5’ UTR Sanger Sequencing gene XYZ3’ UTR5’ UTR Biologist: There are holes in my genes!
5
Why Scaffolding? Annotation Comparative biology Re-sequencing and gap Filling Structural variation!
6
Read Pairs Paired Read Construction 2kb same strand and orientation R1 R2 Informative Reads Align each read against the contigs Only accept uniquely mapped reads Use the non-unique reads later Both reads in a pair must map to different contigs
7
Linkage Information Possible States Two contigs are adjacent if: A read pair spans the contigs State (A, B, C, D) Depends on orientation of the read Order of contigs is arbitrary Each read pair can be “consistent” with one of the four states 5’ 3’ contig icontig j R1 R2 A B C D
8
The Scaffolding Problem Given Contigs Paired reads Find Orientation Ordering Relative Distance Goal Recreate true scaffolds Possible Objectives Un-weighted Max number of consistent read pairs Weighted Each states is weighted: Overlap with repeat Deviation of expected distance …
9
Graph Representation Using input we can define a scaffolding graph: This is an undirected multi-graph Assume it is connected
10
Integer Linear Program Formulation Variables Contig Pair State: Contig Orientation: Pairwise Contig Consistency: Objective Maximize weight of consistent pairs
11
Constraints Pairwise Orientation Mutually Exclusivity Forbid 2 and 3 Cycles Explicitly
12
Graph Decomposition: Articulation Points solvesolve solvesolve Articulation point
13
Graph Decomposition: 2-cuts 2-cut + + + - - + - -
14
Non-Serial Dynamic Programming SPQR-tree to schedule decomposition Traverse tree using DFS NSDP utilizes solutions of previous stage in current stage
15
Largest Connected Component
16
Largest Biconnected Component
17
Largest Triconnected Component
18
Post Processing ILP Solution May have cycles Not a total ordering for each connected components A B C D F E ILP Solution outgoing incoming A B C D E F A B C D E F Bipartite matching Objectives: Max weight Max cardinality Max cardinality / Max weight
19
Testing Framework Venter Genome Read TypeTotal Reads Total Bases Avg lengthCoverage Sanger31,861,9762.79E+108759.930637 SOLiD pairs4.85E+082.42E+10508.623028 # Reads # Bases in reads# Contigs # Bases in contigsN50 112,00,0001.1E+10422,8372.26E+097704 4x Assembly
20
Testing Metrics Computer Scientists Finding Scaffold = Binary Classification Test n contigs, try to predict n-1 adjacencies TP,FP,TN,FN, Sensitivity, PPV Biologists (main focus) N50 (basically average scaffold size, ignore gaps) TP50 Break scaffold at incorrect edges, then find N50
21
Results test casemethodbundle sizesensitivityppvN50TP50 10%opera281.13%99.26%27,56727,327 10%mip259.01%98.94% 19,98819,755 10%ilp179.86%98.58% 26,814 26,459 25%opera280.44%98.27% 27,296 26,849 25%mip258.95%97.56% 19,84219,518 25%ilp179.30%96.93% 26,684 26,079 100%opera3pending… … … 100%mip3failedn/a 100%ilp168.25%89.90% 20,538 19,006
22
Conclusions Success ILP solves scaffolding problem! NSDP works. Improvements Finalize large test cases (then publish?!) Practical considerations (read style, multi-libraries, merge ctgs) Future Work Where else can I apply NSDP? Scaffold before assembly?? Structural Variation??
23
Questions?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.