Presentation is loading. Please wait.

Presentation is loading. Please wait.

ALLPATHS: De Novo Assembly of Whole-Genome Shotgun Microreads

Similar presentations


Presentation on theme: "ALLPATHS: De Novo Assembly of Whole-Genome Shotgun Microreads"— Presentation transcript:

1 ALLPATHS: De Novo Assembly of Whole-Genome Shotgun Microreads
Jonathan Butler, Iain MacCallum, Michael Kleber, Ilya A. Shlyakhter, Matthew K. Belmonte, Eric S. Lander, Chad Nusbaum, and David B. Jaffe Presented by: Mohit Jain

2 Motivation (P) DNA Sequencing:
Introduction Overview Algorithm Results Motivation (P) DNA Sequencing: Chr length: ~ ~250,000,000 bps Longest sequence-able fragment: ~600 bps (S) Shotgun Method: determine sequence by breaking genome into many small segments (reads) (P) Sequence/Genome Assembly: combining these reads to reconstruct the genome Slide: 1/

3 Motivation (S) Original Genome: (P) Repeats (S) Overlap Graph
Shortest Common Superstring (SCS) Problem = The shortest sequence that contains every read as a substring (P) Repeats Genomes have repeats, and SCS represents repeats only once (S) Overlap Graph

4 (S) Overlap Graph Algorithm:
Each read forms a Node Edge exists between two nodes if the reads overlap Algorithm: Step 1: Removing redundant edges, classify edges as required/optional Step 2: Find the shortest walk which includes all required edges Red/Thin: False Overlaps

5 (P) Overlap Graph Microreads = only bases long (for HTS) shorter reads = shorter overlap => more reads => more overlaps - Very large number of (mostly false) overlaps - Large number of reads + short overlap + higher error rate

6 (S) De Bruijn Graph To construct de Bruijn graph:
all reads are broken in to overlapping subsequences of length k (k-mer) Each k-1 subsequence represents a Node A directed Edge e exists between two nodes a and b iff there exists a k-mer such that its prefix = a and its suffix = b

7 (S) De Bruijn Graph Condensed by collapsing non-ambiguous paths
Genome: An Eulerian path (Superwalk: walk including all edges) in this graph

8 Paired Reads (Mate pairs)
Sequence two ends of a fragment of known size Results better assemblies, but more complicated

9 Current Approaches Velvet EULER-USR ALLPATHS
(Velvet and Euler USR are based on De Bruijn Graph method)

10 ALLPATHS Step I. Builds Unipath Graph
Step II. Localizes reads sequences before assembly Unipath: maximal unbranched sequence

11 Read Localization Select a unipath with ‘normal’ coverage
Avoid large repeat regions All other unipaths and paired tags connected to this unipath is considered to be in its neighborhood Assemble each neighborhood separately

12 Short Fragment Pair Merger
Fills the gap in between two paired reads Builds a local unipath graph Extend both ends (of all reads) based on the local unipath graph For each pair, search for other pairs which overlap on both ends, and merge to obtain longer reads

13 Short Fragment Pair Merger
Repeat the process for all pairs. Once sequence is complete, update the local unipath graph Iteratively merge local unipath graphs to obtain a global unipath graph, representing the genome

14 ALLPATHS Paired-Read Assembly Algorithm
Step 1: Creating Approximate Unipaths 1a: Error correction 1b: k-mer numbering and searchable data structure (Ignoring any overlaps between reads) 1c: Computing unipaths from the data structure by walking along the reads until a branch is encountered Read pairs  Unipaths  Localize

15 Step 2: Selecting Seeds Seeds = Unipaths around which assemblies of genomic regions are build Ideal seed: Long Unipaths with Low Copy Number (=1) Copy Number = Inferred from read coverage of the unipaths 2a: For each unipath, compute the closest unipaths in the set that are to the left and to the right of the given unipath 2b: If the distance between left and right neighbours is less than 4 kb, then the middle unipath is removed 2c: After all such unipaths are removed, remaining forms the seeds unipaths

16 Step 3: Assembling neighbourhoods around the seeds
Neighbourhood = Seed + 10 kb on each side 3a: Define a collection of low-copy number unipaths, using iterative linking 3b: Construct two sets of read clouds: primary (B): only reads, whose true genomic locations are near the seed secondary (C): contains all the short-fragment read pairs (~0.5 kb) near the seed C paired-read links partners unipaths “primary read cloud,” generally containing only reads whose true genomic locations are near the seed (but not all such reads), and the “secondary read cloud,” generally containing all the short-fragment read pairs near the seed, and some outsiders as well. Problem of too-many closures persists, hence use Short-Fragment Pair Merger (progressively merge the secondary read cloud pairs)

17 Step 4: Finding All Paths
compute the closures (include false closures) of all the merged short-fragment pairs Step 5: Gluing Together the Local Assembly sequence graph is formed by iteratively joining closures Step 6: Building the Global Assembly outputs of local assemblies are glued together to yield a single sequence graph

18 Step 7: Editing the Assembly
To remove detritus, eliminate ambiguity, and pull apart regions where repeats are assembled on top of each other

19 Experiments Simulated Data Real Data
10 reference genomes from bacteria and fungi, and 1 10-Mb segment of human genome; with introduced errors Real Data Solexa

20 Results Simulated Data Real Data
Highly complete and contiguous assemblies (Proportion of genome covered > 96%) Assembly ambiguities regions <20 per megabase Assemblies of C.jejuni and E.coli have no errors. Very high accuracy, less than one error per 106 bases Real Data High coverage (99.1%) High continuity High accuracy (Final assembly matches the reference sequence exactly, with only 12 exceptions)

21 + / - + Read Localization + Multi-CPU compatible
+ Extremely good (accurate) results - Slow - Very memory intensive - Impractical assumptions on input data (500bp +/- 5bp insert size)

22 Thank you


Download ppt "ALLPATHS: De Novo Assembly of Whole-Genome Shotgun Microreads"

Similar presentations


Ads by Google