Download presentation
Presentation is loading. Please wait.
1
Sequence Assembly
2
Assembling the data Problem: the longest single sequence possible is 1,000 bp, and most technology is bp. Microbial genomes are 2,000,000 bp Therefore how do you sequence a whole genome? 2
3
Sequencing the genomes
Extract DNA Shear DNA into small pieces Ligate adapters on each end Sequencing using “next generation sequencing” 3
4
Sequence assembly Before we look at the data Can we make longer pieces
4
5
The assembly A hierarchical data structure that maps sequence data to a reconstruction of the target. The assembly groups reads into contigs contigs into scaffolds Contigs provide multiple sequence alignment of reads consensus sequence. Scaffolds provide contig order and orientation sizes of the gaps between contigs. 5
6
Sequence assembly Reads Contigs Scaffolds 6
7
Four approaches to assembly
Naïve approach Greedy approach Overlap / Layout / Consensus de Bruijn Graphs
8
Naïve approach Compare every sequence to every other sequence
Find stretches that are the same Need to account for phred scores – what if a base is wrong? How long of a sequence do you need to be unique? 8
9
Sequence composition 4 bases
4n chance of finding a sequence if all evenly used (they are not) 3 bp: 43 = 64 8 bp: 48 = 65,336 20 bp: 420 = 1,099,511,627,776 9
10
Problems with this approach
Sequences are not random Most genomes contain biased information Repeat sequences in the genome 10
11
Greedy approaches Start with a sequence
Keep extending it while another sequence matches the end When can not be extended further, mark as a contig
12
Improve greedy approachs
Only use high quality sequence Use reads that are represented more than n- times in the sample (SSAKE) End to end overlap vs. partial overlap Ignores low coverage regions … also incorporate quality scores (SHARCGS) In general, greedy approaches are fast but not very good. Make lots of short contigs
13
Overlap / Layout / Consensus
All versus all comparison (done with K-mers for speed). Generate approximate read layout as an overlap graph. Use multiple sequence alignments to resolve layout.
14
Newbler (O/L/C) Makes unitigs Single contigs with no discrepancies
Merge unitigs into contigs. May split unitigs and even reads (could be chimeras) Use coverage to compensate for base calls Works in flow space to calculate homopolymeric tracts. More accurate than average of averages
15
Assembly is a “graph” problem
Overlap/Layout/Consensus de Bruijn Graph Greedy graphs A graph is nodes + edges node edge
16
Assemble these two sequences!
AACCGGT CCGGTTA Consensus: AACCGGTTA
17
AACCGGT as graphs aacc accg ccgg cggt
Node = K-mers; edges = nodes that overlap by K-1 bases. aacc accg ccgg cggt Here K = 4, but in reality K = 19 to 31
18
CCGGTTA as graphs ccgg cggt ggtt gtta
19
Join the two graphs ccgg cggt ggtt gtta aacc accg ccgg cggt
20
Join the two graphs ccgg cggt ggtt gtta aacc accg ccgg cggt
21
Join the two graphs ccgg cggt ggtt gtta aacc accg ccgg cggt AACCGGTTA
22
Differences between overlap graphs and de Bruijn graphs for assembly.
Differences between an overlap graph and a de Bruijn graph for assembly. Based on the set of 10 8-bp reads (A), we can build an overlap graph (B) in which each read is a node, and overlaps >5 bp are indicated by directed edges. Transitive overlaps, which are implied by other longer overlaps, are shown as dotted edges. In a de Bruin graph (C), a node is created for every k-mer in all the reads; here the k-mer size is 3. Edges are drawn between every pair of successive k-mers in a read, where the k-mers overlap by k − 1 bases. In both approaches, repeat sequences create a fork in the graph. Note here we have only considered the forward orientation of each sequence to simplify the figure. Schatz M C et al. Genome Res. 2010;20: ©2010 by Cold Spring Harbor Laboratory Press
23
Problems with all assemblies
Sequences are not random Most genomes contain biased information Repeat sequences in the genome 23
24
Repeats Exact repeats How does basecalling cope?
High coverage versus high error rates Polymorphic repeats Real SNPs (between non-clonal individuals) Polymorphic haplotypes (eukaryotes)
25
Graphs get very complex
26
“Spurs” from bad base calls
27
Polymorphisms cause “bubbles”
28
Repeats have multiple sinks/sources
29
Repeats have multiple sinks/sources
Salmonella has 7 rrn operons Salmonella recombines at rrn operons Helm and Maloy
30
Repeat sequences What happens if the repeat is longer than the read length? Need paired end reads to resolve order Need pairs that span the repeat Need pairs with one end in the repeat 30
31
Paired end sequencing
32
Paired End Sequencing Add linkers
33
Paired end sequencing Sequencing Nick migration
34
Repeats A B C Paired end reads or mate pairs 34
35
Discussion point Should you pair sequences before analyzing?
Should you throw away singletons? What happens if ½ reads pair and ½ not?
36
N50 Length of the contig that contains 50% of the sequences
Measure of assembly quality Longer N50 is better
37
N50 of Vibrio sequence assemblies
38
Assemblers
39
Current assemblers AMOS Celera WGA Assembler CLC Genomics Workbench
DNA Dragon DNAnexus Euler Geneious IDBA (Iterative De Bruijn graph short read Assembler) LIGR Assembler (derived from TIGR Assembler) MIRA (Mimicking Intelligent Read Assembly) Newbler Phrap SSAKE SOAPdenovo SPAdes Velvet
40
Assembly RAST Pipeline for automatic assembly
Works with fasta, fastq, single end, paired end Runs multiple assemblers in parallel Combines contigs
41
ARAST Module Stages Description
a preprocess,assembler,post-process A5 microbial assembly pipeline a preprocess,assembler,post-process Modified A5 microbial assembly pipeline bhammer preprocess SPAdes component for quality control of sequence data bowtie post-process Bowtie2 aligner that maps reads to contigs bwa post-process BWA aligner that maps reads to contigs fastqc preprocess FastQC quality control tool for sequence data filter_by_length preprocess Length-based sequencing reads filter and trimmer based on seqtk idba assembler IDBA iterative graph-based assembler for single-cell kiki assembler Kiki overlap-based parallel microbial and metagenomic assembler quast post-process QUAST assembly quality assessment tool (run by default) ray assembler Ray graph-based parallel microbial and metagenomic assembler reapr post-process REAPR assembly error recognizer using paired-end reads sga_ec preprocess SGA component for error correction sga_preprocess preprocess SGA component for preprocessing reads spades preprocess,assembler SPAdes based on paired de Bruijn graphs sspace post-process SSPACE pre-assembled contig scaffolder swap assembler SWAP Assembler tagdust preprocess TagDust sequencing artifacts remover trim_sort preprocess DynamicTrim and LengthSort from SolexaQA velvet assembler Velvet de-bruijn graph based assembler
42
ARAST ar-run --single Ecoli_DH10B_Control_200.fastq -m "E. coli DH10B Control 200" -a velvet spades
43
ARAST
44
ARAST
45
ARAST
46
Hybrid assembly Geni Silva
47
Sequence assembly Reads Contigs Scaffolds 47
48
scaffold_builder http://edwards.sdsu.edu/scaffold_builder
Silva et al. Source Code for Biology and Medicine 2013, 8:23
49
Bandage to view graphs Bandage: Ryan Wick:
50
Bandage to view graphs Bandage: Ryan Wick:
51
Discussion points Should we assemble or not?
How does assembly affect ecological analyses?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.