Sequence Assembly.

Sequence Assembly

Assembling the data Problem: the longest single sequence possible is 1,000 bp, and most technology is bp. Microbial genomes are 2,000,000 bp Therefore how do you sequence a whole genome? 2

Sequencing the genomes
Extract DNA Shear DNA into small pieces Ligate adapters on each end Sequencing using “next generation sequencing” 3

Sequence assembly Before we look at the data Can we make longer pieces
4

The assembly A hierarchical data structure that maps sequence data to a reconstruction of the target. The assembly groups reads into contigs contigs into scaffolds Contigs provide multiple sequence alignment of reads consensus sequence. Scaffolds provide contig order and orientation sizes of the gaps between contigs. 5

Sequence assembly Reads Contigs Scaffolds 6

Four approaches to assembly
Naïve approach Greedy approach Overlap / Layout / Consensus de Bruijn Graphs

Naïve approach Compare every sequence to every other sequence
Find stretches that are the same Need to account for phred scores – what if a base is wrong? How long of a sequence do you need to be unique? 8

Sequence composition 4 bases
4n chance of finding a sequence if all evenly used (they are not) 3 bp: 43 = 64 8 bp: 48 = 65,336 20 bp: 420 = 1,099,511,627,776 9

Problems with this approach
Sequences are not random Most genomes contain biased information Repeat sequences in the genome 10

Greedy approaches Start with a sequence
Keep extending it while another sequence matches the end When can not be extended further, mark as a contig

Improve greedy approachs
Only use high quality sequence Use reads that are represented more than n- times in the sample (SSAKE) End to end overlap vs. partial overlap Ignores low coverage regions … also incorporate quality scores (SHARCGS) In general, greedy approaches are fast but not very good. Make lots of short contigs

Overlap / Layout / Consensus
All versus all comparison (done with K-mers for speed). Generate approximate read layout as an overlap graph. Use multiple sequence alignments to resolve layout.

Newbler (O/L/C) Makes unitigs Single contigs with no discrepancies
Merge unitigs into contigs. May split unitigs and even reads (could be chimeras) Use coverage to compensate for base calls Works in flow space to calculate homopolymeric tracts. More accurate than average of averages

Assembly is a “graph” problem
Overlap/Layout/Consensus de Bruijn Graph Greedy graphs A graph is nodes + edges node edge

Assemble these two sequences!
AACCGGT CCGGTTA Consensus: AACCGGTTA

AACCGGT as graphs aacc accg ccgg cggt
Node = K-mers; edges = nodes that overlap by K-1 bases. aacc accg ccgg cggt Here K = 4, but in reality K = 19 to 31

CCGGTTA as graphs ccgg cggt ggtt gtta

Join the two graphs ccgg cggt ggtt gtta aacc accg ccgg cggt

Join the two graphs ccgg cggt ggtt gtta aacc accg ccgg cggt AACCGGTTA

Differences between overlap graphs and de Bruijn graphs for assembly.
Differences between an overlap graph and a de Bruijn graph for assembly. Based on the set of 10 8-bp reads (A), we can build an overlap graph (B) in which each read is a node, and overlaps >5 bp are indicated by directed edges. Transitive overlaps, which are implied by other longer overlaps, are shown as dotted edges. In a de Bruin graph (C), a node is created for every k-mer in all the reads; here the k-mer size is 3. Edges are drawn between every pair of successive k-mers in a read, where the k-mers overlap by k − 1 bases. In both approaches, repeat sequences create a fork in the graph. Note here we have only considered the forward orientation of each sequence to simplify the figure. Schatz M C et al. Genome Res. 2010;20: ©2010 by Cold Spring Harbor Laboratory Press

Problems with all assemblies
Sequences are not random Most genomes contain biased information Repeat sequences in the genome 23

Repeats Exact repeats How does basecalling cope?
High coverage versus high error rates Polymorphic repeats Real SNPs (between non-clonal individuals) Polymorphic haplotypes (eukaryotes)

Graphs get very complex

“Spurs” from bad base calls

Polymorphisms cause “bubbles”

Repeats have multiple sinks/sources

Repeats have multiple sinks/sources
Salmonella has 7 rrn operons Salmonella recombines at rrn operons Helm and Maloy

Repeat sequences What happens if the repeat is longer than the read length? Need paired end reads to resolve order Need pairs that span the repeat Need pairs with one end in the repeat 30

Paired end sequencing

Paired End Sequencing Add linkers

Paired end sequencing Sequencing Nick migration

Repeats A B C Paired end reads or mate pairs 34

Discussion point Should you pair sequences before analyzing?
Should you throw away singletons? What happens if ½ reads pair and ½ not?

N50 Length of the contig that contains 50% of the sequences
Measure of assembly quality Longer N50 is better

N50 of Vibrio sequence assemblies

Assemblers

Current assemblers AMOS Celera WGA Assembler CLC Genomics Workbench
DNA Dragon DNAnexus Euler Geneious IDBA (Iterative De Bruijn graph short read Assembler) LIGR Assembler (derived from TIGR Assembler) MIRA (Mimicking Intelligent Read Assembly) Newbler Phrap SSAKE SOAPdenovo SPAdes Velvet

Assembly RAST Pipeline for automatic assembly
Works with fasta, fastq, single end, paired end Runs multiple assemblers in parallel Combines contigs

ARAST Module Stages Description
a preprocess,assembler,post-process A5 microbial assembly pipeline a preprocess,assembler,post-process Modified A5 microbial assembly pipeline bhammer preprocess SPAdes component for quality control of sequence data bowtie post-process Bowtie2 aligner that maps reads to contigs bwa post-process BWA aligner that maps reads to contigs fastqc preprocess FastQC quality control tool for sequence data filter_by_length preprocess Length-based sequencing reads filter and trimmer based on seqtk idba assembler IDBA iterative graph-based assembler for single-cell kiki assembler Kiki overlap-based parallel microbial and metagenomic assembler quast post-process QUAST assembly quality assessment tool (run by default) ray assembler Ray graph-based parallel microbial and metagenomic assembler reapr post-process REAPR assembly error recognizer using paired-end reads sga_ec preprocess SGA component for error correction sga_preprocess preprocess SGA component for preprocessing reads spades preprocess,assembler SPAdes based on paired de Bruijn graphs sspace post-process SSPACE pre-assembled contig scaffolder swap assembler SWAP Assembler tagdust preprocess TagDust sequencing artifacts remover trim_sort preprocess DynamicTrim and LengthSort from SolexaQA velvet assembler Velvet de-bruijn graph based assembler

ARAST ar-run --single Ecoli_DH10B_Control_200.fastq -m "E. coli DH10B Control 200" -a velvet spades

Hybrid assembly Geni Silva

Sequence assembly Reads Contigs Scaffolds 47

scaffold_builder http://edwards.sdsu.edu/scaffold_builder
Silva et al. Source Code for Biology and Medicine 2013, 8:23

Bandage to view graphs Bandage: Ryan Wick:

Discussion points Should we assemble or not?
How does assembly affect ecological analyses?

Sequence Assembly.

Similar presentations

Presentation on theme: "Sequence Assembly."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Sequence Assembly.

Similar presentations

Presentation on theme: "Sequence Assembly."— Presentation transcript:

Similar presentations

About project

Feedback