Presentation is loading. Please wait.

Presentation is loading. Please wait.

Sequence Assembly.

Similar presentations


Presentation on theme: "Sequence Assembly."— Presentation transcript:

1 Sequence Assembly

2 Assembling the data Problem: the longest single sequence possible is 1,000 bp, and most technology is bp. Microbial genomes are 2,000,000 bp Therefore how do you sequence a whole genome? 2

3 Sequencing the genomes
Extract DNA Shear DNA into small pieces Ligate adapters on each end Sequencing using “next generation sequencing” 3

4 Sequence assembly Before we look at the data Can we make longer pieces
4

5 The assembly A hierarchical data structure that maps sequence data to a reconstruction of the target. The assembly groups reads into contigs contigs into scaffolds Contigs provide multiple sequence alignment of reads consensus sequence. Scaffolds provide contig order and orientation sizes of the gaps between contigs. 5

6 Sequence assembly Reads Contigs Scaffolds 6

7 Four approaches to assembly
Naïve approach Greedy approach Overlap / Layout / Consensus de Bruijn Graphs

8 Naïve approach Compare every sequence to every other sequence
Find stretches that are the same Need to account for phred scores – what if a base is wrong? How long of a sequence do you need to be unique? 8

9 Sequence composition 4 bases
4n chance of finding a sequence if all evenly used (they are not) 3 bp: 43 = 64 8 bp: 48 = 65,336 20 bp: 420 = 1,099,511,627,776 9

10 Problems with this approach
Sequences are not random Most genomes contain biased information Repeat sequences in the genome 10

11 Greedy approaches Start with a sequence
Keep extending it while another sequence matches the end When can not be extended further, mark as a contig

12 Improve greedy approachs
Only use high quality sequence Use reads that are represented more than n- times in the sample (SSAKE) End to end overlap vs. partial overlap Ignores low coverage regions … also incorporate quality scores (SHARCGS) In general, greedy approaches are fast but not very good. Make lots of short contigs

13 Overlap / Layout / Consensus
All versus all comparison (done with K-mers for speed). Generate approximate read layout as an overlap graph. Use multiple sequence alignments to resolve layout.

14 Newbler (O/L/C) Makes unitigs Single contigs with no discrepancies
Merge unitigs into contigs. May split unitigs and even reads (could be chimeras) Use coverage to compensate for base calls Works in flow space to calculate homopolymeric tracts. More accurate than average of averages

15 Assembly is a “graph” problem
Overlap/Layout/Consensus de Bruijn Graph Greedy graphs A graph is nodes + edges node edge

16 Assemble these two sequences!
AACCGGT CCGGTTA Consensus: AACCGGTTA

17 AACCGGT as graphs aacc accg ccgg cggt
Node = K-mers; edges = nodes that overlap by K-1 bases. aacc accg ccgg cggt Here K = 4, but in reality K = 19 to 31

18 CCGGTTA as graphs ccgg cggt ggtt gtta

19 Join the two graphs ccgg cggt ggtt gtta aacc accg ccgg cggt

20 Join the two graphs ccgg cggt ggtt gtta aacc accg ccgg cggt

21 Join the two graphs ccgg cggt ggtt gtta aacc accg ccgg cggt AACCGGTTA

22 Differences between overlap graphs and de Bruijn graphs for assembly.
Differences between an overlap graph and a de Bruijn graph for assembly. Based on the set of 10 8-bp reads (A), we can build an overlap graph (B) in which each read is a node, and overlaps >5 bp are indicated by directed edges. Transitive overlaps, which are implied by other longer overlaps, are shown as dotted edges. In a de Bruin graph (C), a node is created for every k-mer in all the reads; here the k-mer size is 3. Edges are drawn between every pair of successive k-mers in a read, where the k-mers overlap by k − 1 bases. In both approaches, repeat sequences create a fork in the graph. Note here we have only considered the forward orientation of each sequence to simplify the figure. Schatz M C et al. Genome Res. 2010;20: ©2010 by Cold Spring Harbor Laboratory Press

23 Problems with all assemblies
Sequences are not random Most genomes contain biased information Repeat sequences in the genome 23

24 Repeats Exact repeats How does basecalling cope?
High coverage versus high error rates Polymorphic repeats Real SNPs (between non-clonal individuals) Polymorphic haplotypes (eukaryotes)

25 Graphs get very complex

26 “Spurs” from bad base calls

27 Polymorphisms cause “bubbles”

28 Repeats have multiple sinks/sources

29 Repeats have multiple sinks/sources
Salmonella has 7 rrn operons Salmonella recombines at rrn operons Helm and Maloy

30 Repeat sequences What happens if the repeat is longer than the read length? Need paired end reads to resolve order Need pairs that span the repeat Need pairs with one end in the repeat 30

31 Paired end sequencing

32 Paired End Sequencing Add linkers

33 Paired end sequencing Sequencing Nick migration

34 Repeats A B C Paired end reads or mate pairs 34

35 Discussion point Should you pair sequences before analyzing?
Should you throw away singletons? What happens if ½ reads pair and ½ not?

36 N50 Length of the contig that contains 50% of the sequences
Measure of assembly quality Longer N50 is better

37 N50 of Vibrio sequence assemblies

38 Assemblers

39 Current assemblers AMOS Celera WGA Assembler CLC Genomics Workbench
DNA Dragon DNAnexus Euler Geneious IDBA (Iterative De Bruijn graph short read Assembler) LIGR Assembler (derived from TIGR Assembler) MIRA (Mimicking Intelligent Read Assembly) Newbler Phrap SSAKE SOAPdenovo SPAdes Velvet

40 Assembly RAST Pipeline for automatic assembly
Works with fasta, fastq, single end, paired end Runs multiple assemblers in parallel Combines contigs

41 ARAST Module Stages Description
a preprocess,assembler,post-process A5 microbial assembly pipeline a preprocess,assembler,post-process Modified A5 microbial assembly pipeline bhammer preprocess SPAdes component for quality control of sequence data bowtie post-process Bowtie2 aligner that maps reads to contigs bwa post-process BWA aligner that maps reads to contigs fastqc preprocess FastQC quality control tool for sequence data filter_by_length preprocess Length-based sequencing reads filter and trimmer based on seqtk idba assembler IDBA iterative graph-based assembler for single-cell kiki assembler Kiki overlap-based parallel microbial and metagenomic assembler quast post-process QUAST assembly quality assessment tool (run by default) ray assembler Ray graph-based parallel microbial and metagenomic assembler reapr post-process REAPR assembly error recognizer using paired-end reads sga_ec preprocess SGA component for error correction sga_preprocess preprocess SGA component for preprocessing reads spades preprocess,assembler SPAdes based on paired de Bruijn graphs sspace post-process SSPACE pre-assembled contig scaffolder swap assembler SWAP Assembler tagdust preprocess TagDust sequencing artifacts remover trim_sort preprocess DynamicTrim and LengthSort from SolexaQA velvet assembler Velvet de-bruijn graph based assembler

42 ARAST ar-run --single Ecoli_DH10B_Control_200.fastq -m "E. coli DH10B Control 200" -a velvet spades

43 ARAST

44 ARAST

45 ARAST

46 Hybrid assembly Geni Silva

47 Sequence assembly Reads Contigs Scaffolds 47

48 scaffold_builder http://edwards.sdsu.edu/scaffold_builder
Silva et al. Source Code for Biology and Medicine 2013, 8:23

49 Bandage to view graphs Bandage: Ryan Wick:

50 Bandage to view graphs Bandage: Ryan Wick:

51 Discussion points Should we assemble or not?
How does assembly affect ecological analyses?


Download ppt "Sequence Assembly."

Similar presentations


Ads by Google