Sequence Assembly.

Slides:



Advertisements
Similar presentations
Accurate Assembly of Maize BACs Patrick S. Schnable Srinivas Aluru Iowa State University.
Advertisements

MCB Lecture #15 Oct 23/14 De novo assemblies using PacBio.
ILP-BASED MAXIMUM LIKELIHOOD GENOME SCAFFOLDING James Lindsay Ion Mandoiu University of Connecticut Hamed Salooti Alex ZelikovskyGeorgia State University.
RNA Assembly Using extending method. Wei Xueliang
Kelley Bullard, Henry Dewhurst, Kizee Etienne, Esha Jain, VivekSagar KR, Benjamin Metcalf, Raghav Sharma, Charles Wigington, Juliette Zerick Genome Assembly.
Pamela Ferretti Laboratory of Computational Metagenomics Centre for Integrative Biology University of Trento Italy Microbial Genome Assembly 1.
Introduction to Short Read Sequencing Analysis
Bioinformatics for Whole-Genome Shotgun Sequencing of Microbial Communities By Kevin Chen, Lior Pachter PLoS Computational Biology, 2005 David Kelley.
CS273a Lecture 4, Autumn 08, Batzoglou Some Terminology insert a fragment that was incorporated in a circular genome, and can be copied (cloned) vector.
Class 02: Whole genome sequencing. The seminal papers ``Is Whole Genome Sequencing Feasible?'' ``Whole-Genome DNA.
CS262 Lecture 11, Win07, Batzoglou Some Terminology insert a fragment that was incorporated in a circular genome, and can be copied (cloned) vector the.
Assembly.
Sequencing and Assembly Cont’d. CS273a Lecture 5, Win07, Batzoglou Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT..
Novel multi-platform next generation assembly methods for mammalian genomes The Baylor College of Medicine, Australian Government and University of Connecticut.
Sequencing and Assembly Cont’d. CS273a Lecture 5, Aut08, Batzoglou Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT..
CS273a Lecture 4, Autumn 08, Batzoglou Hierarchical Sequencing.
CSE182-L10 LW statistics/Assembly. Whole Genome Shotgun Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics.
Henrik Lantz - BILS/SciLife/Uppsala University
Sequencing Data Quality Saulo Aflitos. Read (≈100bp) Contig (≈2Kbp) Scaffold (≈ 2Mbp) Pseudo Molecule (Super Scaffold) Paired-End Mate-Pair LowComplexityRegion.
Assembling Genomes BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BCH364C-391L/Spring.
De-novo Assembly Day 4.
How to Build a Horse Megan Smedinghoff.
CS 394C March 19, 2012 Tandy Warnow.
CUGI Pilot Sequencing/Assembly Projects Christopher Saski.
Opera: Reconstructing optimal genomic scaffolds with high- throughput paired-end sequences Song Gao, Niranjan Nagarajan, Wing-Kin Sung National University.
Introduction to Short Read Sequencing Analysis
1 Velvet: Algorithms for De Novo Short Assembly Using De Bruijn Graphs March 12, 2008 Daniel R. Zerbino and Ewan Birney Presenter: Seunghak Lee.
O PTICAL M APPING AS A M ETHOD OF W HOLE G ENOME A NALYSIS M AY 4, 2009 C OURSE : 22M:151 P RESENTED BY : A USTIN J. R AMME.
Next generation sequence data and de novo assembly For human genetics By Jaap van der Heijden.
Sequence assembly using paired- end short tags Pramila Ariyaratne Genome Institute of Singapore SOC-FOS-SICS Joint Workshop on Computational Analysis of.
Metagenomics Assembly Hubert DENISE
SIZE SELECT SHEAR Shotgun DNA Sequencing (Technology) DNA target sample LIGATE & CLONE Vector End Reads (Mates) SEQUENCE Primer.
RNA-Seq Assembly 转录组拼接 唐海宝 基因组与生物技术研究中心 2013 年 11 月 23 日.
Sequencing and Assembly GEN875, Genomics and Proteomics, Fall 2010.
RNA Sequence Assembly WEI Xueliang. Overview Sequence Assembly Current Method My Method RNA Assembly To Do.
De novo assembly validation
CS 173, Lecture B Introduction to Genome Assembly (using Eulerian Graphs) Tandy Warnow.
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
OPERA highthroughput paired-end sequences Reconstructing optimal genomic scaffolds with.
When the next-generation sequencing becomes the now- generation Lisa Zhang November 6th, 2012.
JERI DILTS SUZANNA KIM HEMA NAGRAJAN DEEPAK PURUSHOTHAM AMBILY SIVADAS AMIT RUPANI LEO WU Genome Assembly Final Results
Virginia Commonwealth University
Lesson: Sequence processing
Assembly algorithms for next-generation sequencing data
Preprocessing Data Rob Schmieder.
Quality Control & Preprocessing of Metagenomic Data
Sequence assembly Jose Blanca COMAV institute bioinf.comav.upv.es.
Presented By: Chinua Umoja
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
A Fast Hybrid Short Read Fragment Assembly Algorithm
Denovo genome assembly of Moniliophthora roreri
M. roreri de novo genome assembly using abyss/1.9.0-maxk96
Jeong-Hyeon Choi, Sun Kim, Haixu Tang, Justen Andrews, Don G. Gilbert
Fragment Assembly (in whole-genome shotgun sequencing)
Genome sequence assembly
Professors: Dr. Gribskov and Dr. Weil
Assembly.
Ssaha_pileup - a SNP/indel detection pipeline from new sequencing data
Single-molecule sequencing and chromatin conformation capture enable de novo reference assembly of the domestic goat genome.
Introduction to Genome Assembly
Removing Erroneous Connections
CS 598AGB Genome Assembly Tandy Warnow.
Alternative Computational Analysis Shows No Evidence for Nucleosome Enrichment at Repetitive Sequences in Mammalian Spermatozoa  Hélène Royo, Michael Beda.
How to Build a Horse: Final Report
DNA Sequencing By Dan Massa.
Genome Sequencing and Assembly
(Top) Construction of synthetic long read clouds with 10× Genomics technology. (Top) Construction of synthetic long read clouds with 10× Genomics technology.
Assembling Genomes BCH339N Systems Biology / Bioinformatics – Spring 2016 Edward Marcotte, Univ of Texas at Austin.
Fragment Assembly 7/30/2019.
Presentation transcript:

Sequence Assembly

Assembling the data Problem: the longest single sequence possible is 1,000 bp, and most technology is 50-500 bp. Microbial genomes are 2,000,000 bp Therefore how do you sequence a whole genome? 2

Sequencing the genomes Extract DNA Shear DNA into small pieces Ligate adapters on each end Sequencing using “next generation sequencing” 3

Sequence assembly Before we look at the data Can we make longer pieces 4

The assembly A hierarchical data structure that maps sequence data to a reconstruction of the target. The assembly groups reads into contigs contigs into scaffolds Contigs provide multiple sequence alignment of reads consensus sequence. Scaffolds provide contig order and orientation sizes of the gaps between contigs. 5

Sequence assembly Reads Contigs Scaffolds 6

Four approaches to assembly Naïve approach Greedy approach Overlap / Layout / Consensus de Bruijn Graphs

Naïve approach Compare every sequence to every other sequence Find stretches that are the same Need to account for phred scores – what if a base is wrong? How long of a sequence do you need to be unique? 8

Sequence composition 4 bases 4n chance of finding a sequence if all evenly used (they are not) 3 bp: 43 = 64 8 bp: 48 = 65,336 20 bp: 420 = 1,099,511,627,776 9

Problems with this approach Sequences are not random Most genomes contain biased information Repeat sequences in the genome 10

Greedy approaches Start with a sequence Keep extending it while another sequence matches the end When can not be extended further, mark as a contig

Improve greedy approachs Only use high quality sequence Use reads that are represented more than n- times in the sample (SSAKE) End to end overlap vs. partial overlap Ignores low coverage regions … also incorporate quality scores (SHARCGS) In general, greedy approaches are fast but not very good. Make lots of short contigs

Overlap / Layout / Consensus All versus all comparison (done with K-mers for speed). Generate approximate read layout as an overlap graph. Use multiple sequence alignments to resolve layout.

Newbler (O/L/C) Makes unitigs Single contigs with no discrepancies Merge unitigs into contigs. May split unitigs and even reads (could be chimeras) Use coverage to compensate for base calls Works in flow space to calculate homopolymeric tracts. More accurate than average of averages

Assembly is a “graph” problem Overlap/Layout/Consensus de Bruijn Graph Greedy graphs A graph is nodes + edges node edge

Assemble these two sequences! AACCGGT CCGGTTA Consensus: AACCGGTTA

AACCGGT as graphs aacc accg ccgg cggt Node = K-mers; edges = nodes that overlap by K-1 bases. aacc accg ccgg cggt Here K = 4, but in reality K = 19 to 31

CCGGTTA as graphs ccgg cggt ggtt gtta

Join the two graphs ccgg cggt ggtt gtta aacc accg ccgg cggt

Join the two graphs ccgg cggt ggtt gtta aacc accg ccgg cggt

Join the two graphs ccgg cggt ggtt gtta aacc accg ccgg cggt AACCGGTTA

Differences between overlap graphs and de Bruijn graphs for assembly. Differences between an overlap graph and a de Bruijn graph for assembly. Based on the set of 10 8-bp reads (A), we can build an overlap graph (B) in which each read is a node, and overlaps >5 bp are indicated by directed edges. Transitive overlaps, which are implied by other longer overlaps, are shown as dotted edges. In a de Bruin graph (C), a node is created for every k-mer in all the reads; here the k-mer size is 3. Edges are drawn between every pair of successive k-mers in a read, where the k-mers overlap by k − 1 bases. In both approaches, repeat sequences create a fork in the graph. Note here we have only considered the forward orientation of each sequence to simplify the figure. Schatz M C et al. Genome Res. 2010;20:1165-1173 ©2010 by Cold Spring Harbor Laboratory Press

Problems with all assemblies Sequences are not random Most genomes contain biased information Repeat sequences in the genome 23

Repeats Exact repeats How does basecalling cope? High coverage versus high error rates Polymorphic repeats Real SNPs (between non-clonal individuals) Polymorphic haplotypes (eukaryotes)

Graphs get very complex

“Spurs” from bad base calls

Polymorphisms cause “bubbles”

Repeats have multiple sinks/sources

Repeats have multiple sinks/sources Salmonella has 7 rrn operons Salmonella recombines at rrn operons Helm and Maloy

Repeat sequences What happens if the repeat is longer than the read length? Need paired end reads to resolve order Need pairs that span the repeat Need pairs with one end in the repeat 30

Paired end sequencing

Paired End Sequencing Add linkers

Paired end sequencing Sequencing Nick migration

Repeats A B C Paired end reads or mate pairs 34

Discussion point Should you pair sequences before analyzing? Should you throw away singletons? What happens if ½ reads pair and ½ not?

N50 Length of the contig that contains 50% of the sequences Measure of assembly quality Longer N50 is better

N50 of Vibrio sequence assemblies

Assemblers

Current assemblers AMOS Celera WGA Assembler CLC Genomics Workbench DNA Dragon DNAnexus Euler Geneious IDBA (Iterative De Bruijn graph short read Assembler) LIGR Assembler (derived from TIGR Assembler) MIRA (Mimicking Intelligent Read Assembly) Newbler Phrap SSAKE SOAPdenovo SPAdes Velvet

Assembly RAST Pipeline for automatic assembly Works with fasta, fastq, single end, paired end Runs multiple assemblers in parallel Combines contigs

ARAST Module Stages Description a5 preprocess,assembler,post-process A5 microbial assembly pipeline a6 preprocess,assembler,post-process Modified A5 microbial assembly pipeline bhammer preprocess SPAdes component for quality control of sequence data bowtie2 post-process Bowtie2 aligner that maps reads to contigs bwa post-process BWA aligner that maps reads to contigs fastqc preprocess FastQC quality control tool for sequence data filter_by_length preprocess Length-based sequencing reads filter and trimmer based on seqtk idba assembler IDBA iterative graph-based assembler for single-cell kiki assembler Kiki overlap-based parallel microbial and metagenomic assembler quast post-process QUAST assembly quality assessment tool (run by default) ray assembler Ray graph-based parallel microbial and metagenomic assembler reapr post-process REAPR assembly error recognizer using paired-end reads sga_ec preprocess SGA component for error correction sga_preprocess preprocess SGA component for preprocessing reads spades preprocess,assembler SPAdes based on paired de Bruijn graphs sspace post-process SSPACE pre-assembled contig scaffolder swap assembler SWAP Assembler tagdust preprocess TagDust sequencing artifacts remover trim_sort preprocess DynamicTrim and LengthSort from SolexaQA velvet assembler Velvet de-bruijn graph based assembler

ARAST ar-run --single Ecoli_DH10B_Control_200.fastq -m "E. coli DH10B Control 200" -a velvet spades

ARAST

ARAST

ARAST

Hybrid assembly Geni Silva

Sequence assembly Reads Contigs Scaffolds 47

scaffold_builder http://edwards.sdsu.edu/scaffold_builder Silva et al. Source Code for Biology and Medicine 2013, 8:23

Bandage to view graphs Bandage: Ryan Wick: http://rrwick.github.io/Bandage

Bandage to view graphs Bandage: Ryan Wick: http://rrwick.github.io/Bandage

Discussion points Should we assemble or not? How does assembly affect ecological analyses?