NGS Bioinformatics Workshop 2

Slides:



Advertisements
Similar presentations
Considerations for Analyzing Targeted NGS Data HLA
Advertisements

Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.
Pamela Ferretti Laboratory of Computational Metagenomics Centre for Integrative Biology University of Trento Italy Microbial Genome Assembly 1.
Next Generation Sequencing, Assembly, and Alignment Methods
Introduction to Short Read Sequencing Analysis
Bioinformatics for Whole-Genome Shotgun Sequencing of Microbial Communities By Kevin Chen, Lior Pachter PLoS Computational Biology, 2005 David Kelley.
Genome Sequence Assembly: Algorithms and Issues Fiona Wong Jan. 22, 2003 ECS 289A.
Class 02: Whole genome sequencing. The seminal papers ``Is Whole Genome Sequencing Feasible?'' ``Whole-Genome DNA.
Assembly.
Sequencing and Assembly Cont’d. CS273a Lecture 5, Win07, Batzoglou Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT..
Sequencing and Assembly Cont’d. CS273a Lecture 5, Aut08, Batzoglou Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT..
CS273a Lecture 4, Autumn 08, Batzoglou Hierarchical Sequencing.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Introduction to computational genomics – hands on course Gene expression (Gasch et al) Unit 1: Mapper Unit 2: Aggregator and peak finder Solexa MNase Reads.
Genome sequencing and assembling
Utilizing Fuzzy Logic for Gene Sequence Construction from Sub Sequences and Characteristic Genome Derivation and Assembly.
Genome sequencing. Vocabulary Bac: Bacterial Artificial Chromosome: cloning vector for yeast Pac, cosmid, fosmid, plasmid: cloning vectors for E. coli.
Biostatistics-Lecture 15 High-throughput sequencing and sequence alignment Ruibin Xi Peking University School of Mathematical Sciences.
Assembling Genomes BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BCH364C-391L/Spring.
De-novo Assembly Day 4.
Physical Mapping of DNA Shanna Terry March 2, 2004.
Mon C222 lecture by Veli Mäkinen Thu C222 study group by VM  Mon C222 exercises by Anna Kuosmanen Algorithms in Molecular Biology, 5.
CS 394C March 19, 2012 Tandy Warnow.
Todd J. Treangen, Steven L. Salzberg
CUGI Pilot Sequencing/Assembly Projects Christopher Saski.
PE-Assembler: De novo assembler using short paired-end reads Pramila Nuwantha Ariyaratne.
Graphs and DNA sequencing CS 466 Saurabh Sinha. Three problems in graph theory.
Phred/Phrap/Consed Analysis A User’s View Arthur Gruber International Training Course on Bioinformatics Applied to Genomic Studies Rio de Janeiro 2001.
Introduction to Short Read Sequencing Analysis
1 Velvet: Algorithms for De Novo Short Assembly Using De Bruijn Graphs March 12, 2008 Daniel R. Zerbino and Ewan Birney Presenter: Seunghak Lee.
O PTICAL M APPING AS A M ETHOD OF W HOLE G ENOME A NALYSIS M AY 4, 2009 C OURSE : 22M:151 P RESENTED BY : A USTIN J. R AMME.
Meraculous: De Novo Genome Assembly with Short Paired-End Reads
Sequence assembly using paired- end short tags Pramila Ariyaratne Genome Institute of Singapore SOC-FOS-SICS Joint Workshop on Computational Analysis of.
20.1 Structural Genomics Determines the DNA Sequences of Entire Genomes The ultimate goal of genomic research: determining the ordered nucleotide sequences.
Steps in a genome sequencing project Funding and sequencing strategy source of funding identified / community drive development of sequencing strategy.
Biological Motivation for Fragment Assembly Rhys Price Jones Anne R. Haake.
SIZE SELECT SHEAR Shotgun DNA Sequencing (Technology) DNA target sample LIGATE & CLONE Vector End Reads (Mates) SEQUENCE Primer.
The Changing Face of Sequencing
The iPlant Collaborative
1 NETTAB 2012 FILTERING WITH ALIGNMENT FREE DISTANCES FOR HIGH THROUGHPUT DNA READS ASSEMBLY Maria de Cola, Giovanni Felici, Daniele Santoni, Emanuel Weitschek.
RNA-Seq Assembly 转录组拼接 唐海宝 基因组与生物技术研究中心 2013 年 11 月 23 日.
Problems of Genome Assembly James Yorke and Aleksey Zimin University of Maryland, College Park 1.
Human Genome.
Introduction to RNAseq
billion-piece genome puzzle
University of Connecticut School of Engineering Assembler Reference Abyss Simpson et al., J. T., Wong, K., Jackman, S. D., Schein, J. E., Jones,
De novo assembly validation
Whole Genome Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 13, 2005 ChengXiang Zhai Department of Computer Science University of.
CS 173, Lecture B Introduction to Genome Assembly (using Eulerian Graphs) Tandy Warnow.
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
1. Assembly by alignment Instead of overlap-layout-consensus we use alignment-consensus 2.
Meet the ants Camponotus floridanus Carpenter ant Harpegnathos saltator Jumping ant Solenopsis invicta Red imported fire ant Pogonomyrmex barbatus Harvester.
ALLPATHS: De Novo Assembly of Whole-Genome Shotgun Microreads
CyVerse Workshop Transcriptome Assembly. Overview of work RNA-Seq without a reference genome Generate Sequence QC and Processing Transcriptome Assembly.
When the next-generation sequencing becomes the now- generation Lisa Zhang November 6th, 2012.
Virginia Commonwealth University
Lesson: Sequence processing
Assembly algorithms for next-generation sequencing data
Sequence assembly Jose Blanca COMAV institute bioinf.comav.upv.es.
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
Denovo genome assembly of Moniliophthora roreri
Genome sequence assembly
Research in Computational Molecular Biology , Vol (2008)
Introduction to Genome Assembly
CS 598AGB Genome Assembly Tandy Warnow.
Discovery tools for human genetic variations
Introduction to Sequencing
Sequence Analysis - RNA-Seq 2
Assembling Genomes BCH339N Systems Biology / Bioinformatics – Spring 2016 Edward Marcotte, Univ of Texas at Austin.
Presentation transcript:

NGS Bioinformatics Workshop 2 NGS Bioinformatics Workshop 2.1 Next Generation Sequencing and Sequence Assembly Algorithms May 2nd, 2012 IRMACS 10900, SFU Facilitators: NGS Technology: thanks to Jim Mattson NGS Assembly Algorithms: Richard Bruskiewich

NGS Bioinformatics – 2nd Part Topic Lecture (12:30 – 14:30, Wednesdays) Demo/Lab (9:30 – 11:30, Thursdays) Next Generation Sequence Analysis and Beyond Next Generation Sequencing and Sequence Assembly Algorithms May 2nd May 3rd Sequence Assembly of Whole Genomes May 9th May 10th Sequence Assembly of Transcriptomes May 16th May 17th Identification and Analysis of Sequence Variation May 23rd May 24th Comparative Genomic Analysis and Visualization May 30th May 31st Meta-Analysis of Genomic Data June 6th June 7th

Overview Sequencing: Overview of sequence assembly Sanger method (brief review) “Next Generation” Sequencing (NGS) In depth treatment by Jim Mattson… Overview of sequence assembly

What is a “Sequence Read”? A single instance of experimentally obtained subsequence representing a (possibly erroneous, likely biased) subsampling of a sequence space of (generally larger) target nucleotide molecules, which may also have some computed associated measure of quality

2.1 Next Generation Sequencing and Sequence Assembly Algorithms SANGER Sequencing

Contigs and Sanger Automated Sequencing Chromosome Large-insert Clones Sequencing reads from subclones Sequence reads CAGACTACCGTTAGACTT Large insert library Shotgun cloning Sequencing

Dideoxy chain-termination (“Sanger”) Method

NGS Trend PubMed was “searched in two-year increments for key words and the number of hits plotted over time.” From the following article What would you do if you could sequence everything? Avak Kahvejian, John Quackenbush & John F Thompson Nature Biotechnology 26, 1125 - 1133 (2008) doi:10.1038/nbt1494

“Next generation” or “Deep” sequencing Example: Illumina Genome Analyzer II sequencing Rapid, short-read sequencing Less accurate but higher coverage compensates Benefits greatly from Paired-End sequencing (Mate pairs) Sequence two ends of a fragment of known size. Fragment length (insert size) can range from 2 – 5 kb Illumina reads can range from 25-77 bps (longer length better except for high GC sequence - most use 100 or 150 bp reads now) ~200 million reads

What is a Mate-Pair (or Long Paired-End) library? Mate-pair library is the Illumina synonym for the Roche long paired-end library (LPE). While the long paired end library is adapted to be sequenced on GS FLX, the mate-pair library is adapted to the Illumina HiSeq 2000 technology. The library consists of approximately 150-300 bp fragments. These are composed of 2 DNA segments originally located 2-5 kbp apart in the genome of interest. With a mate-pair library it is therefore possible to span gaps or repeats of up to 2-5 kbp.

Paired End Reads are Important! Known Distance Read 1 Read 2 Repetitive DNA Unique DNA Paired read maps uniquely Single read maps to multiple positions

NGS Sequencing technologies 2.1 Next Generation Sequencing and Sequence Assembly Algorithms NGS Sequencing technologies Over to Jim Mattson

1.5 Principles of Genomics, Next Generation Sequencing and Sequence Assembly Algorithms

What is a sequence assembly? An assembly is an hierarchical data structure that maps the sequence data to a putative reconstruction of the target* a (*) Miller JR, Koren S and Sutton G. 2010. Assembly algorithms for next-generation sequencing data. Genomics 95:315-327

(Classical, Sanger) Sequence Assembly Early genomes were sequenced on the basis of decomposition of genomes and chromosomes into tractable sizes of (subcloned) DNA (~100 kb), ordered and oriented by detailed genetic and physical maps. Clone-by-clone based sequence assembly was a simpler computational problem given the relatively small size and reduced complexity of the sequence target (subclone) and relatively long Sanger reads, but was extremely costly due to the experimental overhead of the DNA decomposition.

“Classical” Sequence Assembly Read, edit & trim DNA chromatograms Remove overlaps & ambiguous calls Read in all sequence files (10-10,000) Reverse complement all sequences (doubles # of sequences to align) Remove vector sequences (vector trim) Remove regions of low complexity Perform multiple sequence alignment & merge Fill (“finish”) gaps using a variety of experimental procedures.

Contig Alignment - Process ATCGATGCGTAGC TAGCAGACTACCGTT GTTACGATGCCTT GCTACGCATCGT CGATGCGTAGCA CGATGCGTAGCA ATCGATGCGTAGC TAGCAGACTACCGTT GTTACGATGCCTT ATCGATGCGTAGCAGACTACCGTTACGATGCCTT…

Sanger Automated Sequence Assembly Software Phred: base calling program that does detailed statistical analysis on Sanger chromatogram (“trace”) files Concept of “PHRED” quality score for base pairs Q = -10 log10( Pe ) e.g. 1 error in 1000 = -10 log10 (10-3) = 30 Phrap: sequence assembly program http://www.phrap.org/phredphrapconsed.html Ewing et al. Genome Research 1998, Vol. 8, Issue 3 for a good overview.

Whole Genome Shot-Gun Sequencing Takes reads from random positions along a target molecule Whole-genome shotgun (WGS) sequencing samples all of the chromosomes that make up one genome.

What is a WGS sequence assembly? WGS assembly is the reconstruction of sequence up to chromosome length, through over-sampling such that reads overlap. Groups reads into contigs, and contigs into scaffolds Contigs document a multiple sequence alignment of reads into a contiguous consensus Scaffolds are gapped sequences composed of ordered and oriented contigs with inferred gaps indicated as indeterminate bases (N’s) a

Confounding factors for NGS WGS assembly Assembly of NGS reads is generally limited by the fact that read lengths are much shorter than even the smallest genome. Random oversampling of the target tries to overcome this limitation however… All current NGS technologies are intrinsically error prone hence imperfect alignments are tolerated in assembly algorithms, but… Target genomes are generally full of subtly divergent repetitive content of diverse nature, generally of a length longer than NGS read lengths, it is not easy to always know what is sequencing error versus a sequence variant

Opportunity cost of sequencing errors… Software must tolerate imperfect sequence alignments to avoid missing true joins, however, this error tolerance may introduce false-positive joins of reads that induce chimeric assemblies

Opportunity cost of sequencing errors In practice, tolerance for sequencing error makes it difficult to resolve a wide range of genomic variation: Polymorphic repeats Polymorphic differences between non-clonal asexual individuals Polymorphic differences between non-inbred sexual individuals Polymorphic haplotypes from one non-inbred individual

Other Limitations in Assemblies Assembly is also confounded by non-uniform coverage of the target Cellular copy number variation in source molecules. Remember too that this kind of variation may have intrinsic scientific interest: e.g. estimation of gene expression via NGS transcriptome assembly (RNA-Seq) Compositional biases of sequencing technology (e.g. paucity of representation of AT rich and/or GC rich fragments in NGS templates)

NGS Contig & Scaffold Assembly Strive for proper experimental design for libraries (e.g. sample quality, mate pair insert sizes, etc.) Transfer raw read data into analysis environment (it may be a non-trivial task to load the enormous raw read files into the computer for the analysis) Perform bulk statistical analysis of raw read data to ascertain overall dataset quality Filter out and trim reads based on low quality thresholds Select a k-mer size Perform assembly based on selected k-mer size (and perhaps, mate pair library insert size). Measure quality of assembly. Iterate on other k-mer sizes to improve quality. Use multiple assembly programs and compare results.

Measuring the quality of an assembly The most common (and crude) metric of assembly “quality” is N50, which is a weighted median of lengths of items equal to the length of the longest item i such that the sum of the lengths of items greater than or equal in length to i is greater than or equal to half of the length of all of the items. The items in question are generally specified: either the contigs or scaffolds of the assembly. A number of additional metrics have now been developed by the bioinformatics community*. (*) Earl DA et al. 2011 Assemblathon 1: A competitive assessment of de novo short read assembly methods.  http://genome.cshlp.org/content/early/2011/09/16/gr.126599.111

Two Classes of Assembly Map-based WGS assembly refers to reconstruction of the underlying sequence facilitated by alignments to a previously resolved reference sequences. de novo WGS assembly refers to reconstruction of the underlying sequence without a previously resolved reference sequence.

Map-Based Assembly

Map alignment assembly of short reads Strategy: index the reference genome sequence and search it efficiently For this purpose, map-alignment sequence assembly approaches generally use a computing strategy called Burrows–Wheeler indexing to notablyreduce compute time and memory usage, see http://bio-bwa.sourceforge.net MAQ – Mapping and Assembly with Quality Heng Li, Sanger Centre http://maq.sourceforge.net/maq-man.shtml Bowtie - An ultrafast memory-efficient short read aligner Ben Langmead and Cole Trapnell, University of Maryland http://bowtie-bio.sourceforge.net/ SOAPAligner from SOAP (Short Oligonucleotide Analysis Package) http://soap.genomics.org.cn/soapaligner.html

de novo Assemblies

Graph Theory The mathematical concept(*) of a graph is a topographical data abstraction which is a set of nodes (vertices) are connected by a set of edges (arcs). Nodes are some object in a collection and edges are their relationships This is a widely used data structure in computing science, used to represent many diverse computational problems (and used in many computing algorithms) (*) First defined by the great mathematician Leonhard Euler in 1735, as a tool to solve the 'Bridges of Königsberg problem’. See http://en.wikipedia.org/wiki/Graph_theory

What is a k-mer? For efficiency, all NGS assembly software relies to some extent on the notion of a k-mer. A k-mer is a contiguous sequence of k base calls, where k is any positive integer. Intuitively, sequence reads with high similarity must share k-mers in their overlapping regions Fast detection (by indexing) of shared k-mer content is computationally far less expensive than “all-against-all” sequence alignment detection of variable length sequence overlaps.

de Bruijn Graphs of k-mer de Bruijn graphs were developed outside of the field of sequence assembly as a data representation for arbitrary strings spanning a finite alphabet. The nodes of a de Bruijn graph represent all possible fixed length strings. The edges represent suffix-to-prefix perfect overlaps. Graphs of all k-mers and their fixed length overlaps observed in a target nucleotide sequence (or a sequence sampling of that target), is a kind of de Bruijn graph The primary advantage of de Bruijn graphs is that their size is generally delimited by the complexity of the underlying genome, not by the total number (“depth”) of reads The primary disadvantage is that different sequences (i.e. reads) can theoretically resolve to identical de Bruijn graphs due to internal repeat content (cycles in the graph) See: Compeau PEC, Pevzner PA and Tesler G. 2011. How to apply de Bruijn graphs to genome assembly. Nature Biotechnology 29, 987–991. doi:10.1038/nbt.2023

Example of a DNA Sequence de Bruijn Graph From http://www.homolog.us/blogs/2011/07/28/de-bruijn-graphs-i/ accessed 30/4/2012

Sequencing Errors Generate Lightly Travelled Divergent Paths in de Bruijn graphs Sequence assembly algorithms can prune such lightly travelled paths but reconstruct the genome from heavily traversed paths.

Repeat Content in Targets Add Graph Cycles

Spanning or Mate-Pair Reads Resolve Complexity

The Computational Challenge of Assembly Sequencing generates enormous data sets with highly heterogeneous sequence reads. This is especially true of NGS. This adds complexity to computation: graph processing algorithms are in the category of NP complete computing problems (=> really hard!) hence rely heavily on heuristics to solve (but still demand a significant number of CPU cycles) de Bruijn graphs constructed to merge read data are inherently very large – due to the observed number of distinct k-mers in the target sequences - hence require significant computer memory to hold the constructed graph

De novo NGS WGS assembly of short reads Velvet Daniel Zerbino and Ewan Birney, EMBL-EBI http://www.ebi.ac.uk/~zerbino/velvet/ ABySS Inanç Birol, Shaun Jackman, Steve Jones and others, GSC http://www.bcgsc.ca/platform/bioinfo/software/abyss ALLPATHS-LG Jaffe et al CRD, Broad Institute http://www.broadinstitute.org/software/allpaths-lg/blog/ SOAPdenovo Li et al. Beijing Genome Institute http://soap.genomics.org.cn/soapdenovo.html Additional software listed in the Earl DA et al. 2011. Assemblathon 1: A competitive assessment of de novo short read assembly methods. http://genome.cshlp.org/content/early/2011/09/16/gr.126599.111

Transcriptome Assembly has Additional Considerations Transcripts from highly similar paralogous loci is another instance of the repetitive sequence problem Alternate splicing also generates branching de Bruijn graphs Catch-22: Assembly efficacy biased by gene expression variation in the numbers of raw reads (i.e. especially for transcripts with low gene expression). Gene expression estimates from NGS sequencing (“RNA-Seq”) confounded by NGS sequence technology compositional biases and the above graph resolution issues

Detection of Sequence Variants is a Challenge Sequencing errors can masquerade as variation Gene copy number polymorphism, segmental duplication, resolution of linkage haplotypes, etc. are special cases of the repeat duplication problem