GENOME SEQUENCING AND ASSEMBLY Mayo/UIUC Summer Course in Computational Biology.

Slides:



Advertisements
Similar presentations
MCB Lecture #15 Oct 23/14 De novo assemblies using PacBio.
Advertisements

ILP-BASED MAXIMUM LIKELIHOOD GENOME SCAFFOLDING James Lindsay Ion Mandoiu University of Connecticut Hamed Salooti Alex ZelikovskyGeorgia State University.
Genome Assembly: a brief introduction
MCB Lecture #9 Sept 23/14 Illumina library preparation, de novo genome assembly.
Pamela Ferretti Laboratory of Computational Metagenomics Centre for Integrative Biology University of Trento Italy Microbial Genome Assembly 1.
Bioinformatics for Whole-Genome Shotgun Sequencing of Microbial Communities By Kevin Chen, Lior Pachter PLoS Computational Biology, 2005 David Kelley.
Genome Sequence Assembly: Algorithms and Issues Fiona Wong Jan. 22, 2003 ECS 289A.
Genome sequence assembly
Comparative ab initio prediction of gene structures using pair HMMs
Sequencing and Assembly Cont’d. CS273a Lecture 5, Win07, Batzoglou Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT..
Novel multi-platform next generation assembly methods for mammalian genomes The Baylor College of Medicine, Australian Government and University of Connecticut.
CS273a Lecture 4, Autumn 08, Batzoglou Hierarchical Sequencing.
Henrik Lantz - BILS/SciLife/Uppsala University
Genome sequencing and assembling
The Sorcerer II Global ocean sampling expedition Katrine Lekang Global Ocean Sampling project (GOS) Global Ocean Sampling project (GOS) CAMERA CAMERA METAREP.
Genome sequencing. Vocabulary Bac: Bacterial Artificial Chromosome: cloning vector for yeast Pac, cosmid, fosmid, plasmid: cloning vectors for E. coli.
Genome Annotation BCB 660 October 20, From Carson Holt.
Genome sequencing and assembly Mayo/UIUC Summer Course in Computational Biology Genome sequencing and assembly.
Assembling Genomes BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BCH364C-391L/Spring.
De-novo Assembly Day 4.
How to Build a Horse Megan Smedinghoff.
Mouse Genome Sequencing
Mon C222 lecture by Veli Mäkinen Thu C222 study group by VM  Mon C222 exercises by Anna Kuosmanen Algorithms in Molecular Biology, 5.
CS 394C March 19, 2012 Tandy Warnow.
Genome Annotation using MAKER-P at iPlant Collaboration with Mark Yandell Lab (University of Utah) iPlant: Josh Stein (CSHL) Matt Vaughn.
Todd J. Treangen, Steven L. Salzberg
CUGI Pilot Sequencing/Assembly Projects Christopher Saski.
A hierarchical approach to building contig scaffolds Mihai Pop Dan Kosack Steven L. Salzberg Genome Research 14(1), pp , 2004.
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
PE-Assembler: De novo assembler using short paired-end reads Pramila Nuwantha Ariyaratne.
Opera: Reconstructing optimal genomic scaffolds with high- throughput paired-end sequences Song Gao, Niranjan Nagarajan, Wing-Kin Sung National University.
1 Velvet: Algorithms for De Novo Short Assembly Using De Bruijn Graphs March 12, 2008 Daniel R. Zerbino and Ewan Birney Presenter: Seunghak Lee.
June 11, 2013 Intro to Bioinformatics – Assembling a Transcriptome Tom Doak Carrie Ganote National Center for Genome Analysis Support.
Gao Song 2010/07/14. Outline Overview of Metagenomices Current Assemblers Genovo Assembly.
Sequence assembly using paired- end short tags Pramila Ariyaratne Genome Institute of Singapore SOC-FOS-SICS Joint Workshop on Computational Analysis of.
Steps in a genome sequencing project Funding and sequencing strategy source of funding identified / community drive development of sequencing strategy.
Metagenomics Assembly Hubert DENISE
The iPlant Collaborative
Towards your own genome. Designing your Sequencing Run Sequencing strategy Genome size and genome.
RNA-Seq Assembly 转录组拼接 唐海宝 基因组与生物技术研究中心 2013 年 11 月 23 日.
Problems of Genome Assembly James Yorke and Aleksey Zimin University of Maryland, College Park 1.
Bombus terrestris, the buff-tailed bumble bee Native to Europe A managed pollinator Commercially available Reared in greenhouses Important pollinator in.
1.Data production 2.General outline of assembly strategy.
Human Genome.
billion-piece genome puzzle
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
University of Connecticut School of Engineering Assembler Reference Abyss Simpson et al., J. T., Wong, K., Jackman, S. D., Schein, J. E., Jones,
De novo assembly validation
__________________________________________________________________________________________________ Fall 2015GCBA 815 __________________________________________________________________________________________________.
Sequencing technologies and Velvet assembly Lecturer : Du Shengyang September 29 , 2012.
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
Accessing and visualizing genomics data
CyVerse Workshop Transcriptome Assembly. Overview of work RNA-Seq without a reference genome Generate Sequence QC and Processing Transcriptome Assembly.
RNA Sequencing and transcriptome reconstruction Manfred G. Grabherr.
When the next-generation sequencing becomes the now- generation Lisa Zhang November 6th, 2012.
Canadian Bioinformatics Workshops
JERI DILTS SUZANNA KIM HEMA NAGRAJAN DEEPAK PURUSHOTHAM AMBILY SIVADAS AMIT RUPANI LEO WU Genome Assembly Final Results
bacteria and eukaryotes
Assembly algorithms for next-generation sequencing data
Sequence assembly Jose Blanca COMAV institute bioinf.comav.upv.es.
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
Gapless genome assembly of Colletotrichum higginsianum reveals chromosome structure and association of transposable elements with secondary metabolite.
Genome sequence assembly
Professors: Dr. Gribskov and Dr. Weil
How to Build a Horse: Final Report
Genome Sequencing and Assembly
Genome Assembly Chris Fields
Assembling Genomes BCH339N Systems Biology / Bioinformatics – Spring 2016 Edward Marcotte, Univ of Texas at Austin.
Presentation transcript:

GENOME SEQUENCING AND ASSEMBLY Mayo/UIUC Summer Course in Computational Biology

Session Outline Planning a genome sequencing projectAssembly strategies and algorithmsAssessing the quality of the assemblyAssessing the quality of the assemblersGenome annotation

Genome sequencing

Schematic overview of genome assembly. (a) DNA is collected from the biological sample and sequenced. (b) The output from the sequencer consists of many billions of short, unordered DNA fragments from random positions in the genome. (c) The short fragments are compared with each other to discover how they overlap. (d) The overlap relationships are captured in a large assembly graph shown as nodes representing kmers or reads, with edges drawn between overlapping kmers or reads. (e) The assembly graph is refined to correct errors and simplify into the initial set of contigs, shown as large ovals connected by edges. (f) Finally, mates, markers and other long-range information are used to order and orient the initial contigs into large scaffolds, as shown as thin black lines connecting the initial contigs. Schatz et al. Genome Biology :243

Planning a genome sequencing project How large is my genome? How much of it is repetitive, and what is the repeat size distribution? Is a good quality genome of a related species available? What will be my strategy for performing the assembly?

How large is my genome? The size of the genome can be estimated from the ploidy of the organism and the DNA content per cell This will affect: »How many reads will be required to attain sufficient coverage (typically 10x to 100x) »What sequencing technology to use »What computational resources will be needed

Repetitive sequences Most common source of assembly errors If sequencing technology produces reads > repeat size, impact is much smaller Most common solution: generate reads or mate pairs with spacing > largest known repeat

Assemblies can collapse around repetitive sequences. Salzberg S L, and Yorke J A Bioinformatics 2005;21: © The Author Published by Oxford University Press. All rights reserved. For Permissions, please

Genome(s) from related species Preferably of good quality, with large reliable scaffolds Help guiding the assembly of the target species Help verifying the completeness of the assembly Can themselves be improved in some cases But to be used with caution – can cause errors when architectures are different!

Strategies for assembly The sequencing approaches and assembly strategies are interdependent! »E.g., for bacterial genome assembly, can generate PacBio reads and assemble with Celera Assembler, or generate Illumina reads and assemble with Velvet or SPAdes »Optimal sequencing strategies very different for a SOAPdenovo or an ALLPATHS-LG assembly

Typical sequencing strategies Bacterial genome: »2x300 overlapping paired-end reads from Illumina MiSeq machine, assembly with SPAdes »PacBio CLR sequences at 200x coverage, self-correction and/or hybrid correction and assembly using Celera Assembler or PBJelly Vertebrate genome: »Combination paired-end (2x250 nt overlapping fragments) and mate- pair (1, 3 and 10 kb libraries) 100 nt reads from Illumina machine at 100x coverage (~1B reads for 1 GB genome), assembly with ALLPATHS-LG

Illumina paired end and mate pair sequencing

Additional useful data Fosmid libraries »End sequencing adds long-range contiguity information »Pooled fosmids (~5000) can often be assembled more efficiently Moleculo (Illumina TSLR) libraries »Technology acquired by Illumina, allows generation of fully assembled 10 kb sequences Pacbio reads »Provide 5-8 kb reads, but in most cases need parallel coverage by Illumina data for error correction

Assembly strategies and algorithms In all cases, start with cleanup and error correction of raw reads For long reads (>500 nt), Overlap/Layout/Consensus (OLC) algorithms work best For short reads, De Bruijn graph-based assemblers are most widely used

Cleaning up the data Trim reads with low quality calls Remove short reads Correct errors: »Find all distinct k-mers (typically k=15) in input data »Plot coverage distribution »Correct low-coverage k-mers to match high- coverage »Part of several assemblers, also stand- alone Quake or khmer programs

Overlap-layout-consensus 23

OLC assembly steps Calculate overlays »Can use BLAST-like method, but finding common k- mers more efficient Assemble layout graph, try to simplify graph and remove nodes (reads) – find Hamiltonian path Generate consensus from the alignments between reads (overlays)

Some OLC-based assemblers Celera Assembler with the Best Overlap Graph (CABOG) »Designed for Sanger sequences, but works with 454 and PacBio reads (with or without error correction) Newbler, a.k.a. GS de novo Assembler »Designed for 454 sequences, but works with Sanger reads

De Bruijn graphs - concept

Converting reads to a De Bruijn graph Reads are 7 nt long Graph with k=3 Deduced sequence (main branch)

DBG implementation in the Velvet assembler

Examples of DBG-based assemblers EULER (P. Pevzner), the first assembler to use DBG Velvet (D. Zerbino), a popular choice for small genomes SOAPdenovo (BGI), widely used by BGI, best for relatively unstructured assemblies ALLPATHS-LG, probably the most reliable assembler for large genomes (but with strict input requirements)

Repeats often split genome into contigs Contig derived from unique sequences Reads from multiple repeats collapse into artefactual contig

Consensus (15- 30Kbp) Reads Contig Assembly without pairs results in contigs whose order and orientation are not known. ? Pairs, especially groups of corroborating ones, link the contigs into scaffolds where the size of gaps is well characterized. 2-pair Mean & Std.Dev. is known Scaffold Pairs Give Order & Orientation

ChromosomeSTS STS-mapped Scaffolds Contig Gap (mean & std. dev. Known) Read pair (mates) Consensus Reads (of several haplotypes) SNPs External “Reads” Anatomy of a WGS Assembly

Assembly gaps 37 sequencing gap - we know the order and orientation of the contigs and have at least one clone spanning the gap physical gap - no information known about the adjacent contigs, nor about the DNA spanning the gap Sequencing gaps Physical gaps

38 Handling repeats 1.Repeat detection » pre-assembly: find fragments that belong to repeats statistically (most existing assemblers) repeat database ( RepeatMasker ) » during assembly: detect "tangles" indicative of repeats (Pevzner, Tang, Waterman 2001) » post-assembly: find repetitive regions and potential mis- assemblies. Reputer, RepeatMasker "unhappy" mate-pairs (too close, too far, mis-oriented) 2.Repeat resolution »find DNA fragments belonging to the repeat »determine correct tiling across the repeat »Obtain long reads spanning repeats

How good is my assembly? How much total sequence is in the assembly relative to estimated genome size? How many pieces, and what is their size distribution? Are the contigs assembled correctly? Are the scaffolds connected in the right order / orientation? How were the repeats handled? Are all the genes I expected in the assembly?

N50: the most common measure of assembly quality N50 = length of the shortest contig in a set making up 50% of the total assembly length

Order and orientation of contigs – more errors in one assembly than in another

REAPR overview

REAPR Summary REAPR is a toolkit that assesses the quality of a genome assembly independently of the assembler, and without needing a “gold” reference assembly REAPR is not a variant calling tool; it examines the consistency of a genome assembly with the same data that were used to assemble it REAPR output can be visualized in many ways, and helps genome finishing projects Every genome assembly project should use REAPR or a similar toolkit to perform quality checks on the assemblies being produced

BUSCO and CEGMA: conserved gene sets From Ian Korf’s group, UC Davis Mapping Core Eukaryotic Genes From Evgeny Zdobnov’s group, University of Geneva Coverage is indicative of quality and completeness of assembly

Even the best genomes are not perfect

There is no such thing as a “perfect” assembler (results from GAGE competition)

The computational demands and effectiveness of assemblers are very different

Assessing assembly strategies Assemblathon (UC Davis and UC Santa Cruz) »Provide challenging datasets to assemble in open competition (synthetic for edition 1, real for edition 2) »Assess competitor assemblies by many different metrics »Publish extensive reports GAGE (U. of Maryland and Johns Hopkins) »Select datasets associated with known high-quality genomes »Run a set of open source assemblers with parameter sweeps on these datasets »Compare the results, publish in scholarly Journals with complete documentation of parameters

Some advice on running assemblies Perform parameter sweeps »Use many different values of key parameters, especially k-mer size for DBG assemblers, and evaluate the output (some assemblers can do this automatically) Try different subsets of the data »Sometimes libraries are of poor quality and degrade the quality of the assembly »Artefacts in the data (e.g. PCR duplicates, homopolymer runs, …) can also badly affect output quality Try more than one assembler »There is no such thing as “the best” assembler

Genome annotation A genome sequence is useless without annotation Three steps in genome annotation: »Find features not associated with protein-coding genes (e.g. tRNA, rRNA, snRNA, SINE/LINE, miRNA precursors) »Build models for protein-coding genes, including exons, coding regions, regulatory regions »Associate biologically relevant information with the genome features and genes

Methods for genome annotation Ab initio, i.e. based on sequence alone »INFERNAL/rFAM (RNA genes), miRBase (miRNAs), RepeatMasker (repeat families), many gene prediction algorithms (e.g. AUGUSTUS, Glimmer, GeneMark, …) Evidence-based »Require transcriptome data for the target organism (the more the better) »Align cDNA sequences to assembled genome and generate gene models: TopHat/Cufflinks, Scripture

Methods for biological annotation BLAST of gene models against protein databases »Sequence similarity to known proteins InterProScan of predicted proteins against databases of protein domains (Pfam, Prosite, HAMAP, PANTHER, …) Mapping against Gene Ontology terms (BLAST2GO)

MAKER, integration framework for genome annotation MAKER runs many software tools on the assembled genome and collates the outputs See

Acknowledgements For this slide deck I “borrowed” figures and slides from many publications, Web pages and presentations by »M. Schatz, S. Salzberg, K. Bradnam, K. Krampis, D. Zerbino, J. J. Cook, M. Pop, G. Sutton, T. Seemann Thank you!