Graph-based variant detection

Slides:



Advertisements
Similar presentations
Considerations for Analyzing Targeted NGS Data HLA
Advertisements

ILP-BASED MAXIMUM LIKELIHOOD GENOME SCAFFOLDING James Lindsay Ion Mandoiu University of Connecticut Hamed Salooti Alex ZelikovskyGeorgia State University.
Next-generation sequencing
Targeted Data Introduction  Many mapping, alignment and variant calling algorithms  Most of these have been developed for whole genome sequencing and.
Bioinformatics for Whole-Genome Shotgun Sequencing of Microbial Communities By Kevin Chen, Lior Pachter PLoS Computational Biology, 2005 David Kelley.
Genome Sequence Assembly: Algorithms and Issues Fiona Wong Jan. 22, 2003 ECS 289A.
Sequencing and Assembly Cont’d. CS273a Lecture 5, Aut08, Batzoglou Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT..
CS273a Lecture 4, Autumn 08, Batzoglou Hierarchical Sequencing.
Utilizing Fuzzy Logic for Gene Sequence Construction from Sub Sequences and Characteristic Genome Derivation and Assembly.
The Sorcerer II Global ocean sampling expedition Katrine Lekang Global Ocean Sampling project (GOS) Global Ocean Sampling project (GOS) CAMERA CAMERA METAREP.
Considerations for Analyzing Targeted NGS Data Introduction Tim Hague, CTO.
De-novo Assembly Day 4.
CS 394C March 19, 2012 Tandy Warnow.
Todd J. Treangen, Steven L. Salzberg
A hierarchical approach to building contig scaffolds Mihai Pop Dan Kosack Steven L. Salzberg Genome Research 14(1), pp , 2004.
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
PE-Assembler: De novo assembler using short paired-end reads Pramila Nuwantha Ariyaratne.
How I learned to quit worrying Deanna M. Church Staff Scientist, Short Course in Medical Genetics 2013 And love multiple coordinate.
1 Velvet: Algorithms for De Novo Short Assembly Using De Bruijn Graphs March 12, 2008 Daniel R. Zerbino and Ewan Birney Presenter: Seunghak Lee.
Genome alignment Usman Roshan. Applications Genome sequencing on the rise Whole genome comparison provides a deeper understanding of biology – Evolutionary.
P. Tang ( 鄧致剛 ); RRC. Gan ( 甘瑞麒 ); PJ Huang ( 黄栢榕 ) Bioinformatics Center, Chang Gung University. Genome Sequencing Genome Resequencing De novo Genome.
Metagenomics Assembly Hubert DENISE
CS177 Lecture 10 SNPs and Human Genetic Variation
The Changing Face of Sequencing
The iPlant Collaborative
Chapter 21 Eukaryotic Genome Sequences
RNA-Seq Assembly 转录组拼接 唐海宝 基因组与生物技术研究中心 2013 年 11 月 23 日.
© 2010 by The Samuel Roberts Noble Foundation, Inc. 1 The Samuel Roberts Noble Foundation, 2510 Sam Noble Parkway, Ardmore, OK, 73401, USA 2 National Center.
Genomics and Forensics
MEME homework: probability of finding GAGTCA at a given position in the yeast genome, based on a background model of A = 0.3, T = 0.3, G = 0.2, C = 0.2.
Human Genome.
Introduction to RNAseq
billion-piece genome puzzle
De novo assembly validation
__________________________________________________________________________________________________ Fall 2015GCBA 815 __________________________________________________________________________________________________.
Ke Lin 23 rd Feb, 2012 Structural Variation Detection Using NGS technology.
Analyzing DNA using Microarray and Next Generation Sequencing (1) Background SNP Array Basic design Applications: CNV, LOH, GWAS Deep sequencing Alignment.
CyVerse Workshop Transcriptome Assembly. Overview of work RNA-Seq without a reference genome Generate Sequence QC and Processing Transcriptome Assembly.
RNA Sequencing and transcriptome reconstruction Manfred G. Grabherr.
Canadian Bioinformatics Workshops
Discussion on Genomic/Metagenomic Data for ANGUS Course Adina Howe.
The Bovine Genome Sequence: potential resources and practical uses. Nicola Hastings, Andy Law and John L. Williams * * Department of Genetics and Genomics,
Rob Edwards San Diego State University
Metagenomic Species Diversity.
Introduction to Bioinformatics Resources for DNA Barcoding
Lesson: Sequence processing
Week-6: Genomics Browsers
Preprocessing Data Rob Schmieder.
Quality Control & Preprocessing of Metagenomic Data
Bioinformatics Research Group
CAP5510 – Bioinformatics Sequence Assembly
Gapless genome assembly of Colletotrichum higginsianum reveals chromosome structure and association of transposable elements with secondary metabolite.
Metagenomic assembly Cedric Notredame
Pre-assembly analyses
Research in Computational Molecular Biology , Vol (2008)
Introduction to Genome Assembly
CSE182-L12 Gene Finding.
Henrik Lantz - NBIS/SciLife/Uppsala University
Metagenomics Image: Iverson et al. 2012, Science.
CS 598AGB Genome Assembly Tandy Warnow.
Haley J. Abel, Hussam Al-Kateb, Catherine E. Cottrell, Andrew J
Discovery tools for human genetic variations
Genome organization and Bioinformatics
BLAST.
The ability of the SOP to sequence and identify unknown samples.
Eric Samorodnitsky, Jharna Datta, Benjamin M
Volume 21, Issue 8, Pages (August 2014)
Taxonomic identification and phylogenetic profiling
BF528 - Genomic Variation and SNP Analysis
Presentation transcript:

Graph-based variant detection Todd Treangen

Goals for a metagenomic assembler Must work well for clonal data handle repeats handle errors deal with low coverage regions Must deal with polymorphisms distinguish between errors and polymorphisms – distinguish between repeats and polymorphisms – enable discovery of polymorphisms/variation

Repeats in metagenome High coverage regions -> abundant organisms Repeats are genome sized due to closely related strains in the sample Isolate genome Metagenome Although the statistical methods seem to work well in single genome assembly, they can not be applied to metagenome assembly. Metagenome consists of a mixture of organisms. So if some organism is more abundant in the sample, it will have higher coverage despite of not being a repeat. By removing such regions, a valuable information is lost. Also, repeats are very large in length in metagenome due to closely related strains in the sample. For example, in this figure, regions marked by green are repetitive regions. In case of isolate genome, repeats are small in length but in case of metagenome, repeats are much larger in length. We do not want to remove such regions as it would result in loss of valuable information while reconstruction of genomes. Species 1 Species 2 Species 3

Why assemblers break contigs? Genome Reads Contigs Not enough coverage to assemble!

Why assemblers break contigs? Genome Contig 2 ? Contig 1 ? Contig 3

Orthogonal Linkage Information Genome scaffolding Orthogonal Linkage Information Scaffolds Assembled Contigs

Variant Detection Using Graph Topology SPQR Tree If two nodes lie in the same node of SPQR tree and share a virtual edge, then they are start and end of a bubble! Figure credits: Nijkamp et al., 2013

Prior work Very few scaffolding methods available for metagenome data Assemblers like MetaSPAdes (Nurk et al., 2017) have inbuilt scaffolding modules Bambus2 (Koren et al., 2011) is the only standalone metagenome scaffolder Not easy to use, requires specific file formats Both don’t scale to large data

Link Construction and Bundling MetaCarvel Pipeline Paired end reads Link Construction and Bundling Repeat Detection Contig Orientation Variant Detection (Kececioglu and Myers, 1995) (Huson et al., 2001) Approximate Betweenness Centrality and coverage skew Short read assembly Ghurye, Jay, and Mihai Pop. "Better Identification of Repeats in Metagenomic Scaffolding." WABI, Aarhus-Denmark, 2016.

Variant detection using graph topology Species 1 …AACTTACGTGAAAATACAACTTAGGACTA… Variations in the sequences cause bubble in the graph Naïve search for all bubble is exponential in the number of nodes Use a SPQR tree based approach to find out bubbles (Nijkamp et al. 2013) Runs in O(V+E) time (Gutwenger et al. 2001) …AACTTACTTGAAAATACAACTTTGGACTA.. Species 2 ACGTG TTAGG AAAAT ACAAC GACTA AACTT ACTTG TTTGG

MetaCarvel and MetagenomeScope Overview of the different types of variants detected by MetaCarvel. ​ We focus on five types of variants:​ A) Triangular bubbles,​ B) Simple four bubbles ​ C) High centrality nodes​ D) Simple Cycles, and ​ E) Multinode cycles https://github.com/marbl/MetaCarvel https://github.com/marbl/MetagenomeScope

Biological variants captured by graph motifs High-coverage nodes (inter/intragenomic repeats)           Simple bubbles (strain variants, gene loss/gain)           Cycles  Dataset Coverage Tandem repeats  Interspersed repeats Four-node Bubbles  Three-node Bubbles Inversions/ Inverted repeats Plasmids (closed) HMP2 Stool      17X 99442 95062 4243 16967 1182573 8106 HMP2 Tongue 28X 167099 120876 19872 62814 2651898 15551 HMP2 Plaque 1023956 62431 15106 50543 1661594 21710 HMP2 Cheek 15X 175226 32615 2923 9455 486016 3059

3 bubbles 4 bubbles Mobilome Mobilome RNA processing RNA processing

Putting it all together: the search for a pathogen in a mosquito microbiome

Pathogen in Peru The Orthobunyavirus genus comprises a diverse set of viral species, represented by multiple serogroups, including Their RNA genome (10 kbp nt) includes three segments (Small [S], Medium [M], and Large [L]). Speculated to cause febrile like illness

Virus isolation Mosquitoes were captured at Aotus monkey-baited traps as part of an enzootic dengue study conducted in the vicinity of Iquitos, Peru (Andrews et al. 2014). Mosquitoes were identified to species, pooled (up to 25 specimens/pool), frozen on dry ice, and kept at -70°C until tested for infectious virus.

Orthobunyavirus biology 3 segments: Small, Medium, Large Single-strand, negative sense, RNA genome

Reference DB

Needle in the haystack: strategies to consider De novo assembly and post-assembly annotation Let the assembler figure it out Reference-guided assembly No good reference available to target! Subtractive approach Remove everything but target (fingers crossed) Recruitment Run sensitive alignments (HMM scan) of all reads against related proteins No risk of missing novel gene content

Full de novo vs recruited asm IDBA-UD for full de novo assembly of 50 million 100bp HiSeq reads Resulted in 340,327 contigs, covering all of segment L & M (no segment S), with some notable additions* Recruited Blasted reads to Orthobunyavirus DB, word size =7, permissive e-value, 230K total reads (0.5% of sample) L & M segments assembled into a single contig, S segment not found (800- 1000bp)

Coverage plot of Segment L 50 million reads total 120K mapped back to recruited asm

Coverage plot of Segment M 50 million reads total 30K mapped back to recruited asm

Coverage plot of segment L 50 million reads total 500 mapped back to recruited asm

Segment L assembly Green monkey chr7

About those pesky assembly errors…

Perils In a metagenome, associating genes to replicons, segments to a virus, is hard (shared gene content between plasmids and chromosomes) and requires assembly/assembly graphs Single SNPs are important Annotation is important Poorly assembled/unavailable host/background sequence is common

Low abundance and RNA virus detection underexplored Initially only 4 reads out of 400 million using several binning tools were classified in Orthobunyavisus genus. More sensitive profile HMMs found more (1-2K reads) Present or not present? No straightforward way to link viral segments