De Novo Assembly of Mitochondrial Genomes from Low Coverage Whole-Genome Sequencing Reads Fahad Alqahtani and Ion Mandoiu University of Connecticut Computer.

Slides:



Advertisements
Similar presentations
Marius Nicolae Computer Science and Engineering Department
Advertisements

RNA-Seq based discovery and reconstruction of unannotated transcripts
Alex Zelikovsky Department of Computer Science Georgia State University Joint work with Serghei Mangul, Irina Astrovskaya, Bassam Tork, Ion Mandoiu Viral.
RNAseq.
Model-based species identification using DNA barcodes Bogdan Paşaniuc CSE Department, University of Connecticut Joint work with Ion Măndoiu and Sotirios.
Metabarcoding 16S RNA targeted sequencing
MCB Lecture #9 Sept 23/14 Illumina library preparation, de novo genome assembly.
Next-generation sequencing
Targeted Data Introduction  Many mapping, alignment and variant calling algorithms  Most of these have been developed for whole genome sequencing and.
Profile Hidden Markov Models Bioinformatics Fall-2004 Dr Webb Miller and Dr Claude Depamphilis Dhiraj Joshi Department of Computer Science and Engineering.
Transcriptome Assembly and Quantification from Ion Torrent RNA-Seq Data Alex Zelikovsky Department of Computer Science Georgia State University Joint work.
Bioinformatics for Whole-Genome Shotgun Sequencing of Microbial Communities By Kevin Chen, Lior Pachter PLoS Computational Biology, 2005 David Kelley.
Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut.
Novel multi-platform next generation assembly methods for mammalian genomes The Baylor College of Medicine, Australian Government and University of Connecticut.
Marius Nicolae Computer Science and Engineering Department University of Connecticut Joint work with Serghei Mangul, Ion Mandoiu and Alex Zelikovsky.
High Throughput Sequencing
Metagenomics Binning and Machine Learning
Detecting copy number variations using paired-end sequence data Nick Furlotte CS224 May 29, 2009.
Considerations for Analyzing Targeted NGS Data Introduction Tim Hague, CTO.
Advancing Science with DNA Sequence Natalia Ivanova MGM Workshop September 12, 2012 Metagenome analysis: use case.
DNA Barcoding Amy Driskell Laboratories of Analytical Biology
Todd J. Treangen, Steven L. Salzberg
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
PE-Assembler: De novo assembler using short paired-end reads Pramila Nuwantha Ariyaratne.
Variables: – T(p) - set of candidate transcripts on which pe read p can be mapped within 1 std. dev. – y(t) -1 if a candidate transcript t is selected,
Adrian Caciula Department of Computer Science Georgia State University Joint work with Serghei Mangul (UCLA) Ion Mandoiu (UCONN) Alex Zelikovsky (GSU)
Biodiversity initiative: Integrating Taxonomy, Genomics and Biodiversity ++ = ????? Speaker: Benjamin Linard Alfried Vogler Team.
FISH SPECIES IDENTIFICATION AND BIODIVERSIFICATION IN ENUGU METROPOLIS RIVER BY DNA BACODING PRESENTED BY Chioma Nwakanma (PhD) Michael Okpara University.
Advancing Science with DNA Sequence Metagenome definitions: a refresher course Natalia Ivanova MGM Workshop September 12, 2012.
RNA-Seq Assembly 转录组拼接 唐海宝 基因组与生物技术研究中心 2013 年 11 月 23 日.
Serghei Mangul Department of Computer Science Georgia State University Joint work with Irina Astrovskaya, Marius Nicolae, Bassam Tork, Ion Mandoiu and.
SEQUENCING – THE BENCHTOPS. Roche 454 Junior Same technology as 454 FLX Read length: 400 bases Paired-end 100,000 reads 12 hours (instrument time) Output.
Overview of the Drosophila modENCODE hybrid assemblies Wilson Leung01/2014.
Advancing Science with DNA Sequence Natalia Ivanova MGM Workshop September 29, 2011 Metagenome analysis: use case.
RNA-Seq Primer Understanding the RNA-Seq evidence tracks on the GEP UCSC Genome Browser Wilson Leung08/2014.
Neanderthals Noonan, et al. Sequencing and Analysis of Neanderthal Genomic DNA Green, et al. Analysis of one million base pairs of Neanderthal DNA Kristine.
Analyzing Time Course Data: How can we pick the disappearing needle across multiple haystacks? IEEE-HPEC Bioinformatics Challenge Day Dr. C. Nicole Rosenzweig.
University of Connecticut School of Engineering Assembler Reference Abyss Simpson et al., J. T., Wong, K., Jackman, S. D., Schein, J. E., Jones,
Metagenome analysis Natalia Ivanova MGM Workshop February 2, 2012.
Accurate estimation of microbial communities using 16S tags
Alex Zelikovsky Department of Computer Science Georgia State University Joint work with Adrian Caciula (GSU), Serghei Mangul (UCLA) James Lindsay, Ion.
Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features 王荣 14S
__________________________________________________________________________________________________ Fall 2015GCBA 815 __________________________________________________________________________________________________.
1 Repeats!. 2 Introduction  A repeat family is a collection of repeats which appear multiple times in a genome.  Our objective is to identify all families.
Meiotic gene conversion in humans: rate, sex ratio, and GC bias Amy L. Williams June 19, 2013 University of Chicago.
Metagenomic dataset preprocessing – data reduction
A Robust and Accurate Binning Algorithm for Metagenomic Sequences with Arbitrary Species Abundance Ratio Zainab Haydari Dr. Zelikovsky Summer 2011.
An Integer Programming Approach to Novel Transcript Reconstruction from Paired-End RNA-Seq Reads Serghei Mangul Department of Computer Science Georgia.
When the next-generation sequencing becomes the now- generation Lisa Zhang November 6th, 2012.
Gene prediction in metagenomic fragments: A large scale machine learning approach Katharina J Hoff, Maike Tech, Thomas Lingner, Rolf Daniel, Burkhard Morgenstern.
CNR ITB, Bari Section - BioInformatics and Genomics MOLECULAR BIODIVERSITY Barcode: A new challenge for Bioinformatics Cecilia Saccone Meeting FIRB 2005.
Metagenomic Species Diversity.
Next generation sequencing
Draft sequencing and assembly of the genome of the world’s largest fish, the whale shark: Rhincodon typus Smith 1828 Timothy D. Read, Robert A. Petit III,
Preprocessing Data Rob Schmieder.
Quality Control & Preprocessing of Metagenomic Data
Gapless genome assembly of Colletotrichum higginsianum reveals chromosome structure and association of transposable elements with secondary metabolite.
Denovo genome assembly of Moniliophthora roreri
MtActinopterygii: Analysing evolution of mitogenomes belonging to the most dominant class of vertebrates Sevgi Kaynar1, Esra Mine Ünal1, Tuğçe Aygen1,
Statistical Mitogenome Assembly with Repeats
Metagenomics Image: Iverson et al. 2012, Science.
2nd (Next) Generation Sequencing
Forensic Biology by Richard Li
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
The ability of the SOP to sequence and identify unknown samples.
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Comparison of species and function profiles with ultradeep sequencing data. Comparison of species and function profiles with ultradeep sequencing data.
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Toward Accurate and Quantitative Comparative Metagenomics
Presentation transcript:

De Novo Assembly of Mitochondrial Genomes from Low Coverage Whole-Genome Sequencing Reads Fahad Alqahtani and Ion Mandoiu University of Connecticut Computer Science and Engineering Department

Outline Introduction & prior work Our approach Preliminary results Conclusion and future work

Nuclear Genome Vs. Mitochondrial Genome Source:

mt10k pipeline Read filtering using BLAST against database of mitogenomes 3 De Novo assemblers

mt10k Results 20/60 complete circular mitogenomes Source:

MITObim Mitochondrial Baiting and Iterative Mapping Source:

ARC Assembly by Reduced Complexity Steps: 1) align sequence reads to reference sequences of related species 2) use alignment results to distribute reads into target specific bins 3) perform assemblies for each bin (target) to produce contigs 4) replace previous reference targets with assembled contigs and iterate. Source:

Outline Introduction & prior work Our approach Preliminary results Conclusion and future work

Our Approach Reads Read filtering Geneious Circular mtDNA Sequence Unlike previous similarity-based filters, we use k-mer coverage- based read classifiers Less biased than use of related species genome Final assembly done with circular-genome aware assembler (Geneious)

K-mer coverage histogram

Coverage is Uniform across Mitogenome, but Varies b/w Individuals & Sequencing Centers source:Li, Mingkun, et al. "Transmission of human mtDNA heteroplasmy in the Genome of the Netherlands families: support for a variable-size bottleneck." Genome research 26.4 (2016): So, we need to learn the coverage from each sample

COI Gene Cytochrome c oxidase subunit 1 mitochondrial region ("COI", ~648 base pairs long) has been selected as a “DNA barcode” for taxonomic classification Barcode of Life Datasystem (BOLD) has 4,954K COI sequences from 168K animal species

Detailed pipeline BOLD/ GenBank COI gene K-mers & Counts Jellyfish Hashtable of k-mers with coverage similar to COI COI k-mer coverage distribution K-mer classifier Reads Read filtering Geneious Circular mtDNA Sequence Keep if ≥l-3k k-mers in hashtable

K-mer Classifiers 1.Likelihood Ratio K-mer Classifier: Keep k-mer x if P(x | µ COI, σ COI ) > P(x | µ genome, σ genome ) 2. Coverage K-mer Classifier: Keep k-mer x if | coverage(x) - µ COI | <= 3σ COI

Outline Introduction & prior work Our approach Preliminary results Conclusion and future work

Human Data 4 individuals from 1000 Genomes project 2 Male, 2 Female One male and one female are siblings Illumina paired-end reads Up to 5 million reads Read length: 108bp Insert length: between 100bp and 600bp Ground truth Generated by 1000 Genomes Project by mapping reads to reference genome

Tammar wallaby (Macropus eugenii) Illumina paired-end reads 10 Million reads Read length:100bp Insert lengths:108bp and 550bp Ground truth Macropus eugenii voucher ABTC18205 mitochondrion, partial genome Sequence ID: gb|KJ |gb|KJ | Length: 16865

Results (Human 2M reads) SampleClassifier OutputHisatGeneious Length Edit Distance # reads Is circular? Male 1 None2,000,0002,920Yes16,5704 Ratio1,994,1282,920Yes16,5704 Coverage1,926,1262,920Yes16,5704 Female 1 None2,000,0003,796No Ratio1,996,6583,796No Coverage1,920,1043,796No Male 2 None2,000,0003,788Yes16, Ratio1,995,1583,788Yes16, Coverage1,937,1463,776Yes16, Female 2 None2,000,0002,936Yes16,569 5 Ratio1,997,0622,936Yes16,569 5 Coverage1,934,2662,936Yes16,569 5

Results (Human 5M reads) SampleClassifier OutputHisatGeneious Length Edit Distance # reads Is circular? Male 1 None5,000,0007,248Yes16,5704 Ratio661,7306,846Yes16,5704 Coverage4,808,0807,238Yes16,5704 Female 1 None5,000,0009,514Yes16,5704 Ratio546,5868,942Yes16,5704 Coverage4,794,0669,514Yes16,5704 Male 2 None5,000,0009,864Yes16,5686 Ratio667,1009,668Yes16,5686 Coverage4,829,7169,864Yes16,5686 Female 2 None5,000,0007,682Yes16,5677 Ratio646,7407,308Yes16,5677 Coverage4,835,6227,682Yes16,5677

Results (Tammar 10M reads) Insert lengthClassifier Output HisatGeneious Length Blast #Reads#readsIs circular?% indentity# gaps 500 None10M7,232Yes16, Ratio642,8747,084Yes16, Coverage795,9946,578Yes16, None10M7,778Yes16, Ratio1,658,2327,720Yes16, Coverage9,525,1227,032Yes16,

Outline Introduction & prior work Our approach Preliminary results Conclusion and future work

Conclusion & Future Work Preliminary results show high success rate in assembling complete circular mitogenomes Future work: Improved k-mer classifier accuracy by incorporating GC bias Direct comparison with previous methods (mt10k, MITObim, ARC) Application to different sequencing technologies (i.e. Ion Torrent) Detection of heteroplasmies Assembly of mitogenomes from metagenomic samples Assembly of chloroplast genomes from low coverage DNA sequencing of plants

GC-content bias in coverage

THANK YOU FOR YOUR ATTENTION ANY QUESTIONS?

How De Novo Assembly works? source:

Human mtDNA Copy Number Source: Miller, Francis J., et al. "Precise determination of mitochondrial DNA copy number in human skeletal and cardiac muscle by a PCR ‐ based assay: lack of change of copy number with age." Nucleic acids research (2003): e61-e61. The mtDNA copy number also varies from tissue to tissue (6970 +/- 920 in heart muscle compared to /- 620 in skeletal muscle). Right Atrium of Heart Skeletal Muscle