De Novo Assembly of Mitochondrial Genomes from Low Coverage Whole-Genome Sequencing Reads Fahad Alqahtani and Ion Mandoiu University of Connecticut Computer Science and Engineering Department
Outline Introduction & prior work Our approach Preliminary results Conclusion and future work
Nuclear Genome Vs. Mitochondrial Genome Source:
mt10k pipeline Read filtering using BLAST against database of mitogenomes 3 De Novo assemblers
mt10k Results 20/60 complete circular mitogenomes Source:
MITObim Mitochondrial Baiting and Iterative Mapping Source:
ARC Assembly by Reduced Complexity Steps: 1) align sequence reads to reference sequences of related species 2) use alignment results to distribute reads into target specific bins 3) perform assemblies for each bin (target) to produce contigs 4) replace previous reference targets with assembled contigs and iterate. Source:
Outline Introduction & prior work Our approach Preliminary results Conclusion and future work
Our Approach Reads Read filtering Geneious Circular mtDNA Sequence Unlike previous similarity-based filters, we use k-mer coverage- based read classifiers Less biased than use of related species genome Final assembly done with circular-genome aware assembler (Geneious)
K-mer coverage histogram
Coverage is Uniform across Mitogenome, but Varies b/w Individuals & Sequencing Centers source:Li, Mingkun, et al. "Transmission of human mtDNA heteroplasmy in the Genome of the Netherlands families: support for a variable-size bottleneck." Genome research 26.4 (2016): So, we need to learn the coverage from each sample
COI Gene Cytochrome c oxidase subunit 1 mitochondrial region ("COI", ~648 base pairs long) has been selected as a “DNA barcode” for taxonomic classification Barcode of Life Datasystem (BOLD) has 4,954K COI sequences from 168K animal species
Detailed pipeline BOLD/ GenBank COI gene K-mers & Counts Jellyfish Hashtable of k-mers with coverage similar to COI COI k-mer coverage distribution K-mer classifier Reads Read filtering Geneious Circular mtDNA Sequence Keep if ≥l-3k k-mers in hashtable
K-mer Classifiers 1.Likelihood Ratio K-mer Classifier: Keep k-mer x if P(x | µ COI, σ COI ) > P(x | µ genome, σ genome ) 2. Coverage K-mer Classifier: Keep k-mer x if | coverage(x) - µ COI | <= 3σ COI
Outline Introduction & prior work Our approach Preliminary results Conclusion and future work
Human Data 4 individuals from 1000 Genomes project 2 Male, 2 Female One male and one female are siblings Illumina paired-end reads Up to 5 million reads Read length: 108bp Insert length: between 100bp and 600bp Ground truth Generated by 1000 Genomes Project by mapping reads to reference genome
Tammar wallaby (Macropus eugenii) Illumina paired-end reads 10 Million reads Read length:100bp Insert lengths:108bp and 550bp Ground truth Macropus eugenii voucher ABTC18205 mitochondrion, partial genome Sequence ID: gb|KJ |gb|KJ | Length: 16865
Results (Human 2M reads) SampleClassifier OutputHisatGeneious Length Edit Distance # reads Is circular? Male 1 None2,000,0002,920Yes16,5704 Ratio1,994,1282,920Yes16,5704 Coverage1,926,1262,920Yes16,5704 Female 1 None2,000,0003,796No Ratio1,996,6583,796No Coverage1,920,1043,796No Male 2 None2,000,0003,788Yes16, Ratio1,995,1583,788Yes16, Coverage1,937,1463,776Yes16, Female 2 None2,000,0002,936Yes16,569 5 Ratio1,997,0622,936Yes16,569 5 Coverage1,934,2662,936Yes16,569 5
Results (Human 5M reads) SampleClassifier OutputHisatGeneious Length Edit Distance # reads Is circular? Male 1 None5,000,0007,248Yes16,5704 Ratio661,7306,846Yes16,5704 Coverage4,808,0807,238Yes16,5704 Female 1 None5,000,0009,514Yes16,5704 Ratio546,5868,942Yes16,5704 Coverage4,794,0669,514Yes16,5704 Male 2 None5,000,0009,864Yes16,5686 Ratio667,1009,668Yes16,5686 Coverage4,829,7169,864Yes16,5686 Female 2 None5,000,0007,682Yes16,5677 Ratio646,7407,308Yes16,5677 Coverage4,835,6227,682Yes16,5677
Results (Tammar 10M reads) Insert lengthClassifier Output HisatGeneious Length Blast #Reads#readsIs circular?% indentity# gaps 500 None10M7,232Yes16, Ratio642,8747,084Yes16, Coverage795,9946,578Yes16, None10M7,778Yes16, Ratio1,658,2327,720Yes16, Coverage9,525,1227,032Yes16,
Outline Introduction & prior work Our approach Preliminary results Conclusion and future work
Conclusion & Future Work Preliminary results show high success rate in assembling complete circular mitogenomes Future work: Improved k-mer classifier accuracy by incorporating GC bias Direct comparison with previous methods (mt10k, MITObim, ARC) Application to different sequencing technologies (i.e. Ion Torrent) Detection of heteroplasmies Assembly of mitogenomes from metagenomic samples Assembly of chloroplast genomes from low coverage DNA sequencing of plants
GC-content bias in coverage
THANK YOU FOR YOUR ATTENTION ANY QUESTIONS?
How De Novo Assembly works? source:
Human mtDNA Copy Number Source: Miller, Francis J., et al. "Precise determination of mitochondrial DNA copy number in human skeletal and cardiac muscle by a PCR ‐ based assay: lack of change of copy number with age." Nucleic acids research (2003): e61-e61. The mtDNA copy number also varies from tissue to tissue (6970 +/- 920 in heart muscle compared to /- 620 in skeletal muscle). Right Atrium of Heart Skeletal Muscle