A Maximum Likelihood Method for Quasispecies Reconstruction Nicholas Mancuso, Georgia State University Bassam Tork, Georgia State University Pavel Skums,

Slides:



Advertisements
Similar presentations
Marius Nicolae Computer Science and Engineering Department
Advertisements

RNA-Seq based discovery and reconstruction of unannotated transcripts
Reconstruction of Infectious Bronchitis Virus Quasispecies from NGS Data Bassam Tork Georgia State University Atlanta, GA 30303, USA.
Alex Zelikovsky Department of Computer Science Georgia State University Joint work with Serghei Mangul, Irina Astrovskaya, Bassam Tork, Ion Mandoiu Viral.
Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.
Model-based species identification using DNA barcodes Bogdan Paşaniuc CSE Department, University of Connecticut Joint work with Ion Măndoiu and Sotirios.
Wavelength Assignment in Optical Network Design Team 6: Lisa Zhang (Mentor) Brendan Farrell, Yi Huang, Mark Iwen, Ting Wang, Jintong Zheng Progress Report.
Reference Assisted Nucleic Acid Sequence Reconstruction from Mass Spectrometry Data Gabriel Ilie 1, Alex Zelikovsky 2 and Ion Măndoiu 1 1 CSE Department,
Amplicon-Based Quasipecies Assembly Using Next Generation Sequencing Nick Mancuso Bassam Tork Computer Science Department Georgia State University.
Transcriptome Assembly and Quantification from Ion Torrent RNA-Seq Data Alex Zelikovsky Department of Computer Science Georgia State University Joint work.
Reconstructing Ancestral DNA (at least the gaps) Using unrooted phylogeny, multiple alignment, and affine gap cost function. Work in progress.
Department of Computer Science, University of Maryland, College Park, USA TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.:
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
RNA-Seq based discovery and reconstruction of unannotated transcripts in partially annotated genomes 3 Serghei Mangul*, Adrian Caciula*, Ion.
Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut.
Lecture 5: Learning models using EM
Class 02: Whole genome sequencing. The seminal papers ``Is Whole Genome Sequencing Feasible?'' ``Whole-Genome DNA.
1 Nicholas Mancuso Department of Computer Science Georgia State University Joint work with Bassam Tork, GSU Pavel Skums, CDC Ion M ӑ ndoiu, UConn Alex.
Combinatorial Algorithms for Maximum Likelihood Tag SNP Selection and Haplotype Inference Ion Mandoiu University of Connecticut CS&E Department.
Tracking Moving Objects in Anonymized Trajectories Nikolay Vyahhi 1, Spiridon Bakiras 2, Panos Kalnis 3, and Gabriel Ghinita 3 1 St. Petersburg State University.
Optimal Tag SNP Selection for Haplotype Reconstruction Jin Jun and Ion Mandoiu Computer Science & Engineering Department University of Connecticut.
Marius Nicolae Computer Science and Engineering Department University of Connecticut Joint work with Serghei Mangul, Ion Mandoiu and Alex Zelikovsky.
Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut.
Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut.
Reconstruction of infectious bronchitis virus quasispecies from 454 pyrosequencing reads CAME 2011 Ion Mandoiu Computer Science & Engineering Dept. University.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Software for Robust Transcript Discovery and Quantification from RNA-Seq Ion Mandoiu, Alex Zelikovsky, Serghei Mangul.
Reconstruction of Haplotype Spectra from NGS Data Ion Mandoiu UTC Associate Professor in Engineering Innovation Department of Computer Science & Engineering.
Fast Subsequence Matching in Time-Series Databases Christos Faloutsos M. Ranganathan Yannis Manolopoulos Department of Computer Science and ISR University.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Optimizing pooling strategies for the massive next-generation sequencing of viral samples Pavel Skums 1 Joint work with Olga Glebova 2, Alex Zelikovsky.
Network Aware Resource Allocation in Distributed Clouds.
Inferring Genomic Sequences Irina Astrovskaya Irina Astrovskaya Dr. Alexander Zelikovsky 02/15/2011.
1 HMM - Part 2 Review of the last lecture The EM algorithm Continuous density HMM.
Variables: – T(p) - set of candidate transcripts on which pe read p can be mapped within 1 std. dev. – y(t) -1 if a candidate transcript t is selected,
Adrian Caciula Department of Computer Science Georgia State University Joint work with Serghei Mangul (UCLA) Ion Mandoiu (UCONN) Alex Zelikovsky (GSU)
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Combinatorial Optimization Problems in Computational Biology Ion Mandoiu CSE Department.
Novel transcript reconstruction from ION Torrent sequencing reads and Viral Meta-genome Reconstruction from AmpliSeq Ion Torrent data University of Connecticut.
Serghei Mangul Department of Computer Science Georgia State University Joint work with Irina Astrovskaya, Marius Nicolae, Bassam Tork, Ion Mandoiu and.
Quasispecies Assembly Using Network Flows Alex Zelikovsky Georgia State University Joint work with Kelly Westbrooks Georgia State University Irina Astrovskaya.
Estimating Recombination Rates. LRH selection test, and recombination Recall that LRH/EHH tests for selection by looking at frequencies of specific haplotypes.
Paired Sampling in Density-Sensitive Active Learning Pinar Donmez joint work with Jaime G. Carbonell Language Technologies Institute School of Computer.
Inferring Viral Quasispecies Spectra from NGS Reads Ion Măndoiu Computer Science & Engineering Department University of Connecticut.
Bioinformatics tools for viral quasispecies reconstruction from next-generation sequencing data and vaccine optimization PD: Ion Măndoiu, UConn Co-PDs: Mazhar.
De novo discovery of mutated driver pathways in cancer Discussion leader: Matthew Bernstein Scribe: Kun-Chieh Wang Computational Network Biology BMI 826/Computer.
Alex Zelikovsky Department of Computer Science Georgia State University Joint work with Adrian Caciula (GSU), Serghei Mangul (UCLA) James Lindsay, Ion.
1 MAVID: Constrained Ancestral Alignment of Multiple Sequence Author: Nicholas Bray and Lior Pachter.
Estimating Recombination Rates. Daly et al., 2001 Daly and others were looking at a 500kb region in 5q31 (Crohn disease region) 103 SNPs were genotyped.
Classification and Regression Trees
Network Partition –Finding modules of the network. Graph Clustering –Partition graphs according to the connectivity. –Nodes within a cluster is highly.
A Robust and Accurate Binning Algorithm for Metagenomic Sequences with Arbitrary Species Abundance Ratio Zainab Haydari Dr. Zelikovsky Summer 2011.
Introduction to Multiple-multicast Routing Chu-Fu Wang.
An Integer Programming Approach to Novel Transcript Reconstruction from Paired-End RNA-Seq Reads Serghei Mangul Department of Computer Science Georgia.
KGEM: an EM Error Correction Algorithm for NGS Amplicon-based Data Alexander Artyomenko.
Population sequencing using short reads: HIV as a case study Vladimir Jojic et.al. PSB 13: (2008) Presenter: Yong Li.
ICCABS 2013 kGEM: An EM-based Algorithm for Local Reconstruction of Viral Quasispecies Alexander Artyomenko.
Data Driven Resource Allocation for Distributed Learning
Data Structures: Disjoint Sets, Segment Trees, Fenwick Trees
Greedy Technique.
Genome sequence assembly
Data Structures: Segment Trees, Fenwick Trees
Alexander Zelikovsky Computer Science Department
Estimating Recombination Rates
Mohammed El-Kebir, Gryte Satas, Layla Oesper, Benjamin J. Raphael 
Problem Solving 4.
Minimizing the Aggregate Movements for Interval Coverage
SEG5010 Presentation Zhou Lanjun.
Inference of alternative splicing from RNA-Seq data with probabilistic splice graphs BMI/CS Spring 2019 Colin Dewey
Presentation transcript:

A Maximum Likelihood Method for Quasispecies Reconstruction Nicholas Mancuso, Georgia State University Bassam Tork, Georgia State University Pavel Skums, Centers for Disease Control Lilia Ganova-Raeva, Centers for Disease Control Ion Mandoiu, University of Connecticut Alex Zelikovsky, Georgia State University CANGS 2012

Outline Background Quasispecies spectrum reconstruction from amplicon NGS reads Ongoing and future work

Cost of DNA Sequencing

Cost/Performance Comparison [Glenn 2011]

RNA Viruses o HIV, HCV, SARS, Influenza o Higher (than DNA) mutation rates o ➔ quasispecies  set of closely related variants rather than a single species Knowing quasispecies can help o Interferon HCV therapy effectiveness (Skums et al 2011) NGS allows to find individual quasispecies sequences o 454 Life Sciences : Mb with reads bp long Sequencing is challenging o multiple quasispecies o qsps sequences are very similar  different qsps may be indistinguishable for > 1kb (longer than reads) Viral Quasispecies and NGS

Shotgun reads starting positions distributed ~uniformly Amplicon reads reads have predefined start/end positions covering fixed overlapping windows Shotgun vs. Amplicon Reads

Quasispecies Spectrum Reconstruction (QSR) Problem Given Shotgun/amplicon pyrosequencing reads from a quasispecies population of unknown size and distribution Reconstruct the quasispecies spectrum Sequences Frequencies

Prior Work Eriksson et al 2008 maximum parsimony using Dilworth’s theorem, clustering, EM Westbrooks et al min-cost network flow Zagordi et al (ShoRAH) probabilistic clustering based on a Dirichlet process mixture Prosperi et al 2011 (amplicon based) based on measure of population diversity Huang et al 2011 (QColors) Parsimonious reconstruction of quasispecies subsequences using constraint programming within regions with sufficient variability

Outline Background Quasispecies spectrum reconstruction from amplicon NGS reads Ongoing and future work

Amplicon Sequencing Challenges Distinct quasispecies may be indistinguishable in an amplicon interval Multiple reads from consecutive amplicons may match over their overlap

Prosperi et al First published approach for amplicons Based on the idea of guide distribution choose most variable amplicon extend to right/left with matching reads, breaking ties by rank

Read Graph for Amplicons K amplicons → K-staged read graph vertices → distinct reads edges → reads with consistent overlap vertices, edges have a count function

Read Graph May transform bi-cliques into 'fork' subgraphs common overlap is represented by fork vertex

Observed vs Ideal Read Frequencies Ideal frequency consistent frequency across forks Observed frequency (count) inconsistent frequency across forks

Fork Balancing Problem Given Set of reads and respective frequencies Find Minimal frequency offsets balancing all forks Simplest approach is to scale frequencies from left to right

Least Squares Balancing Quadratic Program for read offsets q – fork, o i – observed frequency, x i – frequency offset

Fork Resolution: Parsimony 8 (a) (b)

Fork Resolution: Max Likelihood Observation o Potential quasispecies has extra bases in overlap o Must be at least two instances of this quasispecies to produce both of these reads Assumption o Solution is a forest

Fork Resolution: Max Likelihood Given a forest, ML = # of ways to produce observed reads / 2^(#qsp): Can be computed efficiently for trees: multiply by binomial coefficient of a leaf and its parent edge, prune the edge, and iterate Solution (b) has a larger likelihood than (a) although both have 3 qsp’s (a) (4 choose 2) * (8 choose 4) * (8 choose 4)/2^20 = 29400/2^20 ~ 2.8% (b) (12 choose 6) * (4 choose 2)*(4 choose 2)/2^20 = 33264/2^20 ~ 3.3% 8 (a)(b)

Fork Resolution: Min Entropy Solution (b) also has a lower entropy than (a) (a) -[ (8/20)log(8/20) + (8/20)log(8/20) + (4/20)log(4/20) ] ~ (b) -[ (12/20)log(4/20) + (4/20)log(4/20) + (4/20)log(4/20) ] ~ (a)(b)

Fork Resolution: Min Entropy Local Resolution Greedily match maximum count reads in overlap Repeat for all forks until graph is fully resolved Global Resolution Maximum bandwidth paths Find s-t path, reduce counts by minimum edge, repeat until exhausted

Local Optimization: Greedy Method

Greedy Method

Global Optimization: Maximum Bandwidth

Maximum Bandwidth Method

Experimental Setup Error free reads simulated from 1739bp long fragments of HCV quasispecies - Frequency distributions: uniform, geometric, … 5k-100k reads - Amplicon width = 300bp - Shift (= width – overlap, i.e., how much to slide the next amplicon) between 50 and 250 Quality measures - Sensitivity - PPV - Jensen-Shannon divergence

Sensitivity for 100k Reads (Uniform Qsps)

PPV for 100k Reads (Uniform Qsps)

JS Divergence for 100k Reads (Uniform Qsps)

Amplicon vs. Shotgun Reads (avg. sensitivity/PPV over 10 runs)

Real HBV Data Real HBV data from two patients Sequenced using GS FLX LR25 Twenty-five amplicons were generated Error correction with KEC (Skums 2011) Aligned with MosiakAligner tool

Real HBV Data

Outline Background Quasispecies spectrum reconstruction from amplicon NGS reads Ongoing and future work

Ongoing and Future Work Correction for coverage bias Comparison of shotgun and amplicon based reconstruction methods on real data Quasispecies reconstruction from Ion Torrent reads Combining long and short read technologies Optimization of vaccination strategies

Acknowledgements University of Connecticut Rachel O’Neill, PhD. Mazhar Kahn, Ph.D. Hongjun Wang, Ph.D. Craig Obergfell Andrew Bligh Georgia State University Alex Zelikovsky, Ph.D. Bassam Tork Nicholas Mancuso Serghei Mangul University of Maryland Irina Astrovskaya, Ph.D. Centers for Disease Control and Prevention Pavel Skums, Ph.D.