ICCABS 2013 kGEM: An EM-based Algorithm for Local Reconstruction of Viral Quasispecies Alexander Artyomenko.

Slides:



Advertisements
Similar presentations
. Lecture #8: - Parameter Estimation for HMM with Hidden States: the Baum Welch Training - Viterbi Training - Extensions of HMM Background Readings: Chapters.
Advertisements

Marius Nicolae Computer Science and Engineering Department
RNA-Seq based discovery and reconstruction of unannotated transcripts
Reconstruction of Infectious Bronchitis Virus Quasispecies from NGS Data Bassam Tork Georgia State University Atlanta, GA 30303, USA.
Alex Zelikovsky Department of Computer Science Georgia State University Joint work with Serghei Mangul, Irina Astrovskaya, Bassam Tork, Ion Mandoiu Viral.
Hidden Markov Model in Biological Sequence Analysis – Part 2
Model-based species identification using DNA barcodes Bogdan Paşaniuc CSE Department, University of Connecticut Joint work with Ion Măndoiu and Sotirios.
HMM II: Parameter Estimation. Reminder: Hidden Markov Model Markov Chain transition probabilities: p(S i+1 = t|S i = s) = a st Emission probabilities:
ILP-BASED MAXIMUM LIKELIHOOD GENOME SCAFFOLDING James Lindsay Ion Mandoiu University of Connecticut Hamed Salooti Alex ZelikovskyGeorgia State University.
Reference Assisted Nucleic Acid Sequence Reconstruction from Mass Spectrometry Data Gabriel Ilie 1, Alex Zelikovsky 2 and Ion Măndoiu 1 1 CSE Department,
Amplicon-Based Quasipecies Assembly Using Next Generation Sequencing Nick Mancuso Bassam Tork Computer Science Department Georgia State University.
A New Nonparametric Bayesian Model for Genetic Recombination in Open Ancestral Space Presented by Chunping Wang Machine Learning Group, Duke University.
. Learning – EM in The ABO locus Tutorial #8 © Ilan Gronau. Based on original slides of Ydo Wexler & Dan Geiger.
HMM for CpG Islands Parameter Estimation For HMM Maximum Likelihood and the Information Inequality Lecture #7 Background Readings: Chapter 3.3 in the.
… Hidden Markov Models Markov assumption: Transition model:
. Parameter Estimation and Relative Entropy Lecture #8 Background Readings: Chapters 3.3, 11.2 in the text book, Biological Sequence Analysis, Durbin et.
RNA-Seq based discovery and reconstruction of unannotated transcripts in partially annotated genomes 3 Serghei Mangul*, Adrian Caciula*, Ion.
Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut.
. Hidden Markov Model Lecture #6 Background Readings: Chapters 3.1, 3.2 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
CpG islands in DNA sequences
1 Nicholas Mancuso Department of Computer Science Georgia State University Joint work with Bassam Tork, GSU Pavel Skums, CDC Ion M ӑ ndoiu, UConn Alex.
Genotype Error Detection using Hidden Markov Models of Haplotype Diversity Justin Kennedy, Ion Mandoiu, Bogdan Pasaniuc CSE Department, University of Connecticut.
Optimal Tag SNP Selection for Haplotype Reconstruction Jin Jun and Ion Mandoiu Computer Science & Engineering Department University of Connecticut.
Marius Nicolae Computer Science and Engineering Department University of Connecticut Joint work with Serghei Mangul, Ion Mandoiu and Alex Zelikovsky.
Genotyping of James Watson’s genome from Low-coverage Sequencing Data Sanjiv Dinakar and Yözen Hernández.
1 Probability. 2 Probability has three related “meanings.” 1. Probability is a mathematical construct. Probability.
Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut.
Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut.
Reconstruction of infectious bronchitis virus quasispecies from 454 pyrosequencing reads CAME 2011 Ion Mandoiu Computer Science & Engineering Dept. University.
. Parameter Estimation For HMM Lecture #7 Background Readings: Chapter 3.3 in the text book, Biological Sequence Analysis, Durbin et al.,  Shlomo.
Software for Robust Transcript Discovery and Quantification from RNA-Seq Ion Mandoiu, Alex Zelikovsky, Serghei Mangul.
Reconstruction of Haplotype Spectra from NGS Data Ion Mandoiu UTC Associate Professor in Engineering Innovation Department of Computer Science & Engineering.
. Parameter Estimation For HMM Lecture #7 Background Readings: Chapter 3.3 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
VirVarSeq vs ViVaMBC Pictured above: The structure of HIV.
Variables: – T(p) - set of candidate transcripts on which pe read p can be mapped within 1 std. dev. – y(t) -1 if a candidate transcript t is selected,
Case(Control)-Free Multi-SNP Combinations in Case-Control Studies Dumitru Brinza and Alexander Zelikovsky Combinatorial Search (CS) for Disease-Association:
National Taiwan University Department of Computer Science and Information Engineering Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Adrian Caciula Department of Computer Science Georgia State University Joint work with Serghei Mangul (UCLA) Ion Mandoiu (UCONN) Alex Zelikovsky (GSU)
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Hidden Markov Models Usman Roshan CS 675 Machine Learning.
Informative SNP Selection Based on Multiple Linear Regression
National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department.
Serghei Mangul Department of Computer Science Georgia State University Joint work with Irina Astrovskaya, Marius Nicolae, Bassam Tork, Ion Mandoiu and.
Quasispecies Assembly Using Network Flows Alex Zelikovsky Georgia State University Joint work with Kelly Westbrooks Georgia State University Irina Astrovskaya.
Lecture 15: Linkage Analysis VII
Linear Reduction Method for Tag SNPs Selection Jingwu He Alex Zelikovsky.
Algorithms in Computational Biology11Department of Mathematics & Computer Science Algorithms in Computational Biology Markov Chains and Hidden Markov Model.
California Pacific Medical Center
Bioinformatics tools for viral quasispecies reconstruction from next-generation sequencing data and vaccine optimization PD: Ion Măndoiu, UConn Co-PDs: Mazhar.
Alex Zelikovsky Department of Computer Science Georgia State University Joint work with Adrian Caciula (GSU), Serghei Mangul (UCLA) James Lindsay, Ion.
1 MAVID: Constrained Ancestral Alignment of Multiple Sequence Author: Nicholas Bray and Lior Pachter.
A Maximum Likelihood Method for Quasispecies Reconstruction Nicholas Mancuso, Georgia State University Bassam Tork, Georgia State University Pavel Skums,
Scalable Algorithms for Next-Generation Sequencing Data Analysis Ion Mandoiu UTC Associate Professor in Engineering Innovation Department of Computer Science.
Coalescent theory CSE280Vineet Bafna Expectation, and deviance Statements such as the ones below can be made only if we have an underlying model that.
The Haplotype Blocks Problems Wu Ling-Yun
Inferences on human demographic history using computational Population Genetic models Gabor T. Marth Department of Biology Boston College Chestnut Hill,
KGEM: an EM Error Correction Algorithm for NGS Amplicon-based Data Alexander Artyomenko.
Population sequencing using short reads: HIV as a case study Vladimir Jojic et.al. PSB 13: (2008) Presenter: Yong Li.
Monkey Business Bioinformatics Research Center University of Aarhus Thomas Mailund Joint work with Asger Hobolth, Ole F. Christiansen and Mikkel H. Schierup.
Hidden Markov Models BMI/CS 576
Gonçalo Abecasis and Janis Wigginton University of Michigan, Ann Arbor
Constrained Hidden Markov Models for Population-based Haplotyping
Signatures of Selection
Hidden Markov Models - Training
Alexander Zelikovsky Computer Science Department
Stacks simulation results.
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Hidden Markov Model Lecture #6
CISC 667 Intro to Bioinformatics (Fall 2005) Hidden Markov Models (I)
Presentation transcript:

ICCABS 2013 kGEM: An EM-based Algorithm for Local Reconstruction of Viral Quasispecies Alexander Artyomenko

Introduction Reconstructing spectrum of viral population Challenges: Reconstructing spectrum of viral population is very reasonable task for epidemiology. Limitations of sequencing technology do not allow to read the entire coding region and assembling short reads is a big challenge. But sequencing errors cause another challenge for that problem. Reconstructing spectrum of viral population Challenges: Assembling short reads to span entire genome Distinguishing sequencing errors from mutations Avoid assembling: ID sequences via high variability region Reconstructing spectrum of viral population is very reasonable task for epidemiology. Limitations of sequencing technology do not allow to read the entire coding region and assembling short reads is a big challenge. But sequencing errors cause another challenge for that problem.

Previous Work KEC (k-mer Error Correction) [Skums et al.] Incorporates counts (frequencies) of k-mers (substrings of length k) QuasiRecomb (Quasispecies Recombination) [Töpfer et. al] Hidden Markov Model-based approach Incorporates possibility for recombinant progeny Parameter: k generators (ancestor haplotypes)

Problem Formulation Given: a set of reads R emitted by a set of unknown haplotypes H’ Find: a set of haplotypes H={H1,…,Hk} maximizing Pr(R|H)

Fractional Haplotype Fractional Haplotype: a string of 5-tuples of probabilities for each possible symbol: a, c, t, g, d=‘-’ a c - t c t g c a 0.71 0.06 0.0 0.13 0.27 0.10 0.03 c 0.94 0.64 0.14 0.58 t 0.16 0.01 0.87 0.11 0.73 0.09 g 0.21 0.25 0.76 d 0.78 a 0.71 0.06 0.0 0.13 0.27 0.10 0.03 c 0.94 0.64 0.14 0.58 t 0.16 0.01 0.87 0.11 0.73 0.09 g 0.21 0.25 0.76 d 0.78

kGEM Initialize (fractional) Haplotypes Repeat until Haplotypes are unchanged Estimate Pr(r|Hi) probability of a read r being emitted by haplotype Hi Estimate frequencies of Haplotypes Update and Round Haplotypes Collapse Identical and Drop Rare Haplotypes Output Haplotypes

Initialization Find set of reads representing haplotype population 4 1 Start with a random read Each next read maximizes minimum distance to previously chosen 1 2 3 4

Initialization Transform selected reads into fractional haplotypes using formula: where sm is i-th nucleotide of selected read s. a c - t g - g a - c ε=0.01 a 0.96 0.01 c t g d

Read Emission Probability 1 2 3 Reads Haplotypes h1,1 h3,2 h2,1 h3,1 h1,2 h2,2 For each i=1, … , k and for each read rj from R compute value:

Estimate Frequencies E-step: expected portion of r emitted by Hi Estimate haplotype frequencies via Expectation Maximization (EM) method Repeat two steps until the change < σ E-step: expected portion of r emitted by Hi M-step: updated frequency of haplotype Hi

Update Haplotypes Update allele frequencies for each haplotype according to read’s contribution: a 0.71 0.06 0.0 0.13 0.27 … 0.10 0.03 c 0.94 0.64 0.14 0.58 t 0.16 0.01 0.87 0.11 0.73 0.09 g 0.21 0.25 0.76 d 0.78

Round Haplotypes Round each haplotype’s position to most probable allele a c - t g a 0.96 0.01 … c t g d a 0.76 0.0 0.01 0.06 0.77 0.29 … 0.14 0.09 c 0.11 0.89 0.23 0.68 0.50 t 0.13 0.93 0.71 0.04 g 0.21 0.18 0.80 d a 0.76 0.0 0.01 0.06 0.77 0.29 … 0.14 0.09 c 0.11 0.89 0.23 0.68 0.50 t 0.13 0.93 0.71 0.04 g 0.21 0.18 0.80 d a 0.76 0.0 0.01 0.06 0.77 0.29 … 0.14 0.09 c 0.11 0.89 0.23 0.68 0.50 t 0.13 0.93 0.71 0.04 g 0.21 0.18 0.80 d a 0.76 0.0 0.01 0.06 0.77 0.29 … 0.14 0.09 c 0.11 0.89 0.23 0.68 0.50 t 0.13 0.93 0.71 0.04 g 0.21 0.18 0.80 d

Collapse and Drop Rare Collapse haplotypes which have the same integral strings Drop haplotypes with coverage ≤δ Empirically, δ<5 implies drop in PPV without improving sensitivity

kGEM Initialize (fractional) Haplotypes Repeat until Haplotypes are unchanged Estimate Pr(r|Hi) probability of a read r being emitted by haplotype Hi Estimate frequencies of Haplotypes Update and Round Haplotypes Collapse Identical and Drop Rare Haplotypes Output Haplotypes

Experimental Setup HCV E1E2 sub-region (315bp) 20 simulated data sets of 10 variants 100,000 reads from Grinder 0.5 10 datasets with homo-polymer errors Frequency distribution: uniform and power-law model with parameter α= 2.0

Acknowledgements Nicholas Mancuso Alex Zelikovsky Ion Măndoiu Pavel Skums

Thank you! Questions?