VirVarSeq vs ViVaMBC Pictured above: The structure of HIV.

Slides:



Advertisements
Similar presentations
Quick Lesson on dN/dS Neutral Selection Codon Degeneracy Synonymous vs. Non-synonymous dN/dS ratios Why Selection? The Problem.
Advertisements

Marius Nicolae Computer Science and Engineering Department
RNA-Seq based discovery and reconstruction of unannotated transcripts
Alex Zelikovsky Department of Computer Science Georgia State University Joint work with Serghei Mangul, Irina Astrovskaya, Bassam Tork, Ion Mandoiu Viral.
Gapped BLAST and PSI-BLAST Altschul et al Presenter: 張耿豪 莊凱翔.
Reference mapping and variant detection Peter Tsai Bioinformatics Institute, University of Auckland.
Quasispecies Theory and the Behavior of RNA Viruses Sumeeta Singh, Steve Bowers, Greg Rice, Tom McCarty BINF /19/13.
Amplicon-Based Quasipecies Assembly Using Next Generation Sequencing Nick Mancuso Bassam Tork Computer Science Department Georgia State University.
Variant discovery Different approaches: With or without a reference? With a reference – Limiting factors are CPU time and memory required – Crossbow –
DATA ANALYSIS Module Code: CA660 Lecture Block 2.
Mining SNPs from EST Databases Picoult-Newberg et al. (1999)
1 Nicholas Mancuso Department of Computer Science Georgia State University Joint work with Bassam Tork, GSU Pavel Skums, CDC Ion M ӑ ndoiu, UConn Alex.
Molecular Evolution with an emphasis on substitution rates Gavin JD Smith State Key Laboratory of Emerging Infectious Diseases & Department of Microbiology.
Genotyping of James Watson’s genome from Low-coverage Sequencing Data Sanjiv Dinakar and Yözen Hernández.
Something related to genetics? Dr. Lars Eijssen. Bioinformatics to understand studies in genomics – São Paulo – June Image:
Markov models and applications Sushmita Roy BMI/CS 576 Oct 7 th, 2014.
High Throughput Sequencing
Reconstruction of Haplotype Spectra from NGS Data Ion Mandoiu UTC Associate Professor in Engineering Innovation Department of Computer Science & Engineering.
Information theoretic interpretation of PAM matrices Sorin Istrail and Derek Aguiar.
Detection of parvovirus B19 and novel human parvoviruses in high-risk individuals Ashleigh Manning 1, Kate Templeton 2, Ed. Gomperts 3, Peter Simmonds.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University.
SVCL Automatic detection of object based Region-of-Interest for image compression Sunhyoung Han.
Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)
Probability Distributions u Discrete Probability Distribution –Discrete vs. continuous random variables »discrete - only a countable number of values »continuous.
PeptideProphet Explained Brian C. Searle Proteome Software Inc SW Bertha Blvd, Portland OR (503) An explanation.
National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department.
Chapter coverage Part A Part A –1: Practical tools –2: Consulting –3: Design Principles Part B (4-6) One-way ANOVA Part B (4-6) One-way ANOVA Part C (7-9)
HaloPlexHS Get to Know Your DNA. Every Single Fragment.
What is a QUASI-SPECIES By Ye Dan U062281A USC3002 Picturing the World through Mathematics.
Serghei Mangul Department of Computer Science Georgia State University Joint work with Irina Astrovskaya, Marius Nicolae, Bassam Tork, Ion Mandoiu and.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
Optimization of personalized therapies for anticancer treatment Alexei Vazquez The Cancer Institute of New Jersey.
Chapter 4: Variability. Variability Provides a quantitative measure of the degree to which scores in a distribution are spread out or clustered together.
Analysis and comparison of very large metagenomes with fast clustering and functional annotation Weizhong Li, BMC Bioinformatics 2009 Present by Chuan-Yih.
ANALYSIS OF GENE EXPRESSION DATA. Gene expression data is a high-throughput data type (like DNA and protein sequences) that requires bioinformatic pattern.
Expectation-Maximization (EM) Algorithm & Monte Carlo Sampling for Inference and Approximation.
Scalable Algorithms for Next-Generation Sequencing Data Analysis Ion Mandoiu UTC Associate Professor in Engineering Innovation Department of Computer Science.
No reference available
Variant calling: number of individuals vs. depth of read coverage Gabor T. Marth Boston College Biology Department 1000 Genomes Meeting Cold Spring Harbor.
Lectures 7 – Oct 19, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall.
 During replication (in DNA), an error may be made that causes changes in the mRNA and proteins made from that part of the DNA  These errors or changes.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Yu Cheng Chen Author: Lynette.
TIGER * Biosensor for Emerging Infectious Disease Surveillance *Triangulation Identification for Genetic Evaluation of Risks Ranga Sampath David Ecker.
Surface Defect Inspection: an Artificial Immune Approach Dr. Hong Zheng and Dr. Saeid Nahavandi School of Engineering and Technology.
KGEM: an EM Error Correction Algorithm for NGS Amplicon-based Data Alexander Artyomenko.
Population sequencing using short reads: HIV as a case study Vladimir Jojic et.al. PSB 13: (2008) Presenter: Yong Li.
Early changes of hepatitis B virus quasispecies during lamivudine treatment and the correlation with antiviral efficacy. J Hepatol May;50(5):
ICCABS 2013 kGEM: An EM-based Algorithm for Local Reconstruction of Viral Quasispecies Alexander Artyomenko.
Lesson: Sequence processing
Geno2pheno[454]: A Web Server for the Prediction of HIV-1 Coreceptor Usage from Next-Generation Sequencing Data Intervirology 2012;55:113–117 - DOI: /
Preprocessing Data Rob Schmieder.
Quality Control & Preprocessing of Metagenomic Data
Classification of unlabeled data:
Introduction to RAD Acropora millepora.
MUTATIONS.
Alexander Zelikovsky Computer Science Department
Ranking Tumor Phylogeny Trees by Likelihood
Accurate genotyping of hepatitis C virus through nucleotide sequencing and identification of new HCV subtypes in China population  Y.-Q. Tong, B. Liu,
Chapter 11.6 When it all goes Wrong
Pairwise Sequence Alignment (cont.)
DNA Mutations.
Accurate genotyping of hepatitis C virus through nucleotide sequencing and identification of new HCV subtypes in China population  Y.-Q. Tong, B. Liu,
MUTATIONS.
MUTATIONS.
Figure 2. Technique overview
Gang Wang, Na Zhao, Ben Berkhout, Atze T Das  Molecular Therapy 
Canadian Bioinformatics Workshops
Fig. 1 General VirScan analysis of the human virome.
HIV-1 Vif Adaptation to Human APOBEC3H Haplotypes
Presentation transcript:

Statistical methods for improved variant calling of massively parallel sequencing data. VirVarSeq vs ViVaMBC Pictured above: The structure of HIV. Bie Verbist | NCS Brugge | 10-10-2014

OUTLINE Viral dynamics Massive parallel sequencing Variant calling VirVarSeq ViVaMBC Results HCV plasmids HCV clinical sample

Viral dynamics A virus is a small infectious agent that replicates only inside the living cells of other organisms. High replication rate (1011 replications a day for HIV) High mutation rate Viral population consist of closely related subgroups, viral quasispecies, which we want to identify and quantify.

Viral dynamics Number of virusus in population Time Drug-sensitive variants Drug-resistant variant Number of virusus in population Heterogeneous viral population Undetectable Before treatment On treatment Time

Sequencing Sanger sequencing detection limit: 20-30% no accurate estimate of frequency Massively parallel sequencing ACGGTTTCCGTCTGGG ACGGTTTCTGTCTGGG ACGGTTTCCGTCTGGG ACGGTTTCTGTCTGGG ACGGTTTCTGTCTGGG ACGGTTTCTGTCTGGG ACGGTTTCTGTCTGGG ACGATTTCTGTCTGGG detection limit << 20% more accurate estimate of frequency

Massively parallel sequencing Fragmentation Amplification Viral population DNA Fragments Sequencing by synthesis Example, one fragment: T G C C A A A G A C G G T T T C T

Massively parallel sequencing Viral population @HWUSI-EAS1524:17:FC:1:120:19254:21417 1:N:0:GATCAG GATCGGAAGAGCACACGTCTGAACTCCAGTCACGATCAGATCTCGTATGCCGTCTTCTGCTTGAAAAAAA + G@GG@GG@GGHHHBH>GEGDGGBGEGG?GGHHHH>GEGBG@?BEF?DBB<GDGGGGFGG3GGEBA>EC:; @HWUSI-EAS1524:17:FC:1:120:9430:21420 1:N:0:GATCAG ATCGGAAGAGCACACGNCTGAACTCCAGTCACGATCAGATCCCGTATGCCGTCTTCTGCTTGAAAAAAAA DDDDDDDDDD2DDDDD#DDDDDDDDDDDDDDDDDDDDDDD2:8:7;<@>;DDDDDDDDDDD:DDDDD### @HWUSI-EAS1524:17:FC:1:120:12760:21420 1:N:0:GATCAG ATCATACTGTCTTACTNTGATAAAACCTCCAATTCCCCCTANCATTNTTGGTTNCCATCTTCCTTGCAAA HHHHHHHHHHHHHHHG#GGGFFFF@HHHHHHGHHHHHHHHF#FFEB#BBBA>B#BFFFFFHHHHHHHHHG

Distinguish low-frequency variants Variant calling Distinguish low-frequency variants from sequencing error. VirVarSeq ViVaMBC Adaptive filtering approach based on quality scores. Verbist et al. 2014, Bioinformatics. doi: 10.1093/ bioinformatics/btu587. Model based clustering approach which models the error probabilities based on quality scores. Verbist et al. 2014, BMC bioinformatics. under revision.

VirVarSeq Extract reads that cover codon of interest Filter based on the quality scores. Build a codon table Reference ... ... ... ... Reads ... CGA CCA CGT GGA CGA CCA CGT GGA ... Pos x Codon Freq CGA 0.62 CCA 0.25 GGA 0.13 ... ... ... ... ... Filtering Codon Table ... ... ... ... ... ... ... ... ... * codon = nucleotide triplets which specifies a single amino acid

Image or graphic goes here VirVarSeq Definition of the Q-threshold (QIT) : Fit mixture distribution on Q-scores with 3 components: Point prob around Q 2 Error distribution Reliable call distribution Intersection point is threshold. QIT Image or graphic goes here

ViVaMBC Extract reads that cover codon of interest Perform Model Based Clustering Model the error probability Clusters unknown, EM algorithm Reference ... ... Reads ... CGA CCA CGT GGA ... Pos x Codon Freq CGA 0.62 CCA 0.25 GGA 0.13 ... ... CCA ... ... CCA GGA ... Clustering Codon Table CCA ... CGT ... ... CGA ... ... CGA CGA ... CGA CGA ... CGA ... CGA ... Cluster medoids = variant Size of Cluster = Frequency N° Clusters = N° variants

Results – HCV plasmids Two plasmids Amino acids 1 to 181 of NS3 region differ at two codon positions (36 and 155) mixed 4 different proportions

Results – HCV plasmids Other variants (11481 max) are false positives. VirVarSeq reports: more false positives with frequencies going up to 0,5%

Results - HCV clinical sample VirVarSeq reports more variants. Above 1% methods in agreement, even above 0.5%. Few false pos in GC region for ViVaMBC ? Image or graphic goes here VirVarSeq ViVaMBC

VirVarSeq vs ViVaMBC When applying reporting limits of 1% or 0.5%, methods are in agreement. Below this limit, trade-off between sensitivity and specificity, with VirVarSeq less specific. VirVarSeq Adaptive approach Easy development Runs fast ViVaMBC More elegant Longer development time Longer run time

Acknowledgements Promoters: Prof.Dr.O.Thas1, Prof.Dr.L.Clement1 and Prof.Dr.L.Bijnens2 Yves Wetzels, Tobias Verbeke, Joris Meys1 for IT support Scientists within discovery sciences2 Non-clinical statistics team2 2 2 1

Thank you bverbis2@its.jnj.com 10-10-2014

Back-up

ViVaMBC Notation: Complete Data Likelihood: ri: best base calls of read i (i=1 ... n) si: second best base calls of read i (i=1 ... n) zij: zij=1 when read i belongs to haplotype j (j=1...k) τj: probability to belong to haplotype j Complete Data Likelihood:

ViVaMBC Complete Data Likelihood: Likelihood depends on cluster membership zij  EM algorithm

Library preparation Sequencing by synthesis