VARiD: A Variation Detection Framework for Color-space and Letter- space platforms By A.V. Dalca, S. M. Rumble, S. Levy, M. Brudno Presented by Velian.

Slides:



Advertisements
Similar presentations
Mo17 shotgun project Goal: sequence Mo17 gene space with inexpensive new technologies Datasets in progress: Four-phases of 454-FLX sequencing to max of.
Advertisements

Markov models and applications
. Markov Chains. 2 Dependencies along the genome In previous classes we assumed every letter in a sequence is sampled randomly from some distribution.
GPU and machine learning solutions for comparative genomics Usman Roshan Department of Computer Science New Jersey Institute of Technology.
Hidden Markov Models (1)  Brief review of discrete time finite Markov Chain  Hidden Markov Model  Examples of HMM in Bioinformatics  Estimations Basic.
Bioinformatics Phylogenetic analysis and sequence alignment The concept of evolutionary tree Types of phylogenetic trees Measurements of genetic distances.
HMM II: Parameter Estimation. Reminder: Hidden Markov Model Markov Chain transition probabilities: p(S i+1 = t|S i = s) = a st Emission probabilities:
Ab initio gene prediction Genome 559, Winter 2011.
VARiD: Variation Detection in Color-Space and Letter-Space Adrian Dalca 1 and Michael Brudno 1,2 University of Toronto 1 Department of Computer Science.
MNW2 course Introduction to Bioinformatics
University of Connecticut
Hidden Markov Models Ellen Walker Bioinformatics Hiram College, 2008.
Hidden Markov Models Theory By Johan Walters (SR 2003)
. Hidden Markov Model Lecture #6. 2 Reminder: Finite State Markov Chain An integer time stochastic process, consisting of a domain D of m states {1,…,m}
Markov Chains Lecture #5
درس بیوانفورماتیک December 2013 مدل ‌ مخفی مارکوف و تعمیم ‌ های آن به نام خدا.
Variant discovery Different approaches: With or without a reference? With a reference – Limiting factors are CPU time and memory required – Crossbow –
Non-coding RNA William Liu CS374: Algorithms in Biology November 23, 2004.
. Hidden Markov Model Lecture #6 Background Readings: Chapters 3.1, 3.2 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
Hidden Markov Model Special case of Dynamic Bayesian network Single (hidden) state variable Single (observed) observation variable Transition probability.
S. Maarschalkerweerd & A. Tjhang1 Parameter estimation for HMMs, Baum-Welch algorithm, Model topology, Numerical stability Chapter
Master’s course Bioinformatics Data Analysis and Tools
The Extraction of Single Nucleotide Polymorphisms and the Use of Current Sequencing Tools Stephen Tetreault Department of Mathematics and Computer Science.
SNP Genotyping Without Probes by High Resolution Melting of Small Amplicons Robert Pryor 1, Michael Liew 2 Robert Palais 3, and Carl Wittwer 1, 2 1 Dept.
. Class 5: HMMs and Profile HMMs. Review of HMM u Hidden Markov Models l Probabilistic models of sequences u Consist of two parts: l Hidden states These.
Bioinformatics Pipeline for Fosmid based Molecular Haplotype Sequencing Jorge Duitama1,2, Thomas Huebsch1, Gayle McEwen1, Sabrina Schulz1, Eun-Kyung Suk1,
Comparative ab initio prediction of gene structures using pair HMMs
Genome evolution: a sequence-centric approach Lecture 3: From Trees to HMMs.
Elze de Groot1 Parameter estimation for HMMs, Baum-Welch algorithm, Model topology, Numerical stability Chapter
Genotyping of James Watson’s genome from Low-coverage Sequencing Data Sanjiv Dinakar and Yözen Hernández.
Hidden Markov Models.
Markov models and applications Sushmita Roy BMI/CS 576 Oct 7 th, 2014.
The Human Genome (Harding & Sanger) * *20  globin (chromosome 11) 6*10 4 bp 3*10 9 bp *10 3 Exon 2 Exon 1 Exon 3 5’ flanking 3’ flanking 3*10 3.
High Throughput Sequencing: Technologies & Applications Michael Brudno CSC 2431 – Algorithms for HTS University of Toronto 06/01/2010.
MNW2 course Introduction to Bioinformatics Lecture 22: Markov models Centre for Integrative Bioinformatics FEW/FALW
BINF6201/8201 Hidden Markov Models for Sequence Analysis
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Targeted next generation sequencing for population genomics and phylogenomics in Ambystomatid salamanders Eric M. O’Neill David W. Weisrock Photograph.
MapNext: a software tool for spliced and unspliced alignments and SNP detection of short sequence reads Hua Bao Sun Yat-sen University, Guangzhou,
SHRiMP: Accurate Mapping of Short Reads in Letter- and Colour-spaces Stephen Rumble, Phil Lacroute, …, Arend Sidow, Michael Brudno.
Hash Algorithm and SSAHA Implementations Zemin Ning Production Software Group Informatics.
Informative SNP Selection Based on Multiple Linear Regression
TOOLS FOR HTS ANALYSIS Michael Brudno and Marc Fiume Department of Computer Science University of Toronto.
BRUDNO LAB: A WHIRLWIND TOUR Marc Fiume Department of Computer Science University of Toronto.
Finnish Genome Center Monday, 16 November Genotyping & Haplotyping.
MS Sequence Clustering
Comp. Genomics Recitation 9 11/3/06 Gene finding using HMMs & Conservation.
CS Statistical Machine learning Lecture 24
Algorithms in Computational Biology11Department of Mathematics & Computer Science Algorithms in Computational Biology Markov Chains and Hidden Markov Model.
CZ5226: Advanced Bioinformatics Lecture 6: HHM Method for generating motifs Prof. Chen Yu Zong Tel:
SHRiMP: The SHort Read Mapping Package Michael Brudno Department of Computer Science University of Toronto 11/09/08.
February 20, 2002 UD, Newark, DE SNPs, Haplotypes, Alleles.
From Genomics to Geology: Hidden Markov Models for Seismic Data Analysis Samuel Brown February 5, 2009.
A Hardware Accelerator for the Fast Retrieval of DIALIGN Biological Sequence Alignments in Linear Space Author: Azzedine Boukerche, Jan M. Correa, Alba.
Markov Chain Models BMI/CS 576 Colin Dewey Fall 2015.
Lectures 7 – Oct 19, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall.
Maik Friedel, Thomas Wilhelm, Jürgen Sühnel FLI-Jena, Germany Introduction: During the last 10 years, a large number of complete.
(H)MMs in gene prediction and similarity searches.
1 Repeats!. 2 Introduction  A repeat family is a collection of repeats which appear multiple times in a genome.  Our objective is to identify all families.
Analysis of Next Generation Sequence Data BIOST /06/2015.
Methods in Phylogenetic Inference Chris Castorena Thornton Lab.
Mismatch String Kernals for SVM Protein Classification Christina Leslie, Eleazar Eskin, Jason Weston, William Stafford Noble Presented by Pradeep Anand.
Spectral Algorithms for Learning HMMs and Tree HMMs for Epigenetics Data Kevin C. Chen Rutgers University joint work with Jimin Song (Rutgers/Palentir),
Using the Fisher kernel method to detect remote protein homologies Tommi Jaakkola, Mark Diekhams, David Haussler ISMB’ 99 Talk by O, Jangmin (2001/01/16)
bacteria and eukaryotes
Markov Chain Models BMI/CS 776
Disease risk prediction
Department of Computer Science
Ab initio gene prediction
Data formats Gabor T. Marth Boston College
Presentation transcript:

VARiD: A Variation Detection Framework for Color-space and Letter- space platforms By A.V. Dalca, S. M. Rumble, S. Levy, M. Brudno Presented by Velian Pandeliev

VARiD Overview Purpose: Variation Detection (SNP, indel) Pitch: First to use both colour-space and letter-space data Principle: Hidden Markov Model with Forward-Backward algorithm Platform: 454/Roche, Solexa, ABI SOLiD Pros: Can work with unconverted sets of both formats simultaneously Performance: linear in length of reference, great on mixed format data

ABI SOLiD Basics Reads bases two at a time Outputs one of four colours based on transition state machine:

ABI SOLiD Properties Read errors and SNPs present differently. Reference:

ABI SOLiD Properties Read errors and SNPs present differently. Reference: Error:

ABI SOLiD Properties Read errors and SNPs present differently. Reference: Error: SNP:

ABI SOLiD Properties A read error propagates through the rest of the sequence on translation to letter-space

Consequences Colour-space encoding is better suited to calling SNPs than letter-space encoding In letter-space data, errors do not propagate through to the rest of the read Wouldn’t it be great to have a SNP calling framework that could use both kinds of data!?

VARiD A Hidden Markov Model for Variation Detection In general, HMM’s have the following elements: -States (hidden) -Transitions (probabilities of reaching any particular state from the previous one) -Emissions (observed outputs)

Building a Basic HMM States: pairs of consecutive letter- space positions: S = {AA, AT, AC, AG TT, TA, TC, TG CC, CA, CT, CG GG, GA, GT, GC}

Building a Basic HMM Transitions: since consecutive states share a nucleotide, probabilities are defined as follows: P(transition WX  YZ) = frequency(Z) if X=Y 0if X≠Y

Building a Basic HMM Emissions: a letter and a colour from donor reads at each state. E.g. P(emission = c|state = CA) = q(c|CA) = 1 – 3εif c is 1 εif c is 0, 2, 3 for colour space

Building a Basic HMM Emissions: a letter and a colour from donor reads at each state. E.g. P(emission = n|state = CA) = q(n|CA) = 1 – 3ξif n is A ξif n is C, G, T for letter space

Building a Basic HMM Emission probabilities from all reads: P(emissions = E|state = s) = which combines colour and letter space data

Building a Basic HMM Detecting variation is accomplished through finding the maximum likelihood state for each position in the genotype (the donor) and comparing it against the reference nucleotide.

Building a Basic HMM Source: Dalca, A. & Brudno, M. (Poster) By running the Forward-Backward algorithm on the HMM, a probability distribution is obtained from the possible states and a base is called (in bold).

Extensions The HMM described above is quite simple and only calls a single nucleotide for each position. VARiD extends the model to detect heterozygous SNPs, as well as to handle indels.

Microindels To deal with microindels (<5 bp) in the sample, gap states are required: E.g. [A G] (would emit colour 2) -4 dummy ‘gap’ nucleotides are defined, one for A, C, G, T -[A G] = {(A, gap-A), (gap-A, gap-A), (gapA-gap-A), (gap-A,G)} Colour 2

Microindels Requires 24 more states: -(X, gapX)x 4 -(gapX, gapX)x 4 -(gapX,Y)x16 -Total (incl. orig.) 40 states

Heterozygous SNPs For diploid samples, each state has to account for heterozygous differences Each state in VARiD’s HMM is a unique combination of two of the original 40 states (obtained by S x S) 40 2 = 1600 states!

Features Keeps track of quality scores and positions within a read to augment HMM error rates (ε, ξ) for greater accuracy Post-processing ensures that all heterozygous SNP calls are supported by enough reads

Features Source: Original paper

Features First T in a read is NOT part of the sequence.

Features First T is NOT part of the genotype! VARiD eliminates linker remnant without having to translate fully

VALiDation 260kb from the human genome Sequenced with ABI SOLiD and 454/Roche Reference obtained through Sanger reads Artificial datasets created with varying amounts of coverage Tested in colour-space alone (against Corona), letter-space alone (against gigaBayes) with various aligners and with a combination of data

VALiDation Measures: True Positives (correctly identified SNPs) False Positives (SNPs not in Sanger set) Precision (TP as fraction of all predictions) Recall (TP as fraction of Sanger set SNPs)

VALiDation Colour space only In colour space, VARiD had slightly higher precision than the Corona caller on AB- mapped reads, but had comparable and slightly lower recall. Using VARiD with SHRiMP produced a higher recall rate, but a lower precision when compared to VARiD + AB mapper. (no significance statistics were presented)

VALiDation Letter Space Only In letter space, gigaBayes + mosaik perfomed better than VARiD (using the same mosaik mapper) with low coverage, but fell behind in higher coverage. VARiD + SHRiMP did better than VARiD + mosaik in both low and high coverage, and clearly outperformed gigaBayes at 20x coverage

VALiDation Mixed space VARiD’s true strength lies in being able to combine colour- and letter-space reads and to perform better on them than on cost- equivalent letter-only or colour-only data:

Issues No statistical significance presented on performance improvement Experimental size relatively small (260kb) Not ideal for low coverage data Would be interesting to see how VARiD performs on more diverse data sets (more/fewer SNPs, indels, etc.)

Issues No statistical significance presented on performance improvement Experimental size relatively small (260kb) Not ideal for low coverage data Would be interesting to see how VARiD performs on more diverse data sets (more/fewer SNPs, indels, etc.) Any more?

The End.

References Dalca, A.V., Rumble, S.M., Levy, S., Brudno, M. VARiD: A Variation Detection Framework for Color-space and Letter- space platforms (in progress) Dalca, A.V. & Brudno, M. VARiD: Variation Detection in Color- space and Letter-space (poster) Hidden Markov model. (2010, Février 2). In Wikipedia, The Gratuit Encyclopedia. Retrieved 13:24, Février 10, 2010, from model&oldid= Rumble, S.M., Lacroute, P., Dalca, A.V., Fiume, M. Sidow, A. and Brudno, M. (2009) SHRiMP: Accurate mapping of short color-space reads. PLoS Comput Biol.