VARiD: Variation Detection in Color-Space and Letter-Space Adrian Dalca 1 and Michael Brudno 1,2 University of Toronto 1 Department of Computer Science.

Slides:



Advertisements
Similar presentations
Markov models and applications
Advertisements

RNA-Seq based discovery and reconstruction of unannotated transcripts
Efficient Algorithms for Imputation of Missing SNP Genotype Data A.Mihajlović, V. Milutinović,
Hidden Markov Models (1)  Brief review of discrete time finite Markov Chain  Hidden Markov Model  Examples of HMM in Bioinformatics  Estimations Basic.
METHODS FOR HAPLOTYPE RECONSTRUCTION
HMM II: Parameter Estimation. Reminder: Hidden Markov Model Markov Chain transition probabilities: p(S i+1 = t|S i = s) = a st Emission probabilities:
Targeted Data Introduction  Many mapping, alignment and variant calling algorithms  Most of these have been developed for whole genome sequencing and.
數據分析 David Shiuan Department of Life Science Institute of Biotechnology Interdisciplinary Program of Bioinformatics National Dong Hwa University.
Hidden Markov Models Ellen Walker Bioinformatics Hiram College, 2008.
Profiles for Sequences
Albert Gatt Corpora and Statistical Methods Lecture 8.
. Hidden Markov Model Lecture #6. 2 Reminder: Finite State Markov Chain An integer time stochastic process, consisting of a domain D of m states {1,…,m}
GS 540 week 6. HMM basics Given a sequence, and state parameters: – Each possible path through the states has a certain probability of emitting the sequence.
HMM for CpG Islands Parameter Estimation For HMM Maximum Likelihood and the Information Inequality Lecture #7 Background Readings: Chapter 3.3 in the.
Single nucleotide polymorphisms and applications Usman Roshan BNFO 601.
Hidden Markov Model 11/28/07. Bayes Rule The posterior distribution Select k with the largest posterior distribution. Minimizes the average misclassification.
. Hidden Markov Model Lecture #6 Background Readings: Chapters 3.1, 3.2 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
. Parameter Estimation For HMM Background Readings: Chapter 3.3 in the book, Biological Sequence Analysis, Durbin et al., 2001.
Hidden Markov Model Special case of Dynamic Bayesian network Single (hidden) state variable Single (observed) observation variable Transition probability.
Lecture 5: Learning models using EM
S. Maarschalkerweerd & A. Tjhang1 Parameter estimation for HMMs, Baum-Welch algorithm, Model topology, Numerical stability Chapter
Computational Tools for Finding and Interpreting Genetic Variations Gabor T. Marth Department of Biology, Boston College
The Extraction of Single Nucleotide Polymorphisms and the Use of Current Sequencing Tools Stephen Tetreault Department of Mathematics and Computer Science.
. Class 5: HMMs and Profile HMMs. Review of HMM u Hidden Markov Models l Probabilistic models of sequences u Consist of two parts: l Hidden states These.
Genome evolution: a sequence-centric approach Lecture 3: From Trees to HMMs.
Elze de Groot1 Parameter estimation for HMMs, Baum-Welch algorithm, Model topology, Numerical stability Chapter
Genotyping of James Watson’s genome from Low-coverage Sequencing Data Sanjiv Dinakar and Yözen Hernández.
Markov models and applications Sushmita Roy BMI/CS 576 Oct 7 th, 2014.
Hidden Markov Model Continues …. Finite State Markov Chain A discrete time stochastic process, consisting of a domain D of m states {1,…,m} and 1.An m.
The Human Genome (Harding & Sanger) * *20  globin (chromosome 11) 6*10 4 bp 3*10 9 bp *10 3 Exon 2 Exon 1 Exon 3 5’ flanking 3’ flanking 3*10 3.
. Parameter Estimation For HMM Lecture #7 Background Readings: Chapter 3.3 in the text book, Biological Sequence Analysis, Durbin et al.,  Shlomo.
Informatics challenges and computer tools for sequencing 1000s of human genomes Gabor T. Marth Boston College Biology Department Cold Spring Harbor Laboratory.
Handwritten Character Recognition using Hidden Markov Models Quantifying the marginal benefit of exploiting correlations between adjacent characters and.
High Throughput Sequencing: Technologies & Applications Michael Brudno CSC 2431 – Algorithms for HTS University of Toronto 06/01/2010.
CSCE555 Bioinformatics Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
. Parameter Estimation For HMM Lecture #7 Background Readings: Chapter 3.3 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
Computational research for medical discovery at Boston College Biology Gabor T. Marth Boston College Department of Biology
BINF6201/8201 Hidden Markov Models for Sequence Analysis
earthobs.nr.no Temporal Analysis of Forest Cover Using a Hidden Markov Model Arnt-Børre Salberg and Øivind Due Trier Norwegian Computing Center.
Adrian Caciula Department of Computer Science Georgia State University Joint work with Serghei Mangul (UCLA) Ion Mandoiu (UCONN) Alex Zelikovsky (GSU)
SHRiMP: Accurate Mapping of Short Reads in Letter- and Colour-spaces Stephen Rumble, Phil Lacroute, …, Arend Sidow, Michael Brudno.
One-class Training for Masquerade Detection Ke Wang, Sal Stolfo Columbia University Computer Science IDS Lab.
1 A Robust Framework for Detecting Structural Variations February 6, 2008 Seunghak Lee 1, Elango Cheran 1, and Michael Brudno 1 1 University of Toronto,
TOOLS FOR HTS ANALYSIS Michael Brudno and Marc Fiume Department of Computer Science University of Toronto.
BRUDNO LAB: A WHIRLWIND TOUR Marc Fiume Department of Computer Science University of Toronto.
Markov Models and Simulations Yu Meng Department of Computer Science and Engineering Southern Methodist University.
BNFO 615 Usman Roshan. Short read alignment Input: – Reads: short DNA sequences (upto a few hundred base pairs (bp)) produced by a sequencing machine.
MS Sequence Clustering
Hidden Markov Models & POS Tagging Corpora and Statistical Methods Lecture 9.
VARiD: A Variation Detection Framework for Color-space and Letter- space platforms By A.V. Dalca, S. M. Rumble, S. Levy, M. Brudno Presented by Velian.
Linear Reduction Method for Tag SNPs Selection Jingwu He Alex Zelikovsky.
Algorithms in Computational Biology11Department of Mathematics & Computer Science Algorithms in Computational Biology Markov Chains and Hidden Markov Model.
California Pacific Medical Center
Inference: Probabilities and Distributions Feb , 2012.
SHRiMP: The SHort Read Mapping Package Michael Brudno Department of Computer Science University of Toronto 11/09/08.
February 20, 2002 UD, Newark, DE SNPs, Haplotypes, Alleles.
From Genomics to Geology: Hidden Markov Models for Seismic Data Analysis Samuel Brown February 5, 2009.
Linkage Disequilibrium and Recent Studies of Haplotypes and SNPs
Hidden Markov Model and Its Application in Bioinformatics Liqing Department of Computer Science.
Computational Biology and Genomics at Boston College Biology Gabor T. Marth Department of Biology, Boston College
1 Repeats!. 2 Introduction  A repeat family is a collection of repeats which appear multiple times in a genome.  Our objective is to identify all families.
Aaron R. Quinlan and Gabor T. Marth Department of Biology, Boston College, Chestnut Hill, MA 02467
The Haplotype Blocks Problems Wu Ling-Yun
JERI DILTS SUZANNA KIM HEMA NAGRAJAN DEEPAK PURUSHOTHAM AMBILY SIVADAS AMIT RUPANI LEO WU Genome Assembly Final Results
Canadian Bioinformatics Workshops
Nucleotide variation in the human genome
Disease risk prediction
Stat 217 – Day 28 Review Stat 217.
Canadian Bioinformatics Workshops
Presentation transcript:

VARiD: Variation Detection in Color-Space and Letter-Space Adrian Dalca 1 and Michael Brudno 1,2 University of Toronto 1 Department of Computer Science 2 Banting & Best Department of Medical Research

Motivation we have different Color-space and Letter-space platforms need to bring them together (while taking advantage of both) Motivation we have different Color-space and Letter-space platforms need to bring them together (while taking advantage of both) Methods Advantages Results Motivation

motivation | methods | results | advantages combining the different Color-space and Letter-space platforms motivation | methods | results | advantages combining the different Color-space and Letter-space platforms Sequencing Platforms letter-space Sanger, 454, Illumina, etc > NC_ | BRCA1 SX3 TCAGCATCGGCATCGACTGCACAGG color-space AB SOLiD not as many software tools out there > NC_ | BRCA1 AF3 T different sequencing biases, different inherent errors and different advantages useful to combine this information

Color Space pic reference: SHRiMP motivation | methods | results | advantages combining the different Color-space and Letter-space platforms motivation | methods | results | advantages combining the different Color-space and Letter-space platforms

Color Space Translating T T Sequencing Error vs SNP > T > T > TCAGCATCGGCAGCGACTGCACAGG > T > T > TCAGCATCGGCAAGCTGACGTGTCC A G C T CA GC ATC G GCA T CGAC T G CACA G G motivation | methods | results | advantages combining the different Color-space and Letter-space platforms motivation | methods | results | advantages combining the different Color-space and Letter-space platforms

Notes: clear distinction between a sequencing error and a SNP can this help us in SNP detection? sounds like it! single color change  error, 2 colors changed  (likely) SNP. reference VARiD toolbox GUI Example reads Sequencing ErrorsSNP TTTTTGAGAGGAATA A motivation | methods | results | advantages combining the different Color-space and Letter-space platforms motivation | methods | results | advantages combining the different Color-space and Letter-space platforms

reference reads Examples (more realistically) guess: Het SNP e.g. above real data ACT AGTAGT ? heterozygous SNPsa lot more errors motivation | methods | results | advantages combining the different Color-space and Letter-space platforms motivation | methods | results | advantages combining the different Color-space and Letter-space platforms

Realistically, situation is tougher. Heterozygous SNPs Homologous SNPs Tri-allelic SNPs small indels alot more error than in original previous example misalignment (by chance) misalignment (consistently) Motivation we want a SNP caller to handle both traditional letter-space as well as color-space reads motivation | methods | results | advantages combining the different Color-space and Letter-space platforms motivation | methods | results | advantages combining the different Color-space and Letter-space platforms

Methods Model the system with an HMM Expand the HMM and apply Heuristics Methods Model the system with an HMM Expand the HMM and apply Heuristics Motivation Advantages Results Quick breath.

motivation | methods | results | advantages HMM models and heuristics motivation | methods | results | advantages HMM models and heuristics Hidden Markov Model Statistical model for a system (so we have states) Assume that system is a Markov process with state unobserved. Markov Process: future state depends only on current state We can observe the state’s emission (output) each state has a probability distribution over outputs apply: we don’t know the state (donor?), but we can observe some output determined by the state (reads?)

Our Hidden Markov Model At every pair of consecutive positions: don’t know the donor nucleotides, have some color-space and/or letter-space reads The donor could be: letters: AA color 0 letters: AC color 1 : letters: TT color 0 16 combinations (for colors) Note: AA and TT give the same colors! So we have redundancy. motivation | methods | results | advantages HMM models and heuristics motivation | methods | results | advantages HMM models and heuristics

AA and TT give the same colors! So we have redundancy. letters: AA color 0 letters: TT color 0 can’t just call colors, since they can represent one of several translations to properly call SNPs, we need to model underlying letters. Colors and Letters motivation | methods | results | advantages HMM models and heuristics motivation | methods | results | advantages HMM models and heuristics

States of the Model Consider donor at positions 532, 533 and 534. At each pair we have one color, two letters AA CA AT TT 532/3 : 533/4 : GA CT : : AA TT states only certain transitions allowed each state depends on the previous states, but not further (Markov Process) position … X Y Z … motivation | methods | results | advantages HMM models and heuristics motivation | methods | results | advantages HMM models and heuristics

…NNNNNNNNNNNNNN… T T T Emissions ATTGCGCAATGCG TTGGGCAATGCGA GCGCACTGCGAC Unknown genome Color reads Letter reads color emissions letter emissions motivation | methods | results | advantages HMM models and heuristics motivation | methods | results | advantages HMM models and heuristics

Our Hidden Markov Model Emissions A color 0 color 1 color 2 color 3 letters A letters C Same distribution of emissions in color-space 1 – ε/3 ε ε ε emissionprobability letters Tξ (1- ξ/3 ) different emissions in letter-space T letters Gξ ξ motivation | methods | results | advantages HMM models and heuristics motivation | methods | results | advantages HMM models and heuristics

…NNNNNNNNNNNNNN… T T T Emissions Probability ATTGCGCAATGCG TTGGGCAATGCGA GCGCACTGCGAC E.g. For state CC: How do we use emissions? Assign an Emission Probability to each state: What is the probability that this state emitted these reads. motivation | methods | results | advantages HMM models and heuristics motivation | methods | results | advantages HMM models and heuristics

So we have the unknown (donor pair at some location), the emissions (output – the read colors at some location), and the dependency on the previous state. Our Hidden Markov Model motivation | methods | results | advantages HMM models and heuristics motivation | methods | results | advantages HMM models and heuristics

Our Hidden Markov Model Have set-up a form of an HMM run Forward-Backward algorithm get probability distribution over states AA CA ATAT TT : : GA CT : : likely state motivation | methods | results | advantages HMM models and heuristics motivation | methods | results | advantages HMM models and heuristics

Current form of HMM only detects homozygous SNPs We include : short indels heterozygous SNPs Expansion and Heuristics motivation | methods | results | advantages HMM models and heuristics motivation | methods | results | advantages HMM models and heuristics

Expansion: Gaps and heterozygous SNPs Expand states Have states that include gaps emit: gap or color A- -- -G AG TG T- Have larger states, for diploids emit: colors Same algorithm, but in all we have 1600 states motivation | methods | results | advantages HMM models and heuristics motivation | methods | results | advantages HMM models and heuristics

Expansion: Gaps and heterozygous SNPs Use variable error rates for emissions o can support quality values (alter the emission probabilities) Translate through the first letter o gives guidance in letter-space o know the error rate (= error rate at first color) note: not ok to translate the whole read due to effects of color-space error, but one letter is safe. handle like a normal letter-space emission >T >>C motivation | methods | results | advantages HMM models and heuristics motivation | methods | results | advantages HMM models and heuristics

Post Processing: Uncorrelated Errors HMM doesn’t know which read each emission came from. We will get a lot of confidence in states voting for which is a het SNP But there are NO reads supporting Blue-Green Example Post Processing: For each proposed variant, check that there actually is enough reads supporting this variant. Several other cases are handled with a similar check motivation | methods | results | advantages HMM models and heuristics motivation | methods | results | advantages HMM models and heuristics

Results Motivation Methods Advantages Quicker breath.

Working Results Simulations Color-space dataset Source: JCVI. Validated with Sanger. Mappings are done with SHRiMP 8 datasets all with similar performance: 83-87% True Positives (real SNPs called) few False Positives (non-var called as SNPS) % of calls, 0.02% of nucleotides results very similar to Corona; Examples (~25000 bp) motivation | methods | results | advantages simulations and real data motivation | methods | results | advantages simulations and real data

motivation | methods | results | advantages simulations and real data motivation | methods | results | advantages simulations and real data Example of False Positive Sanger (“real”) haplomes Color-space Reads VARiD Het SNP Prediction ACT AGTAGT Example of False Negative (missed call) Sanger (“real”) haplomes Color-space Reads VARiD Prediction CCT CTTCTT ATG ACGACG

Advantages take advantage of both Color-space and Letter-space reads Adjacent SNPs, short indels Advantages take advantage of both Color-space and Letter-space reads Adjacent SNPs, short indels Motivation Methods Results Quicker breath.

motivation | methods | results | advantages combining both platforms natively motivation | methods | results | advantages combining both platforms natively Summary of VARiD Treats color-space and letter-space together in the same framework no translation – take advantage of each technology’s properties fully probabilistic Handles adjacent SNPs CAAG translates to C102 CTTG translates to C201 Example Looks like 2 sequencing errors. VARiD can detect the 2 SNPs donor reference

Thank you: Sam Levy at JCVI Find the poster session: U61. Monday (June 29) evening VARiD website Contact: VARiD Adrian Dalca & Michael Brudno University of Toronto VARiD Adrian Dalca & Michael Brudno University of Toronto

VARiD Adrian Dalca & Michael Brudno University of Toronto VARiD Adrian Dalca & Michael Brudno University of Toronto