Linear Time Probabilistic Algorithms for the Singular Haplotype Reconstruction Problem from SNP Fragments Zhixiang ChenUniversity of Texas Pan American.

Slides:



Advertisements
Similar presentations
Numerical Linear Algebra in the Streaming Model Ken Clarkson - IBM David Woodruff - IBM.
Advertisements

Shortest Vector In A Lattice is NP-Hard to approximate
Combinatorial Algorithms for Haplotype Inference Pure Parsimony Dan Gusfield.
Greedy Algorithms CS 6030 by Savitha Parur Venkitachalam.
1 Profile Hidden Markov Models For Protein Structure Prediction Colin Cherry
Approaching the Long-Range Phasing Problem using Variable Memory Markov Chains Samuel Angelo Crisanto 2015 Undergraduate Research Symposium Brown University.
Single nucleotide polymorphisms and applications Usman Roshan BNFO 601.
CS107 Introduction to Computer Science
Association Mapping of Complex Diseases with Ancestral Recombination Graphs: Models and Efficient Algorithms Yufeng Wu UC Davis RECOMB 2007.
1 Learning Entity Specific Models Stefan Niculescu Carnegie Mellon University November, 2003.
Computational Problems in Perfect Phylogeny Haplotyping: Xor-Genotypes and Tag SNPs Tamar Barzuza 1 Jacques S. Beckmann 2,3 Ron Shamir 4 Itsik Pe’er 5.
Combinatorial Algorithms for Maximum Likelihood Tag SNP Selection and Haplotype Inference Ion Mandoiu University of Connecticut CS&E Department.
Multiple-view Reconstruction from Points and Lines
Oded Regev Tel-Aviv University On Lattices, Learning with Errors, Learning with Errors, Random Linear Codes, Random Linear Codes, and Cryptography and.
Optimal Tag SNP Selection for Haplotype Reconstruction Jin Jun and Ion Mandoiu Computer Science & Engineering Department University of Connecticut.
CSE182-L17 Clustering Population Genetics: Basics.
Finding Bit Patterns Applying haplotype models to association study design Natalie Castellana Kedar Dhamdhere Russell Schwartz August 16, 2005.
Genotyping of James Watson’s genome from Low-coverage Sequencing Data Sanjiv Dinakar and Yözen Hernández.
Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.
Evaluation of the Haplotype Motif Model using the Principle of Minimum Description Srinath Sridhar, Kedar Dhamdhere, Guy E. Blelloch, R. Ravi and Russell.
Visualization and graphics research group CIPIC January 21, 2003Multiresolution (ECS 289L) - Winter Surface Simplification Using Quadric Error Metrics.
CS107 Introduction to Computer Science Lecture 7, 8 An Introduction to Algorithms: Efficiency of algorithms.
Phylogenetic Tree Construction and Related Problems Bioinformatics.
Radial Basis Function Networks
Optimization of thermal processes2007/2008 Optimization of thermal processes Maciej Marek Czestochowa University of Technology Institute of Thermal Machinery.
Efficient Model Selection for Support Vector Machines
Surface Simplification Using Quadric Error Metrics Michael Garland Paul S. Heckbert.
Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University.
Evolving a Sigma-Pi Network as a Network Simulator by Justin Basilico.
Network Aware Resource Allocation in Distributed Clouds.
Stochastic Algorithms Some of the fastest known algorithms for certain tasks rely on chance Stochastic/Randomized Algorithms Two common variations – Monte.
Linear Reduction for Haplotype Inference Alex Zelikovsky joint work with Jingwu He WABI 2004.
Combinatorial Algorithms Reference Text: Kreher and Stinson.
National Taiwan University Department of Computer Science and Information Engineering Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
CS CM124/224 & HG CM124/224 DISCUSSION SECTION (JUN 6, 2013) TA: Farhad Hormozdiari.
Informative SNP Selection Based on Multiple Linear Regression
National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department.
CSE 589 Part VI. Reading Skiena, Sections 5.5 and 6.8 CLR, chapter 37.
Lecture 4 Haplotype assembly. Variation calling, diploid genomes CAGCTACATCACGAGCATCGACGAGCTAGCGAGCGATCGCGA CAGCTACATAACGAGCATCGACCAGCTAGCGAGCTATCGCCA.
EE749 I ntroduction to Artificial I ntelligence Genetic Algorithms The Simple GA.
Linear Reduction Method for Tag SNPs Selection Jingwu He Alex Zelikovsky.
California Pacific Medical Center
Class 01 – Fragment assembly. DNA sequence data DNA sequence data is the motherlode of molecular biology. 10^10 base pairs. One human genome/year. It.
Flat clustering approaches
Chapter 2-OPTIMIZATION G.Anuradha. Contents Derivative-based Optimization –Descent Methods –The Method of Steepest Descent –Classical Newton’s Method.
Monte Carlo Analysis of Uncertain Digital Circuits Houssain Kettani, Ph.D. Department of Computer Science Jackson State University Jackson, MS
Multiple Sequence Alignment Vasileios Hatzivassiloglou University of Texas at Dallas.
CS 395T: Computational phylogenetics January 18, 2006 Tandy Warnow.
Genetic Algorithm Dr. Md. Al-amin Bhuiyan Professor, Dept. of CSE Jahangirnagar University.
Machine Learning Chapter 7. Computational Learning Theory Tom M. Mitchell.
Chapter 1 Algorithms with Numbers. Bases and Logs How many digits does it take to represent the number N >= 0 in base 2? With k digits the largest number.
National Taiwan University Department of Computer Science and Information Engineering An Approximation Algorithm for Haplotype Inference by Maximum Parsimony.
International Workshop on Bioinformatics Research and Applications, May 2005 Phasing and Missing data recovery in Family Trios D. Brinza J. He W. Mao A.
Genetic Algorithm. Outline Motivation Genetic algorithms An illustrative example Hypothesis space search.
ICCABS 2013 kGEM: An EM-based Algorithm for Local Reconstruction of Viral Quasispecies Alexander Artyomenko.
Support Vector Machines (SVM)
Constrained Hidden Markov Models for Population-based Haplotyping
Design of Hierarchical Classifiers for Efficient and Accurate Pattern Classification M N S S K Pavan Kumar Advisor : Dr. C. V. Jawahar.
CSE 589 Applied Algorithms Spring 1999
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Motivation Normalization is necessary to remove technical variation.
EE368 Soft Computing Genetic Algorithms.
CS 394C: Computational Biology Algorithms
Approximation Algorithms for the Selection of Robust Tag SNPs
Practical Algorithms for the Single Individual SNP Haplotyping Problem
Clustering.
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Reinforcement Learning (2)
Parsimony population haplotyping
Presentation transcript:

Linear Time Probabilistic Algorithms for the Singular Haplotype Reconstruction Problem from SNP Fragments Zhixiang ChenUniversity of Texas Pan American Bin FuUniversity of Texas Pan American Robert SchwellerUniversity of Texas Pan American Boting YangUniversity of Regina Zhiyu ZhaoUniversity of New Orleans Binhai ZhuMontana State University The Sixth Asia Pacific Bioinformatics Conference January 16, 2008

Outline Background, Haplotypes Problem Formulation Algorithms Empirical Data

AGATTACCACATAGTCGATAGAGATTACTAGTATCGATC AGTTTACCACATAGTCGATCGAGATTACCAGTATCGAAC AGATTACCACATAGGCGATCGAGATTACCAGTATCGATC Haplotypes Full Chromosome for a cat

AGATTACCACATAGTCGATAGAGATTACTAGTATCGATC AGTTTACCACATAGTCGATCGAGATTACCAGTATCGAAC AGATTACCACATAGGCGATCGAGATTACCAGTATCGATC Haplotypes Most sites are the same, those that differ are Single nucleotide polymorphisms ( SNP’s )

AGATTACCACATAGTCGATAGAGATTACTAGTATCGATC AGTTTACCACATAGTCGATCGAGATTACCAGTATCGAAC AGATTACCACATAGGCGATCGAGATTACCAGTATCGATC Haplotypes Single Nucleotide Polymorphisms ( SNP’s )

AGATTACCACATAGTCGATAGAGATTACTAGTATCGATC AGTTTACCACATAGTCGATCGAGATTACCAGTATCGAAC AGATTACCACATAGGCGATCGAGATTACCAGTATCGATC ATATT TTCCA AGCCT Haplotypes Haplotype Haplotype: Compact genetic fingerprint specifying identifying characteristics among individual specimens within the same species.

ATATT TTCCA AGCCT Haplotypes, simplified even further Each SNP site only varies between two possible values. Use binary representation

ATATT TTCCA AGCCT Haplotypes Each SNP site only varies between two possible values. Use binary representation Determination of an individual’s haplotype is a key step in the analysis of genetic variation. -Drug design - Medical applications

Outline Background, Haplotypes Problem Formulation Algorithms Empirical Data

Haplotype Reconstruction Problem Diploid organism: Two haplotypes 2 Unknown haplotypes

Diploid organism: Two haplotypes Unknown Given: Set of n fragments 1)Incomplete n x m SNP matrix —01— 1—101-1— — — Haplotype Reconstruction Problem

Diploid organism: Two haplotypes Unknown Given: Set of n fragments 1)Incomplete 2)Inconsistent —01— 1—101-1— — — n x m SNP matrix Haplotype Reconstruction Problem

Diploid organism: Two haplotypes Unknown Given: Set of n fragments 1)Incomplete 2)Inconsistent —01— 1—101-0— — — n x m SNP matrix Haplotype Reconstruction Problem

Diploid organism: Two haplotypes Unknown Given: Set of n fragments 1)Incomplete 2)Inconsistent 3)Don’t know which fragments came from the same haplotype —01— 1—101-0— — — n x m SNP matrix Haplotype Reconstruction Problem

Input: -n x m SNP matrix Output: -The 2 “best” haplotypes for generating the input matrix —01— 1—101-0— — — n x m SNP matrix - What is meant by “best” defines multiple optimization problems

Haplotype Reconstruction Problem —01— 1—101-0— — — n x m SNP matrix What is meant by “best” defines multiple optimization problems Minimum Error Correction (MEC) - Flip the minimum number of matrix positions such that the matrix can be partitioned into 2 sets of consistent strings.

Haplotype Reconstruction Problem n x m SNP matrix What is meant by “best” defines multiple optimization problems Minimum Error Correction (MEC) - Flip the minimum number of matrix positions such that the matrix can be partitioned into 2 sets of consistent strings —01— 1—101-0— — —

Haplotype Reconstruction Problem n x m SNP matrix What is meant by “best” defines multiple optimization problems Minimum Error Correction (MEC) - Flip the minimum number of matrix positions such that the matrix can be partitioned into 2 sets of consistent strings —01— 1—101-1— — —

Haplotype Reconstruction Problem n x m SNP matrix What is meant by “best” defines multiple optimization problems Minimum Error Correction (MEC) - Flip the minimum number of matrix positions such that the matrix can be partitioned into 2 sets of consistent strings —01— 1—101-1— — —

Haplotype Reconstruction Problem n x m SNP matrix What is meant by “best” defines multiple optimization problems Minimum Error Correction (MEC) - Flip the minimum number of matrix positions such that the matrix can be partitioned into 2 sets of consistent strings —01— 1—101-1— — — Minimum Fragment Removal (MFR) - Remove minimum number of fragments/rows Minimum SNP Removal (MSR) - Remove minimum number of SNP sites/columns

Haplotype Reconstruction Problem n x m SNP matrix What is meant by “best” defines multiple optimization problems Minimum Error Correction (MEC) - Flip the minimum number of matrix positions such that the matrix can be partitioned into 2 sets of consistent strings —01— 1—101-1— — — Minimum Fragment Removal (MFR) - Remove minimum number of fragments/rows Minimum SNP Removal (MSR) - Remove minimum number of SNP sites/columns Great problems. However, they are all computationally hard. [Bafna, Istrail, Lancia, Rizzi, 2005], [Cilibrasi, Iersel, Kelk, Tromp, 2005] [Lippert, Schwartz, Lancia, Istrail, 2002]

Haplotype Reconstruction Problem Dealing with the hardness of Haplotype Reconstruction

Haplotype Reconstruction Problem Dealing with the hardness of Haplotype Reconstruction Approximation algorithms for MEC, MFR, MSR. Some work here, but.. - Some versions provably hard to approximate - No known O(1) approximation for MEC.

Haplotype Reconstruction Problem Dealing with the hardness of Haplotype Reconstruction Approximation algorithms for MEC, MFR, MSR. Some work here, but.. - Some versions provably hard to approximate - No known O(1) approximation for MEC. Consider a probabilistic model For SNP matrix generation. (our approach)

Haplotype Reconstruction Problem Consider the following model parameters: 1)  1, inconsistency rate 2)  2, incompleteness rate 3) , difference rate

Haplotype Reconstruction Problem Consider the following model parameters: 1)  1, inconsistency rate 2)  2, incompleteness rate 3) , difference rate Haplotype: Random : Fragment Rate  1 A fragment is generated  1,  2) by a Haplotype H if 1)mismatch/inconsistency errors happened with probability at most  1.

Haplotype Reconstruction Problem Consider the following model parameters: 1)  1, inconsistency rate 2)  2, incompleteness rate 3) , difference rate Haplotype: Random : Fragment Rate  1Rate  A fragment is generated  1,  2) by a Haplotype H if 1)mismatch/inconsistency errors happened with probability at most  1. 2)Holes happen with probability at most  2.

Haplotype Reconstruction Problem Consider the following model parameters: 1)  1, inconsistency rate 2)  2, incompleteness rate 3) , difference rate A fragment is generated  1,  2) by a Haplotype H if 1)mismatch/inconsistency errors happened with probability at most  1. 2)Holes happen with probability at most  2. Haplotype Reconstruction Problem: Let  1,  2, and  be small positive constants. Let H1 and H2 be any two length m haplotypes Such that: HAMM( H1, H2 ) m ≥  Input: n x m SNP matrix such that each row Is generated  1,  2) by either H1 or H2. Output: H1 and H2. (with high probability)

Outline Background, Haplotypes Problem Formulation Algorithms Empirical Data

Algorithm — 1—101-0— — — GROUP1 GROUP — 1)Grab a random fragment X

Algorithm — 1—101-0— — — GROUP1 GROUP — 1)Grab a random fragment X 2)Compare each remaining fragment Y to X: -IF diff( Y, X ) < 2( α1 + ε ) THEN put Y in GROUP1 ELSE put Y in GROUP2

Algorithm — 1—101-0— — — GROUP1 GROUP2 1—101-0— — — — — )Grab a random fragment X 2)Compare each remaining fragment Y to X: -IF diff( Y, X ) < 2( α1 + ε ) THEN put Y in GROUP1 ELSE put Y in GROUP2 For Example: α1 =.03 ε =.04 2 or fewer mismatches, GROUP1 3 or more, GROUP2

Algorithm — 1—101-0— — — GROUP1 GROUP2 1—101-0— — — — — )Grab a random fragment X 2)Compare each remaining fragment Y to X: -IF diff( Y, X ) < 2( α1 + ε ) THEN put Y in GROUP1 ELSE put Y in GROUP2 For Example: α1 =.03 ε =.04 2 or fewer mismatches, GROUP1 3 or more, GROUP Consensus Strings

Algorithm 1 1)Grab a random fragment X 2)Compare each remaining fragment Y to X: -IF diff( Y, X ) < 2( α1 + ε ) THEN put Y in GROUP1 ELSE put Y in GROUP2 GROUP1 GROUP2 1—101-0— — — — — Consensus Strings Theorem: n₁ and n₂ denote the number of fragments generated from H1 and H2 respectively If 4(α₁ + ε) < β 0 < 2α₁ + α₂ - ε < 1 Then the target haplotypes are reconstructed with probability at least Pr ≥ 1 - 2ne^(-ε²m/3) – 2me ^(-ε²n₁/2) – 2me ^(-ε²n₂/2) α₁ = 3%n = α₂ = 3%m = Β = 30%n₁ = ε = n₂ = Pr[success] ≥ Example:

Downside: Algorithm 1 needs to know the parameter α1 Algorithm 1 1)Grab a random fragment X 2)Compare each remaining fragment Y to X: -IF diff( Y, X ) < 2( α1 + ε ) THEN put Y in GROUP1 ELSE put Y in GROUP2 GROUP1 GROUP2 1—101-0— — — — — Consensus Strings

Algorithm — 1—101-0— — — GROUP1 GROUP — Pick 2 fragments at random

Algorithm — 1—101-0— — — Partition each remaining fragment to closest representative fragment GROUP1 GROUP —101-0— — — — —

Algorithm — 1—101-0— — — GROUP1 GROUP —101-0— — — — — Can get a Bad Partition So try again…

Algorithm — 1—101-0— — — GROUP1 GROUP —101-0— — — — — Repeatedly pick 2 random fragments, and partition.

Algorithm — 1—101-0— — — GROUP1 GROUP —101-0— — — — — Repeatedly pick 2 random fragments, and partition. - For each iteration, compute MAX distance between representative fragment and any other fragment in the same group.

Algorithm — 1—101-0— — — GROUP1 GROUP —101-0— — — — — Repeatedly pick 2 random fragments, and partition. - For each iteration, compute MAX distance between representative fragment and any other fragment in the same group. -Repeat O(1) times, output the best partition found over all iterations. Achieves success with high probability Does not require prior knowledge of model parameters

— 1—101-0— — — GROUP1 GROUP —101-0— — — — — Repeatedly pick 2 random fragments, and partition. - For each iteration, compute MAX distance between representative fragment and any other fragment in the same group. -Repeat O(1) times, output the best partition found over all iterations. Achieves success with high probability Does not require prior knowledge of model parameters Measure SUM of distances between representative fragment and ALL other fragments in the same group Algorithm 3 -This achieves the best empirical results

Outline Background, Haplotypes Problem Formulation Algorithms Empirical Data

Empirical Tests In experiments, Algorithm 3 gave the best performance:

Summary / Future Work Summary – Provably good probabilistic algorithms for singular haplotype reconstruction. – Algorithms are fast: linear time – Good empirical performance on simulated data Future Work – Consider model with short fragments – Test with real biological data The software of our algorithms is available for public access and for real-time on- line demonstration at Thank you for listening. Questions?