Approximation Algorithms for the Selection of Robust Tag SNPs

Slides:

Advertisements

Similar presentations

Association Studies, Haplotype Blocks and Tagging SNPs Prof. Sorin Istrail.

Advertisements

Combinatorial Algorithms for Haplotype Inference Pure Parsimony Dan Gusfield.

Generalization and Specialization of Kernelization Daniel Lokshtanov.

Introduction to Approximation Algorithms Lecture 12: Mar 1.

Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.

Computational problems, algorithms, runtime, hardness

Instructor Neelima Gupta Table of Contents Lp –rounding Dual Fitting LP-Duality.

WABI 2005 Algorithms for Imperfect Phylogeny Haplotyping (IPPH) with a Single Homoplasy or Recombnation Event Yun S. Song, Yufeng Wu and Dan Gusfield University.

Computational Problems in Perfect Phylogeny Haplotyping: Xor-Genotypes and Tag SNPs Tamar Barzuza 1 Jacques S. Beckmann 2,3 Ron Shamir 4 Itsik Pe’er 5.

The Extraction of Single Nucleotide Polymorphisms and the Use of Current Sequencing Tools Stephen Tetreault Department of Mathematics and Computer Science.

Approximation Algorithms

Haplotyping via Perfect Phylogeny Conceptual Framework and Efficient (almost linear-time) Solutions Dan Gusfield U.C. Davis RECOMB 02, April 2002.

Combinatorial Algorithms for Maximum Likelihood Tag SNP Selection and Haplotype Inference Ion Mandoiu University of Connecticut CS&E Department.

Wei-Bung Wang Tao Jiang

Optimal Tag SNP Selection for Haplotype Reconstruction Jin Jun and Ion Mandoiu Computer Science & Engineering Department University of Connecticut.

CSE182-L17 Clustering Population Genetics: Basics.

A dynamic program algorithm for haplotype block partitioning Zhang, et. al. (2002) PNAS. 99, 7335.

1 Introduction to Approximation Algorithms Lecture 15: Mar 5.

Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University.

National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational.

Linear Reduction for Haplotype Inference Alex Zelikovsky joint work with Jingwu He WABI 2004.

1 Introduction to Approximation Algorithms. 2 NP-completeness Do your best then.

Human SNP haplotypes Statistics 246, Spring 2002 Week 15, Lecture 1.

National Taiwan University Department of Computer Science and Information Engineering Haplotype Inference Yao-Ting Huang Kun-Mao Chao.

Informative SNP Selection Based on Multiple Linear Regression

National Taiwan University Department of Computer Science and Information Engineering Dynamic Programming Algorithms for Haplotype Block Partitioning:

National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department.

1 Population Genetics Basics. 2 Terminology review Allele Locus Diploid SNP.

Estimating Recombination Rates. LRH selection test, and recombination Recall that LRH/EHH tests for selection by looking at frequencies of specific haplotypes.

CSE 589 Part VI. Reading Skiena, Sections 5.5 and 6.8 CLR, chapter 37.

Linear Reduction Method for Tag SNPs Selection Jingwu He Alex Zelikovsky.

SNPs, Haplotypes, Disease Associations Algorithmic Foundations of Computational Biology II Course 1 Prof. Sorin Istrail.

Implicit Hitting Set Problems Richard M. Karp Erick Moreno Centeno DIMACS 20 th Anniversary.

Lecture.6. Table of Contents Lp –rounding Dual Fitting LP-Duality.

CPS Computational problems, algorithms, runtime, hardness (a ridiculously brief introduction to theoretical computer science) Vincent Conitzer.

Linkage Disequilibrium and Recent Studies of Haplotypes and SNPs

Efficient Algorithms for SNP Haplotype Block Selection Problems Yaw-Ling Lin ( 林耀鈴 ) Dept Computer Sci and Info Engineering College of Computing and Informatics.

Estimating Recombination Rates. Daly et al., 2001 Daly and others were looking at a 500kb region in 5q31 (Crohn disease region) 103 SNPs were genotyped.

National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational.

Approximation Algorithms based on linear programming.

National Taiwan University Department of Computer Science and Information Engineering An Approximation Algorithm for Haplotype Inference by Maximum Parsimony.

International Workshop on Bioinformatics Research and Applications, May 2005 Phasing and Missing data recovery in Family Trios D. Brinza J. He W. Mao A.

The Haplotype Blocks Problems Wu Ling-Yun

Introduction to SNP and Haplotype Analysis

Introduction to Approximation Algorithms

Computational problems, algorithms, runtime, hardness

BackTracking CS255.

Of Sea Urchins, Birds and Men

CS4234 Optimiz(s)ation Algorithms

L4: Counting Recombination events

NP-Completeness Yin Tat Lee

Introduction to SNP and Haplotype Analysis

Chapter 6. Large Scale Optimization

Estimating Recombination Rates

Instructor: Shengyu Zhang

Integer Programming (정수계획법)

TagSNP Selection Problems based on Linkage Disequilibrium and Lagrangian Relaxation Chia-Yi Ma I-Lin Wang Department of Industrial & Information Management.

Haplotype Inference Yao-Ting Huang Kun-Mao Chao.

Haplotype Inference Yao-Ting Huang Kun-Mao Chao.

CPS 173 Computational problems, algorithms, runtime, hardness

The computation of hitting sets: Review and new algorithms

Outline Cancer Progression Models

NP-Completeness Yin Tat Lee

Approximation Algorithms for the Selection of Robust Tag SNPs

Haplotype Inference Yao-Ting Huang Kun-Mao Chao.

Chapter 6. Large Scale Optimization

Parsimony population haplotyping

Haplotype Block Partition with Limited Resources and Applications to Human Chromosome 21 Haplotype Data Kui Zhang, Fengzhu Sun, Michael S. Waterman,

Presentation transcript:

Approximation Algorithms for the Selection of Robust Tag SNPs Kui Zhang Ting Chen This talk is about how to handle SNP genotyping with missing data. My name is Yao-Ting Huang and my advisor is Kun-Mao Chao. And we have two coauthors not here today, They are Prof. Zhang and Prof. Chen. Yao-Ting Huang Kun-Mao Chao Dept. Computer Science & Information Engineering, National Taiwan University Dept. Biostatistics, University of Alabama at Birmingham, USA Dept. Biological Sciences, University of Southern California, USA

Haplotype Blocks and SNPs Recent studies (Daly et al., Patil, et al.) have shown that the chromosome recombination only takes place at some narrow hot spots. Haplotype blocks stand for chromosome segments between these recombination hot spots. Single Nucleotide Polymorphisms (SNP) is a single DNA base variation observed with frequency > 1%. Tag SNPs stand for a small subset of SNPs which is able to capture the haplotype pattern of the block. Our research is based on previous studies, such as that by Daly and Patil, They show that chromosome recombination only occurs at some hot spots. Based on these hot spots, the chromosome can be partitioned into many haplotype blocks. The haplotype block is the chromosome region between these hot spots. Roughly speaking, SNP is a single DNA mutation with frequency more than one percent. Tag SNPs are a small subset of SNPs in the block that can capture the pattern of a haplotype block.

A Haplotype Block Example Patil et al. partition the Chromosome 21 into 4,135 haplotype blocks over 24,047 SNPs. This graph shows 18 haplotype blocks defined by 147 SNPs. Blue box: major allele Yellow box: minor allele This picture shows the SNPs in Chromosome 21 found by Patil. Each box, blue or yellow one, is a SNP. And Each chromosome region is a haplotype block.

Identification of an Unknown Haplotype Sample Haplotype patterns An unknown haplotype sample P1 P2 P3 P4 S1 We can genotype all SNPs to identify an unknown haplotype sample. S2 S3 S4 S5 SNP loci S6 Once we have enough SNP data, the next step is to perform association study and to identify unknown samples. The naïve approach to identify a sample is extracting each SNP of the sample, and comparing with the database. However, there are millions of SNPs in the human body, this approach is not only wasting money but also time-consuming. S7 S8 S9 : Major allele S10 S11 : Minor allele S12

Examples of Tag SNPs Haplotype pattern P1 P2 P3 P4 S1 In fact, it is not necessary to genotype all SNPs. SNPs S3, S4, and S5 can form a set of tag SNPs. S2 S3 S4 S5 SNP loci S6 Actually it’s no necessary to look at all SNPs. For example, SNPs 3 4 5 are already sufficient. We call they are a set of tag SNPs. P1 P2 P3 P4 S7 S8 S3 S9 S4 S10 S5 S11 S12

Examples of Tag SNPs Haplotype pattern P1 P2 P3 P4 S1 SNPs S1, S2, and S3 can not form a set of tag SNPs because P1 and P4 will be ambiguous. S2 S3 S4 S5 SNP loci S6 The negative example of tag SNPs is SNPs 1 2 3. They can not be tag SNPs. Because we will not be able to tell whether a sample belongs to patterns 1 or 4. P1 P2 P3 P4 S7 S1 S8 S2 S9 S3 S10 S11 S12

Examples of Tag SNPs Haplotype pattern P1 P2 P3 P4 SNPs S1 and S12 can form a set of tag SNPs. This set of SNPs is the minimum solution in this example (Bafna et al., Zhang, et. al.). S1 S2 S3 S4 S5 SNP loci S6 In order to minimize the genotyping cost, we wish to find the minimum number of tag SNP. For example, SNP 1 and 12 is the minimum solution in this example. Many studies have worked on how to find the minimum tag SNPs in a haplotype block. However, they didn’t consider the influence of missing data. S7 S8 P1 P2 P3 P4 S9 S1 S10 S12 S11 S12

The Influence of Missing Data Haplotype pattern P1 P2 P3 P4 P1 P2 P3 P4 S1 S1 S12 S2 S3 A SNP is called missing data if it does not pass the threshold of data quality. S4 S5 SNP loci S6 If S12 is genotyped as missing data, this sample can be identified as P2 or P3 patterns. Sometimes we may loss or not able to obtain the SNP data at some locus. This is usually called missing data. For example, if SNP S12 is missing data, we can not tell whether this sample is P2 or P3. If SNP S 1 is missing data, we can not distinguish whether it’s pattern 1 or 3. This problem is what our paper trying to solve. S7 S8 S9 If S1 is genotyped as missing data, this sample can be identified as P1 or P3 patterns. S10 S11 S12

Auxiliary Tag SNPs We can re-genotype auxiliary tag SNPs which is able to resolve the ambiguity caused by missing data. P1 P2 P3 P4 S1 S2 S3 P1 P2 P3 P4 S4 S1 S5 S12 S6 S5 S7 Let’s take a closer look at this problem. Now we already know this sample is either pattern 2 or pattern3. We can find a SNP that’s able to distinguish this two patterns. For example, SNP 5 is can distinguish them. And for this example, we wanna distinguish pattern 1 and 3. SNP 8 is what we need. We call these additional SNPs auxiliary tag SNPs. Auxiliary Tag SNP S8 P1 P3 P2 P4 S9 S1 S10 S12 S11 S12 S8

Robust Tag SNPs P1 P2 P3 P4 P1 P2 P3 P4 S1 S1 S2 S5 S3 S4 S8 S5 S12 S6 Alternatively, we can work on a set of SNPs that can tolerate missing data, called robust tag SNPs. For example, if we wanna tolerate one missing data, we can genotype SNPs 1,5,8,12. If any SNP is missing data, there is no identical patterns defined by the remaining three SNPs. As a result, there will be no ambiguity. The benefit of robust tag SNPs is that we don’t need to perform re-genotyping process whenever encountering missing data. Robust tag SNPs are a set of SNPs that can tolerate missing data. S1, S5, S8, S12 can tolerate one missing tag SNP S8 S9 S10 S11 S12

Our Result Finding minimum robust and auxiliary tag SNPs are both shown to be NP-hard. The auxiliary SNPs can found efficiently when robust tag SNPs have been computed in advance. We will focus on the problem of finding robust tag SNPs. We propose two greedy and one LP-relaxation algorithms to find robust tag SNPs. The first and second greedy algorithms give solutions of The LP-relaxation algorithm gives a solution of approximation. The problems of finding robust and auxiliary tag SNPs are both NP-hard. Because auxiliary tag SNPs can be found efficiently when robust tag SNPs have been computed in advance. We will focus on finding robust tag SNPs. We propose two greedy and one LP-relaxation algorithms to find robust tag SNPs. And we also have mathematical proofs for the approximation bound of these algorithms.

Transformation P1 P2 P3 P4 S3 S4 S1 S2 Each SNP can distinguish partial pairs of patterns. S1 can distinguish (P1, P3), (P1, P4), (P2, P3), and (P2, P4). S2 can distinguish (P1, P4), (P2, P4), (P3, P4). S3 S4 S1 S2 To solve this problem, we first take a closer look at the function of each SNP. If we pick SNP 1, we can be sure that we can distnguish patterns 1 and 3, Because they are in different color at this SNP locus. And we formulate this relation into a bipartite graph. SNP 1 can also distinguish patterns 1 and pattern 4. and so on. (1,2) (1,3) (1,4) (2,3) (2,4) (3,4) There are pairs of patterns

Observation 1: Tag SNPs P1 P2 P3 P4 S3 S4 S1 S2 The SNPs can form a set of tag SNPs iff each pair of patterns is covered by at least one edge from the SNPs. e.g., S1 and S3 can form a set of tag SNPs. e.g., S1 and S2 can not be tag SNPs. S3 S1 S2 One unanswered question is what kind of SNPs can be tag SNPs. We can easily answer this question by seeing if the bottom nodes in the graph are all covered by edges from them. For example, SNPs 1 and 3 are tag SNPs. And SNPs 1 and 2 are not tag SNPs. Because patterns 1 and 2 are not covered. So we can not distinguish patterns 1 and 2. (1,2) (1,3) (1,4) (2,3) (2,4) (3,4) Each pair of patterns is covered by at least one edge

Observation 2: Missing Data P1 P2 P3 P4 S3 S1 S2 If a SNP is genotyped as missing data, it is the same as the removal of its node and edges. S4 S3 S4 S1 S2 Another important question is what’s the effect of missing data? It is easy to tell by this graph because it’s just like removing the node and edges from the graph. (1,2) (1,3) (1,4) (2,3) (2,4) (3,4) Suppose S4 is genotyped as missing data

Problem Reformulation S3 S4 S1 S2 To tolerate m missing tag SNPs, we need to find a set of SNPs such that each pair of patterns is covered by (m+1) edges. e.g., We wish to find a set of robust tag SNPs that tolerates 1 missing tag SNP. S4 S3 S1 From the above two observations, we claim that if we wanna tolerate m missing data, We have to guarantee that each bottom node is covered by at least m plus 1 edges. For example, if we wanna tolerate one missing data, SNPs 1 3 and 4 can be robust tag SNPs. Because each node is covered by at least two edges. (1,2) (1,3) (1,4) (2,3) (2,4) (3,4) Each pair of patterns is covered by at least two edges

The First Greedy Algorithm P1 P2 P3 P4 S3 S4 S1 S2 (1,2) (1,3) (1,4) (2,3) (2,4) (3,4) S4 S1 S3 S3 S4 S3 S1 Suppose this graph is implemented by a table-like data structure. Roughly speaking, the greedy approach is to pick the SNP that contributes most edges to the bottom nodes. In this example, suppose the first algorithm picks SNP 1 first. Then it will pick SNP 4. In other words, this algorithm is based on a row-by-row manner. (1,2) (1,3) (1,4) (2,3) (2,4) (3,4) Suppose we want to tolerate one missing tag SNP

The Second Greedy Algorithm P1 P2 P3 P4 S1 (1,2) (1,3) (1,4) (2,3) (2,4) (3,4) S2 S3 S1 S3 S4 S2 S4 S4 S3 S1 S2 The second algorithm picks the SNP in a scope of whole table. For example, it will pick SNPs 1 and then SNP 2. It doesn’t care if the first row is still uncovered. (1,2) (1,3) (1,4) (2,3) (2,4) (3,4) Suppose we want to tolerate one missing tag SNP

An Iterative LP-relaxation Algorithm Let xi be the selection of each SNP xi = 1 if the i-th SNP is selected; xi = 0 otherwise. Let D(Pi, Pj) be the set of SNPs that can distinguish Pi and Pj patterns. Step 1. Integer programming formulation. The final algorithm is a LP-relaxation algorithm. First we need to formulate this problem as an integer programming problem. As we mentioned, the constraint is that each bottom node needs to be covered by at least m plus 1 SNPs.

An Iterative LP-relaxation Algorithm Step 2. Linear programming relaxation. Step 3. Randomized rounding method. Step 4. Repeat Steps 1, 2, and 3 for those unsatisfied inequalities until all of them are satisfied. Then we relax the integer constraint and solve the linear programming problem. This is so called LP-relaxation technique. After computing the linear solutions, we obtain the integer solution by the randomized rounding method. Finally, we check if there is any constraint still unsatisfied by this integer solution. And repeat this process until all of them are satisfied.

Reference Daly, M.J. et al. High-resolution haplotype structure in the human genome. Nat Genet, 2001. Patil, N. et al. Blocks of limited haplotype diversity revealed by high-resolution scanning of human chromosome 21. Science 294: 1719-1723, 2001. Gustfield, D. Haplotyping as perfect phylogeny: conceotual framework and efficient solutionis. RECOMB, 2002. Zhang, K., Sun, F., Waterman, M.S., Chen, T. Dynamic programming algorithms for haplotype block partitioning: Applications to human chromosome 21 haplotype data. RECOMB, 2003. Bafna, V. et al. Hapotypes and informative SNP selection algorithms: Don’t block out information. RECOMB, 2003. Huang, Y.T., Zhang, K., Chen, T., and Chao, K.M. Approximation Algorithms for the selection of robust tag SNPs. WABI, 2004.