Introduction to SNP and Haplotype Analysis Yao-Ting Huang Kun-Mao Chao Algorithms and Computational Biology Lab, Department of Computer Science & Information Engineering, National Taiwan University, Taiwan.
Genetic Variations The genetic variations in DNA sequences (e.g., insertions, deletions, and mutations) have a major impact on genetic diseases and phenotypic differences. All humans share 99% the same DNA sequence. The genetic variations in the coding region may change the codon of an amino acid and alters the amino acid sequence.
Single Nucleotide Polymorphism A Single Nucleotide Polymorphisms (SNP), pronounced “snip,” is a genetic variation when a single nucleotide (i.e., A, T, C, or G) is altered and kept through heredity. SNP: Single DNA base variation found >1% Mutation: Single DNA base variation found <1% C T T A G C T T 99.9% C T T A G C T T 94% 6% C T T A G T T T 0.1% C T T A G T T T SNP Mutation
Observed genetic variations Mutations and SNPs Observed genetic variations SNPs Mutations Common Ancestor time present
Single Nucleotide Polymorphism SNPs are the most frequent form among various genetic variations. 90% of human genetic variations come from SNPs. SNPs occur about every 300~600 base pairs. Millions of SNPs have been identified (e.g., HapMap and Perlegen). SNPs have become the preferred markers for association studies because of their high abundance and high-throughput SNP genotyping technologies.
Single Nucleotide Polymorphism A SNP is usually assumed to be a binary variable. The probability of repeat mutation at the same SNP locus is quite small. The tri-allele cases are usually considered to be the effect of genotyping errors. The nucleotide on a SNP locus is called a major allele (if allele frequency > 50%), or a minor allele (if allele frequency < 50%). 94% A C T T A G C T T T: Major allele 6% A C T T A G C T C C: Minor allele
Haplotypes A haplotype stands for a set of linked SNPs on the same chromosome. A haplotype can be simply considered as a binary string since each SNP is binary. -A C T T A G C T T- -A A T T T G C T C- -A C T T T G C T C- Haplotype 2 Haplotype 3 C A T A T C C T C Haplotype 1 SNP1 SNP2 SNP3 SNP1 SNP2 SNP3
Genotypes G T A T C G Haplotype data Genotype data A C The use of haplotype information has been limited because the human genome is a diploid. In large sequencing projects, genotypes instead of haplotypes are collected due to cost consideration. AC GT A T C G G T A T SNP1 SNP2 C G Haplotype data SNP1 SNP2 Genotype data A C However, the haplotype data is not easy to be obtained because the human genome is a diploid, Which s composed of two chromosomes. To obtain the haplotype data, we have to separate them first and then extract the SNPs in each chromosome. And obtain two haplotypes AT and CG. But most large sequencing projects, because of cost considerations, The diploid chromosomes are not separated. and thus we obtain the less accurate information called genotype data. Based on genotype data, we only knows the two SNPs at each locus. But we do not know the combination between of these SNPs at different loci. For example, we don’t know the haplotype data are AT CG or AG and CT. So here comes the problem. We are only interested in haplotype data. Which haplotype pair is true.. SNP1 SNP2 SNP1 SNP2
Problems of Genotypes or A T C G A G C T A C G T Genotype data Genotypes only tell us the alleles at each SNP locus. But we don’t know the connection of alleles at different SNP loci. There could be several possible haplotypes for the same genotype. AC GT A T C G SNP1 SNP2 A G C T SNP1 SNP2 A C G T SNP1 SNP2 Genotype data or SNP1 SNP2 We don’t know which haplotype pair is real.
Research Directions of SNPs and Haplotypes in Recent Years SNP Database Haplotype Inference Tag SNP Selection … Maximum Parsimony Perfect Phylogeny Statistical Methods Haplotype block LD bin Prediction Accuracy
Haplotype Inference The problem of inferring the haplotypes from a set of genotypes is called haplotype inference. This problem is already known to be not only NP-hard but also APX-hard. Most combinatorial methods consider the maximum parsimony model to solve this problem. This model assumes that the real haplotypes in natural population is rare. The solution of this problem is a minimum set of haplotypes that can explain the given genotypes. To solve this problem, most combinatorial methods consider the maximum parsimony model. This model assume that the number of real haplotpyes is rare in the population.
Maximum Parsimony A T C G A G C T A C G T A T A T A T C G A G C T h1 C G h2 A G h3 C T h4 G1 A C SNP1 SNP2 G T or G2 A SNP1 SNP2 T A T h1 Suppose we are given two genotypes. G1 and G2. A T C G A G C T Find a minimum set of haplotypes to explain the given genotypes.
Related Works Statistical methods: Combinatorial methods: Niu, et al. (2002) developed a PL-EM algorithm called HAPLOTYPER. Stephens and Donnelly (2003) designed a MCMC algorithm based on Gibbs sampling called PHASE. Combinatorial methods: Gusfield (2003) proposed an integer linear programming algorithm. Wang and Xu (2003) developed a branching and bound algorithm called HAPAR to find the optimal solution. Brown and Harrower (2004) proposed a new integer linear formulation of this problem.
Our Results We formulated this problem as an integer quadratic programming (IQP) problem. We proposed an iterative semidefinite programming (SDP) relaxation algorithm to solve the IQP problem. This algorithm finds a solution of O(log n) approximation. We implemented this algorithm in MatLab and compared with existing methods. Huang, Y.-T., Chao, K.-M., and Chen, T., 2005, “An Approximation Algorithm for Haplotype Inference by Maximum Parsimony,” Journal of Computational Biology, 12: 1261-1274.
Problem Formulation A T C G A C G T A T C G A T A T Input: Output: A set of n genotypes and m possible haplotypes. Output: A minimum set of haplotypes that can explain the given genotypes. A T h1 C G h2 G1 A C SNP1 SNP2 G T A T h1 C G h2 G2 A SNP1 SNP2 T A T h1
Integer Quadratic Programming (IQP) Define xi as an integer variable with values 1 or -1. xi = 1 if the i-th haplotype is selected. xi = -1 if the i-th haplotype is not selected. Minimizing the number of selected haplotypes is to minimize the following integer quadratic function: The first step is to formulate the haplotype inference problem as an IQP problem. Let m be the number of haplotypes and xi be the selection for each haplotype. Xi is equal to 1 if the i-th haplotype is selected and -1 otherwise. Because if one haplotype is selected, this term is 1. If no selected, this term is 0. And the summation of these terms is just the number of selected haplotypes. For each genotype at least one pair of haplotypes must be selected. For example, G1 can be resolved by h1 h2 or h3 h4. If h1 h2 are both selected, this equation is equal to 1.
Integer Quadratic Programming (IQP) Each genotype must be resolved by at least one pair of haplotypes. For genotype G1, the following integer quadratic function must be satisfied. Suppose h1 and h2 are selected 1 A T h1 C G h2 A G h3 C T h4 G1 A C SNP1 SNP2 G T or
Integer Quadratic Programming (IQP) Objective Function Constraint Functions Find a minimum set of haplotypes Maximum parsimony: We use the SDP-relaxation technique to solve this IQP problem. to resolve all genotypes.
The Flow of the Iterative SDP Relaxation Algorithm Relax the integer constraint NP-hard P Reformulation Integer Quadratic Programming Vector Formulation Semidefinite Programming No, repeat this algorithm. Existing SDP solver All genotypes resolved? Yes, done. Integral Solution Vector Solution SDP Solution Randomized rounding Incomplete Cholesky decomposition
Research Directions of SNPs and Haplotypes in Recent Years SNP Database Haplotype Inference Tag SNP Selection … Maximum Parsimony Perfect Phylogeny Statistical Methods Haplotype block LD bin Prediction Accuracy
Problems of Using SNPs for Association Studies The number of SNPs is too large to be used for association studies. There are millions of SNPs in a human body. To reduce the SNP genotyping cost, we wish to use as few SNPs as possible for association studies. Tag SNPs are a small subset of SNPs that is sufficient for performing association studies without losing the power of using all SNPs. We will first study the definition of tag SNPs based on the haplotype-block model.
Haplotype Blocks and Tag SNPs Some studies have shown that the chromosome can be partitioned into haplotype blocks interspersed by some recombination hotspots. Within a haplotype block, there is little or no recombination occurred. The SNPs within a haplotype block tend to be inherited together. Within a haplotype block, a small subset of SNPs (called tag SNPs) is sufficient to distinguish each pair of haplotype patterns in the block. We only need to genotype tag SNPs instead of all SNPs within a haplotype block.
Recombination Hotspots and Haplotype Blocks SNP loci Haplotype patterns : Major allele : Minor allele Recombination hotspots Chromosome Haplotype blocks
A Haplotype Block Example Human chromosome 21 is partitioned into 4,135 haplotype blocks over 24,047 SNPs by Patil et al. (Science, 2001). Blue box: major allele Yellow box: minor allele This picture shows an example of the SNPs and haplotype blocks in Chromosome 21. When we study SNP, we usually assume that each SNP is biallelic. In other words, at each SNP locus, there is either a major allele or a minor allele. And no third type of allele exists..
Examples of Tag SNPs Haplotype patterns An unknown haplotype sample P1 Suppose we wish to distinguish an unknown haplotype sample. We can genotype all SNPs to identify the haplotype sample. S2 S3 S4 S5 SNP loci S6 S7 S8 S9 : Major allele S10 S11 : Minor allele S12
Examples of Tag SNPs Haplotype pattern P1 P2 P3 P4 S1 In fact, it is not necessary to genotype all SNPs. SNPs S3, S4, and S5 can form a set of tag SNPs. S2 S3 S4 S5 SNP loci S6 P1 P2 P3 P4 S7 S8 S3 S9 S4 S10 S5 S11 S12
Examples of Wrong Tag SNPs Haplotype pattern P1 P2 P3 P4 S1 SNPs S1, S2, and S3 can not form a set of tag SNPs because P1 and P4 will be ambiguous. S2 S3 S4 S5 SNP loci S6 P1 P2 P3 P4 S7 S1 S8 S2 S9 S3 S10 S11 S12
Examples of Tag SNPs Haplotype pattern P1 P2 P3 P4 SNPs S1 and S12 can form a set of tag SNPs. This set of SNPs is the minimum solution in this example. S1 S2 S3 S4 S5 SNP loci S6 S7 S8 P1 P2 P3 P4 S9 S1 S10 S12 S11 S12
Problems of Finding Tag SNPs The problem of finding the minimum set of tag SNPs is known to be NP-hard. This problem is the minimum test set problem. A number of methods have been proposed to find the minimum set of tag SNPs. Here we illustrate how to recast the tag SNP selection problem as the set cover problem and the integer linear programming problem.
Problem Formulation P1 P2 P3 P4 The relation between SNPs and haplotypes can be formulated as a bipartite graph. S1 can distinguish (P1, P3), (P1, P4), (P2, P3), and (P2, P4). S2 can distinguish (P1, P4), (P2, P4), (P3, P4). S1 S2 S3 S4 S4 S3 S1 S2 To solve this problem, we first take a closer look at the function of each SNP. If we pick SNP 1, we are sure that we can distinguish patterns 1 and 3, Because they are in different color at this SNP locus. And we formulate this relation into a bipartite graph. SNP 1 can also distinguish patterns 1 and pattern 4. and so on. (1,2) (1,3) (1,4) (2,3) (2,4) (3,4) Given h patterns, we have pairs of patterns.
Set Cover P1 P2 P3 P4 S3 S4 S1 S2 The SNPs can form a set of tag SNPs if each pair of patterns is connected by at least one edge. e.g., S1 and S3 can form a set of tag SNPs. e.g., S1 and S2 can not be tag SNPs. S3 S1 S2 One unanswered question is what kind of SNPs can be tag SNPs. We can easily answer this question by seeing if the bottom nodes in the graph are all covered by edges from them. For example, SNPs 1 and 3 are tag SNPs. And SNPs 1 and 2 are not tag SNPs. Because patterns 1 and 2 are not covered. So we can not distinguish patterns 1 and 2. (1,2) (1,3) (1,4) (2,3) (2,4) (3,4) Each pair of patterns is connected by at least one edge.
A Greedy Algorithm P1 P2 P3 P4 S3 S4 S1 S2 (1,2) (1,3) (1,4) (2,3) (2,4) (3,4) S4 S1 S4 S1 Suppose this graph is implemented by a table-like data structure. Roughly speaking, the greedy approach is to pick the SNP that contributes most edges to the bottom nodes. In this example, suppose the first algorithm picks SNP 1 first. Then it will pick SNP 4. In other words, this algorithm is based on a row-by-row manner. (1,2) (1,3) (1,4) (2,3) (2,4) (3,4)
Integer Linear Programming n SNPs, h patterns Let xi be defined as follows. xi = 1 if the i-th SNP is selected; xi = 0 otherwise. Let D(Pj, Pk) be the set of SNPs that can distinguish patterns Pj and Pk. Integer programming formulation. The final algorithm is a LP-relaxation algorithm. First we need to formulate this problem as an integer programming problem. As we mentioned, the constraint is that each bottom node needs to be covered by at least m plus 1 SNPs.
Problem Formulation D(P1, P2)={S3, S4} D(P1, P3)={S1, S3} D(P1, P4)={S1, S2, S4} D(P2, P3)={S1, S4} D(P2, P4)={S1, S2, S3} D(P3, P4)={S2, S3, S4} S1 S2 S3 S4 To solve this problem, we first take a closer look at the function of each SNP. If we pick SNP 1, we are sure that we can distinguish patterns 1 and 3, Because they are in different color at this SNP locus. And we formulate this relation into a bipartite graph. SNP 1 can also distinguish patterns 1 and pattern 4. and so on.
An Iterative LP-relaxation Algorithm Linear programming relaxation. Randomized rounding method. Repeat the steps for those unsatisfied inequalities until all of them are satisfied. Then we relax the integer constraint and solve the linear programming problem. This is so called LP-relaxation technique. After computing the linear solutions, we obtain the integer solution by the randomized rounding method. Finally, we check if there is any constraint still unsatisfied by this integer solution. And repeat this process until all of them are satisfied.
Discussion In this chapter, we illustrate how to recast the tag SNP selection problem as the set cover problem and the integer linear programming problem. hard problems approximation algorithms Related topics: missing data LD-bins a specified number of tag SNPs
References: Huang, Y.-T., Zhang, K., Chen, T. and Chao, K.-M., 2005, “Selecting Additional Tag SNPs for Tolerating Missing Data in Genotyping,” BMC Bioinformatics, 6: 263. Chang, C.-J., Huang, Y.-T., and Chao, K.-M., 2006, “A Greedier Approach for Finding Tag SNPs,” Bioinformatics, 22: 685-691.
Research Directions of SNPs and Haplotypes in Recent Years SNP Database Haplotype Inference Tag SNP Selection … Maximum Parsimony Perfect Phylogeny Statistical Methods Haplotype block LD bin Prediction Accuracy
Linkage Disequilibrium The problem of finding tag SNPs can be also solved from the statistical point of view. We can measure the correlation between SNPs and identify sets of highly correlated SNPs. For each set of correlated SNPs, only one SNP need to be genotyped and can be used to predict the values of other SNPs. Linkage Disequilibrium (LD) is a measure that estimates such correlation between two SNPs. We will formally introduce the detailed information of LD later.
Linkage Disequilibrium Bins The statistical methods for finding tag SNPs are based on the analysis of LD among all SNPs. An LD bin is a set of SNPs such that SNPs within the same bin are highly correlated with each other. The value of a single SNP in one LD bin can predict the values of other SNPs of the same bin. These methods try to identify the minimum set of LD bins.
An Example of LD Bins (1/3) SNP1 and SNP2 can not form an LD bin. e.g., A in SNP1 may imply either G or A in SNP2. Individual SNP1 SNP2 SNP3 SNP4 SNP5 SNP6 1 A G C T 2 3 4 5 6 7 8
An Example of LD Bins (2/3) SNP1, SNP2, and SNP3 can form an LD bin. Any SNP in this bin is sufficient to predict the values of others. Individual SNP1 SNP2 SNP3 SNP4 SNP5 SNP6 1 A G C T 2 3 4 5 6 7 8
An Example of LD Bins (3/3) There are three LD bins, and only three tag SNPs are required to be genotyped (e.g., SNP1, SNP2, and SNP4). Individual SNP1 SNP2 SNP3 SNP4 SNP5 SNP6 1 A G C T 2 3 4 5 6 7 8
Difference between Haplotype Blocks and LD bins Haplotype blocks are based on the assumption that SNPs in proximity region should tend to be correlated with each other. The probability of recombination occurs in between is less. LD bins can group correlated of SNPs distant from each other. A disease is usually affected by multiple genes instead of single one. The SNPs in one LD bin can be shared by other bins. The SNPs in a haplotype block do not appear in another block.
Introduction to Linkage Disequilibrium A, B: major alleles (value: 1) a, b: minor alleles (value: 0) PA: probability for A alleles at SNP1 Pa: probability for a alleles at SNP1 PB: probability for B alleles at SNP2 Pb: probability for b alleles at SNP2 PAB: probability for AB haplotypes Pab: probability for ab haplotypes A B A b a B a b SNP1 SNP2 B b Total A PAB PaB PA a Pab Pa PB Pb 1.0
Linkage Equilibrium PAB = PAPB PAb = PAPb = PA(1-PB) SNP2 B b Total A PAB PaB PA a Pab Pa PB Pb 1.0 SNP1
Linkage Disequilibrium PAB ≠ PAPB PAb ≠ PAPb = PA(1-PB) PaB ≠ PaPB = (1-PA) PB Pab ≠ PaPb = (1-PA) (1-PB) SNP2 B b Total A PAB PaB PA a Pab Pa PB Pb 1.0 SNP1
An Example of Linkage Disequilibrium -- C -- -- -- G -- -- -- -- C -- -- -- C -- -- -- PA=1/3 PC=2/3 PG=2/3 PC=1/3 Suppose we have three haplotypes: AG, CG, and CC. There is no AC haplotype, i.e., PAC = 0. Note that PAC =0, PAPC =1/9, and PAC ≠ PAPC. These two SNPs are linkage disequilibrium.
An Example of Linkage Equilibrium Before recombination After recombination -- A -- -- -- G -- -- -- -- A -- -- -- G -- -- -- -- C -- -- -- G -- -- -- -- C -- -- -- G -- -- -- -- C -- -- -- C -- -- -- -- C -- -- -- C -- -- -- -- A -- -- -- C -- -- -- PA=1/2 PC=1/2 PG=1/2 PC=1/2 After recombination, PAG = PAPG = 1/4, PCG = PCPG = 1/4, PCC = PCPC = 1/4, and PAC = PAPC = 1/4. These two SNPs are linkage equilibrium.
Linkage Disequilibrium There are many formulas to compute LD between two SNPs, and most of them are usually normalized between -1~1 or 0~1. LD = 1 (perfect positive correlation) LD = 0 (no correlation or linkage equilibrium) LD = -1 (perfect negative correlation) LD = 0.8 (strong positive correlation) LD = 0.12 (weak positive correlation)
Linkage Disequilibrium Formulas Mathematical formulas for computing LD: r2 or Δ2: D’: Chi-square Test. P value.
Correlation Coefficient The correlation between two random variables A and B can be measured by the correlation coefficient:
Examples of Computing LD Individual SNP1 SNP2 SNP3 SNP4 SNP5 SNP6 1 A T G 2 C 3 4 5
Minimum Clique Cover Problem This problem asks for a minimum set of LD bins. The minimum LD value required between two SNPs in one bin is usually set to 0.8. This problem is known to be the minimum clique cover problem (by Huang and Chao, 2005). Consider each SNP as nodes on the graph. There exists an edge between two nodes iff the LD of these two SNPs ≥ 0.8.
Relaxation of This Problem The minimum clique cover problem is not easy to be approximated. The relaxed problem asks for a minimum set of LD bins such that at least one SNP in an LD bin has r2 ≥ 0.8 with other SNPs in the same bin. The relaxed problem is known to be the minimum dominating set problem. The minimum dominating set problem is still NP-hard but is easier to be approximated.
Minimum Dominating Set Problem Given a graph G(V, E), the minimum dominating set C is the minimum set of nodes, such that each node in V has at least one edge connecting to nodes in C. Consider each node as a SNP and each edge as strong LD (r2 ≥ 0.8) between two SNPs. The minimum dominating set of this graph is the set of tag SNPs. We can only use this set of SNPs to predict other SNPs.
Experimental Data Sets Hinds et al. (2005) identified 1,586,383 SNPs across three human populations. African, Americans of European, and Asian. The database provides both genotype data and inferred haplotype data.