The Haplotype Blocks Problems Wu Ling-Yun
References Daly, M. et al. High-resolution haplotype structure in the human genome. Nature Genetics 29: , Patil, N. et al. Blocks of limited haplotype diversity revealed by high- resolution scanning of human chromosome 21. Science 294: , Gusfield, D. Haplotyping as perfect phylogeny: conceptual framework and efficient solutions. RECOMB 02: , Gabriel, S. B. et al. The structure of haplotype blocks in the human genome. Science 296: , Zhang, K. et al. A dynamic programming algorithm for haplotype block partitioning. Proc. Natl. Acad. Sci. USA 99(11): , Kimmel, G. et al. Computational Problems in Noisy SNP and Haplotype Analysis: Block Scores, Block Identification and Population Stratification. Working paper, 2003.
Methods Hidden Markov Model (Daly, M et al. 2001) Perfect Phylogeny (Gusfield, D. 2002) Linkage Disequilibrium (Gabriel, S. B. et al. 2002) Greedy Algorithm (Daly, M et al. 2001, Patil, N. et al. 2001) Dynamic Programming (Zhang, K. et al. 2002, Kimmel, G. et al. 2003)
What ’ s SNP Genetic Polymorphism is a difference in DNA sequence among individuals, groups, or populations. Genetic Mutation is a change in the nucleotide sequence of a DNA molecule. Genetic mutations are a kind of genetic polymorphism. A Single Nucleotide Polymorphism is a single base mutation in DNA. SNPs ("snip") are the most simple form and most common source of genetic polymorphism in the human genome (90% of all human DNA polymorphisms).
Types of SNPs Two types of substitutions resulting in SNPs Transition : substitution between purines (A, G) or between pyrimidines (C, T). Constitute two thirds of all SNPs. Transversion : substitution between a purine and a pyrimidine. A Non-Synonymous SNP coding region is one in which the substitution results in an alteration of the encoded amino acid. One half of all coding sequence SNPs result in non- synonymous codon changes. Common SNP : >5% minor allele frequency.
What ’ s Haplotype Genotype is an exact description of the genetic constitution of an individual. A Haplotype is a “haploid genotype”. Haplotype is a particular pattern of sequential SNPs (or alleles) observed on a single chromosome. We can associate disease gene with SNPs because they come together in a haplotype. Haplotypes have been successfully used to identify genes for diseases. The general properties of haplotypes in the human genome have remained unclear.
Haplotyping Haplotyping : involves grouping subjects by haplotypes, or particular patterns of sequential SNPs, found on a single chromosome. There are 2 n possible haplotypes provided n SNPs. But in reality, only O(n) haplotypes are observed. There are thought to be a small number of haplotype patterns for each chromosome. Instead of finding haplotypes in the whole genome, we find them in small pieces and recombine.
Haplotype Blocks Recombination of haplotypes occurs primarily in narrow regions called hot spots. The haplotype regions between two neighboring hot spots are called blocks. Limited haplotype diversity are observed within blocks. Few representative SNPs (tag SNPs) from each block are suffice to unambiguously distinguish the haplotypes in this block.
Blocks Properties Little recombination within blocks. Large probability of exchange (recombination) between blocks. Blocks do not have absolute boundaries and may be defined in different ways, depending on the specific application.
Daly ’ s Model HETobs = observed haplotypic heterozygosity HETexp = expected haplotypic heterozygosity Block score = HETobs / HETexp A smaller score represents lower diversity of haplotypes compared with expectation. Start from windows of five SNPs. Windows were expanded or contracted by adding or subtracting SNP to the ends to find the longest local minimum window.
Haplotype Blocks at 5q31
LD on Haplotype Blocks
Petil ’ s Model Consider all possible blocks of physically consecutive SNPs of size one SNP or larger. Select the one with the maximum ratio of total SNPs in the block to the minimal number of SNPs required to uniquely discriminate haplotypes represented more than once in the block. Any of the remaining blocks that physically overlap with the selected block are discarded. Repeat until we have selected a set of contiguous, non-overlapping blocks that cover whole chromosome with no gaps and with every SNP assigned to a block.
Haplotype Blocks Figure
Gabriel ’ s Model A haplotype block is defined as a region over which a very small proportion (<5%) of comparisons among informative SNP pairs show strong evidence of historical recombination.
Types of Haplotypes Common haplotypes Represented more than once 70~90% of the haplotypes within a block Very few (2-5) Rare haplotypes Represented only once 10~30% of the haplotypes within a block
Ambiguous Haplotypes Two haplotypes are said to be compatible if the alleles are identical at all loci for which there are no missing data; otherwise incompatible. A haplotype is ambiguous if it is compatible with two other haplotypes that are themselves incompatible. H1 = (1, 1, ?, 0) H2 = (1, 1, 0, ?) H3 = (1, 1, 1, 0)
Zhang ’ s Model Find a partition to minimize the total number of representative SNPs required to distinguish at least percent of unambiguous haplotypes in each block for the entire chromosome. Minimize the number of blocks among all of the block partitions with the minimum number of representative SNPs. The problem of finding the minimum number of representative SNPs within a block to uniquely distinguish all of the haplotypes is known as Minimum Test Set problem, which have been proven to be NP-complete.
Dynamic Programming
Coverage v.s. Diversity Another measure of haplotype quality in a block is the minimum total number of SNPs required to explain percent of haplotype diversity in each block. The haplotype block partition based on diversity can be solved using the same dynamic programming method.
Kimmel ’ s Model Find a block partition that minimize the total number of distinct haplotypes that are observed in all the blocks. Minimize the total number of haplotypes in blocks can be done in polynomial time if there are no data errors. Several problems are studied Total Block Errors (TBE) Local Block Errors (LBE) Incomplete Haplotypes (IH) Minimum Block Haplotypes (MBH) Probabilistic Model Block Scoring (PMBS) algorithm. Simulated annealing algorithm are used to solve MBH.