Download presentation
Published byMildred Wilkinson Modified over 9 years ago
1
Linkage Disequilibrium and Recent Studies of Haplotypes and SNPs
Speaker: Yao-Ting Huang Advisor: Kun-Mao Chao Good afternoon, this talk is about how to handle SNP genotyping with missing data. My name is Yao-Ting Huang and my advisor is Kun-Mao Chao. And we have two other coauthors, Prof. Zhang and Prof. Chen, but they are not here today. Algorithms and Computational Biology Lab. Dept. of Computer Science & Information Engineering National Taiwan University
2
Variations in DNA Sequence
Variants in the human genome include Single Nucleotide Polymorphisms (SNPs), deletions (e.g., loss of heterozygosity), and insertions. SNPs become the preferred DNA markers for association studies because of their high abundance (e.g., ~1 SNP/1000 base pairs), and high-throughput genotyping technology which allows building a large SNP database (e.g., International HapMap Project).
3
SNPs Arise from Mutations
Variations observed in a population Mutations over time Disease Mutation Common Ancestor time present
4
Haplotype A set of closely linked SNPs located on one chromosome.
GATATTCGTACGGA-T GATGTTCGTACTGAAT GATATTCGTACGGAAT Haplotypes AG- 2/6 GTA 3/6 AGA 1/6 DNA Sequences
5
Factors Affecting Haplotypes
The chromosome recombination breaks up and reorganizes halotypes. If SNPs are closely linked, they will tend to be inherited together as haplotypes. Less chance that recombination will occur between them. Linkage Disequilibrium (LD) is a measure of the non-random association of alleles at linked loci.
6
Linkage Disequilibrium
Consider only two SNPs A b a B a b There are 4 possible haplotypes SNP 1 B b Total A PAB PaB PA a Pab Pa PB Pb 1.0 The probabilities for each haplotype SNP 2
7
Linkage Equilibrium PAB = PAPB PAb = PAPb = PA(1-PB)
SNP 1 B b Total A PAB PaB PA a Pab Pa PB Pb 1.0 SNP 2
8
Linkage Disequilibrium
PAB ≠ PAPB PAb ≠ PAPb = PA(1-PB) PaB ≠ PaPB = (1-PA) PB Pab ≠ PaPb = (1-PA) (1-PB) SNP 1 B b Total A PAB PaB PA a Pab Pa PB Pb 1.0 SNP 2
9
An Example of Linkage Disequilibrium
Before mutation After mutation -- A G -- A G -- C G -- C G PA=1/2 PC=1/2 PG=1 -- C C PA=1/3 PC=2/3 PG=2/3 PC=1/3 We got only three haplotypes: AG, CG, and CC. There is no AC haplotype, i.e., PAC = 0. However, PAPC =1/9, thus PAPC ≠ PAC . These two SNPs are linkage disequilibrium.
10
An Example of Linkage Equilibrium
Before recombination After recombination -- A G -- A G -- C G -- C G -- C C -- C C -- A C PA=1/2 PC=1/2 PG=1/2 PC=1/2 After recombination, PAG = PAPG = 1/4, PCG = PCPG = 1/4, PCC = PCPC = 1/4, and PAC = PAPC = 1/4. Thus, these two SNPs are linkage equilibrium.
11
D Coefficient We can measure the non-randomness of two loci by means of a deviation, D, defined as follows: D = PAB – PAPB or PABPab – PAbPaB PAB = PAPB + D PAb = PA(1-PB) - D PaB = (1-PA) PB - D Pab = (1-PA) (1-PB) + D These two SNPs are linkage equilibrium iff D = 0.
12
Standardization of D Coefficient
D coefficient can be standardized in many ways. D’ = D/Dmax, where Dmax stands for the absolute maximal possible value of D. D D -PAPB PaPB
13
Interpretation of D’ D’ is constrained between -1 and +1.
D’ = 1 (perfect positive LD between SNP alleles) D’ = 0 (linkage equilibrium between SNP alleles) D’ = -1 (perfect negative LD between SNP alleles) D’ = 0.87 (strong positive LD between SNP alleles) D’ = 0.12 (weak positive LD between SNP alleles) Other measures of D coefficient: r2 or Δ2: Chi-square Test. P value.
14
Decay of LD over Time The chromosome recombination decreases LD and should reach equilibrium at the end.
15
Haplotype Blocks in Human Genome
The human genome has been shown to contain regions of high LD interspersed by regions of low LD. The recombination occurs frequently in low LD regions. The high LD regions can form haplotype blocks. The International HapMap Project aims to build the haplotype map across human genome. Recombination hot spots (Low LD regions) Haplotype blocks (High LD regions) Chromosome
16
Genotype Data v.s. Halotype Data
The use of haplotype map has been limited due to the fact that the human genome is diploid. Genotype data instead of haplotype data are obtained. Phase problem: loss of the information of the chromosome where each base appears. e.g., we don’t know they are (GA, TC) or (GC, TA). G A Diploid T C SNP1 SNP2
17
Haplotype Reconstruction with Pedigree
Haplotype reconstruction with pedigree (Li and Jiang, 2004). There is no mutations but only recombinants happened within a pedigree. Given a pedigree and genotype data for each member in the pedigree, find a haplotype configuration for the pedigree that requires minimum number of recombinants. Pedigree 1|2 1|2 1|2 3|2 1|2 3|1 1|2 1|3 2|2 2|2
18
Haplotype Block Partition and Tag SNP Selection Using Genotype Data
Zhang et al. (2004) combine a dynamic programming and an EM algorithms to partition haplotype blocks. The EM algorithm infers the haplotypes for a range of SNPs. The dynamic programming algorithm minimizes the number of tag SNPs used in the haplotype block partition. The experiments examine the factors that affect block partition and tag SNPs used, which include number of haplotypes, density of SNPs, minor allele frequency of SNPs, missing data, and genotyping error rate.
19
Thoughts How to modify the tag SNP selection algorithm to process genotype data. The naïve approach is inferring haplotype data by existing algorithms and finding tag SNPs. Is it possible to determine tag SNPs directly from genotype data? Assume 0: homozygous wild type, 1: homozygous mutant, 2: heterozyhous. P1 P2 P3 P4 S S S S
20
The Relation Between Minor Allele Frequency and Tag SNPs
The minor allele frequency ranges from 0% to 50%. The higher the frequency, the more useful tag SNPs are available. > 20%. > 40%, this SNP can distinguish more haplotype patterns. What is the relation between the minor allele frequency and the number of tag SNPs.
21
Block-Free Selection of Tagging SNPs
Bafna, et al. (2004) propose algorithms for selecting tag SNPs without considering haplotype block structure. They define a new measure called “Informativeness,” which measures how well a set of SNPs can predict another set of SNPs. Find a subset of SNPs which has the maximum Informativeness. The number of total tag SNPs used in a whole genome is less than block-dependent approaches.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.