Ho Kim School of Public Health Seoul National University SNP과 Haplotype 분석 소개 Ho Kim School of Public Health Seoul National University
Contents SNP (Single Nucleotide Polymorphism) Haplotypes Linkage & Linkage disequilibrium Association study design SNP vs. Haplotype for association study Haplotype estimation Data analysis
SNPs (pronounced snips)
Mutation
Polymorphism – Definition A sequence variation that occurs at least 1 percent of the time (> 1%) 90% of variations are SNPs Mutation If the variation is present less than 1 percent of the time (<= 1%)
SNPs in the Human Genome All humans share 99.9% the same genetic sequence SNPs occur about every 1000 base pairs The human genome contains more than 2 million SNPs ~21,000 SNPs are found in genes SNPs are not evenly spaced along the sequence SNP-rich regions SNP-poor regions
SNPs as DNA Landmarks Help in DNA sequencing Help in the discovery of genes responsible for many major diseases: asthma, diabetes, heart disease, schizophrenia and cancer among others
From SNP to Haplotype Phenotype Black eye GATATTCGTACGGA-T Brown eye GATGTTCGTACTGAAT GATATTCGTACGGAAT SNP 1 2 3 4 5 6 Phenotype Black eye Brown eye Blue eye AG- 2/6 GTA 3/6 AGA 1/6 Haplotypes SNP Simple to measure & understand Haplotype have the advantage in the appropriate circumstances of carrying more information about the genotype-phenotype link than do the underlying SNPs. DNA Sequence
SNP & Haplotype SNP: Single Nucleotide Polymorphism Haplotype: A set of closely linked genetic markers present on one chromosome which tend to be inherited together (not easily separable by recombination). G A C Set of SNP polymorphisms: a SNP haplotype
Linkage and Linkage Disequilibrium (1) Linkage: the tendency of genes or other DNA sequences at specific loci to be inherited together as a consequence of their physical proximity on a single chromosome. Linkage disequilibrium (allelic association): particular alleles at two or more neighboring loci show allelic association if they occur together with frequencies significantly different from those predicted from the individual allele frequencies. Linkage is a relation between loci, but association is a relation between alleles.
Linkage and Linkage Disequilibrium (2) ( = recombination fraction) No linkage: = 0.5 Perfect linkage: = 0 Linkage disequlibrium: 0 1 ( = probability of allelic association) Linkage equilibrium: = 0 Complete linkage disequilibrium: = 1
Allelic Association (LD) Morton et al. (2001) Locus B Locus A Allele 1 Allele 2 Allele frequency Allele 1 Allele 2 Allele frequency 1 A, B: diallelic loci; 11, 12, 21, 22: haplotypes; : association probability
Measures of LD Covariance D = | 11 22 - 12 21 | Association = D/Q(1-R) All other measures are functions of Q, R, .
New Findings on Linkage Disequilibrium In the chromosome, there are blocks of limited haplotype diversity in which more than 80% of a global human sample can typically be characterized by only three common haplotypes (Patil et al., Science 2001). Haplotype blocks are the more precise units to reflect genetic variation. Identification of haplotype structure, i.e., construction of a haplotype map, provides a basis for accurate and efficient association studies.
Daly et al. (2001). LD by distance from two markers
The Problem It’s not yet easy to measure an individual’s (only two) haplotypes Molecular haplotyping (nucleotide sequencing) is the gold standard A more efficient strategy: Focus on regions, such as certain genes Estimate haplotypes from SNP data (genotypes) Use LD map, and reduce the number of loci to represent the haplotype Use haplotype map (DB) = key SNPs + haplotype blocks with strong LD
Haplotyping: Phase Problem C SNP1 SNP2 Diploid Observed: SNP1 G/T SNP2 A/C Possible Haplotypes: GA, TC or GC, TA n SNPs 2n possible haplotypes
Molecular Haplotyping Hetero-duplex analysis, mismatch detection, allele-specific PCR: Have potential to get high-throughput Only practical for short haplotypes (2-5 kb vs. 50-100kb) Costly Rolling Circle amplification method, etc: Can handle larger size Difficult to automate
In-silico Haplotyping Alias: Haplotype Reconstruction, Haplotype Inference, Computational Haplotyping, Statistical Haplotyping, etc. Advantages: Cost effective High-throughput Difficulty: Phase Ambiguity: Haplotypes increase exponentially with SNPs
In-silico Haplotyping: Two Tasks Reconstruction of the haplotypes of the sampled individuals II. Estimation of haplotypes frequencies in a population
In-silico Haplotyping: Approaches Clark’s algorithm E-M algorithm (expectation-maximization algorithm) Bayesian algorithm
Clark’s Algorithm 1) Find Homozygotes or heterozygotes at one locus SNP1 T T SNP2 A A SNP3 C C T-A-C Unambiguously defined SNP1 T T SNP2 A A SNP3 C G T-A-C T-A-G
Clark’s Algorithm 2) Try to solve ambiguous haplotype as a combination of solved ones SNP1 A T SNP2 A A SNP3 C G T-A-C : solved one A-A-G …………………………… Continue until either all haplotypes have been solved or until no more haplotypes can be found in this way
Clark’s Algorithm problems No homozygotes or single SNP heterozygotes -> chain might never get started Many unsolved haplotypes left at the end Quite useful in practice !!
EM Algorithm Use multinomial likelihood with HWE Pr(AT//AA//CG) =pr(AAC/TAG)+pr(AAG/TAC) =pr(AAC)pr(TAG)+pr(AAG)pr(TAC) Falling and Schork(2000) showed that EM is better than Clark’s algorithm
A Gibbs sampler, Stephens et al (2001) G=(G1, …, Gn) observed multilocus genotype freq H=(H1, …, Hn) unknown haplotype pairs F=(F1, …, FM) M unknown pop’n hap freq Choose individual i from all ambiguous individuals Sample Hi(t+1) from pr(Hi|g,H-i(t)) Set Hj(t+1)=Hj(t) for j=1,2,…,i-1,i+1,…n
Haplotype Inference A: SNP data: 0 (MM), 1 (Mm), 2 (mm) for a single locus B: Haplotype data: 0(M), 1 (m) for a single locus
#1 1, 2 00000 00100 #2 1, 3 00010 #3 1, 4 01001 #4 1, 5 00001 #5 1, 1 #6 1, 1
An Example Data 169 cases, 231 controls 11 haplotypes sex, age information
Logistic Regression Results Without adjusting for age, sex: Haplotype 7 is most strongly associated, but not statistically significant (p=0.07) Adjusting for age, sex: Haplotype 11 is most strongly associated (p=0.03) Slightly stronger association with accounting for repeated measures (2 haplotypes per person) by GEE procedure (p=0.02)
Other Examples
Drysdale et al. PNAS 2000, 97(19) 10483–10488
Wallenstein, Hodge, and Weston, Genetic Epidemiology 15:173–181 (1998)
Cohort study Case-control study
Shaw et al. Am J of Medical Genet 114 205-213 (2002)
References Clark (1990). Inference of haplotypes from PCR-amplified samples of diploid populations. Mol Bio Evol 7: 111-122 Escoffier and Slatkin (1995). Maximum likelihood estimation of molecular haplotype frequencies in a diploid population. Mol Bio Evol 12: 921-927. Stephens, Smith, and Donnelly (2001) A new statistical method for haplotype reconstruction from population data. Am J Hum Genet 68, 978-989. Niu, Qin, Xu and Liu (2002) Bayesian haplotype inference for multiple linked single-nucleotide ploymorphisms. Am J Hum Genet 70;157-169
Thank you ! Email :hokim@snu.ac.kr This file is available at http://plaza.snu.ac.kr /~hokim 열린 강의실, 세미나자료