Bioinformatics: A Statistician’s Persepctive Ho Kim School of Public Health Seoul National University
Contents Case-crossover design Microarray Data Analysis SNP and haplotype analysis
Case-crossover design Case-control + crossover Popular tool for estimating the effects of acute outcomes by environmental and occupational exposures Only cases are sampled, estimates are based on within-subject comparisons of exposures at failure times vs. control times
Id=A Exposure time Control Case E1 E2
Controls for time-invariant confounders by design Conditional logistic regression & matching can be applied Problems: confounding by time-varying factors -> selection bias Time trends in exposure of interest -> bias
References, case-crossover design Maclure(1992) The case-crossover design : a method for studying transient effects on the risk of acute events, Am J Epi 133:144-153 Mittleman, Maclure, Bobinson(1995), Control sampling strategies for case-crossover studies :an assessment of relative efficiency, Am J Epi 142:91-98 Lee, Kim, Schwartz(2000) Bidirectional Case-crossover studies of air pollution : Bias from skewed and incomplete waves, Environ Health Persp 108:1107-1115 Hong, Rha, Lee, Kwon, Ha, Kim, Ischemic Stroke Onset associated with Decrease in Temperature, Epidemiology (in press)
Microarray Data Analysis 대량의 유전자에 대해 발현현상을 동시관찰 유전자의 regulation과 interaction의 이해에 기여 Full Yeast Genome in a Chip (Brown Lab, Stanford. Univ.)
Microarray Experiment cDNA Microarray Experiment using Apo AI
Microarray & Statistical Method 생물학적 의문점 -유전자 발현양상의 차이 -유전자 및 샘플의 분류 Experimental Design Microarray Experiment Image Processing Normalization Statistical problems in all the procedures Validation Estimation Testing Clustering Classification 생물학적 해석 및 확인
Experimental Design Reference Design Loop Design
Image Analysis
A Statistician’s tip Use robust (nonparametric) estimator data mean median 1,2,3,4 2.5 1,2,3,4,1000 220 3
Steps in Image analysis Addressing: locate centers Segmentation: classification of pixels either as signal or background. Information extraction: for each spot of the array, calculates signal intensity pairs, background and quality measures.
Normalization Microarray자료에서 발현수준에 영향을 미치는 기술적 변이(systematic variation)를 찾아내서 제거함
M&A Plot M = log2(R / G) A = log2(R*G) / 2
Global normalization: add constant to make mean zero Intensity dependent normalization: add values that depend on the intensity
Print-tip normalization
Effect of Normaization Print-tip normalization (just location) Location+scale normalization
Cluster Analysis K-means Hierarchical Clustering SOM (Self-Organizing Map) Gene Shaving SVM (Support Vector Machine)
K-Means Clustering Select k initial seed K is pre-deoermined
K-Means Clustering For each data point, assign cluster Calculate cluster mean Seed is changed to the mean of the cluster
K-Means Clustering Repeats until seeds don’t change
Hierarchical Clustering An evolutionary tree Hypohippus osborni Nannipus minor Neohipparion occidentale Calippus placidus Pliohippus mexicanus Merychippus seversus Merychippus secundus Parahippus pristinus Mesohippus barbouri
We need a measure of similarity or distance(dissimilarity) 0 1 2 3 4 Agglomerative a a,b b a,b,c,d,e c c,d,e e d,e d 4 3 2 1 0 Divisive We need a measure of similarity or distance(dissimilarity)
3 5 2 dAB=(d13+d14+d15+d23+d24+d25)/6 Group average distance Cluster A Cluster B 1 2 dAB=(d13+d14+d15+d23+d24+d25)/6 Group average distance
Dendogram indicating two groups
Dendogram Hierarchical Clustering(Eisen et al. 1998) Clustered display of data from time course of serum stimulation of primary human fibroblasts
SNP and Haplotype Analysis SNP (Single Nucleotide Polymorphism) Haplotypes Linkage & Linkage disequilibrium Association study design SNP vs. Haplotype for association study Haplotype estimation Data analysis
SNPs in the Human Genome All humans share 99.9% the same genetic sequence SNPs occur about every 1000 base pairs 90% of human genome variation comes SNPs The human genome contains more than 2 million SNPs ~21,000 SNPs are found in genes SNPs are not evenly spaced along the sequence SNP-rich regions SNP-poor regions
SNP Locations SNPs within a gene May alter protein structures SNPs in the regulatory region outside a gene May affect when and how the gene is turned on, which affects the quantity of the protein produced SNPs not within the vicinity of a gene genetic markers for locating disease-causing genes
SNPs as DNA Landmarks Help in DNA sequencing Help in the discovery of genes responsible for many major diseases: asthma, diabetes, heart disease, schizophrenia and cancer among others
SNP & Haplotype SNP: Single Nucleotide Polymorphism Haplotype: A set of closely linked genetic markers present on one chromosome which tend to be inherited together (not easily separable by recombination). G A C Set of SNP polymorphisms: a SNP haplotype
From SNP to Haplotype Phenotype Black eye GATATTCGTACGGA-T Brown eye GATGTTCGTACTGAAT GATATTCGTACGGAAT SNP 1 2 3 4 5 6 Phenotype Black eye Brown eye Blue eye AG- 2/6 GTA 3/6 AGA 1/6 Haplotypes SNP Simple to measure & understand Haplotype have the advantage in the appropriate circumstances of carrying more information about the genotype-phenotype link than do the underlying SNPs. DNA Sequence
Linkage and Linkage Disequilibrium (1) Linkage: the tendency of genes or other DNA sequences at specific loci to be inherited together as a consequence of their physical proximity on a single chromosome. Linkage disequilibrium (allelic association): particular alleles at two or more neighboring loci show allelic association if they occur together with frequencies significantly different from those predicted from the individual allele frequencies. Linkage is a relation between loci, but association is a relation between alleles.
Linkage and Linkage Disequilibrium (2) ( = recombination fraction) No linkage: = 0.5 Perfect linkage: = 0 Linkage disequlibrium: 0 1 ( = probability of allelic association) Linkage equilibrium: = 0 Complete linkage disequilibrium: = 1
Allelic Association (LD) Morton et al. (2001) Locus B Locus A Allele 1 Allele 2 Allele frequency Allele 1 Allele 2 Allele frequency 1 A, B: diallelic loci; 11, 12, 21, 22: haplotypes; : association probability
Measures of LD Covariance D = | 11 22 - 12 21 | Association = D/Q(1-R) All other measures are functions of Q, R, .
New Findings on Linkage Disequilibrium In the chromosome, there are blocks of limited haplotype diversity in which more than 80% of a global human sample can typically be characterized by only three common haplotypes (Patil et al., Science 2001). Haplotype blocks are the more precise units to reflect genetic variation. Identification of haplotype structure, i.e., construction of a haplotype map, provides a basis for accurate and efficient association studies.
Daly et al. (2001). LD by distance from two markers
Association Analysis So, what information do we need? For the group under study: Relationships between individuals (family structure) For each individual in the study: Disease status (or health status) Environmental information Haplotypes for the chromosomes where the candidate gene lies
The Problem It’s not yet easy to measure an individual’s (only two) haplotypes Molecular haplotyping (nucleotide sequencing) is the gold standard A more efficient strategy: Focus on regions, such as certain genes Estimate haplotypes from SNP data (genotypes) Use LD map, and reduce the number of loci to represent the haplotype Use haplotype map (DB) = key SNPs + haplotype blocks with strong LD
Haplotyping: Phase Problem C SNP1 SNP2 Diploid Observed: SNP1 G/T SNP2 A/C Possible Haplotypes: GA, TC or GC, TA n SNPs 2n possible haplotypes
Molecular Haplotyping Hetero-duplex analysis, mismatch detection, allele-specific PCR: Have potential to get high-throughput Only practical for short haplotypes (2-5 kb vs. 50-100kb) Costly Rolling Circle amplification method, etc: Can handle larger size Difficult to automate
In-silico Haplotyping Alias: Haplotype Reconstruction, Haplotype Inference, Computational Haplotyping, Statistical Haplotyping, etc. Advantages: Cost effective High-throughput Difficulty: Phase Ambiguity: Haplotypes increase exponentially with SNPs
In-silico Haplotyping: Two Tasks Reconstruction of the haplotypes of the sampled individuals II. Estimation of haplotypes frequencies in a population
In-silico Haplotyping: Approaches Clark’s algorithm E-M algorithm (expectation-maximization algorithm) Bayesian algorithm
Haplotype Inference A: SNP data: 0 (MM), 1 (Mm), 2 (mm) for a single locus B: Haplotype data: 0(M), 1 (m) for a single locus
#1 1, 2 00000 00100 #2 1, 3 00010 #3 1, 4 01001 #4 1, 5 00001 #5 1, 1 #6 1, 1
An Example Data 169 cases, 231 controls 11 haplotypes sex, age information
Logistic Regression Results Without adjusting for age, sex: Haplotype 7 is most strongly associated, but not statistically significant (p=0.07) Adjusting for age, sex: Haplotype 11 is most strongly associated (p=0.03) Slightly stronger association with accounting for repeated measures (2 haplotypes per person) by GEE procedure (p=0.02)
Other Examples
Drysdale et al. PNAS 2000, 97(19) 10483–10488
Wallenstein, Hodge, and Weston, Genetic Epidemiology 15:173–181 (1998)
Cohort study Case-control study
Shaw et al. Am J of Medical Genet 114 205-213 (2002)
IMT2000 소개
Ultimate Human Genome ….. …… Genetic epi Genomic epi , medicine Functional or structural units (coding regions, gene,…) Biology Chromosome ….. Epidemiology 1 2 ….. SNP1 … SNP i Sub 1 …… Various phenotype Sequence alignment Id, naming Exposure Ethnic group
Thank you ! Email :hokim@snu.ac.kr This file is available at http://plaza.snu.ac.kr /~hokim 열린 강의실, 세미나자료