Download presentation
Presentation is loading. Please wait.
1
Bioinformatics: A Statistician’s Persepctive
Ho Kim School of Public Health Seoul National University
2
Contents Case-crossover design Microarray Data Analysis
SNP and haplotype analysis
3
Case-crossover design
Case-control + crossover Popular tool for estimating the effects of acute outcomes by environmental and occupational exposures Only cases are sampled, estimates are based on within-subject comparisons of exposures at failure times vs. control times
4
Id=A Exposure time Control Case E1 E2
5
Controls for time-invariant confounders by design
Conditional logistic regression & matching can be applied Problems: confounding by time-varying factors -> selection bias Time trends in exposure of interest -> bias
6
References, case-crossover design
Maclure(1992) The case-crossover design : a method for studying transient effects on the risk of acute events, Am J Epi 133: Mittleman, Maclure, Bobinson(1995), Control sampling strategies for case-crossover studies :an assessment of relative efficiency, Am J Epi 142:91-98 Lee, Kim, Schwartz(2000) Bidirectional Case-crossover studies of air pollution : Bias from skewed and incomplete waves, Environ Health Persp 108: Hong, Rha, Lee, Kwon, Ha, Kim, Ischemic Stroke Onset associated with Decrease in Temperature, Epidemiology (in press)
7
Microarray Data Analysis
대량의 유전자에 대해 발현현상을 동시관찰 유전자의 regulation과 interaction의 이해에 기여 Full Yeast Genome in a Chip (Brown Lab, Stanford. Univ.)
8
Microarray Experiment
cDNA Microarray Experiment using Apo AI
9
Microarray & Statistical Method
생물학적 의문점 -유전자 발현양상의 차이 -유전자 및 샘플의 분류 Experimental Design Microarray Experiment Image Processing Normalization Statistical problems in all the procedures Validation Estimation Testing Clustering Classification 생물학적 해석 및 확인
10
Experimental Design Reference Design Loop Design
11
Image Analysis
12
A Statistician’s tip Use robust (nonparametric) estimator data mean
median 1,2,3,4 2.5 1,2,3,4,1000 220 3
13
Steps in Image analysis
Addressing: locate centers Segmentation: classification of pixels either as signal or background. Information extraction: for each spot of the array, calculates signal intensity pairs, background and quality measures.
14
Normalization Microarray자료에서 발현수준에 영향을 미치는 기술적 변이(systematic variation)를 찾아내서 제거함
15
M&A Plot M = log2(R / G) A = log2(R*G) / 2
16
Global normalization: add constant to make mean zero
Intensity dependent normalization: add values that depend on the intensity
17
Print-tip normalization
18
Effect of Normaization
Print-tip normalization (just location) Location+scale normalization
19
Cluster Analysis K-means Hierarchical Clustering
SOM (Self-Organizing Map) Gene Shaving SVM (Support Vector Machine)
20
K-Means Clustering Select k initial seed K is pre-deoermined
21
K-Means Clustering For each data point, assign cluster
Calculate cluster mean Seed is changed to the mean of the cluster
22
K-Means Clustering Repeats until seeds don’t change
24
Hierarchical Clustering
An evolutionary tree Hypohippus osborni Nannipus minor Neohipparion occidentale Calippus placidus Pliohippus mexicanus Merychippus seversus Merychippus secundus Parahippus pristinus Mesohippus barbouri
25
We need a measure of similarity or distance(dissimilarity)
Agglomerative a a,b b a,b,c,d,e c c,d,e e d,e d Divisive We need a measure of similarity or distance(dissimilarity)
26
3 5 2 dAB=(d13+d14+d15+d23+d24+d25)/6 Group average distance Cluster A
Cluster B 1 2 dAB=(d13+d14+d15+d23+d24+d25)/6 Group average distance
27
Dendogram indicating two groups
28
Dendogram Hierarchical Clustering(Eisen et al. 1998)
Clustered display of data from time course of serum stimulation of primary human fibroblasts
29
SNP and Haplotype Analysis
SNP (Single Nucleotide Polymorphism) Haplotypes Linkage & Linkage disequilibrium Association study design SNP vs. Haplotype for association study Haplotype estimation Data analysis
30
SNPs in the Human Genome
All humans share 99.9% the same genetic sequence SNPs occur about every 1000 base pairs 90% of human genome variation comes SNPs The human genome contains more than 2 million SNPs ~21,000 SNPs are found in genes SNPs are not evenly spaced along the sequence SNP-rich regions SNP-poor regions
31
SNP Locations SNPs within a gene
May alter protein structures SNPs in the regulatory region outside a gene May affect when and how the gene is turned on, which affects the quantity of the protein produced SNPs not within the vicinity of a gene genetic markers for locating disease-causing genes
32
SNPs as DNA Landmarks Help in DNA sequencing
Help in the discovery of genes responsible for many major diseases: asthma, diabetes, heart disease, schizophrenia and cancer among others
33
SNP & Haplotype SNP: Single Nucleotide Polymorphism
Haplotype: A set of closely linked genetic markers present on one chromosome which tend to be inherited together (not easily separable by recombination). G A C Set of SNP polymorphisms: a SNP haplotype
34
From SNP to Haplotype Phenotype Black eye GATATTCGTACGGA-T Brown eye
GATGTTCGTACTGAAT GATATTCGTACGGAAT SNP Phenotype Black eye Brown eye Blue eye AG- 2/6 GTA 3/6 AGA 1/6 Haplotypes SNP Simple to measure & understand Haplotype have the advantage in the appropriate circumstances of carrying more information about the genotype-phenotype link than do the underlying SNPs. DNA Sequence
35
Linkage and Linkage Disequilibrium (1)
Linkage: the tendency of genes or other DNA sequences at specific loci to be inherited together as a consequence of their physical proximity on a single chromosome. Linkage disequilibrium (allelic association): particular alleles at two or more neighboring loci show allelic association if they occur together with frequencies significantly different from those predicted from the individual allele frequencies. Linkage is a relation between loci, but association is a relation between alleles.
36
Linkage and Linkage Disequilibrium (2)
( = recombination fraction) No linkage: = 0.5 Perfect linkage: = 0 Linkage disequlibrium: 0 1 ( = probability of allelic association) Linkage equilibrium: = 0 Complete linkage disequilibrium: = 1
37
Allelic Association (LD) Morton et al. (2001)
Locus B Locus A Allele 1 Allele 2 Allele frequency Allele 1 Allele 2 Allele frequency 1 A, B: diallelic loci; 11, 12, 21, 22: haplotypes; : association probability
38
Measures of LD Covariance D = | 11 22 - 12 21 | Association
= D/Q(1-R) All other measures are functions of Q, R, .
39
New Findings on Linkage Disequilibrium
In the chromosome, there are blocks of limited haplotype diversity in which more than 80% of a global human sample can typically be characterized by only three common haplotypes (Patil et al., Science 2001). Haplotype blocks are the more precise units to reflect genetic variation. Identification of haplotype structure, i.e., construction of a haplotype map, provides a basis for accurate and efficient association studies.
40
Daly et al. (2001). LD by distance from two markers
41
Association Analysis So, what information do we need?
For the group under study: Relationships between individuals (family structure) For each individual in the study: Disease status (or health status) Environmental information Haplotypes for the chromosomes where the candidate gene lies
42
The Problem It’s not yet easy to measure an individual’s (only two) haplotypes Molecular haplotyping (nucleotide sequencing) is the gold standard A more efficient strategy: Focus on regions, such as certain genes Estimate haplotypes from SNP data (genotypes) Use LD map, and reduce the number of loci to represent the haplotype Use haplotype map (DB) = key SNPs + haplotype blocks with strong LD
43
Haplotyping: Phase Problem
C SNP1 SNP2 Diploid Observed: SNP1 G/T SNP2 A/C Possible Haplotypes: GA, TC or GC, TA n SNPs 2n possible haplotypes
44
Molecular Haplotyping
Hetero-duplex analysis, mismatch detection, allele-specific PCR: Have potential to get high-throughput Only practical for short haplotypes (2-5 kb vs kb) Costly Rolling Circle amplification method, etc: Can handle larger size Difficult to automate
45
In-silico Haplotyping
Alias: Haplotype Reconstruction, Haplotype Inference, Computational Haplotyping, Statistical Haplotyping, etc. Advantages: Cost effective High-throughput Difficulty: Phase Ambiguity: Haplotypes increase exponentially with SNPs
46
In-silico Haplotyping: Two Tasks
Reconstruction of the haplotypes of the sampled individuals II. Estimation of haplotypes frequencies in a population
47
In-silico Haplotyping: Approaches
Clark’s algorithm E-M algorithm (expectation-maximization algorithm) Bayesian algorithm
48
Haplotype Inference A: SNP data: 0 (MM), 1 (Mm), 2 (mm) for a single locus B: Haplotype data: 0(M), 1 (m) for a single locus
49
#1 1, 2 00000 00100 #2 1, 3 00010 #3 1, 4 01001 #4 1, 5 00001 #5 1, 1 #6 1, 1
50
An Example Data 169 cases, 231 controls 11 haplotypes
sex, age information
52
Logistic Regression Results
Without adjusting for age, sex: Haplotype 7 is most strongly associated, but not statistically significant (p=0.07) Adjusting for age, sex: Haplotype 11 is most strongly associated (p=0.03) Slightly stronger association with accounting for repeated measures (2 haplotypes per person) by GEE procedure (p=0.02)
53
Other Examples
54
Drysdale et al. PNAS 2000, 97(19) 10483–10488
57
Wallenstein, Hodge, and Weston, Genetic Epidemiology 15:173–181 (1998)
58
Cohort study Case-control study
59
Shaw et al. Am J of Medical Genet 114 205-213 (2002)
60
IMT2000 소개
61
Ultimate Human Genome ….. …… Genetic epi Genomic epi , medicine
Functional or structural units (coding regions, gene,…) Biology Chromosome ….. Epidemiology 1 2 ….. SNP … SNP i Sub 1 …… Various phenotype Sequence alignment Id, naming Exposure Ethnic group
62
Thank you ! This file is available at /~hokim 열린 강의실, 세미나자료
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.