Bioinformatics: A Statistician’s Persepctive

Bioinformatics: A Statistician’s Persepctive
Ho Kim School of Public Health Seoul National University

Contents Case-crossover design Microarray Data Analysis
SNP and haplotype analysis

Case-crossover design
Case-control + crossover Popular tool for estimating the effects of acute outcomes by environmental and occupational exposures Only cases are sampled, estimates are based on within-subject comparisons of exposures at failure times vs. control times

Id=A Exposure time Control Case E1 E2

Controls for time-invariant confounders by design
Conditional logistic regression & matching can be applied Problems: confounding by time-varying factors -> selection bias Time trends in exposure of interest -> bias

References, case-crossover design
Maclure(1992) The case-crossover design : a method for studying transient effects on the risk of acute events, Am J Epi 133: Mittleman, Maclure, Bobinson(1995), Control sampling strategies for case-crossover studies :an assessment of relative efficiency, Am J Epi 142:91-98 Lee, Kim, Schwartz(2000) Bidirectional Case-crossover studies of air pollution : Bias from skewed and incomplete waves, Environ Health Persp 108: Hong, Rha, Lee, Kwon, Ha, Kim, Ischemic Stroke Onset associated with Decrease in Temperature, Epidemiology (in press)

Microarray Data Analysis
대량의 유전자에 대해 발현현상을 동시관찰 유전자의 regulation과 interaction의 이해에 기여 Full Yeast Genome in a Chip (Brown Lab, Stanford. Univ.)

Microarray Experiment
cDNA Microarray Experiment using Apo AI

Microarray & Statistical Method
생물학적 의문점 -유전자 발현양상의 차이 -유전자 및 샘플의 분류 Experimental Design Microarray Experiment Image Processing Normalization Statistical problems in all the procedures Validation Estimation Testing Clustering Classification 생물학적 해석 및 확인

Experimental Design Reference Design Loop Design

Image Analysis

A Statistician’s tip Use robust (nonparametric) estimator data mean
median 1,2,3,4 2.5 1,2,3,4,1000 220 3

Steps in Image analysis
Addressing: locate centers Segmentation: classification of pixels either as signal or background. Information extraction: for each spot of the array, calculates signal intensity pairs, background and quality measures.

Normalization Microarray자료에서 발현수준에 영향을 미치는 기술적 변이(systematic variation)를 찾아내서 제거함

M&A Plot M = log2(R / G) A = log2(R*G) / 2

Global normalization: add constant to make mean zero
Intensity dependent normalization: add values that depend on the intensity

Print-tip normalization

Effect of Normaization
Print-tip normalization (just location) Location+scale normalization

Cluster Analysis K-means Hierarchical Clustering
SOM (Self-Organizing Map) Gene Shaving SVM (Support Vector Machine)

K-Means Clustering Select k initial seed K is pre-deoermined

K-Means Clustering For each data point, assign cluster
Calculate cluster mean Seed is changed to the mean of the cluster

K-Means Clustering Repeats until seeds don’t change

Hierarchical Clustering
An evolutionary tree Hypohippus osborni Nannipus minor Neohipparion occidentale Calippus placidus Pliohippus mexicanus Merychippus seversus Merychippus secundus Parahippus pristinus Mesohippus barbouri

We need a measure of similarity or distance(dissimilarity)
Agglomerative a a,b b a,b,c,d,e c c,d,e e d,e d Divisive We need a measure of similarity or distance(dissimilarity)

3 5 2 dAB=(d13+d14+d15+d23+d24+d25)/6 Group average distance Cluster A
Cluster B 1 2 dAB=(d13+d14+d15+d23+d24+d25)/6 Group average distance

Dendogram indicating two groups

Dendogram Hierarchical Clustering(Eisen et al. 1998)
Clustered display of data from time course of serum stimulation of primary human fibroblasts

SNP and Haplotype Analysis
SNP (Single Nucleotide Polymorphism) Haplotypes Linkage & Linkage disequilibrium Association study design SNP vs. Haplotype for association study Haplotype estimation Data analysis

SNPs in the Human Genome
All humans share 99.9% the same genetic sequence SNPs occur about every 1000 base pairs 90% of human genome variation comes SNPs The human genome contains more than 2 million SNPs ~21,000 SNPs are found in genes SNPs are not evenly spaced along the sequence SNP-rich regions SNP-poor regions

SNP Locations SNPs within a gene
May alter protein structures SNPs in the regulatory region outside a gene May affect when and how the gene is turned on, which affects the quantity of the protein produced SNPs not within the vicinity of a gene genetic markers for locating disease-causing genes

SNPs as DNA Landmarks Help in DNA sequencing
Help in the discovery of genes responsible for many major diseases: asthma, diabetes, heart disease, schizophrenia and cancer among others

SNP & Haplotype SNP: Single Nucleotide Polymorphism
Haplotype: A set of closely linked genetic markers present on one chromosome which tend to be inherited together (not easily separable by recombination). G A C Set of SNP polymorphisms: a SNP haplotype

From SNP to Haplotype Phenotype Black eye GATATTCGTACGGA-T Brown eye
GATGTTCGTACTGAAT GATATTCGTACGGAAT SNP Phenotype Black eye Brown eye Blue eye AG- 2/6 GTA 3/6 AGA 1/6 Haplotypes SNP Simple to measure & understand Haplotype have the advantage in the appropriate circumstances of carrying more information about the genotype-phenotype link than do the underlying SNPs. DNA Sequence

Linkage and Linkage Disequilibrium (1)
Linkage: the tendency of genes or other DNA sequences at specific loci to be inherited together as a consequence of their physical proximity on a single chromosome. Linkage disequilibrium (allelic association): particular alleles at two or more neighboring loci show allelic association if they occur together with frequencies significantly different from those predicted from the individual allele frequencies. Linkage is a relation between loci, but association is a relation between alleles.

Linkage and Linkage Disequilibrium (2)
( = recombination fraction) No linkage:  = 0.5 Perfect linkage:  = 0 Linkage disequlibrium: 0   1 ( = probability of allelic association) Linkage equilibrium:  = 0 Complete linkage disequilibrium:  = 1

Allelic Association (LD) Morton et al. (2001)
Locus B Locus A Allele 1 Allele 2 Allele frequency Allele 1 Allele 2 Allele frequency 1 A, B: diallelic loci; 11, 12, 21, 22: haplotypes; : association probability

Measures of LD Covariance D = | 11 22 - 12 21 | Association
 = D/Q(1-R) All other measures are functions of Q, R, .

New Findings on Linkage Disequilibrium
In the chromosome, there are blocks of limited haplotype diversity in which more than 80% of a global human sample can typically be characterized by only three common haplotypes (Patil et al., Science 2001). Haplotype blocks are the more precise units to reflect genetic variation. Identification of haplotype structure, i.e., construction of a haplotype map, provides a basis for accurate and efficient association studies.

Daly et al. (2001). LD by distance from two markers

Association Analysis So, what information do we need?
For the group under study: Relationships between individuals (family structure) For each individual in the study: Disease status (or health status) Environmental information Haplotypes for the chromosomes where the candidate gene lies

The Problem It’s not yet easy to measure an individual’s (only two) haplotypes Molecular haplotyping (nucleotide sequencing) is the gold standard A more efficient strategy: Focus on regions, such as certain genes Estimate haplotypes from SNP data (genotypes) Use LD map, and reduce the number of loci to represent the haplotype Use haplotype map (DB) = key SNPs + haplotype blocks with strong LD

Haplotyping: Phase Problem
C SNP1 SNP2 Diploid Observed: SNP1 G/T SNP2 A/C Possible Haplotypes: GA, TC or GC, TA n SNPs  2n possible haplotypes

Molecular Haplotyping
Hetero-duplex analysis, mismatch detection, allele-specific PCR: Have potential to get high-throughput Only practical for short haplotypes (2-5 kb vs kb) Costly Rolling Circle amplification method, etc: Can handle larger size Difficult to automate

In-silico Haplotyping
Alias: Haplotype Reconstruction, Haplotype Inference, Computational Haplotyping, Statistical Haplotyping, etc. Advantages: Cost effective High-throughput Difficulty: Phase Ambiguity: Haplotypes increase exponentially with SNPs

In-silico Haplotyping: Two Tasks
Reconstruction of the haplotypes of the sampled individuals II. Estimation of haplotypes frequencies in a population

In-silico Haplotyping: Approaches
Clark’s algorithm E-M algorithm (expectation-maximization algorithm) Bayesian algorithm

Haplotype Inference A: SNP data: 0 (MM), 1 (Mm), 2 (mm) for a single locus B: Haplotype data: 0(M), 1 (m) for a single locus

#1 1, 2 00000 00100 #2 1, 3 00010 #3 1, 4 01001 #4 1, 5 00001 #5 1, 1 #6 1, 1

An Example Data 169 cases, 231 controls 11 haplotypes
sex, age information

Logistic Regression Results
Without adjusting for age, sex: Haplotype 7 is most strongly associated, but not statistically significant (p=0.07) Adjusting for age, sex: Haplotype 11 is most strongly associated (p=0.03) Slightly stronger association with accounting for repeated measures (2 haplotypes per person) by GEE procedure (p=0.02)

Other Examples

Drysdale et al. PNAS 2000, 97(19) 10483–10488

Wallenstein, Hodge, and Weston, Genetic Epidemiology 15:173–181 (1998)

Cohort study Case-control study

Shaw et al. Am J of Medical Genet 114 205-213 (2002)

IMT2000 소개

Ultimate Human Genome ….. …… Genetic epi Genomic epi , medicine
Functional or structural units (coding regions, gene,…) Biology Chromosome ….. Epidemiology 1 2 ….. SNP … SNP i Sub 1 …… Various phenotype Sequence alignment Id, naming Exposure Ethnic group

Thank you ! This file is available at /~hokim 열린 강의실, 세미나자료

Bioinformatics: A Statistician’s Persepctive

Similar presentations

Presentation on theme: "Bioinformatics: A Statistician’s Persepctive"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Bioinformatics: A Statistician’s Persepctive

Similar presentations

Presentation on theme: "Bioinformatics: A Statistician’s Persepctive"— Presentation transcript:

Similar presentations

About project

Feedback