Bioinformatics: A Statistician’s Persepctive

Slides:



Advertisements
Similar presentations
Linkage and Genetic Mapping
Advertisements

Manish Anand Nihar Sheth Jim Costello Univ. of Indiana
SNP Applications statwww.epfl.ch/davison/teaching/Microarrays/snp.ppt.
METHODS FOR HAPLOTYPE RECONSTRUCTION
Genomics An introduction. Aims of genomics I Establishing integrated databases – being far from merely a storage Linking genomic and expressed gene sequences.
Dr. Almut Nebel Dept. of Human Genetics University of the Witwatersrand Johannesburg South Africa Significance of SNPs for human disease.
Biology and Bioinformatics Gabor T. Marth Department of Biology, Boston College BI820 – Seminar in Quantitative and Computational Problems.
Introduction to Linkage Analysis March Stages of Genetic Mapping Are there genes influencing this trait? Epidemiological studies Where are those.
Human Migrations Saeed Hassanpour Spring Introduction Population Genetics Co-evolution of genes with language and cultural. Human evolution: genetics,
Introduction to Molecular Epidemiology Jan Dorman, PhD University of Pittsburgh School of Nursing
Haplotype Discovery and Modeling. Identification of genes Identify the Phenotype MapClone.
Introduction to BST775: Statistical Methods for Genetic Analysis I Course master: Degui Zhi, Ph.D. Assistant professor Section on Statistical Genetics.
Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University.
Computational research for medical discovery at Boston College Biology Gabor T. Marth Boston College Department of Biology
Doug Brutlag 2011 Genomics & Medicine Doug Brutlag Professor Emeritus of Biochemistry &
SNPs Daniel Fernandez Alejandro Quiroz Zárate. A SNP is defined as a single base change in a DNA sequence that occurs in a significant proportion (more.
CS177 Lecture 10 SNPs and Human Genetic Variation
SNP Haplotypes as Diagnostic Markers Shrish Tiwari CCMB, Hyderabad.
Gene Hunting: Linkage and Association
Announcements: Proposal resubmission deadline 4/23 (Thursday).
National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department.
Experimental Design and Data Structure Supplement to Lecture 8 Fall
Finnish Genome Center Monday, 16 November Genotyping & Haplotyping.
Polymorphism Haixu Tang School of Informatics. Genome variations underlie phenotypic differences cause inherited diseases.
Julia N. Chapman, Alia Kamal, Archith Ramkumar, Owen L. Astrachan Duke University, Genome Revolution Focus, Department of Computer Science Sources
MEME homework: probability of finding GAGTCA at a given position in the yeast genome, based on a background model of A = 0.3, T = 0.3, G = 0.2, C = 0.2.
February 20, 2002 UD, Newark, DE SNPs, Haplotypes, Alleles.
1 Balanced Translocation detected by FISH. 2 Red- Chrom. 5 probe Green- Chrom. 8 probe.
Organization of statistical research. The role of Biostatisticians Biostatisticians play essential roles in designing studies, analyzing data and.
The International Consortium. The International HapMap Project.
Linkage Disequilibrium and Recent Studies of Haplotypes and SNPs
BIOSTATISTICS Lecture 2. The role of Biostatisticians Biostatisticians play essential roles in designing studies, analyzing data and creating methods.
Computational Biology and Genomics at Boston College Biology Gabor T. Marth Department of Biology, Boston College
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
Analyzing DNA using Microarray and Next Generation Sequencing (1) Background SNP Array Basic design Applications: CNV, LOH, GWAS Deep sequencing Alignment.
The Haplotype Blocks Problems Wu Ling-Yun
Inferences on human demographic history using computational Population Genetic models Gabor T. Marth Department of Biology Boston College Chestnut Hill,
Looking Within Human Genome King abdulaziz university Dr. Nisreen R Tashkandy GENOMICS ; THE PIG PICTURE.
Single Nucleotide Polymorphisms (SNPs
Genomic Analysis: GWAS
Of Sea Urchins, Birds and Men
Complex disease and long-range regulation: Interpreting the GWAS using a Dual Colour Transgenesis Strategy in Zebrafish.
Copyright © 2001 American Medical Association. All rights reserved.
Statistical Applications in Biology and Genetics
Biostatistics?.
Genome Wide Association Studies using SNP
Microarray Technology and Applications
Human Cells Human genomics
Migrant Studies Migrant Studies: vary environment, keep genetics constant: Evaluate incidence of disorder among ethnically-similar individuals living.
Introduction to bioinformatics lecture 11 SNP by Ms.Shumaila Azam
Recombination (Crossing Over)
PLANT BIOTECHNOLOGY & GENETIC ENGINEERING (3 CREDIT HOURS)
Linkage: Statistically, genes act like beads on a string
High level GWAS analysis
Epidemiology 101 Epidemiology is the study of the distribution and determinants of health-related states in populations Study design is a key component.
Patterns of Linkage Disequilibrium in the Human Genome
Mapping Quantitative Trait Loci
Genome-wide Association Studies
Haplotype Reconstruction
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Case-Crossover Analysis in Air Pollution Epidemiology
Ho Kim School of Public Health Seoul National University
A Flexible Bayesian Framework for Modeling Haplotype Association with Disease, Allowing for Dominance Effects of the Underlying Causative Variants  Andrew.
Medical genomics BI420 Department of Biology, Boston College
Clustering Analysis for Microarray Data
Medical genomics BI420 Department of Biology, Boston College
Evaluation of power for linkage disequilibrium mapping
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Presentation transcript:

Bioinformatics: A Statistician’s Persepctive Ho Kim School of Public Health Seoul National University

Contents Case-crossover design Microarray Data Analysis SNP and haplotype analysis

Case-crossover design Case-control + crossover Popular tool for estimating the effects of acute outcomes by environmental and occupational exposures Only cases are sampled, estimates are based on within-subject comparisons of exposures at failure times vs. control times

Id=A Exposure time Control Case E1 E2

Controls for time-invariant confounders by design Conditional logistic regression & matching can be applied Problems: confounding by time-varying factors -> selection bias Time trends in exposure of interest -> bias

References, case-crossover design Maclure(1992) The case-crossover design : a method for studying transient effects on the risk of acute events, Am J Epi 133:144-153 Mittleman, Maclure, Bobinson(1995), Control sampling strategies for case-crossover studies :an assessment of relative efficiency, Am J Epi 142:91-98 Lee, Kim, Schwartz(2000) Bidirectional Case-crossover studies of air pollution : Bias from skewed and incomplete waves, Environ Health Persp 108:1107-1115 Hong, Rha, Lee, Kwon, Ha, Kim, Ischemic Stroke Onset associated with Decrease in Temperature, Epidemiology (in press)

Microarray Data Analysis 대량의 유전자에 대해 발현현상을 동시관찰 유전자의 regulation과 interaction의 이해에 기여 Full Yeast Genome in a Chip (Brown Lab, Stanford. Univ.)

Microarray Experiment cDNA Microarray Experiment using Apo AI

Microarray & Statistical Method 생물학적 의문점 -유전자 발현양상의 차이 -유전자 및 샘플의 분류 Experimental Design Microarray Experiment Image Processing Normalization Statistical problems in all the procedures Validation Estimation Testing Clustering Classification 생물학적 해석 및 확인

Experimental Design Reference Design Loop Design

Image Analysis

A Statistician’s tip Use robust (nonparametric) estimator data mean median 1,2,3,4 2.5 1,2,3,4,1000 220 3

Steps in Image analysis Addressing: locate centers Segmentation: classification of pixels either as signal or background. Information extraction: for each spot of the array, calculates signal intensity pairs, background and quality measures.

Normalization Microarray자료에서 발현수준에 영향을 미치는 기술적 변이(systematic variation)를 찾아내서 제거함

M&A Plot M = log2(R / G) A = log2(R*G) / 2

Global normalization: add constant to make mean zero Intensity dependent normalization: add values that depend on the intensity

Print-tip normalization

Effect of Normaization Print-tip normalization (just location) Location+scale normalization

Cluster Analysis K-means Hierarchical Clustering SOM (Self-Organizing Map) Gene Shaving SVM (Support Vector Machine)

K-Means Clustering Select k initial seed K is pre-deoermined

K-Means Clustering For each data point, assign cluster Calculate cluster mean Seed is changed to the mean of the cluster

K-Means Clustering Repeats until seeds don’t change

Hierarchical Clustering An evolutionary tree Hypohippus osborni Nannipus minor Neohipparion occidentale Calippus placidus Pliohippus mexicanus Merychippus seversus Merychippus secundus Parahippus pristinus Mesohippus barbouri

We need a measure of similarity or distance(dissimilarity) 0 1 2 3 4 Agglomerative a a,b b a,b,c,d,e c c,d,e e d,e d 4 3 2 1 0 Divisive We need a measure of similarity or distance(dissimilarity)

3 5 2 dAB=(d13+d14+d15+d23+d24+d25)/6 Group average distance Cluster A Cluster B 1 2 dAB=(d13+d14+d15+d23+d24+d25)/6 Group average distance

Dendogram indicating two groups

Dendogram Hierarchical Clustering(Eisen et al. 1998) Clustered display of data from time course of serum stimulation of primary human fibroblasts

SNP and Haplotype Analysis SNP (Single Nucleotide Polymorphism) Haplotypes Linkage & Linkage disequilibrium Association study design SNP vs. Haplotype for association study Haplotype estimation Data analysis

SNPs in the Human Genome All humans share 99.9% the same genetic sequence SNPs occur about every 1000 base pairs 90% of human genome variation comes SNPs The human genome contains more than 2 million SNPs ~21,000 SNPs are found in genes SNPs are not evenly spaced along the sequence SNP-rich regions SNP-poor regions

SNP Locations SNPs within a gene May alter protein structures SNPs in the regulatory region outside a gene May affect when and how the gene is turned on, which affects the quantity of the protein produced SNPs not within the vicinity of a gene genetic markers for locating disease-causing genes

SNPs as DNA Landmarks Help in DNA sequencing Help in the discovery of genes responsible for many major diseases: asthma, diabetes, heart disease, schizophrenia and cancer among others

SNP & Haplotype SNP: Single Nucleotide Polymorphism Haplotype: A set of closely linked genetic markers present on one chromosome which tend to be inherited together (not easily separable by recombination). G A C Set of SNP polymorphisms: a SNP haplotype

From SNP to Haplotype Phenotype Black eye GATATTCGTACGGA-T Brown eye GATGTTCGTACTGAAT GATATTCGTACGGAAT SNP 1 2 3 4 5 6 Phenotype Black eye Brown eye Blue eye AG- 2/6 GTA 3/6 AGA 1/6 Haplotypes SNP Simple to measure & understand Haplotype have the advantage in the appropriate circumstances of carrying more information about the genotype-phenotype link than do the underlying SNPs. DNA Sequence

Linkage and Linkage Disequilibrium (1) Linkage: the tendency of genes or other DNA sequences at specific loci to be inherited together as a consequence of their physical proximity on a single chromosome. Linkage disequilibrium (allelic association): particular alleles at two or more neighboring loci show allelic association if they occur together with frequencies significantly different from those predicted from the individual allele frequencies. Linkage is a relation between loci, but association is a relation between alleles.

Linkage and Linkage Disequilibrium (2) ( = recombination fraction) No linkage:  = 0.5 Perfect linkage:  = 0 Linkage disequlibrium: 0   1 ( = probability of allelic association) Linkage equilibrium:  = 0 Complete linkage disequilibrium:  = 1

Allelic Association (LD) Morton et al. (2001) Locus B Locus A Allele 1 Allele 2 Allele frequency Allele 1 Allele 2 Allele frequency 1 A, B: diallelic loci; 11, 12, 21, 22: haplotypes; : association probability

Measures of LD Covariance D = | 11 22 - 12 21 | Association  = D/Q(1-R) All other measures are functions of Q, R, .

New Findings on Linkage Disequilibrium In the chromosome, there are blocks of limited haplotype diversity in which more than 80% of a global human sample can typically be characterized by only three common haplotypes (Patil et al., Science 2001). Haplotype blocks are the more precise units to reflect genetic variation. Identification of haplotype structure, i.e., construction of a haplotype map, provides a basis for accurate and efficient association studies.

Daly et al. (2001). LD by distance from two markers

Association Analysis So, what information do we need? For the group under study: Relationships between individuals (family structure) For each individual in the study: Disease status (or health status) Environmental information Haplotypes for the chromosomes where the candidate gene lies

The Problem It’s not yet easy to measure an individual’s (only two) haplotypes Molecular haplotyping (nucleotide sequencing) is the gold standard A more efficient strategy: Focus on regions, such as certain genes Estimate haplotypes from SNP data (genotypes) Use LD map, and reduce the number of loci to represent the haplotype Use haplotype map (DB) = key SNPs + haplotype blocks with strong LD

Haplotyping: Phase Problem C SNP1 SNP2 Diploid Observed: SNP1 G/T SNP2 A/C Possible Haplotypes: GA, TC or GC, TA n SNPs  2n possible haplotypes

Molecular Haplotyping Hetero-duplex analysis, mismatch detection, allele-specific PCR: Have potential to get high-throughput Only practical for short haplotypes (2-5 kb vs. 50-100kb) Costly Rolling Circle amplification method, etc: Can handle larger size Difficult to automate

In-silico Haplotyping Alias: Haplotype Reconstruction, Haplotype Inference, Computational Haplotyping, Statistical Haplotyping, etc. Advantages: Cost effective High-throughput Difficulty: Phase Ambiguity: Haplotypes increase exponentially with SNPs

In-silico Haplotyping: Two Tasks Reconstruction of the haplotypes of the sampled individuals II. Estimation of haplotypes frequencies in a population

In-silico Haplotyping: Approaches Clark’s algorithm E-M algorithm (expectation-maximization algorithm) Bayesian algorithm

Haplotype Inference A: SNP data: 0 (MM), 1 (Mm), 2 (mm) for a single locus B: Haplotype data: 0(M), 1 (m) for a single locus

#1 1, 2 00000 00100 #2 1, 3 00010 #3 1, 4 01001 #4 1, 5 00001 #5 1, 1 #6 1, 1

An Example Data 169 cases, 231 controls 11 haplotypes sex, age information

Logistic Regression Results Without adjusting for age, sex: Haplotype 7 is most strongly associated, but not statistically significant (p=0.07) Adjusting for age, sex: Haplotype 11 is most strongly associated (p=0.03) Slightly stronger association with accounting for repeated measures (2 haplotypes per person) by GEE procedure (p=0.02)

Other Examples

Drysdale et al. PNAS 2000, 97(19) 10483–10488

Wallenstein, Hodge, and Weston, Genetic Epidemiology 15:173–181 (1998)

Cohort study Case-control study

Shaw et al. Am J of Medical Genet 114 205-213 (2002)

IMT2000 소개

Ultimate Human Genome ….. …… Genetic epi Genomic epi , medicine Functional or structural units (coding regions, gene,…) Biology Chromosome ….. Epidemiology 1 2 ….. SNP1 … SNP i Sub 1 …… Various phenotype Sequence alignment Id, naming Exposure Ethnic group

Thank you ! Email :hokim@snu.ac.kr This file is available at http://plaza.snu.ac.kr /~hokim 열린 강의실, 세미나자료