Lectures 7 – Oct 19, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall.

Slides:



Advertisements
Similar presentations
CZ5225 Methods in Computational Biology Lecture 9: Pharmacogenetics and individual variation of drug response CZ5225 Methods in Computational Biology.
Advertisements

Combinatorial Algorithms for Haplotype Inference Pure Parsimony Dan Gusfield.
Note that the genetic map is different for men and women Recombination frequency is higher in meiosis in women.
METHODS FOR HAPLOTYPE RECONSTRUCTION
Multiple Comparisons Measures of LD Jess Paulus, ScD January 29, 2013.
Objectives Cover some of the essential concepts for GWAS that have not yet been covered Hardy-Weinberg equilibrium Meta-analysis SNP Imputation Review.
Basics of Linkage Analysis
Understanding GWAS Chip Design – Linkage Disequilibrium and HapMap Peter Castaldi January 29, 2013.
Association Mapping David Evans. Outline Definitions / Terminology What is (genetic) association? How do we test for association? When to use association.
MALD Mapping by Admixture Linkage Disequilibrium.
Ingredients for a successful genome-wide association studies: A statistical view Scott Weiss and Christoph Lange Channing Laboratory Pulmonary and Critical.
1 Cladistic Clustering of Haplotypes in Association Analysis Jung-Ying Tzeng Aug 27, 2004 Department of Statistics & Bioinformatics Research Center North.
More Powerful Genome-wide Association Methods for Case-control Data Robert C. Elston, PhD Case Western Reserve University Cleveland Ohio.
Computational Challenges in Whole-Genome Association Studies Ion Mandoiu Computer Science and Engineering Department University of Connecticut.
Aspects of Genetics and Genomics in Cancer Research Li Hsu Biostatistics and Biomathematics Program Fred Hutchinson Cancer Research Center.
Genome-wide association studies Usman Roshan. SNP Single nucleotide polymorphism Specific position and specific chromosome.
Introduction to Linkage Analysis March Stages of Genetic Mapping Are there genes influencing this trait? Epidemiological studies Where are those.
Single nucleotide polymorphisms Usman Roshan. SNPs DNA sequence variations that occur when a single nucleotide is altered. Must be present in at least.
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 14 Goodness-of-Fit Tests and Categorical Data Analysis.
Genomewide Association Studies.  1. History –Linkage vs. Association –Power/Sample Size  2. Human Genetic Variation: SNPs  3. Direct vs. Indirect Association.
Chi-Squared (  2 ) Analysis AP Biology Unit 4 What is Chi-Squared? In genetics, you can predict genotypes based on probability (expected results) Chi-squared.
Haplotype Discovery and Modeling. Identification of genes Identify the Phenotype MapClone.
The Chi-Square Test Used when both outcome and exposure variables are binary (dichotomous) or even multichotomous Allows the researcher to calculate a.
Lecture 5: Segregation Analysis I Date: 9/10/02  Counting number of genotypes, mating types  Segregation analysis: dominant, codominant, estimating segregation.
Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University.
Computational research for medical discovery at Boston College Biology Gabor T. Marth Boston College Department of Biology
Linear Reduction for Haplotype Inference Alex Zelikovsky joint work with Jingwu He WABI 2004.
Case(Control)-Free Multi-SNP Combinations in Case-Control Studies Dumitru Brinza and Alexander Zelikovsky Combinatorial Search (CS) for Disease-Association:
National Taiwan University Department of Computer Science and Information Engineering Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
The medical relevance of genome variability Gabor T. Marth, D.Sc. Department of Biology, Boston College Medical Genomics Course – Debrecen,
CS177 Lecture 10 SNPs and Human Genetic Variation
Gene Hunting: Linkage and Association
Lectures 2 – Oct 3, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall.
Genome-Wide Association Study (GWAS)
National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department.
Quantitative Genetics. Continuous phenotypic variation within populations- not discrete characters Phenotypic variation due to both genetic and environmental.
Quantitative Genetics
Finnish Genome Center Monday, 16 November Genotyping & Haplotyping.
Polymorphism Haixu Tang School of Informatics. Genome variations underlie phenotypic differences cause inherited diseases.
Lab 13: Association Genetics December 5, Goals Use Mixed Models and General Linear Models to determine genetic associations. Understand the effect.
Lecture 12: Linkage Analysis V Date: 10/03/02  Least squares  An EM algorithm  Simulated distribution  Marker coverage and density.
Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Lecture Slides Elementary Statistics Twelfth Edition and the Triola Statistics Series.
1 B-b B-B B-b b-b Lecture 2 - Segregation Analysis 1/15/04 Biomath 207B / Biostat 237 / HG 207B.
Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests.
N318b Winter 2002 Nursing Statistics Specific statistical tests Chi-square (  2 ) Lecture 7.
Linear Reduction Method for Tag SNPs Selection Jingwu He Alex Zelikovsky.
Association analysis Genetics for Computer Scientists Biomedicum & Department of Computer Science, Helsinki Päivi Onkamo.
February 20, 2002 UD, Newark, DE SNPs, Haplotypes, Alleles.
Association mapping for mendelian, and complex disorders January 16Bafna, BfB.
Genome wide association studies (A Brief Start)
The International Consortium. The International HapMap Project.
Practical With Merlin Gonçalo Abecasis. MERLIN Website Reference FAQ Source.
Linkage Disequilibrium and Recent Studies of Haplotypes and SNPs
Computational Biology and Genomics at Boston College Biology Gabor T. Marth Department of Biology, Boston College
Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Lecture Slides Elementary Statistics Twelfth Edition and the Triola Statistics Series.
1 1 Slide © 2008 Thomson South-Western. All Rights Reserved Chapter 12 Tests of Goodness of Fit and Independence n Goodness of Fit Test: A Multinomial.
Linkage. Announcements Problem set 1 is available for download. Due April 14. class videos are available from a link on the schedule web page, and at.
Linkage. Announcements Problem set 1 is available for download. Due April 14. class videos are available from a link on the schedule web page, and at.
Association tests. Basics of association testing Consider the evolutionary history of individuals proximal to the disease carrying mutation.
Analysis of Next Generation Sequence Data BIOST /06/2015.
The Haplotype Blocks Problems Wu Ling-Yun
Increasing Power in Association Studies by using Linkage Disequilibrium Structure and Molecular Function as Prior Information Eleazar Eskin UCLA.
Association Mapping in Families Gonçalo Abecasis University of Oxford.
Of Sea Urchins, Birds and Men
Statistical Methods for Quantitative Trait Loci (QTL) Mapping II
Haplotype Reconstruction
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Presentation transcript:

Lectures 7 – Oct 19, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall (JHN) 022 Disease Association Studies 1

2 Last Class … Haplotype reconstruction genetic markers …ACTCGGTTGGCCTTAATTCGGCCCGGACTCGGTTGGCCT AAATTCGGCCCGG … …ACCCGGTAGGCCTTAATTCGGCCCGGACCCGGTAGGCCTTAATTCGGCCCGG … …ACCCGGTTGGCCTTAATTCGGCCGGGACCCGGTTGGCCTTAATTCGGCCGGG … …ACCCGGTAGGCCTATATTCGGCCCGGACCCGGTAGGCCTATATTCGGCCCGG … …ACTCGGTAGGCCTATATTCGGCCGGGACTCGGTAGGCCTATATTCGGCCGGG … …ACCCGGTAGGCCTATATTCGGCCCGGACCCGGTAGGCCTATATTCGGCCCGG … …ACCCGGTAGGCCTAAATTCGGCCTGGACTCGGATGGCCTATATTCGGCCGGG … …ACCCGGTTGGCCTTTATTCGGCCGGGACTCGGTAGGCCT TTATTCGGCCGGG … …ACTCGGTTGGCCTAAATTCGGCCCGGACCCGGTTGGCCTTAATTCGGCCCGG … …ACTCGGTAGGCCTATATTCGGCCGGGACCCGGTTGGCCTT TATTCGGCCCGG … …ACTCGGTTGGCCTT TATTCGGCCCGGACTCGGTAGGCCTAAATTCGGCCCGG … Single nucleotide polymorphism (SNP) [snip] = a variation at a single site in DNA AT TTCG ATCTCG TT AA AT CC CG CC TT CT TATA TTTT CGCG ATAT CTCT CGCG TTTT CCCC GCGC AAAA ATAT ATAT

Outline Application to disease association analysis Single marker based association tests Haplotype-based approach Indirect association – predicting unobserved SNPs Selection of tag SNPs Genetic linkage analysis Pedigree-based gene mapping Elston-Stewart algorithm Association vs linkage 3

A single marker association test Data Genotype data from case/control individuals e.g. case: patients, control: healthy individuals Goals Compare frequencies of particular alleles, or genotypes, in set of cases and controls Typically, relies on standard contingency table tests Chi-square goodness-of-fit test Likelihood ratio test Fisher’s exact test 4

Construct contingency table Organize genotype counts in a simple table Rows: one row for cases, another for controls Columns: one of each genotype (or allele) Individual cells: count of observations 5 i: case, control j: 0/0, 0/1, 1/1 j=1j=2j=3 0/00/11/1 i=1Case (affected) O 1,1 O 1,2 O 1,3 i=2Control (unaffected) O 2,1 O 2,2 O 2,3 O ۰,1 =O 1,1 +O 2,1 O 1,۰ =o 1,1 +o 1,2 +o 1,3 O 2,۰ =o 2,1 +o 2,2 +o 2,3 O ۰,2 =O 1,2 +O 2,2 O ۰,3 =O 1,3 +O 2,3 Notation Let O ij denote the observed counts in each cell Let E ij denote the expected counts in each cell E ij = O i,۰ O ۰,j / O ۰, ۰

Goodness of fit tests (1/2) Null hypothesis There is no statistical dependency between the genotypes and the phenotype (case/control) P-value Probability of obtaining a test statistic at least as extreme as the one that was actually observed Chi-square test If counts are large, compare statistic to chi-squared distribution p = 0.05 threshold is 5.99 for 2 df (degrees of freedom, e.g. genotype test) p = 0.05 threshold is 3.84 for 1 df (e.g. allele test) If counts are small, exact or permutation tests are better 6 Degrees of freedom k

Goodness of fit tests (2/2) Likelihood ratio test The test statistics (usually denoted D) is twice the difference in the log-likelihoods: 7 How about we do this for haplotypes? When does it out-perform the single marker association test?

Haplotype association tests Calculate haplotype frequencies in each group Find most likely haplotype for each group Fill in contingency table to compare haplotypes in the two groups (case, control) Not recommended! 8

Case genotypes & haplotypes Observed case genotypes The phase reconstruction in the five ambiguous individuals will be driven by the haplotypes observed in individual 1 … 9 Inferred case haplotypes This kind of phenomenon will occur with nearly all population based haplotyping methods!

Control genotypes & haplotypes Observed control genotypes Note these are identical, except for the single homozygous individual … 10 Inferred case haplotypes Oops… The difference in a single genotype in the original data has been greatly amplified by estimating haplotypes…

Haplotype association tests Never impute haplotypes in two groups separately Alternatively, Consider both samples jointly Schaid et al (2002) Am J Hum Genet 70: Zaytkin et al (2002) Hum Hered. 53:79-91 Use maximum likelihood 11 individuals Possible haplotype pairs, conditional on genotype Haplotype pair frequency

Likelihood-based test Calculate 3 likelihoods Maximum likelihood for combined samples, L A Maximum likelihood for control sample, L B Maximum likelihood for case sample, L C df (degrees of freedom) corresponds to number of non-zero haplotype frequencies in large samples 12

Significance in small samples In reality sample sizes, it is hard to estimate the number of df accurately Instead, use a permutation approach to calculate empirical significance levels How? 13

Outline Application to disease association analysis Single marker based association tests Haplotype-based approach Indirect association – predicting unobserved SNPs Selection of tag SNPs Genetic linkage analysis Pedigree-based gene mapping Elston-Stewart algorithm Association vs linkage 14

15 G G A A A A A A G G G G G G Time Healthy controls Disease cases AG Indirect association: between proxy genotype and phenotype T T C C C C C C T T T T T T r 2 =1 r 2 : ranges between 0 and 1 1 when the two markers provide identical information 0 when they are in perfect linkage equilibrium In a typical GWAS, disease-causing SNPs have “proxies” that get high LOD scores

Pre-requisite for association studies How can we know which SNP pairs? Very dense genotype data Learn correlation between SNPs – haplotype structures Goal: dense genome-wide association scan G G A A C C T T A T T T C G G C T G G T r 2 =1 Genetic markers

17 Goal: Resource to enable genome-wide association studies Data: 420 human genomes Genomewide map: 3.8M SNPs Benchmark: “all” 17k SNPs/5Mb (ENCODE) G G A A C C T T A T T T C G G C T G G T

18 Main question for HapMap: Are genomewide association studies doable? or Do SNPs have enough proxies?

19 How many proxies will my causal SNP have? 2

20 perfect proxies G G A A A A A A G G G G G G Healthy controls Disease cases T T C C C C C C T T T T T T r 2 =1 r 2 =0.75 G A A A A A A A G G G G G G Im

21 How many proxies will my causal SNP have? % of SNPs can cover the genome

Computational challenges 22 Development of genotyping arrays Redundancy Efficiency

Optimizing SNP-set efficiency Select “tag“ SNPs that maximize the number of other SNPs whose alleles are revealed by them How? 23 G G A A C C T T A T T T C G G C T G G T      Markers tested: Markers captured: high r 2

Computational challenges 24 Development of genotyping arrays Redundancy Efficiency (tag SNP selection) Genotyping study cohort Analysis Power (predicting unobserved SNPs)

25 Analysis questions Can we quantify the coverage of common sequence variations measured by genome-wide SNP genotyping arrays? SNP genotyping arrays Arrays covering 100K/500K/1M SNPs from Affymetrix or Illumina ACTAAATACGTCAATTA/TAAATATAAGCGCTC/ACGCATCA GCAGTTAAATTTATAT CGTCAATTTAAATATA GCAGTTAATTTTATAT CGTCAATTTAAATATA ACTAAATACGTCAATTTAAATATAAGCGC DNA of individual i

Association tests with fixed markers 26 SNPs captured:     G G A A C C T T A T T T C G G C T G G T Tests of association: high r 2

Arrays cover many common alleles 27 Panel: African (most diverse)

Arrays cover many common alleles 28 Panel: European

29 Analysis questions Can we quantify the coverage of common sequence variations measured by genome-wide SNP genotyping arrays? Can we do better?

Association with haplotypes 30 SNPs captured:     G G A A C C T T A T T T C G G C T G G T Tests of association:

31 Tests of association:  SNPs captured:     Association with haplotypes G G A A C C T T A T T T C G G C T G G T TG

32 Increasing coverage (r 2 =0.8) by specified haplotypes 100k 500k Panel: African (most diverse) Panel: European

33 Which platform to use?

34 Summary Association analysis is a powerful strategy for common disease research HapMap and genomewide technologies enable whole-genome association scans

Acknowledgement These lecture notes were generated based on the slides from Prof. Itsik Pe’er (Columbia CS). 35