1 Genes and MS in Tasmania, completed. Lecture 7, Statistics 246 February 12, 2004.

Slides:



Advertisements
Similar presentations
A quantitative trait locus not associated with cognitive ability in children: a failure to replicate Hill, L. et al.
Advertisements

Tests of Hypotheses Based on a Single Sample
Review of main points from last week Medical costs escalating largely due to new technology This is an ethical/social problem with major conseq. Many new.
T-tests continued.
Genetic Heterogeneity Taken from: Advanced Topics in Linkage Analysis. Ch. 27 Presented by: Natalie Aizenberg Assaf Chen.
METHODS FOR HAPLOTYPE RECONSTRUCTION
Recombination and genetic variation – models and inference
Multiple Comparisons Measures of LD Jess Paulus, ScD January 29, 2013.
COURSE: JUST 3900 INTRODUCTORY STATISTICS FOR CRIMINAL JUSTICE Instructor: Dr. John J. Kerbs, Associate Professor Joint Ph.D. in Social Work and Sociology.
Basics of Linkage Analysis
. Parametric and Non-Parametric analysis of complex diseases Lecture #6 Based on: Chapter 25 & 26 in Terwilliger and Ott’s Handbook of Human Genetic Linkage.
AP Statistics – Chapter 9 Test Review
MALD Mapping by Admixture Linkage Disequilibrium.
Significance Testing Chapter 13 Victor Katch Kinesiology.
Stat 512 – Lecture 14 Analysis of Variance (Ch. 12)
Business Statistics - QBM117
Differentially expressed genes
Lecture 9: One Way ANOVA Between Subjects
Lecture 12 One-way Analysis of Variance (Chapter 15.2)
Inferences About Process Quality
Review for Exam 2 Some important themes from Chapters 6-9 Chap. 6. Significance Tests Chap. 7: Comparing Two Groups Chap. 8: Contingency Tables (Categorical.
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 8 Tests of Hypotheses Based on a Single Sample.
Scot Exec Course Nov/Dec 04 Ambitious title? Confidence intervals, design effects and significance tests for surveys. How to calculate sample numbers when.
INFERENTIAL STATISTICS – Samples are only estimates of the population – Sample statistics will be slightly off from the true values of its population’s.
Example 10.1 Experimenting with a New Pizza Style at the Pepperoni Pizza Restaurant Concepts in Hypothesis Testing.
McGraw-Hill/IrwinCopyright © 2009 by The McGraw-Hill Companies, Inc. All Rights Reserved. Chapter 9 Hypothesis Testing.
Probability Distributions and Test of Hypothesis Ka-Lok Ng Dept. of Bioinformatics Asia University.
Multiple testing correction
Multiple testing in high- throughput biology Petter Mostad.
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 6 – Multiple comparisons, non-normality, outliers Marshall.
Copyright © Cengage Learning. All rights reserved. 8 Tests of Hypotheses Based on a Single Sample.
Lecture 5: Segregation Analysis I Date: 9/10/02  Counting number of genotypes, mating types  Segregation analysis: dominant, codominant, estimating segregation.
CHAPTER 16: Inference in Practice. Chapter 16 Concepts 2  Conditions for Inference in Practice  Cautions About Confidence Intervals  Cautions About.
1 CSI5388: Functional Elements of Statistics for Machine Learning Part I.
CHAPTER 18: Inference about a Population Mean
Case(Control)-Free Multi-SNP Combinations in Case-Control Studies Dumitru Brinza and Alexander Zelikovsky Combinatorial Search (CS) for Disease-Association:
Chapter 9 Power. Decisions A null hypothesis significance test tells us the probability of obtaining our results when the null hypothesis is true p(Results|H.
1 Genes and MS in Tasmania, cont. Lecture 5, Statistics 246 February 3, 2004.
Chapter 10: Analyzing Experimental Data Inferential statistics are used to determine whether the independent variable had an effect on the dependent variance.
Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006.
Comp. Genomics Recitation 3 The statistics of database searching.
1 Chapter 10: Introduction to Inference. 2 Inference Inference is the statistical process by which we use information collected from a sample to infer.
Statistical Hypotheses & Hypothesis Testing. Statistical Hypotheses There are two types of statistical hypotheses. Null Hypothesis The null hypothesis,
Lecture 19: Association Studies II Date: 10/29/02  Finish case-control  TDT  Relative Risk.
STA Lecture 251 STA 291 Lecture 25 Testing the hypothesis about Population Mean Inference about a Population Mean, or compare two population means.
Type 1 Error and Power Calculation for Association Analysis Pak Sham & Shaun Purcell Advanced Workshop Boulder, CO, 2005.
Chapter 221 What Is a Test of Significance?. Chapter 222 Thought Question 1 The defendant in a court case is either guilty or innocent. Which of these.
Jianfeng Xu, M.D., Dr.PH Professor of Public Health and Cancer Biology Director, Program for Genetic and Molecular Epidemiology of Cancer Associate Director,
Lecture 4: Statistics Review II Date: 9/5/02  Hypothesis tests: power  Estimation: likelihood, moment estimation, least square  Statistical properties.
Tutorial #10 by Ma’ayan Fishelson. Classical Method of Linkage Analysis The classical method was parametric linkage analysis  the Lod-score method. This.
Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests.
FINE SCALE MAPPING ANDREW MORRIS Wellcome Trust Centre for Human Genetics March 7, 2003.
1 Genes and MS in Tasmania, cont. Lecture 6, Statistics 246 February 5, 2004.
Association analysis Genetics for Computer Scientists Biomedicum & Department of Computer Science, Helsinki Päivi Onkamo.
Errors in Genetic Data Gonçalo Abecasis. Errors in Genetic Data Pedigree Errors Genotyping Errors Phenotyping Errors.
Practical With Merlin Gonçalo Abecasis. MERLIN Website Reference FAQ Source.
Statistical Inference Statistical inference is concerned with the use of sample data to make inferences about unknown population parameters. For example,
Lecture 23: Quantitative Traits III Date: 11/12/02  Single locus backcross regression  Single locus backcross likelihood  F2 – regression, likelihood,
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 6 –Multiple hypothesis testing Marshall University Genomics.
+ Unit 6: Comparing Two Populations or Groups Section 10.2 Comparing Two Means.
BIOL 582 Lecture Set 2 Inferential Statistics, Hypotheses, and Resampling.
Efficient calculation of empirical p- values for genome wide linkage through weighted mixtures Sarah E Medland, Eric J Schmitt, Bradley T Webb, Po-Hsiu.
Fundamentals of Data Analysis Lecture 4 Testing of statistical hypotheses pt.1.
© 2010 Pearson Prentice Hall. All rights reserved Chapter Hypothesis Tests Regarding a Parameter 10.
Power and Meta-Analysis Dr Geraldine M. Clarke Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015 Africa Centre for.
Date of download: 7/2/2016 Copyright © 2016 American Medical Association. All rights reserved. From: How to Interpret a Genome-wide Association Study JAMA.
Step 1: Specify a null hypothesis
Genome Wide Association Studies using SNP
Genetic Dissection of the Human Leukocyte Antigen Region by Use of Haplotypes of Tasmanians with Multiple Sclerosis  Justin P. Rubio, Melanie Bahlo, Helmut.
Emily C. Walsh, Kristie A. Mather, Stephen F
Presentation transcript:

1 Genes and MS in Tasmania, completed. Lecture 7, Statistics 246 February 12, 2004

2 Towards a sharing statistic Our aim was to come up with a statistic that effectively describes haplotype sharing differences between case and “control” haplotypes The sharing statistic should be largest at markers closest to a disease locus, as haplotype sharing there should -extend the furthest; & - the association of disease with particular haplotypes should be strongest

3 Nonparametric haplotype sharing analysis Why nonparametric, rather than likelihood-based methods? Likelihood methods make assumptions regarding the genealogy of the population, and we don’t how many of these assumptions are robust to violations. Likelihood methods are computationally intensive, especially for genome wide scans, where these is a need to maximize over the very large state space of possible ancestral haplotypes (MCMC) Likelihood methods have a hard time at the HLA region, because the LD there is extremely high and non uniform (block-like structure) Simpler statistics will probably do better here, unless we can model background LD

4 Haplotype sharing statistics for genome wide scan data cf. fine mapping Previous (usually likelihood-based) statistics have concentrated on fine mapping and the exact localization of a variant allele. They assume a signal exists. For us, localization was not the primary interest. Rather, detection was our main interest, using a genome-wide scan We needed something that was not as computationally intensive as DHSMAP (McPeek & Strahs, 1999), BLADE (Liu et al, 2001), DMLE+ (Rannala & Reeve, 2001), or the shattered coalescent (Morris et al, 2002).

5 Haplo_clusters (Melanie Bahlo) Calculates a sharing statistic at every marker Obtains a p-value at every marker using a permutation test Allows for several clusters of ancestral haplotypes (allelic heterogeneity)

Cases Controls Testing for shared haplotypes Score for haplotype sharing (- log p) Pter--Qter

7 Sharing drop-off & allelic heterogeneity Marker Proportions of Cases Proportions of Controls = Cluster 1 haplotypes = Cluster 2 haplotypes = neither cluster 1 nor 2 haplotypes

8 Haplo_cluster in action Haplotype Controls11101 Cases00030 Example: Sorting on marker 1 for a sample of 3 case and 4 control haplotypes After sort on haplotype consisting only of marker 1, calculate a chi-square statistic, and move on Cases Controls Haplotype123 Controls301 Cases030 After sorting on haplotype consisting of marker 1 and marker 2, calculate a chi-square statistic, and …. Eventually stop, and sum the chi-square statistics. Then repeat for a suitably large number of random permutations of cases and controls.

9 Statistic to evaluate haplotype sharing Sharing statistic is  2 based, using the idea of multiple ancestral haplotypes (clusters) which are “grown” starting at each marker examined in the scan. Significance is evaluated via a permutation test: choose a random permutation of the pooled cases and controls, and recalculate the statistic; repeat ~20,000 times. A recursive form for the estimator and and the SD of the p-value was used, to enable early termination of program

10 The permutation test The idea is this. We have 170 cases and 105 controls, and at any particular locus, we calculate the value of our statistic, calling it S. Now pool our cases and controls into 275 individuals, and sample 170 to be “cases” at random from the 275, calling the remainder “controls”. For this first artificial set of cases and controls, calculate the value of our statistic, S 1 say. Next, we repeat this procedure 9,999 more times, say, obtaining values S 2, S 3, S 4 … S 10,000. As long as 10,000 is sufficiently many random permutations, we can get a good estimate of the p-value of our initial statistic relative to our empirically estimated null distribution, as p = #{i: S i > S }/10,000.

11 Exercises 1. How should we decide what number of resamplings is large enough? 2. Explain in the simple case of a 2  2 table of cases and controls cross-classified as diseased and healthy, how using all possible resamplings, rather than a fixed size random sample, leads to the p-value for the exact test. 3. To avoid carrying out an unnecessarily large number of permutations, the proportion of resampled values of our statistic exceeding the value S can be monitored. Can you describe a stopping rule for the random resamplings that should lead to “accurate enough” p-values, without going to the full number each time?

12 Haplo_clusters - Output -opt 1 Genetic distances used to decide order of markers to sort on -c 1 The number of clusters of haplotypes to look for = 1 -miss 1 The missing data is replaced randomly using the 2 marker haplotype information. -share 5 The number of haplotypes needed to share = 5 The standard deviation p values are calculated to 0.01*phat. Marker names have been provided and will be used in the output files. # of case haplotypes = 338 # of contol haplotypes = 208 # of markers = 11 # of perms = MarkerMapdistanceChi_Squarepsd(p)-log(p)perms ==================================================== D21S e e D21S e e D21S e e D21S e e D21S e e D21S e e D21S e e D21S e e D21S e e D21S e e D21S e e ===================================================

13 Haplo_clusters - Output II Table of haplotypes MarkerClusterHaplotypeLength(Haplotype) D21S1911D21S1904D21S1899D21S1922D21S1884D21S1914D21S263D21S125 =================================================== D21S # of haplos: Chi-square: D21S # of haplos: Chi-square: D21S # of haplos: Chi-square: D21S # of haplos: Chi-square: D21S # of haplos: Chi-square: Etc etc etc =================================================== Time taken (m) = 55, 23/6/2003, 11:15:12 Haplo_cluster.pl$Revision:1.15$

14 Output for Chromosome 6 HLA Region – p-value < Peak contains D6S105, MOGCA,

15 Empirical distributions of statistic, chr 6 Off scale

16 Comparison of Two Positive Controls against Two Negative Controls Cases versus Controls Cases versus Untransmitteds Untransmitted versus Controls Controls versus Untransmitted

17 Uniform qq-plots and multiple testing When we carry out ~800 tests, as we have here, we expect to see many quite small p-values under the combined null hypothesis of no case-control haplotype differences anywhere, specifically, about 40 smaller than the usual 5% cutoff. In practice, we believe that at most a few of these 800 nulls will be false. How do we adjust our p-values for this multiplicity of tests? One fairly severe way is known as the Bonferroni adjustment: to multiply all our p-values by 800. Another approach is this: rather than compare our 800 p-values to the single test 5% cutoff, we compare them all to that value which the smallest of 800 i.i.d uniforms will exceed 95% of the time. Exercises 1. Prove that Bonferroni procedure is conservative, in that the family-wise type 1 error (the chance of one or more type 1 errors) under the assumption that all the null hypotheses are true, is ≤ 5%. 2. Calculate the 5th percentile of the smallest of 800 i.i.d. uniforms. How close is it to the Bonferroni 5th percentile?

18 Uniform qq-plots and multiple testing, cont. The procedure just described is still conservative, for two reasons. Firstly, the p-values are not independent, though they should be identically distributed under the null hypothesis. There are ways to incorporate this into our analysis, the most direct being to estimate the joint resampling distribution of the test statistics for every marker. This can be computationally prohibitive, especially if we also want address the next point, which is: Only the smallest of the p-values should be compared to the smallest of an i.i.d. or suitably dependent sequence of 800 p-values. The second smallest p-value should more correctly be compared to something slightly different, and so on. This leads is to the notion of step-wise multiple testing procedures. Resampling-based stepwise multiple testing corrections can be very computationally intensive. In our present case we did no more than create a uniform qq-plot, and look at the number of loci “off the line” at the low end. Why? In part for computational reasons; in part, because we plan to follow up “promising” regions even if they do not have small adjusted p-values.

19 Distribution of p-values:uniform qq plots Expected ObservedObserved ObservedObserved

20 Reproducibility: same datasets, different random number seeds

21 Similar method/problem Similar method Haplotype Pattern Mining (Toivonen et al, 2000). Ingileif Hallgrimsdottir (Statistics, UCB) modified and extended this method, and her (blindly derived) results were very similar to those obtained using Haplo_cluster on the MS data. Similar problem A study of bipolar disorder in the Central valley of Costa Rica (Service et al, 2001, Ophoff et al, 2002) involves an admixed population of Amerindian and Spanish people, few founders, little immigration, ~300 years old. They use likelihood methods on 3-locus haplotypes, but didn’t use controls.

22 What next for the MS study (apart from more analysis)? A close study of the MHC (HLA) region was conducted and published Relatedness of cases and controls was studied more carefully, and a few “too close” relatives identified and removed, but leaving the main conclusions unchanged Fine mapping around peaks was carried out: some peaks were strengthened, others disappeared. Further fine mapping under way. The Tasmanian cases are being joined by ethnically similar cases from the mainland, and genotyping of these new individuals in candidate regions is under way International collaboration is also under way We want to find genes and amino acid changes, if at all possible

23 Fine mapping: two regions

24 Relatedness in cases and controls We assume that our cases and controls are mostly representative of the “Tasmanian population”. If they are too closely related (within cases or controls) we might expect bias in our sharing statistic. If they are not closely enough related (within cases or controls) we might expect trouble detecting a signal.

25 This pedigree is similar to the type of pedigree found in Tasmania. The “affected” individuals are represented by the filled in symbols.

26 Determining the relatedness of Tasmanians based on GWS data Determine the level of relatedness of all pairs based on the genome wide scan data (another HMM analysis) We found several pairs which were much more closely related than the meioses (6-8 generations) expected –10 pairs in the case data –6 pairs in the control data –2 pairs in the case and control data Some of these relationships were subsequently verified with further genealogical research We re-did the analyses without these people

27 Does having closely related cases or controls make a difference? Cases versus Controls Cases versus Controls (relateds removed) Cases versus Untransmitteds Cases versus Untransmitteds (relateds removed)

28 HLA Region & MS MS is believed to be an autoimmune disease (similar to type I diabetes) HLA association with MS previously identified One or more genes? Log linear modelling with partial haplotypes suggests that two regions were responsible and that these did not interact

29 The HLA complex Klein J. et al New Eng J Med, 2000; 343: An extremely gene-rich region.

kb 850-Kb Microsatellite markers that spanned the HLA complex generated a peak of association in an 850-Kb segment of the class I region We have implicated an 850-Kb class I region in MS

31

32 The TNF locus + 15 other class III genes have no influence on disease - association due to strong LD with DR15 Genetic dissection of the HLA region by haplotype analysis The HLA region encodes at least two independent susceptibility loci for MS IIIIII MOGFGA E CB TNF  DRB1DQB1 DPB1 DRB1*1501-DQB1*0602 √ √ X (Rubio et al AJHG)

33 IIIIII 3.6 Mb5.1 Mb (~1 cM) D6S299 An extended haplotype across HLA confers increased risk to MS D6S105 D6S464 D6S2223 MOGCA D6S2655 HLA-F D6S510 DQB1 DRB1 D6S *1501*0602 Ancestral haplotype RR=4.3 RR=5.7 (DR15)

34 Acknowledgments MCPHR, Hobart Ingrid van der Mei Trish Groom Kristen Hazelwood Jane Pittaway Rhonda McCoy Lyn Hall Tracy Lowe Natasha Newton Emma Stubbs Michele Sale Maree Ring Annette Banks Joan Clough Tim Albion Jo Dickinson Shelly Brown Sue Sawbridge Deirbhile O’Byrne Bruce Taylor Stan Sjeica Andrew Hughes Bozidar Drulovic Terry Dwyer WEHI Justin Rubio Laura Johnson Rachel Burfoot Stewart Huxtable Simon Foote ANRMSF, Canberra. Rex Simmons MCRI Funding: The Genes-CRC The National Multiple Sclerosis Society (USA) Department of Neurosciences RMH MS Australia NH&MRC (Australia) VTIS The Tasmanian and Victorian public The MS Societies of Victoria and Tasmania Brian Tait Mike Varney Bob Williamson The AGRF (Melbourne) RMH Niall Tubridy Jo Baker John Cary Trevor Kilpatrick Helmut Butzkueven Mark Marriot Melanie Bahlo Jim Stankovich Chris Wilkinson