Download presentation
Presentation is loading. Please wait.
Published byMorris O’Neal’ Modified over 9 years ago
1
1 Genes and MS in Tasmania, completed. Lecture 7, Statistics 246 February 12, 2004
2
2 Towards a sharing statistic Our aim was to come up with a statistic that effectively describes haplotype sharing differences between case and “control” haplotypes The sharing statistic should be largest at markers closest to a disease locus, as haplotype sharing there should -extend the furthest; & - the association of disease with particular haplotypes should be strongest
3
3 Nonparametric haplotype sharing analysis Why nonparametric, rather than likelihood-based methods? Likelihood methods make assumptions regarding the genealogy of the population, and we don’t how many of these assumptions are robust to violations. Likelihood methods are computationally intensive, especially for genome wide scans, where these is a need to maximize over the very large state space of possible ancestral haplotypes (MCMC) Likelihood methods have a hard time at the HLA region, because the LD there is extremely high and non uniform (block-like structure) Simpler statistics will probably do better here, unless we can model background LD
4
4 Haplotype sharing statistics for genome wide scan data cf. fine mapping Previous (usually likelihood-based) statistics have concentrated on fine mapping and the exact localization of a variant allele. They assume a signal exists. For us, localization was not the primary interest. Rather, detection was our main interest, using a genome-wide scan We needed something that was not as computationally intensive as DHSMAP (McPeek & Strahs, 1999), BLADE (Liu et al, 2001), DMLE+ (Rannala & Reeve, 2001), or the shattered coalescent (Morris et al, 2002).
5
5 Haplo_clusters (Melanie Bahlo) Calculates a sharing statistic at every marker Obtains a p-value at every marker using a permutation test Allows for several clusters of ancestral haplotypes (allelic heterogeneity)
6
6 3 5 9 8 7 6 10 1 5 4 3 2 5 Cases 3 2 1 3 7 6 10 1 5 4 1 3 2 1 2 1 3 5 6 10 1 5 2 1 3 4 2 3 7 3 1 6 10 9 1 1 2 5 6 5 9 1 1 4 1 3 1 2 3 1 9 87 6 5 3 1 3 2 1 5 9 7 9 1Controls7 1 2 1 1 3 5 7 1 5 1 3 29 3 9 2 1 2 7 5 3 4 2 2 5 Testing for shared haplotypes Score for haplotype sharing (- log p) Pter--Qter
7
7 Sharing drop-off & allelic heterogeneity Marker Proportions of Cases Proportions of Controls 12341234 = Cluster 1 haplotypes = Cluster 2 haplotypes = neither cluster 1 nor 2 haplotypes
8
8 Haplo_cluster in action Haplotype1 1 21 32 13 1 Controls11101 Cases00030 Example: Sorting on marker 1 for a sample of 3 case and 4 control haplotypes 2 1 3 2 1 4 1 1 2 1 2 3 1 3 3 3 1 2 After sort on haplotype consisting only of marker 1, calculate a chi-square statistic, and move on Cases Controls Haplotype123 Controls301 Cases030 After sorting on haplotype consisting of marker 1 and marker 2, calculate a chi-square statistic, and …. Eventually stop, and sum the chi-square statistics. Then repeat for a suitably large number of random permutations of cases and controls.
9
9 Statistic to evaluate haplotype sharing Sharing statistic is 2 based, using the idea of multiple ancestral haplotypes (clusters) which are “grown” starting at each marker examined in the scan. Significance is evaluated via a permutation test: choose a random permutation of the pooled cases and controls, and recalculate the statistic; repeat ~20,000 times. A recursive form for the estimator and and the SD of the p-value was used, to enable early termination of program
10
10 The permutation test The idea is this. We have 170 cases and 105 controls, and at any particular locus, we calculate the value of our statistic, calling it S. Now pool our cases and controls into 275 individuals, and sample 170 to be “cases” at random from the 275, calling the remainder “controls”. For this first artificial set of cases and controls, calculate the value of our statistic, S 1 say. Next, we repeat this procedure 9,999 more times, say, obtaining values S 2, S 3, S 4 … S 10,000. As long as 10,000 is sufficiently many random permutations, we can get a good estimate of the p-value of our initial statistic relative to our empirically estimated null distribution, as p = #{i: S i > S }/10,000.
11
11 Exercises 1. How should we decide what number of resamplings is large enough? 2. Explain in the simple case of a 2 2 table of cases and controls cross-classified as diseased and healthy, how using all possible resamplings, rather than a fixed size random sample, leads to the p-value for the exact test. 3. To avoid carrying out an unnecessarily large number of permutations, the proportion of resampled values of our statistic exceeding the value S can be monitored. Can you describe a stopping rule for the random resamplings that should lead to “accurate enough” p-values, without going to the full number each time?
12
12 Haplo_clusters - Output -opt 1 Genetic distances used to decide order of markers to sort on -c 1 The number of clusters of haplotypes to look for = 1 -miss 1 The missing data is replaced randomly using the 2 marker haplotype information. -share 5 The number of haplotypes needed to share = 5 The standard deviation p values are calculated to 0.01*phat. Marker names have been provided and will be used in the output files. # of case haplotypes = 338 # of contol haplotypes = 208 # of markers = 11 # of perms = 100000 MarkerMapdistanceChi_Squarepsd(p)-log(p)perms ==================================================== D21S191105.344.44e-014.44e-030.3512510 D21S19040.856.173.63e-013.63e-030.4417577 D21S189910.365.894.37e-014.37e-030.3612876 D21S192216.462.976.83e-016.83e-030.174636 D21S188417.264.744.14e-014.14e-030.3814135 D21S191420.826.493.38e-013.38e-030.4719571 D21S26328.974.065.24e-015.24e-030.289077 D21S125239.411.188.66e-018.65e-030.061553 D21S191942.511.388.51e-018.51e-030.071751 D21S125543.812.247.24e-017.24e-030.143805 D21S26651.513.865.70e-015.70e-030.247557 ===================================================
13
13 Haplo_clusters - Output II Table of haplotypes MarkerClusterHaplotypeLength(Haplotype) D21S1911D21S1904D21S1899D21S1922D21S1884D21S1914D21S263D21S125 =================================================== D21S18841--633811- # of haplos:--582163222- Chi-square:--0.20.03.20.11.2- D21S19141--73352- # of haplos:--416345810- Chi-square:--2.50.00.52.11.3- D21S2631---431025 # of haplos:---35241389 Chi-square:---1.91.20.70.00.3 D21S12521------45 # of haplos:------683 Chi-square:------0.60.1 D21S19191------22 # of haplos:------37 Chi-square:------0.30.0 Etc etc etc =================================================== Time taken (m) = 55, 23/6/2003, 11:15:12 Haplo_cluster.pl$Revision:1.15$
14
14 Output for Chromosome 6 HLA Region – p-value <0.00001. Peak contains D6S105, MOGCA,
15
15 Empirical distributions of statistic, chr 6 Off scale
16
16 Comparison of Two Positive Controls against Two Negative Controls Cases versus Controls Cases versus Untransmitteds Untransmitted versus Controls Controls versus Untransmitted
17
17 Uniform qq-plots and multiple testing When we carry out ~800 tests, as we have here, we expect to see many quite small p-values under the combined null hypothesis of no case-control haplotype differences anywhere, specifically, about 40 smaller than the usual 5% cutoff. In practice, we believe that at most a few of these 800 nulls will be false. How do we adjust our p-values for this multiplicity of tests? One fairly severe way is known as the Bonferroni adjustment: to multiply all our p-values by 800. Another approach is this: rather than compare our 800 p-values to the single test 5% cutoff, we compare them all to that value which the smallest of 800 i.i.d uniforms will exceed 95% of the time. Exercises 1. Prove that Bonferroni procedure is conservative, in that the family-wise type 1 error (the chance of one or more type 1 errors) under the assumption that all the null hypotheses are true, is ≤ 5%. 2. Calculate the 5th percentile of the smallest of 800 i.i.d. uniforms. How close is it to the Bonferroni 5th percentile?
18
18 Uniform qq-plots and multiple testing, cont. The procedure just described is still conservative, for two reasons. Firstly, the p-values are not independent, though they should be identically distributed under the null hypothesis. There are ways to incorporate this into our analysis, the most direct being to estimate the joint resampling distribution of the test statistics for every marker. This can be computationally prohibitive, especially if we also want address the next point, which is: Only the smallest of the p-values should be compared to the smallest of an i.i.d. or suitably dependent sequence of 800 p-values. The second smallest p-value should more correctly be compared to something slightly different, and so on. This leads is to the notion of step-wise multiple testing procedures. Resampling-based stepwise multiple testing corrections can be very computationally intensive. In our present case we did no more than create a uniform qq-plot, and look at the number of loci “off the line” at the low end. Why? In part for computational reasons; in part, because we plan to follow up “promising” regions even if they do not have small adjusted p-values.
19
19 Distribution of p-values:uniform qq plots Expected ObservedObserved ObservedObserved
20
20 Reproducibility: same datasets, different random number seeds
21
21 Similar method/problem Similar method Haplotype Pattern Mining (Toivonen et al, 2000). Ingileif Hallgrimsdottir (Statistics, UCB) modified and extended this method, and her (blindly derived) results were very similar to those obtained using Haplo_cluster on the MS data. Similar problem A study of bipolar disorder in the Central valley of Costa Rica (Service et al, 2001, Ophoff et al, 2002) involves an admixed population of Amerindian and Spanish people, few founders, little immigration, ~300 years old. They use likelihood methods on 3-locus haplotypes, but didn’t use controls.
22
22 What next for the MS study (apart from more analysis)? A close study of the MHC (HLA) region was conducted and published Relatedness of cases and controls was studied more carefully, and a few “too close” relatives identified and removed, but leaving the main conclusions unchanged Fine mapping around peaks was carried out: some peaks were strengthened, others disappeared. Further fine mapping under way. The Tasmanian cases are being joined by ethnically similar cases from the mainland, and genotyping of these new individuals in candidate regions is under way International collaboration is also under way We want to find genes and amino acid changes, if at all possible
23
23 Fine mapping: two regions
24
24 Relatedness in cases and controls We assume that our cases and controls are mostly representative of the “Tasmanian population”. If they are too closely related (within cases or controls) we might expect bias in our sharing statistic. If they are not closely enough related (within cases or controls) we might expect trouble detecting a signal.
25
25 This pedigree is similar to the type of pedigree found in Tasmania. The “affected” individuals are represented by the filled in symbols.
26
26 Determining the relatedness of Tasmanians based on GWS data Determine the level of relatedness of all pairs based on the genome wide scan data (another HMM analysis) We found several pairs which were much more closely related than the 10-12 meioses (6-8 generations) expected –10 pairs in the case data –6 pairs in the control data –2 pairs in the case and control data Some of these relationships were subsequently verified with further genealogical research We re-did the analyses without these people
27
27 Does having closely related cases or controls make a difference? Cases versus Controls Cases versus Controls (relateds removed) Cases versus Untransmitteds Cases versus Untransmitteds (relateds removed)
28
28 HLA Region & MS MS is believed to be an autoimmune disease (similar to type I diabetes) HLA association with MS previously identified One or more genes? Log linear modelling with partial haplotypes suggests that two regions were responsible and that these did not interact
29
29 The HLA complex Klein J. et al New Eng J Med, 2000; 343:702-709 An extremely gene-rich region.
30
30 850 kb 850-Kb Microsatellite markers that spanned the HLA complex generated a peak of association in an 850-Kb segment of the class I region We have implicated an 850-Kb class I region in MS
31
31
32
32 The TNF locus + 15 other class III genes have no influence on disease - association due to strong LD with DR15 Genetic dissection of the HLA region by haplotype analysis The HLA region encodes at least two independent susceptibility loci for MS IIIIII MOGFGA E CB TNF DRB1DQB1 DPB1 DRB1*1501-DQB1*0602 √ √ X (Rubio et al. 2002 AJHG)
33
33 IIIIII 3.6 Mb5.1 Mb (~1 cM) D6S299 An extended haplotype across HLA confers increased risk to MS D6S105 D6S464 D6S2223 MOGCA D6S2655 HLA-F D6S510 DQB1 DRB1 D6S291 3653351*1501*0602 Ancestral haplotype RR=4.3 RR=5.7 (DR15)
34
34 Acknowledgments MCPHR, Hobart Ingrid van der Mei Trish Groom Kristen Hazelwood Jane Pittaway Rhonda McCoy Lyn Hall Tracy Lowe Natasha Newton Emma Stubbs Michele Sale Maree Ring Annette Banks Joan Clough Tim Albion Jo Dickinson Shelly Brown Sue Sawbridge Deirbhile O’Byrne Bruce Taylor Stan Sjeica Andrew Hughes Bozidar Drulovic Terry Dwyer WEHI Justin Rubio Laura Johnson Rachel Burfoot Stewart Huxtable Simon Foote ANRMSF, Canberra. Rex Simmons MCRI Funding: The Genes-CRC The National Multiple Sclerosis Society (USA) Department of Neurosciences RMH MS Australia NH&MRC (Australia) VTIS The Tasmanian and Victorian public The MS Societies of Victoria and Tasmania Brian Tait Mike Varney Bob Williamson The AGRF (Melbourne) RMH Niall Tubridy Jo Baker John Cary Trevor Kilpatrick Helmut Butzkueven Mark Marriot Melanie Bahlo Jim Stankovich Chris Wilkinson
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.