Accuracy of Haplotype Frequency Estimation for Biallelic Loci, via the Expectation- Maximization Algorithm for Unphased Diploid Genotype Data  Daniele.

Slides:



Advertisements
Similar presentations
Xiaoshu Chen, Jianzhi Zhang  Cell Systems 
Advertisements

The Structure of Common Genetic Variation in United States Populations
Extensive Linkage Disequilibrium in Small Human Populations in Eurasia
M. A. Escamilla, M. C. DeMille, E. Benavides, E. Roche, L. Almasy, S
Harald H.H. Göring, Joseph D. Terwilliger 
Fiona E. Müllner, Sheyum Syed, Paul R. Selvin, Fred J. Sigworth 
Volume 26, Issue 7, Pages (April 2016)
Genomewide Linkage Scan Identifies a Novel Susceptibility Locus for Restless Legs Syndrome on Chromosome 9p  Shenghan Chen, William G. Ondo, Shaoqi Rao,
CHEK2*1100delC and Susceptibility to Breast Cancer: A Collaborative Analysis Involving 10,860 Breast Cancer Cases and 9,065 Controls from 10 Studies 
An Extensive Analysis of Y-Chromosomal Microsatellite Haplotypes in Globally Dispersed Human Populations  Manfred Kayser, Michael Krawczak, Laurent Excoffier,
Meta-analysis of Genetic-Linkage Analysis of Quantitative-Trait Loci
Haplotype Estimation Using Sequencing Reads
Caroline Durrant, Krina T. Zondervan, Lon R
Association Mapping in Structured Populations
David H. Spencer, Kerry L. Bubb, Maynard V. Olson 
Thomas Willems, Melissa Gymrek, G
Tuuli Lappalainen, Stephen B. Montgomery, Alexandra C
Brian K. Maples, Simon Gravel, Eimear E. Kenny, Carlos D. Bustamante 
QTL Fine Mapping by Measuring and Testing for Hardy-Weinberg and Linkage Disequilibrium at a Series of Linked Marker Loci in Extreme Samples of Populations 
Rounak Dey, Ellen M. Schmidt, Goncalo R. Abecasis, Seunggeun Lee 
Gene-Expression Variation Within and Among Human Populations
10 Years of GWAS Discovery: Biology, Function, and Translation
John Wakeley, Rasmus Nielsen, Shau Neen Liu-Cordero, Kristin Ardlie 
Highly Significant Linkage to the SLI1 Locus in an Expanded Sample of Individuals Affected by Specific Language Impairment    The American Journal of.
Variant Association Tools for Quality Control and Analysis of Large-Scale Sequence and Genotyping Array Data  Gao T. Wang, Bo Peng, Suzanne M. Leal  The.
A Flexible Bayesian Framework for Modeling Haplotype Association with Disease, Allowing for Dominance Effects of the Underlying Causative Variants  Andrew.
The β-Globin Recombinational Hotspot Reduces the Effects of Strong Selection around HbC, a Recently Arisen Mutation Providing Resistance to Malaria  Elizabeth.
A Selection Operator for Summary Association Statistics Reveals Allelic Heterogeneity of Complex Traits  Zheng Ning, Youngjo Lee, Peter K. Joshi, James.
Volume 173, Issue 1, Pages e9 (March 2018)
Xiangqing Sun, Robert Elston, Nathan Morris, Xiaofeng Zhu 
Matching Strategies for Genetic Association Studies in Structured Populations  David A. Hinds, Renee P. Stokowski, Nila Patil, Karel Konvicka, David Kershenobich,
Are Rare Variants Responsible for Susceptibility to Complex Diseases?
Studying Gene and Gene-Environment Effects of Uncommon and Common Variants on Continuous Traits: A Marker-Set Approach Using Gene-Trait Similarity Regression 
Matthieu Foll, Oscar E. Gaggiotti, Josephine T
Simultaneous Genotype Calling and Haplotype Phasing Improves Genotype Accuracy and Reduces False-Positive Associations for Genome-wide Association Studies 
Haplotypes at ATM Identify Coding-Sequence Variation and Indicate a Region of Extensive Linkage Disequilibrium  Penelope E. Bonnen, Michael D. Story,
Jon Wakefield  The American Journal of Human Genetics 
Pier Francesco Palamara, Laurent C. Francioli, Peter R
Increased Level of Linkage Disequilibrium in Rural Compared with Urban Communities: A Factor to Consider in Association-Study Design  Veronique Vitart,
Rare-Variant Association Testing for Sequencing Data with the Sequence Kernel Association Test  Michael C. Wu, Seunggeun Lee, Tianxi Cai, Yun Li, Michael.
Highly Punctuated Patterns of Population Structure on the X Chromosome and Implications for African Evolutionary History  Charla A. Lambert, Caitlin F.
Claus Skaanning Jensen, Augustine Kong 
Shuhua Xu, Wei Huang, Ji Qian, Li Jin 
Pritam Chanda, Aidong Zhang, Daniel Brazeau, Lara Sucheston, Jo L
Jonathan K. Pritchard, Joseph K. Pickrell, Graham Coop  Current Biology 
Estimating Genetic Effects and Quantifying Missing Heritability Explained by Identified Rare-Variant Associations  Dajiang J. Liu, Suzanne M. Leal  The.
Benjamin A. Rybicki, Robert C. Elston 
A Unified Approach to Genotype Imputation and Haplotype-Phase Inference for Large Data Sets of Trios and Unrelated Individuals  Brian L. Browning, Sharon.
James A. Lautenberger, J. Claiborne Stephens, Stephen J
Gad Kimmel, Ron Shamir  The American Journal of Human Genetics 
Population Structure in Admixed Populations: Effect of Admixture Dynamics on the Pattern of Linkage Disequilibrium  C.L. Pfaff, E.J. Parra, C. Bonilla,
Extensive Linkage Disequilibrium in Small Human Populations in Eurasia
Jared R. Kohler, David J. Cutler 
Selecting a Maximally Informative Set of Single-Nucleotide Polymorphisms for Association Analyses Using Linkage Disequilibrium  Christopher S. Carlson,
The Power of Genomic Control
Rational Inferences about Departures from Hardy-Weinberg Equilibrium
A Perspective on Epistasis: Limits of Models Displaying No Main Effect
Are Variants in the CAPN10 Gene Related to Risk of Type 2 Diabetes
Test for Interaction between Two Unlinked Loci
Yu Zhang, Tianhua Niu, Jun S. Liu 
Jung-Ying Tzeng, Chih-Hao Wang, Jau-Tsuen Kao, Chuhsing Kate Hsiao 
Tao Wang, Robert C. Elston  The American Journal of Human Genetics 
Data Mining Applied to Linkage Disequilibrium Mapping
Xiaoshu Chen, Jianzhi Zhang  Cell Systems 
Lactase Haplotype Diversity in the Old World
Genotype-Imputation Accuracy across Worldwide Human Populations
Messages through Bottlenecks: On the Combined Use of Slow and Fast Evolving Polymorphic Markers on the Human Y Chromosome  Peter de Knijff  The American.
Jung-Ying Tzeng, Daowen Zhang  The American Journal of Human Genetics 
Gonçalo R. Abecasis, Janis E. Wigginton 
Bruce Rannala, Jeff P. Reeve  The American Journal of Human Genetics 
Presentation transcript:

Accuracy of Haplotype Frequency Estimation for Biallelic Loci, via the Expectation- Maximization Algorithm for Unphased Diploid Genotype Data  Daniele Fallin, Nicholas J. Schork  The American Journal of Human Genetics  Volume 67, Issue 4, Pages 947-959 (October 2000) DOI: 10.1086/303069 Copyright © 2000 The American Society of Human Genetics Terms and Conditions

Figure 1 Conceptual framework for simulation studies and accuracy comparisons. The American Journal of Human Genetics 2000 67, 947-959DOI: (10.1086/303069) Copyright © 2000 The American Society of Human Genetics Terms and Conditions

Figure 2 Distribution of maximum log-likelihoods from the estimation procedure, by program settings: convergence criterion, maximum iterations, and number of restarts at different random initial-frequency values. For these analyses, 500 data sets of 200 individuals each were simulated for a five-locus system (mean frequency .03125; variance 10.0). The analyses for each panel were performed on the same batch of 500 simulated sets each time, with the parameter of interest progressively adjusted to a more stringent value (the standard error of the maximum log-likelihood values for all situations was .098). The American Journal of Human Genetics 2000 67, 947-959DOI: (10.1086/303069) Copyright © 2000 The American Society of Human Genetics Terms and Conditions

Figure 3 Influence of sample size on haplotype frequency estimates. A and B, Haplotype frequencies at the three steps of the simulation procedure. Generating frequencies (Gk [line]), sample frequencies (Sk [triangles]), and resulting haplotype frequency estimates from the EM algorithm (Ek [unblackened circles]) for a five-locus system with equally frequent population haplotype frequencies, with sample size set to N=50 (A) and N=500 (B) are shown. C, Average MSE and 95% CI for batches of 500 data sets of each sample size for five-locus haplotypes generated under the N(1/k,σ2) model. Unbroken line denotes comparisons of EM estimates to sample values (SE); dotted line, EM estimates to generating parameters (GE). The American Journal of Human Genetics 2000 67, 947-959DOI: (10.1086/303069) Copyright © 2000 The American Society of Human Genetics Terms and Conditions

Figure 4 MSE of the final estimates as a function of the amount of missing phase information per data set. The X-axis indicates the proportion of heterozygous loci in the entire data set as a measure of the overall missing phase information in the sample. MSEs for the SE (unbroken line) and GE (dotted line) comparisons are plotted along the Y-axis. A, Data sets with generating haplotypes drawn for the normal distribution scenario. B, Data sets with generating haplotype frequencies drawn from a Dirichlet distribution with one haplotype parameter set at 50 and the rest set equal and with Hardy-Weinberg disequilibrium among the haplotypes set to .05. Both panels are based on 10,000 simulated sets (size 200 individuals), for a five-locus system with 15 restarts, 150 maximum iterations, and convergence set to 10−5. The American Journal of Human Genetics 2000 67, 947-959DOI: (10.1086/303069) Copyright © 2000 The American Society of Human Genetics Terms and Conditions

Figure 5 Accuracy of program estimates by haplotype frequency distributions within data sets. MSE measures for the SE (unbroken line) and GE (dotted line) comparisons are plotted along the Y-axis. In panels A–C, the X-axis indicates the frequency of the most common estimated haplotype per data set. In panels D–F, the X-axis indicates the frequency of the least common (nonzero) estimated haplotype per data set. A and B, Batches of data sets simulated under the normal generating distribution scenario. C and D, Generating haplotype frequency parameter values drawn from a Dirichlet distribution with equal parameters. E and F, Generating haplotype frequency parameter values drawn from a Dirichlet distribution with one extreme parameter value (∼90%). Each panel is based on 10,000 simulated sets (size 200 individuals), for a five-locus system with 15 restarts, 150 maximum iterations, and convergence set to 10−5. The American Journal of Human Genetics 2000 67, 947-959DOI: (10.1086/303069) Copyright © 2000 The American Society of Human Genetics Terms and Conditions

Figure 6 Accuracy of program estimates as a function of the dispersion of haplotype frequency values within a data set. MSE measures for the SE (solid line) and GE (dotted line) comparisons are plotted along the Y-axis. In panel A, the X-axis represents the variance used to derive generating haplotype frequency values under the normal distribution scenario. A total of 500 data sets were simulated for each variance value. In panels B and C, the X-axis represents the κ2 value for a test of equality of haplotype frequency values within each data set. These panels represent batches from simulations under the Dirichlet distribution with either uniform parameters (B) or one extreme parameter (C). Both panels are based on 10,000 simulated data sets. All simulations were done for samples of 200 individuals, for a five-locus system with 15 restarts, 150 maximum iterations, and convergence set to 10−5. The American Journal of Human Genetics 2000 67, 947-959DOI: (10.1086/303069) Copyright © 2000 The American Society of Human Genetics Terms and Conditions

Figure 7 Accuracy of program estimates by number of constituent loci with “rare” allele frequencies per data set. MSE measures for the SE (unbroken line) and GE (dotted line) comparisons are plotted along the Y-axis. A, Batch simulated under the normal generating distribution scenario. MSEse and MSEge have separate axes because of orders of scale. B, Generating haplotype frequency parameter values drawn from a Dirichlet distribution with equal parameters. C, Generating haplotype frequency parameter values drawn from a Dirichlet distribution with one extreme parameter value (∼90%). Each panel is based on 10,000 simulated sets (size 200 individuals), for a five-locus system with 15 restarts, 150 maximum iterations, and convergence set to 10−5. The American Journal of Human Genetics 2000 67, 947-959DOI: (10.1086/303069) Copyright © 2000 The American Society of Human Genetics Terms and Conditions

Figure 8 Accuracy of program estimates of HWD. The Y-axis indicates MSE between final haplotype frequency estimates and sample-set values (unbroken line) or generating population parameter values (dotted line). Each panel is based on 10,000 simulated sets under the extreme Dirichlet generating frequency model with HWD (either toward heterozygosity, [A and C] or homozygosity [B and D]) introduced at the haplotype level during the sampling process. Aand B, MSE by the number of loci per simulation with significant HWD. C and D, MSEse as a function of the most extreme HWD coefficient across loci per simulation. All simulations were done for 200 individuals, for a five-locus system with 15 restarts, 150 maximum iterations, and convergence set to 10−5. The American Journal of Human Genetics 2000 67, 947-959DOI: (10.1086/303069) Copyright © 2000 The American Society of Human Genetics Terms and Conditions

Figure 9 Accuracy of program estimates by LD. The Y-axis indicates MSE between final haplotype frequency estimates and sample-set values (unbroken line) or generating population parameter values (dotted line). The X-axis indicates the average D′ LD value across all pairwise comparisons per simulation. A, Simulations generated under the normal distribution scenario. B, Population haplotype frequency values drawn from a Dirichlet distribution with equal parameters. C, Population haplotype frequency values drawn from a Dirichlet distribution with one extreme parameter value (∼50%) and HWD induced toward homozygosity. Each panel is based on 10,000 simulated sets (size 200 individuals), for a five-locus system with 15 restarts, 150 maximum iterations, and convergence set to 10−5. The American Journal of Human Genetics 2000 67, 947-959DOI: (10.1086/303069) Copyright © 2000 The American Society of Human Genetics Terms and Conditions

Figure 10 Accuracy of program estimates by number of loci in haplotype. The Y-axis indicates MSE between final haplotype frequency estimates and sample-set values (unbroken line) or generating population parameter values (dotted line). The categories represented on the X-axis correspond to 1,000 simulations each (2-, 3-, 4-, 5-, 7-, and 10-locus systems). All simulations were for 200 individuals, with 15 restarts, 150 maximum iterations, and convergence set to 10−5. The American Journal of Human Genetics 2000 67, 947-959DOI: (10.1086/303069) Copyright © 2000 The American Society of Human Genetics Terms and Conditions