Presented by Qing Duan Dr. Yun Li group UNC at Chapel Hill

Slides:



Advertisements
Similar presentations
Imputation for GWAS 6 December 2012.
Advertisements

BST 775 Lecture PLINK – A Popular Toolset for GWAS
Association Tests for Rare Variants Using Sequence Data
Why this paper Causal genetic variants at loci contributing to complex phenotypes unknown Rat/mice model organisms in physiology and diseases Relevant.
Single Nucleotide Polymorphism And Association Studies Stat 115 Dec 12, 2006.
METHODS FOR HAPLOTYPE RECONSTRUCTION
Supplementary Figure S1 Distribution of observed (blue) and Poisson expected (red) standard deviation of human-chimpanzee divergence over different window.
Objectives Cover some of the essential concepts for GWAS that have not yet been covered Hardy-Weinberg equilibrium Meta-analysis SNP Imputation Review.
High resolution detection of IBD Sharon R Browning and Brian L Browning Supported by the Marsden Fund.
Understanding GWAS Chip Design – Linkage Disequilibrium and HapMap Peter Castaldi January 29, 2013.
From sequence data to genomic prediction
MALD Mapping by Admixture Linkage Disequilibrium.
Ingredients for a successful genome-wide association studies: A statistical view Scott Weiss and Christoph Lange Channing Laboratory Pulmonary and Critical.
Lab 13: Association Genetics. Goals Use a Mixed Model to determine genetic associations. Understand the effect of population structure and kinship on.
Haplotypes and imputed genotypes in diverse human populations Noah Rosenberg April 29, 2009.
More Powerful Genome-wide Association Methods for Case-control Data Robert C. Elston, PhD Case Western Reserve University Cleveland Ohio.
Introduction to Linkage Analysis March Stages of Genetic Mapping Are there genes influencing this trait? Epidemiological studies Where are those.
Design Considerations in Large- Scale Genetic Association Studies Michael Boehnke, Andrew Skol, Laura Scott, Cristen Willer, Gonçalo Abecasis, Anne Jackson,
Chuanyu Sun Paul VanRaden National Association of Animal breeders, USA Animal Improvement Programs Laboratory, USA Increasing long term response by selecting.
PLINK tutorial, December 2006; Shaun Purcell, PLINK gPLINK Haploview Whole genome association software tutorial Shaun Purcell.
Polymorphism & Variant Analysis Lab Saurabh Sinha Polymorphism and Variant Analysis Lab v1 | Saurabh Sinha 1 Powerpoint by Casey Hanson.
Population Genetics: Chapter 3 Epidemiology 217 January 16, 2011.
ULRIKE PETERS, FRED HUTCHINSON CANCER RESEARCH CENTER, UNIVERSITY OF WASHINGTON Fine-mapping of obesity GWAS loci using the Metabochip in PAGE (Population.
A single-nucleotide polymorphism tagging set for human drug metabolism and transport Kourosh R Ahmadi, Mike E Weale, Zhengyu Y Xue, Nicole Soranzo, David.
Figure S1. Quantile-quantile plot in –log10 scale for the individual studies The red line represents concordance of observed and expected values. The shaded.
MapNext: a software tool for spliced and unspliced alignments and SNP detection of short sequence reads Hua Bao Sun Yat-sen University, Guangzhou,
2007 Paul VanRaden and Mel Tooker Animal Improvement Programs Laboratory, USDA Agricultural Research Service, Beltsville, MD, USA
10cM - Linkage Mapping Set v2 ABI Median intermarker distance: 4.7 Mb Mean intermarker distance: 5.6 Mb Mean genetic gap distance: 8.9 cM Average Heterozygosity.
E XOME SEQUENCING AND COMPLEX DISEASE : practical aspects of rare variant association studies Alice Bouchoms Amaury Vanvinckenroye Maxime Legrand 1.
G.R. Wiggans Animal Improvement Programs Laboratory Agricultural Research Service, USDA Beltsville, MD 2009 G.R. WiggansCouncil.
California Pacific Medical Center
P. M. VanRaden and T. A. Cooper * Animal Genomics and Improvement Laboratory, Agricultural Research Service, USDA, Beltsville, MD, USA
PLINK / Haploview Whole genome association software tutorial
Imputation-based local ancestry inference in admixed populations
GenABEL: an R package for Genome Wide Association Analysis
Practical With Merlin Gonçalo Abecasis. MERLIN Website Reference FAQ Source.
P.M. VanRaden and D.M. Bickhart Animal Genomics and Improvement Laboratory, Agricultural Research Service, USDA, Beltsville, MD, USA
Linkage. Announcements Problem set 1 is available for download. Due April 14. class videos are available from a link on the schedule web page, and at.
Canadian Bioinformatics Workshops
A brief guide to sequencing Dr Gavin Band Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015 Africa Centre for Health.
Kevin A Henry, Ph.D New Jersey Cancer Registry Cancer Epidemiology Services Frank Boscoe, Ph.D New York State Cancer Registry Estimating the accuracy of.
WHI Imputation. Target GWAS data WHIMS +, ~5,000-6,000 samples, Illumina Omni express GRANET, ~5,000 samples, Illumina Omni Hipfx, ~4,000-5,000 samples,
Statistical Genomics Zhiwu Zhang Washington State University Lecture 7: Impute.
From Reads to Results Exome-seq analysis at CCBR
Canadian Bioinformatics Workshops
Quality Control Using EasyQC & Meta-Analysis in METAL
Imputation Sarah Medland Boulder 2015.
A Genomewide Admixture Mapping Panel for Hispanic/Latino Populations
Zhengzheng Tang and Danyu Lin March 26, 2013
High-resolution haplotype structure in the human genome
Imputation-based local ancestry inference in admixed populations
Itsik Pe’er, Yves R. Chretien, Paul I. W. de Bakker, Jeffrey C
Introduction to Data Formats and tools
Identification of Paralogs in RADseq data
Deep Whole-Genome Sequencing of 100 Southeast Asian Malays
10 Years of GWAS Discovery: Biology, Function, and Translation
Emily C. Walsh, Kristie A. Mather, Stephen F
A Flexible Bayesian Framework for Modeling Haplotype Association with Disease, Allowing for Dominance Effects of the Underlying Causative Variants  Andrew.
Volume 173, Issue 1, Pages e9 (March 2018)
Simultaneous Genotype Calling and Haplotype Phasing Improves Genotype Accuracy and Reduces False-Positive Associations for Genome-wide Association Studies 
Genotype Imputation with Millions of Reference Samples
Rare-Variant Association Testing for Sequencing Data with the Sequence Kernel Association Test  Michael C. Wu, Seunggeun Lee, Tianxi Cai, Yun Li, Michael.
Brent S. Pedersen, Aaron R. Quinlan 
A Unified Approach to Genotype Imputation and Haplotype-Phase Inference for Large Data Sets of Trios and Unrelated Individuals  Brian L. Browning, Sharon.
Identifying Darwinian Selection Acting on Different Human APOL1 Variants among Diverse African Populations  Wen-Ya Ko, Prianka Rajan, Felicia Gomez, Laura.
Yu Zhang, Tianhua Niu, Jun S. Liu 
Evaluating the Effects of Imputation on the Power, Coverage, and Cost Efficiency of Genome-wide SNP Platforms  Carl A. Anderson, Fredrik H. Pettersson,
Genotype-Imputation Accuracy across Worldwide Human Populations
Mapping of srt1 by BSA-seq.
Presentation transcript:

Presented by Qing Duan Dr. Yun Li group UNC at Chapel Hill 09-13-2012 Genotype Imputation for American Americans and Hispanics in WHI using reference haplotypes from the 1000 Genomes Project Presented by Qing Duan Dr. Yun Li group UNC at Chapel Hill 09-13-2012 Qing Duan qduan@email.unc.edu

Outline Imputation Post imputation quality assessment Study samples: WHI African Americans and Hispanics samples Reference haplotypes: 1000 Genomes Project (version 3 March 2012 release) Number of markers in reference haplotypes: ~38M Post imputation quality assessment Evaluation of imputation quality by comparing with actual genotypes from Metabochip genotyping Estimation of total number of QC+ markers and number of QC+ indels

QC on WHI Genotypes QC was performed within African American and Hispanics samples separately for autosomes and chromosome X. We excluded markers having: Hardy-Weinberg equilibrium (HW p-value < 1e-6) Genotype completeness (< 90%) Minor allele frequency Chromosome 1-22: MAF < 1% Chromosome X: singleton or monomorphic markers With thanks to Eric Yi Liu

Summary of samples and GWAS QC+ markers Number of Individuals WHI_AA: 8,421 / WHI_HA: 3,587 Number of markers Chr1-22 ChrX WHIAA WHIHA Total 860,510 36,889 QC+ 829,370 834,826 35,411 35,035 Note: chromosome X is currently under imputation, so the results on chromosome X will be available soon.

Reference Haplotypes The complete set of 1000G Phase I Integrated Release version 3 haplotypes in vcf format (March 2012 release) A total of 2184 haplotypes A total of ~38M markers including singleton and monomorphic sites About 1.4M markers are short indels and large deletions, the rest SNPs.

Note on reference haplotypes A latest reduced set of reference haplotypes with singletons and monomorphic markers removed are also available. Number of markers: ~30M Every marker in the reduced set is included in the complete set of reference haplotypes. We expect little influence on imputation quality from singleton and monomorphic markers, because: Phasing of the reference haplotypes were performed with the singleton and monomorphic markers included Our previous evaluation shows little effect of singletons on the quality of imputation (Liu, EY, et al., Genetic Epidemiology, 2012, 36:107-117).

Two-step genotype imputation -- Procedure Step 1: Pre-phasing (MaCH1) WHI African American and Hispanics samples were phased separately Step 2: Genotype imputation (minimac) WHI African Americans and Hispanics samples were imputed separately. Haplotype to haplotype imputation: the pre-phased haplotypes in step 1 are used to impute into the complete set of reference haplotypes from the 1000 Genomes Project.

Two-step genotype imputation -- Computational costs Phasing and imputation strategy Split chromosomes into segments Phase / impute each segment Ligate segments back to chromosomes Computational costs WHI_AA WHI_HA Phasing Split strategy (sample genotypes) Core region: 3000 markers Flanking: 500 markers each # segment after splitting 277 278 Median run time ~245 hours (~10 days) ~63 hours (~3 days) Imputation (reference haplotypes) Core region: 5 Mb Flanking: 500 Kb each Core region: 20 Mb 520 150 ~41 hours (~2 days) ~71 hours (~3 days)

Summary of imputation results -- Before QC WHIAA WHIHA Number of individuals 8,421 3,587 Total number of imputed markers 38,050,692 Number of imputed indels 1,380,758 File size (All files gz compressed) 170 G 71 G Note: Markers with quality filter missing in the 1000G reference haplotypes are excluded from imputation. We found all markers excluded are of type “MERGED_DEL”.

Evaluation of imputation quality -- Introduction Main idea Compare imputed dosages with actual genotypes Quality metric Dosage r2: squared correlation coefficient between imputed dosages (continuous value ranging between 0 and 2) and actual genotypes (coded as 0, 1 and 2) True imputation accuracy (range 0 ~ 1) Rsq: estimated dosage r2 Estimated imputation accuracy

Evaluation of imputation quality -- Study design Imputed dosage Actual genotype (Metabochip) Calculate dosage r2 2 1 2 1 Individuals used in evaluation 1962 WHI African American samples Markers used in evaluation Overlapping markers between 1000G and Metabochip but not on Affymetrix 6.0 (All 22 autosomes) Minor allele frequency (MAF) is defined within the 1962 individuals Among the imputed 8,421 WHI African American samples, 1,962 have been genotyped on metabochip. In other words, we have actual genotypes of a subset of imputed markers for 1,962 samples. Therefore, the evaluation of imputation quality is performed by comparing the imputed dosage and the actual genotype (coded as 0, 1, 2) on the overlapping markers between 1000 Genomes and metabochip but not on Affymetrix 6.0 genotyping chip within 1,962 WHI African Americans. Squared correlation coefficient (dosage R2) is calculated for each evaluated marker. All 22 autosomes are included in the assessment. Minor allele frequency for each evaluated marker is defined within the 1962 individuals.

Estimation of imputation quality -- Results We recommend QC threshold 0.7, 0.6, 0.3, 0.3 and 0.3 for MAF 0.1~0.5%, 0.5~1%, 1~3%, 3~5% and 5~50% category. The thresholds are chosen such that an average Rsq greater than 0.8 is achieved.

Estimation of imputation quality -- Summary We recommend QC threshold 0.7, 0.6 and 0.3 for MAF 0.1~0.5%, 0.5~1%, and >1% category, respectively The thresholds are chosen such that an average Rsq greater than 0.8 in each MAF category is achieved (Liu, EY, et al., Genetic Epidemiology, 2012, 36:107-117). Estimation based on imputation quality assessment Total number of markers passing QC Total number of indels passing QC

Estimation based on imputation quality assessment -- Note The values are estimated because: Estimated Rsq cutoffs Evaluation is based on markers on Metabochip Estimated MAF MAF of imputed markers is calculated based on imputed dosages Note that the number (and percentage) of markers passing QC in each MAF category is an estimated value because 1) Rsq cutoffs are set using evaluation of markers on metabochip. The imputation quality for markers not on metabochip may be different; 2) The actual MAF of imputed markers cannot be obtained within WHI African Americans or within WHI Hispanics. So the MAF is an estimated value which is provided in the imputation output; 3) We have no access to the actual genotype of WHI Hispanics yet. So in this estimation, we assumed WHI Hispanics has similar Rsq cutoff in each MAF category to WHI African Americans. Note in this evaluation, we assume SNPs and indels have similar Rsq cutoffs. Since current Rsq cutoffs are estimated using SNPs on metabochip, that of indels may differ. However, before evaluation of the imputation quality of indels is available, we may have to assume Rsq cutoffs estimated in SNPs can be applied to indels.

Estimation based on imputation quality assessment -- Note (cont’d) The values are estimated because: Estimated QC thresholds for WHI Hispanics samples We assumed WHI Hispanics has similar Rsq cutoff in each MAF category to WHI African Americans We will do similar quality assessment in Hispanics samples once we have their QC+ metabochip data Estimated QC thresholds for indels Rsq is set based on evaluation on SNPs. We assumed indels has similar Rsq cutoff in each MAF category to SNPs Note that the number (and percentage) of markers passing QC in each MAF category is an estimated value because 1) Rsq cutoffs are set using evaluation of markers on metabochip. The imputation quality for markers not on metabochip may be different; 2) The actual MAF of imputed markers cannot be obtained within WHI African Americans or within WHI Hispanics. So the MAF is an estimated value which is provided in the imputation output; 3) We have no access to the actual genotype of WHI Hispanics yet. So in this estimation, we assumed WHI Hispanics has similar Rsq cutoff in each MAF category to WHI African Americans. Note in this evaluation, we assume SNPs and indels have similar Rsq cutoffs. Since current Rsq cutoffs are estimated using SNPs on metabochip, that of indels may differ. However, before evaluation of the imputation quality of indels is available, we may have to assume Rsq cutoffs estimated in SNPs can be applied to indels.

Estimation based on imputation quality assessment -- Total number of markers passing QC To get a general idea about how many imputed markers retained after applying QC, We estimated the number in each MAF category within WHI African Americans and Hispanics, separately. We choose Rsq cutoff 0.7 and 0.6 for MAF 0.1~0.5% and 0.5~1% category such that the average Rsq is greater than 0.8 (as shown in Table 1). We use 0.3 as Rsq cutoff in the rest of the MAF categories. Note: Markers includes both SNPs and indels

Estimation based on imputation quality assessment -- Number of indels passing QC

Summary We conducted genotype imputation for 8,421 African American and 3,587 Hispanics samples in the Women’s Health Initiative (WHI) study using reference haplotypes from the 1000 Genomes Project (version 3, March 2012 release) Summary of imputation results before and after QC WHIAA WHIHA Before QC After QC Number of individuals 8,421 3,587 Total number of markers 38,050,692 18,940,103 15,214,231 Number of indels 1,380,758 1,219,538 1,126,704 File size (All files gz compressed) 170 G 102 G 71 G 33 G