Download presentation
Published byEfrain Treadway Modified over 9 years ago
1
Presented by Qing Duan Dr. Yun Li group UNC at Chapel Hill 09-13-2012
Genotype Imputation for American Americans and Hispanics in WHI using reference haplotypes from the 1000 Genomes Project Presented by Qing Duan Dr. Yun Li group UNC at Chapel Hill Qing Duan
2
Outline Imputation Post imputation quality assessment
Study samples: WHI African Americans and Hispanics samples Reference haplotypes: 1000 Genomes Project (version 3 March 2012 release) Number of markers in reference haplotypes: ~38M Post imputation quality assessment Evaluation of imputation quality by comparing with actual genotypes from Metabochip genotyping Estimation of total number of QC+ markers and number of QC+ indels
3
QC on WHI Genotypes QC was performed within African American and Hispanics samples separately for autosomes and chromosome X. We excluded markers having: Hardy-Weinberg equilibrium (HW p-value < 1e-6) Genotype completeness (< 90%) Minor allele frequency Chromosome 1-22: MAF < 1% Chromosome X: singleton or monomorphic markers With thanks to Eric Yi Liu
4
Summary of samples and GWAS QC+ markers
Number of Individuals WHI_AA: 8,421 / WHI_HA: 3,587 Number of markers Chr1-22 ChrX WHIAA WHIHA Total 860,510 36,889 QC+ 829,370 834,826 35,411 35,035 Note: chromosome X is currently under imputation, so the results on chromosome X will be available soon.
5
Reference Haplotypes The complete set of 1000G Phase I Integrated Release version 3 haplotypes in vcf format (March 2012 release) A total of 2184 haplotypes A total of ~38M markers including singleton and monomorphic sites About 1.4M markers are short indels and large deletions, the rest SNPs.
6
Note on reference haplotypes
A latest reduced set of reference haplotypes with singletons and monomorphic markers removed are also available. Number of markers: ~30M Every marker in the reduced set is included in the complete set of reference haplotypes. We expect little influence on imputation quality from singleton and monomorphic markers, because: Phasing of the reference haplotypes were performed with the singleton and monomorphic markers included Our previous evaluation shows little effect of singletons on the quality of imputation (Liu, EY, et al., Genetic Epidemiology, 2012, 36: ).
7
Two-step genotype imputation -- Procedure
Step 1: Pre-phasing (MaCH1) WHI African American and Hispanics samples were phased separately Step 2: Genotype imputation (minimac) WHI African Americans and Hispanics samples were imputed separately. Haplotype to haplotype imputation: the pre-phased haplotypes in step 1 are used to impute into the complete set of reference haplotypes from the 1000 Genomes Project.
8
Two-step genotype imputation -- Computational costs
Phasing and imputation strategy Split chromosomes into segments Phase / impute each segment Ligate segments back to chromosomes Computational costs WHI_AA WHI_HA Phasing Split strategy (sample genotypes) Core region: 3000 markers Flanking: 500 markers each # segment after splitting 277 278 Median run time ~245 hours (~10 days) ~63 hours (~3 days) Imputation (reference haplotypes) Core region: 5 Mb Flanking: 500 Kb each Core region: 20 Mb 520 150 ~41 hours (~2 days) ~71 hours (~3 days)
9
Summary of imputation results -- Before QC
WHIAA WHIHA Number of individuals 8,421 3,587 Total number of imputed markers 38,050,692 Number of imputed indels 1,380,758 File size (All files gz compressed) 170 G 71 G Note: Markers with quality filter missing in the 1000G reference haplotypes are excluded from imputation. We found all markers excluded are of type “MERGED_DEL”.
10
Evaluation of imputation quality -- Introduction
Main idea Compare imputed dosages with actual genotypes Quality metric Dosage r2: squared correlation coefficient between imputed dosages (continuous value ranging between 0 and 2) and actual genotypes (coded as 0, 1 and 2) True imputation accuracy (range 0 ~ 1) Rsq: estimated dosage r2 Estimated imputation accuracy
11
Evaluation of imputation quality -- Study design
Imputed dosage Actual genotype (Metabochip) Calculate dosage r2 2 1 2 1 Individuals used in evaluation 1962 WHI African American samples Markers used in evaluation Overlapping markers between 1000G and Metabochip but not on Affymetrix 6.0 (All 22 autosomes) Minor allele frequency (MAF) is defined within the 1962 individuals Among the imputed 8,421 WHI African American samples, 1,962 have been genotyped on metabochip. In other words, we have actual genotypes of a subset of imputed markers for 1,962 samples. Therefore, the evaluation of imputation quality is performed by comparing the imputed dosage and the actual genotype (coded as 0, 1, 2) on the overlapping markers between 1000 Genomes and metabochip but not on Affymetrix 6.0 genotyping chip within 1,962 WHI African Americans. Squared correlation coefficient (dosage R2) is calculated for each evaluated marker. All 22 autosomes are included in the assessment. Minor allele frequency for each evaluated marker is defined within the 1962 individuals.
12
Estimation of imputation quality -- Results
We recommend QC threshold 0.7, 0.6, 0.3, 0.3 and 0.3 for MAF 0.1~0.5%, 0.5~1%, 1~3%, 3~5% and 5~50% category. The thresholds are chosen such that an average Rsq greater than 0.8 is achieved.
13
Estimation of imputation quality -- Summary
We recommend QC threshold 0.7, 0.6 and 0.3 for MAF 0.1~0.5%, 0.5~1%, and >1% category, respectively The thresholds are chosen such that an average Rsq greater than 0.8 in each MAF category is achieved (Liu, EY, et al., Genetic Epidemiology, 2012, 36: ). Estimation based on imputation quality assessment Total number of markers passing QC Total number of indels passing QC
14
Estimation based on imputation quality assessment -- Note
The values are estimated because: Estimated Rsq cutoffs Evaluation is based on markers on Metabochip Estimated MAF MAF of imputed markers is calculated based on imputed dosages Note that the number (and percentage) of markers passing QC in each MAF category is an estimated value because 1) Rsq cutoffs are set using evaluation of markers on metabochip. The imputation quality for markers not on metabochip may be different; 2) The actual MAF of imputed markers cannot be obtained within WHI African Americans or within WHI Hispanics. So the MAF is an estimated value which is provided in the imputation output; 3) We have no access to the actual genotype of WHI Hispanics yet. So in this estimation, we assumed WHI Hispanics has similar Rsq cutoff in each MAF category to WHI African Americans. Note in this evaluation, we assume SNPs and indels have similar Rsq cutoffs. Since current Rsq cutoffs are estimated using SNPs on metabochip, that of indels may differ. However, before evaluation of the imputation quality of indels is available, we may have to assume Rsq cutoffs estimated in SNPs can be applied to indels.
15
Estimation based on imputation quality assessment -- Note (cont’d)
The values are estimated because: Estimated QC thresholds for WHI Hispanics samples We assumed WHI Hispanics has similar Rsq cutoff in each MAF category to WHI African Americans We will do similar quality assessment in Hispanics samples once we have their QC+ metabochip data Estimated QC thresholds for indels Rsq is set based on evaluation on SNPs. We assumed indels has similar Rsq cutoff in each MAF category to SNPs Note that the number (and percentage) of markers passing QC in each MAF category is an estimated value because 1) Rsq cutoffs are set using evaluation of markers on metabochip. The imputation quality for markers not on metabochip may be different; 2) The actual MAF of imputed markers cannot be obtained within WHI African Americans or within WHI Hispanics. So the MAF is an estimated value which is provided in the imputation output; 3) We have no access to the actual genotype of WHI Hispanics yet. So in this estimation, we assumed WHI Hispanics has similar Rsq cutoff in each MAF category to WHI African Americans. Note in this evaluation, we assume SNPs and indels have similar Rsq cutoffs. Since current Rsq cutoffs are estimated using SNPs on metabochip, that of indels may differ. However, before evaluation of the imputation quality of indels is available, we may have to assume Rsq cutoffs estimated in SNPs can be applied to indels.
16
Estimation based on imputation quality assessment -- Total number of markers passing QC
To get a general idea about how many imputed markers retained after applying QC, We estimated the number in each MAF category within WHI African Americans and Hispanics, separately. We choose Rsq cutoff 0.7 and 0.6 for MAF 0.1~0.5% and 0.5~1% category such that the average Rsq is greater than 0.8 (as shown in Table 1). We use 0.3 as Rsq cutoff in the rest of the MAF categories. Note: Markers includes both SNPs and indels
17
Estimation based on imputation quality assessment -- Number of indels passing QC
18
Summary We conducted genotype imputation for 8,421 African American and 3,587 Hispanics samples in the Women’s Health Initiative (WHI) study using reference haplotypes from the 1000 Genomes Project (version 3, March 2012 release) Summary of imputation results before and after QC WHIAA WHIHA Before QC After QC Number of individuals 8,421 3,587 Total number of markers 38,050,692 18,940,103 15,214,231 Number of indels 1,380,758 1,219,538 1,126,704 File size (All files gz compressed) 170 G 102 G 71 G 33 G
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.