ESP6800 BP Analysis (recessive model) Zhengzheng Tang and Danyu Lin March 26, 2013
Variant QC Started with “ESP6900 May02 release” vcf (1908614 variants; 6823 individuals) SVM filter deleted variants that were flagged in vcf Per-genotype depth filter set genotype to missing if DP < 10 Bi-allelic filter deleted non-bi-allelic variants after this step, 1908596 variants remain (18 variants deleted) Per-variant depth filter filtered out all variants with an average read depth > 500 applied Paul’s exclusion list Fail_DP500 (284 variants listed) after this step, 1908312 variants remain (284 variants deleted) HWE filter deleted variants if race-specific p-value < 5x10^-8 applied Paul’s race-specific exclusion lists Fail_hwe_AA (2779 variants listed) and Fail_hwe_EA (2592 variants listed) after this step, 1904594 variants remain (3718 variants deleted)
Sample QC I Started with 6823 subjects Dropped 536 subjects in certain cohorts/phenotype groups ESP_COHORT: PAH (80), CF (431) ESP_PHENOTYPE: BP (1), Blind (22), EOMI_Case_Drop (2), SSC PAH (40), SSC no PAH (35), Other (5) Dropped 34 subjects whose self-reported race is not missing and is not AA or EA Dropped 52 subjects based on QC information contained in the phenotype file intentional duplicates (22), high missing rates (1), sex mismatch (13), high homozygosity (1), poor concordance (5), no data to check concordance (3), unresolved ID (4), self-reported race is missing (30), PCA outliers (17), race mismatch (13) Dropped 16 subjects due to duplications for each duplicated pair, dropped the subject with the lower genotype call rate
Sample QC II Dropped 155 subjects due to relatedness Only consider 1st and 2nd degree relatedness For each related pair: If one of the subjects in the pair is missing on SSR and the other is not, then drop the one with missing SSR. If both of them are missing on SSR or neither of them is missing on SSR, then drop the one with the lower genotype call rate.
Data Phenotype groups (studies): Genetic variants: Nonsynonymous T1/T5/VT burden scores LOF T5 Single variants Annotation: SeattleSeq MAFs: extracted from vcf file Study LDL BMI BP EOMI Stroke DPR AA EA 286 318 6120 262 502 351 579 82 404 226 667 AA w/BP EA w/BP 227 269 520 260 464 36 196 188 447
Data Processing Phenotypes Genotypes Genes SSR Remove variants with call rates < 90%. Impute missing values by expected number of minor alleles. Remove variants with MAC<2 for single variant analysis. Genes Remove genes with MAC<2 for rare-variant analysis.
Analysis Perform race-specific analysis and meta analyze the results. The total number of genes/variants being analyzed: 5658 (Nonsyn T1); 8116 (Nonsyn T5); 139 (LOF T5); 210077 (Single variant). Covariates: pc1-2, sex, age, age square, target, cohorts
Analysis Methods: For each study, calculated the score statistic based on full likelihood (MLE). Performed meta-analysis of the score statistics from the six studies. The MLE approach properly adjusts for trait-dependent sampling. It has the highest power among all valid tests and provides unbiased estimates of genetic effects. Genetic models: Recessive model
Nonsyn T1: combined
Nonsyn T1: AA
Nonsyn T1: EA
Nonsyn T1: Top 1-25 genes
Nonsyn T1: Top 26-50 genes
Nonsyn T5: combined
Nonsyn T5: AA
Nonsyn T5: EA
Nonsyn T5: Top 1-25 genes
Nonsyn T5: Top 26-50 genes
Nonsyn VT: combined
Nonsyn VT: AA
Nonsyn VT: EA
Nonsyn VT: Top 1-25 genes
Nonsyn VT: Top 26-50 genes
LOF T5: combined
LOF T5: AA
LOF T5: EA
LOF T5: Top 1-25 genes
LOF T5: Top 26-50 genes
Single Variant: combined
Single Variant: AA
Single Variant: EA
Single Variant: combined Top 1-25 SNPs
Single Variant: combined Top 26-50 SNPs
Single Variant: AA Top 1-25 SNPs
Single Variant: AA Top 26-50 SNPs
Single Variant: EA Top 1-25 SNPs
Single Variant: EA Top 26-50 SNPs