BST 775 Lecture PLINK – A Popular Toolset for GWAS Guodong Wu SSG, Department of Biostatistics University of Alabama at Birmingham September 24, 2013
Overview Designed for GWAS and population-based linkage analysis. Developed by Shaun Purcell*, current version V1.07. http://pngu.mgh.harvard.edu/~purcell/plink/ Why the toolset is so popular? Store the GWAS data sets, which is too large for SAS, R, or other statistical packages. Well developed guideline and toolsets for Dataset Management and Quality Control Platform for various association methods * Purcell et al 2007, AJHG
Overview Data management Summary statistics Quality Control Association Test
PLINK in GWAS workflow Experimental Design & Sample Collection Cell Intensity Files for each chip GeneChip Scanner Summary statistics and quality control Phenotype, sex and other covariates Assessment of population stratification Whole genome SNP-based association Further exploration of ‘hits’ Visualization and follow-up
Data Format PED and MAP format Transposed format SNP information SNPs → SNP information 1 snp1 0 1000 X snp2 0 1000 Y snp3 0 1000 XY snp4 0 1000 MT snp5 0 1000 P1 A A A C C G T T A A T T P2 A C A A C G G T A C T T P3 C C A C G G T T A A T T P4 C C A A G G G T A A T T ←People Transposed format People → 0101010010101010101 1010011101010101010 1101110101001010101 1101001011101101010 1101010101010111010 People information S1 A A A C C C C C S2 A C A A A C A A S3 C G C G G G G G S4 T T C G T T G T S5 A A G T A A A A S6 T T A C T T T T ←SNPs P1 … P2 … P3 … P4 … P5 … Compact binary format
Data management Recode dataset (A,C,G,T → 1,2) Reorder, reformat dataset Flip DNA strand Extract/remove individuals/SNPs New phenotypes, covariates as extra file Merge 2 or more data sets
Summary and QC Hardy-Weinberg test Mendel errors Missing genotypes Allele frequencies Tests of non-random missingness by phenotype and by (unobserved) genotype Sex Check Pairwise IBD estimates
Mendel errors An exact test by default. plink --file data --hardy An exact test by default. In Case control study, the Control group typically needs more lenient threshold (eg. P-value < 1e-3)
Mendel errors plink --file data --mendel Genotyping error when child’s genotype is not inherited from the parents, according to mendel’s law Output as Output the error rate for each SNP and each individual Code Pat , Mat -> Offspring 1 AA , AA -> AB 2 BB , BB -> AB 3 BB , ** -> AA 4 ** , BB -> AA 5 BB , BB -> AA 6 AA , ** -> BB 7 ** , AA -> BB 8 AA , AA -> BB
Missingness and Allele Frequency Output each SNP’s allele frequency plink --file data --missing Output the missing rate per SNP and per individual. plink --file data --freq Output each SNP’s allele frequency
Is the missingness random? plink --file data –-test-missing Test whether the SNP is randomly missing between case and control status. plink --file data -–test-mishap Test whether the SNP is randomly missing based on observed genotyped nearby SNPs. Assume dense SNP genotyping. Use haplotype and LD information in tests.
Sex Check plink --file data –check-sex Use X chromosome data heterozygosity rates to determine sex, and then compare with the observed sex.
Pairwise IBD sharing (relatedness) Most recent common ancestor from homogeneous random mating population Parents AB AC AB AC IBS = 1 IBD = 0 AB AC PLINK tutorial, October 2006; Shaun Purcell, shaun@pngu.mgh.harvard.edu
Relatedness Check plink --file data –-genome The Genome-wide information, typically do not need whole-genome SNPs. Typically 100K independent SNPs are enough.
Association methods in PLINK Population-based Allelic, trend, genotypic, Fisher’s exact Stratified tests (Cochran-Mantel-Haenszel, Breslow-Day) Linear & logistic regression models multiple covariates, interactions, joint tests, etc Family-based Disease traits: TDT / sib-TDT Continuous traits: QFAM (between/within model, QTDT) Permutation procedures “adaptive”, max(T), gene-dropping, between/within, rank-based, within-cluster Multilocus tests Haplotype estimation, set-based tests, Hotelling’s T2, epistasis
An Example: logistic Regression plink --maf 0.05 --exclude nonautosomalSNPs.txt --out AllAssoc --bfile bdata --remove exclusions.txt --logistic --hide-covar --pheno IChipCovs.txt --pheno-name cas_con --covar IChipCovs.txt --covar-name Sex,EurAdmix
An Example: logistic Regression Result
Cardinal rules in PLINK Always consult the log file, console output Also consult the web documentation regularly PLINK has no memory each run loads data anew, previous filters lost Exact syntax and spelling is important “minus minus” … PLINK tutorial, October 2006; Shaun Purcell, shaun@pngu.mgh.harvard.edu