Genome-Wide Pharmacogenomic Study on Methadone Maintenance Treatment
Analysis Flow of the Study Data preprocessing Genome-wide single locus association test Manhattan plot & Q-Q plot False discovery rate (FDR) correction Regional association plot Analysis of the proportion of variation explained by significant SNPs
Data Availability GSE78098_series_matrix.txt.gz is downloaded from GSE78098 wget ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE78nnn/GSE78098/matrix/GSE78098_series_matrix.txt.gz gunzip *gz GPL21480-513.txt is downloaded from GSE78098 download full table from http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GPL21480 GSE78098_MMT_stage_discovery_postqc.txt.gz is downloaded from GSE78098 wget ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE78nnn/GSE78098/suppl/GSE78098_MMT_stage_discovery_postqc.txt.gz
File Documentation GSE78098_series_matrix.xlsx (phenotype & covariates) Row 30 is sample ID, Row 41 is r_met_plasma_concentration, Row 42 is s_met_plasma_concentration, Row 43 is r_eddp_plasma_concentration, Row 44 is s_eddp_plasma_concentration, Row 45 is age, Row 46 is gender, Row 47 is bmi First 344 samples: discovery stage Last 76 samples: replication stage Data transformation to normality was performed GPL21480-513.txt (SNP information) Row 31- is SNP information Col 1 is ID, Col 3 is SNP ID, Col 5 is chromosome, Col 6 is physical position GSE78098_MMT_stage_discovery_postqc.txt (genotype) Row 1 is sample ID, Col 1 is SNP ID Statistical qualtity control procedures were performed using PLINK software
Preprocessed Data Covariates: covariates.txt Col 1 is sample ID, Col 2 is age, Col 3 is gender, Col 4 is bmi Phenotype data: phenotype.txt Col 1 is sample ID, Col 2 is trait 1, Col 3 is trait 2, Col 4 is trait 3, Col 5 is trait 4
Summary Statistics Summarize covariates and transformed data of quantitative traits by gender (using R) # Read data pheno=read.table("phenotype.txt", header=T) covar=read.table("covariates.txt", header=T) df=data.frame(covar$gender, covar$age, covar$bmi, pheno[,2:5]) names(df)=c("gender", "age", "bmi", "r_met", "s_met", "r_eddp", "s_eddp") # Sample size, mean and standard deviation by gender library(plyr) ddply(df, ~gender,summarise, count=length(r_met[!is.na(r_met)]), mean=mean(r_met[!is.na(r_met)]), sd=sd(r_met[!is.na(r_met)])) # Normality test install.packages(“fBasics") library(fBasics) ksnormTest(df$r_met[!is.na(df$r_met)]) # Histogram hist(df$r_met[!is.na(df$r_met)],xlab=”r_met”,ylab=”histogram of r_met”)
Summary Statistics Of Covariates And The Transformed Data Of Quantitative Traits By Gender Characteristics Male Female Sample size Mean± SD Normality test (p value) Age (years) 281 39.3061 ± 7.6587 63 33.0318 ± 5.4506 - BMI (kg/m2) 278 23.9176 ± 3.4537 22.398 ± 3.5792 Transformed plasma R-methadone/dose (ng/ml/mg) 0.0030 ± 0.9946 -0.0136 ± 1.0316 0.6255 Transformed plasma S-methadone/dose (ng/ml/mg) 0.0173 ± 1.0227 -0.0774 ± 0.8953 0.0802 Transformed plasma R-EDDP/dose (ng/ml/mg) 272 0.0212 ± 0.9787 -0.0916 ± 1.0908 0.7903 Transformed plasma S-EDDP/dose (ng/ml/mg) 277 -0.0004 ± 1.0075 0.0019 ± 0.9741 0.1876
Covariates Adjustments Replace missing values in phenotype and covariates with mean of the variable (using R) # Phenotype pheno=read.table("phenotype.txt", header=T) pheno0=pheno[,-1] n=dim(pheno0)[1] p=dim(pheno0)[2] for (i in 1:p){ pheno0[is.na(pheno0[,i]),i]=mean(pheno0[!is.na(pheno0[,i]),i])} write.table(pheno0, "phenotype0.txt", row.names=F, col.names=F, quote=F, sep=" ") # Covariates covar=read.table("covariates.txt", header=T) covar0=covar[,-1] n=dim(covar0)[1] p=dim(covar0)[2] covar0[is.na(covar0[,i]),i]=mean(covar0[!is.na(covar0[,i]),i])} write.table(covar0, "covariates0.txt", row.names=F, col.names=F, quote=F, sep=" ")
Covariates Adjustments Covariates adjustments (using R) pheno=read.table("phenotype0.txt") covar=read.table("covariates0.txt") pheno=as.matrix(pheno) covar=as.matrix(covar) n=dim(pheno)[1] p=dim(pheno)[2] fit=list() residpheno=matrix(0,n,p) for (i in 1:p){ fit[[i]]=lm(pheno[,i]~covar) residpheno[,i]=resid(fit[[i]]) } write.table(residpheno, "resid_phenotype0.txt", row.names=F, col.names=F, quote=F, sep=" ")
PLINK PLINK is a free, open-source whole genome association analysis toolset, designed to perform a range of basic, large- scale analyses in a computationally efficient manner, see http://pngu.mgh.harvard.edu/~purcell/plink/. We will use PLINK to perform genome-wide single-locus association analysis.
PLINK Input Files PED file Example: Col 1: Family ID Col 2: Individual ID Col 3: Paternal ID Col 4: Maternal ID Col 5: Sex (1=male; 2=female; other character=unknown) Col 6: Phenotype (The missing phenotype value for quantitative traits is, by default, -9) Col 7-: Genotypes Example: FAM001 1 0 0 1 3.4 A A G G A C C C FAM001 2 0 0 1 2.5 A A A G 0 0 A C
PLINK Input Files MAP file Example: Col 1: Chromosome (1-22, X, Y or 0 if unplaced) Col 2: rs# or SNP identifier Col 3: Genetic distance (morgans) Col 4: Base-pair position (bp units) Example: 1 rs123456 0 1234555 1 rs234567 0 1237793 1 rs224534 0 -1237697 1 rs233556 0 1337456
PLINK Input Files Example: Alternate phenotype files (to specify an alternate phenotype for analysis, other than the one in the PED file) Col 1: Family ID Col 2: Individual ID Col 3: Phenotype A Col 4: Phenotype B Col 5: Phenotype C Col 6: Phenotype D …… Example: FAM001 1 2.3 22.22 2 19 FAM002 2 3.2 18.23 1 32
PLINK-Ready Files PED file: genotype.ped MAP file: genotype.map Alternate phenotype file: resid_phenotype.txt
Genome-Wide Single Locus Association Test See PLINK reference page http://pngu.mgh.harvard.edu/~purcell/plink/anal.shtml#qt. Run PLINK: plink --noweb --file genotype --assoc –adjust --pheno resid_phenotype.txt --all- pheno --out younameit Usage --file specifies .ped and .map files, --assoc performs case/control or QTL association, --adjust generates a file of adjusted significance values that correct for all tests performed and other metrics, --pheno specifies alternate phenotype, --all-pheno performs association for all phenotypes in file, --out specifies output filename.
Genome-Wide Single Locus Association Test This will generate the files younameit.P1.qassoc, younameit.P2.qassoc, younameit.P3.qassoc, younameit.P4.qassoc with fields as follows: CHR Chromosome number SNP SNP identifier BP Physical position (base-pair) NMISS Number of non-missing genotypes BETA Regression coefficient SE Standard error R2 Regression r-squared T Wald test (based on t-distribtion) P Wald test asymptotic p-value
Genome-Wide Single Locus Association Test --adjust generates the file younameit.adjust, which contains the following fields CHR Chromosome number SNP SNP identifer UNADJ Unadjusted p-value GC Genomic-control corrected p-values BONF Bonferroni single-step adjusted p-values HOLM Holm (1979) step-down adjusted p-values SIDAK_SS Sidak single-step adjusted p-values SIDAK_SD Sidak step-down adjusted p-values FDR_BH Benjamini & Hochberg (1995) step-up FDR control FDR_BY Benjamini & Yekutieli (2001) step-up FDR control
qqman R Package qqman is an R package for creating Q-Q and manhattan plots from GWAS results. See the reference page http://www.gettinggeneticsdone.com/2014/05/qqman-r- package-for-qq-and-manhattan-plots-for-gwas-results.html. The qqman R package assumes you have columns named SNP, CHR, BP, and P, corresponding to the SNP name (rs number), chromosome number, base-pair position, and p-value. Here is what the data looks like: SNP CHR BP P rs10495434 1 235800006 0.62220 rs6689417 1 46100028 0.06195 rs3897197 1 143700035 0.10700 rs2282450 1 202300047 0.47280 rs11208515 1 64900051 0.53430
Manhattan Plot and Q-Q Plot Prepare qqman R input files (CHR, SNP, BP, P) awk '{print $1,$2,$3,$9}' younameit.P1.qassoc > P1.qassoc awk '{print $1,$2,$3,$9}' younameit.P2.qassoc > P2.qassoc awk '{print $1,$2,$3,$9}' younameit.P3.qassoc > P3.qassoc awk '{print $1,$2,$3,$9}' younameit.P4.qassoc > P4.qassoc
Manhattan Plot and Q-Q Plot Create Manhattan plots and Q-Q plots (using R) traits=c("r_met", "s_met", "r_eddp", "s_eddp") traits=as.matrix(traits) library(qqman) i=1 #i=2/i=3/i=4 qassoc=read.table(paste0("P", i, ".qassoc"), header=T) qassoc=qassoc[qassoc$CHR!=0,] png(filename=paste0("Manhattan_Plot_for_", traits[i], ".png"), type="cairo") manhattan(qassoc, col=c("green4", "red"), suggestiveline=F, genomewideline=F) dev.off() png(filename=paste0("Q-Q_Plot_for_", traits[i], ".png"), type="cairo") qq(qassoc$P)
Manhattan Plot of Genome-Wide Single Locus Association Test for R-Methadone and S-Methadone
Q-Q Plot of Genome-Wide Single Locus Association Test for R-methadone and S-Methadone
Identify Significant SNPs After a Multiple-Test Correction of a False Discovery Rate (FDR) Prepare R input files (CHR, SNP, UNADJ, FDR_BH) awk '{print $1,$2,$3,$9}' younameit.P1.qassoc.adjusted > P1.qassoc.adjusted awk '{print $1,$2,$3,$9}' younameit.P2.qassoc.adjusted > P2.qassoc.adjusted awk '{print $1,$2,$3,$9}' younameit.P3.qassoc.adjusted > P3.qassoc.adjusted awk '{print $1,$2,$3,$9}' younameit.P4.qassoc.adjusted > P4.qassoc.adjusted
Identify Significant SNPs After a Multiple-Test Correction of a False Discovery Rate (FDR) Significant SNPs after a multiple-test correction of FDR (using R) traits=c("r_met", "s_met", "r_eddp", "s_eddp") traits=as.matrix(traits) for (i in 1:4){ qassoc.adjusted=read.table(paste0("P", i, ".qassoc.adjusted"), header=T) sigidx=which(qassoc.adjusted$FDR_BH<0.05) #index of significant SNPs sigSNP=qassoc.adjusted[sigidx,] write.table(sigSNP, paste0("significant_SNPs_for_", traits[i],".txt"), row.names=F, col.names=T, quote=F, sep=" ") }
The Significant SNPs Identified by Genome-Wide Single Locus Association Analysis The genome-wide single locus association analysis identified only SNP rs17180299 (Chr 9, 82944202) to be significantly associated with the plasma concentration of R-methadone after a multiple-test correction of a false discovery rate (raw p=4.692e-09).
Usage of make.fancy.locus.plot make.fancy.locus.plot is an R function for highlighting the statistical strength of an association in the context of the association results for surrounding markers, gene annotations, estimated recombination rates and pairwise correlations between the surrounding markers and the putative associated variant, see the reference page http://www.broadinstitute.org/diabetes/scandinavs/figures.html You have to provide a file that contains the following data for every SNP across the region of interest: position, p-value, a label to indicate whether a SNP is “typed" or "imputed", and the r-squared between that SNP and the putative associated variant. All SNPs in this file will be plotted with their corresponding P-values (as -log10 values) as a function of chromosomal position. SNPs that are "typed" are plotted as diamonds; "imputed" SNPs are plotted as circles. Estimated recombination rates are plotted to reflect the local LD structure around the associated SNP and their correlated proxies (bright red indicating highly correlated, faint red indicating weakly correlated).
Significant SNPs in a Regional Association Plot Obtain regional SNPs (using R) qassoc=read.table("younameit.P1.qassoc", header=T, stringsAsFactors=F) idx=which(qassoc$CHR==9 & qassoc$BP>=82500000 & qassoc$BP<=83300000 & qassoc$SNP!="- --") TYPE=rep("typed", length(idx)) region=data.frame(qassoc$SNP[idx], qassoc$BP[idx], qassoc$P[idx], TYPE, qassoc$R2[idx]) write.table(region, "regional_SNPs.txt", row.names=F, col.names=c("SNP", "POS", "PVAL", "TYPE", "RSQR"), quote=F, sep=" ") The souce code "regional_association_plot.r", the estimated recombination rate from HapMap and the gene annotations from the UCSC genome browser (using Build 35 coordinates) should be available in the same folder.
Significant SNPs in a Regional Association Plot Create regional association plot (using R) source("regional association plot.r") locus=read.table("regional_SNPs.txt", header=T, row.names=1) pdf("assocplot_rs17180299.pdf", width=8, height=6) make.fancy.locus.plot("rs17180299", "rs17180299", "9", locus, 9, 4.69e-9) dev.off()
Regional Association Plot of rs17180299
Distribution of Plasma Concentration of R-Methadone for the Genotypes of rs17180299 Phenotype data: r_met.txt Genotype data for significant SNPs: rs17180299.txt Create Box plot (using R) pheno=read.table("r_met.txt") geno=read.table("rs17180299.txt", sep="\t") pheno=as.matrix(pheno) geno=as.matrix(geno) table(geno) #number of individuals having AA, AG and GG boxplot(pheno~geno, xlab="Genotype", ylab="R-methadone")
Distribution of Plasma Concentration of R-Methadone for Three Genotypes of rs17180299
Proportion of Variation Explained by Significant SNPs Based on the variable(s) or covariate(s) in a regression model, the next SNP or was included if the SNP produced the maximal increment of model R2. Model R2 revealed the coefficient of determination of a full regression model that contained one or more SNPs. In addition, the marginal R2 was calculated for each SNP according to the regression model that contained only that SNP.
Analysis of the Proportion of Variation Explained by Significant SNPs Significant SNPs: rs17180299 Genotype data of significant SNPs: rs17180299.txt Phenotype data: r_met.txt
Analysis of the Proportion of Variation Explained by Significant SNPs Calculate marginal R2 (using R) pheno=read.table("r_met.txt") geno=read.table("rs17180299.txt") pheno=as.matrix(pheno) geno=as.matrix(geno) fit=lm(pheno~geno) summary(fit)$r.squared [1] 0.09559355
Genome-Wide Case/Control Association Test
PLINK Input Files Phenotype data: Genotype data: We convert the four continuous traits to binary traits based on the sign of value: Values greater than 0 are coded as 1 Values less than 0 are coded as 2 Phenotype data: Binary_phenotype.txt Genotype data: genotype.ped genotype.map
Genome-Wide Case/Control Association Test Run PLINK: plink --noweb --file genotype --assoc –adjust --pheno binary_phenotype.txt -- all-pheno --out younameit This will generate the files younameit.P1.assoc,younameit.P2.assoc, younameit.P3.assoc, younameit.P4.assoc with fields as follows CHR Chromosome SNP SNP ID BP Physical position (base-pair) A1 Minor allele name (based on whole sample) F_A Frequency of this allele in cases F_U Frequency of this allele in controls A2 Major allele name CHISQ Basic allelic test chi-square (1df) P Asymptotic p-value for this test OR Estimated odds ratio (for A1, i.e. A2 is reference)
Genome-Wide Case/Control Association Test --adjust generates the file younameit.adjust, which contains the following fields CHR Chromosome number SNP SNP identifer UNADJ Unadjusted p-value GC Genomic-control corrected p-values BONF Bonferroni single-step adjusted p-values HOLM Holm (1979) step-down adjusted p-values SIDAK_SS Sidak single-step adjusted p-values SIDAK_SD Sidak step-down adjusted p-values FDR_BH Benjamini & Hochberg (1995) step-up FDR control FDR_BY Benjamini & Yekutieli (2001) step-up FDR control
Manhattan Plot of Genome-Wide Case/Control Association Test for Binary R-Methadone and Binary S-Methadone
Q-Q Plot of Genome-Wide Case/Control Association Test for Binary R-Methadone and Binary S-Methadone
The Significant SNPs Identified by Genome-Wide Case/Control Association Analysis The genome-wide case/control association analysis did not identify any SNP to be significantly associated with any of the four binary traits after a multiple-test correction of a false discovery rate.
Top 10 SNPs for Binary R-Methadone and Continuous R-Methadone CHR SNP P 16 rs11860324 1.14E-06 14 rs17092000 1.18E-06 9 rs17083111 3.75E-06 7 rs3823990 6.77E-06 rs10282605 7.10E-06 rs17180299 9.38E-06 17 rs16959039 rs7154542 1.06E-05 rs13235135 1.32E-05 18 rs74873501 1.46E-05 Continuous R-Methadone CHR SNP P 9 rs17180299 4.69E-09 rs17083111 3.96E-07 rs62572435 5.74E-07 5 rs26411 1.14E-06 21 rs2827540 4.38E-06 12 rs17116144 5.28E-06 rs79712882 6.49E-06 11 rs77120929 6.86E-06 rs77254985 7.17E-06 rs1972039 8.70E-06