GxG and GxE
Top 4 SNPs for r_met The top 4 SNPs for r_met are: chr9 rs17180299 chr9 rs17083111 chr9 rs62572435 chr5 rs26411 We want to test SNP by SNP Epistasis for the top 4 SNPs for r_met using PLINK.
PLINK Input Files MAP file: top4SNPs.map PED file: top4SNPs.ped Phenotype data: r_met.txt
SNP by SNP Interaction (GxG) PLINK makes a model based on allele dosage for each SNP, A and B, and fits the model in the form of Y ~ 0 + 1.A + 2.B + 3.AB + e See reference: http://pngu.mgh.harvard.edu/~purcell/plink/epi.shtml PLINK commands: plink --noweb --file top4SNPs --epistasis --epi1 1 --pheno conty.txt --out younameit
SNP by SNP Interaction (GxG) The output is in the form: CHR1 Chromosome of first SNP SNP1 Identifier for first SNP CHR2 Chromosome of second SNP SNP2 Identifier for second SNP OR_INT Odds ratio for interaction STAT Chi-square statistic, 1df P Asymptotic p-value
SNP by SNP Interaction (GxG) Results: CHR1 SNP1 CHR2 SNP2 BETA_INT STAT P 5 rs26411 9 rs62572435 -0.02902 0.03577 0.85 rs17083111 -0.01511 0.01382 0.9064 rs17180299 -0.01058 0.004594 0.946 0.2549 1.007 0.3157 -0.1009 0.1224 0.7264 -0.05149 0.03492 0.8518
SNP by SNP Interaction (GxG) The output can be controlled via plink --noweb --file top4SNPs --epistasis --epi1 0.0001--out younameit which means only record results that are significant p<=0.0001. (This prevents too much output from being generated).
Covariate File PLINK provides the ability to test for a difference in association with a quantitative trait between two environments (or, more generally, two groups). Covariate file: gender.txt Col 1 is family ID, Col 2 is sample ID, Col 3 is gender (male: 1; female: 2)
Quantitative Trait Interaction (GxE) PLINK commands: plink --noweb --file top4SNPs --gxe --covar gender.txt --pheno r_met.txt --out younameit The output is in the form: CHR Chromosome number SNP SNP identifier NMISS1 Number of non-missing genotypes in first group (1) BETA1 Regression coefficient in first group SE1 Standard error of coefficient in first group NMISS2 As above, second group BETA2 As above, second group SE2 As above, second group Z_GXE Z score, test for interaction P_GXE Asymptotic p-value for this test
Quantitative Trait Interaction (GxE) Results: CHR SNP NMISS1 BETA1 SE1 NMISS2 BETA2 SE2 Z_GXE P_GXE 5 rs26411 280 -0.3813 0.08359 63 -0.3518 0.171 -0.1554 0.8765 9 rs62572435 281 0.5774 0.1412 0.8819 0.2821 -0.9654 0.3344 rs17083111 278 0.5029 0.1111 0.6459 0.2565 -0.5115 0.609 rs17180299 0.7273 0.1418 -0.4898 0.6243
Population Stratification Correction Using EIGENSTRAT
EIGENSTRAT The EIGENSTRAT method uses principal components analysis to explicitly model ancestry differences between cases and controls along continuous axes of variation; the resulting correction is specific to a candidate marker’s variation in frequency across ancestral populations, minimizing spurious associations while maximizing power to detect true associations. The EIGENSOFT package has a built-in plotting script and supports multiple file formats and quantitative phenotypes. The package is based on ideas from Price et al. 2006. See https://github.com/DReichLab/EIG.
EIGENSTRAT Input Files (PED Format) genotype file: the same as PLINK PED file *** file name MUST end in .ped *** snp file: the same as PLINK MAP file *** file name MUST end in .pedsnp indiv file: the first six columns of PLINK PED file *** file name MUST end in .pedind ***
Run PCA on Input Genotype Data We call smartpca.pl to run PCA on input genotype data. Options: -i example.ped : genotype file -a example.pedsnp : snp file -b example.pedind : indiv file -k k : (Default is 10) number of principal components to output -o example.pca : output file of principal components -p example.plot : prefix of output plot files of top 2 principal components. (labeling individuals according to labels in indiv file) -e example.eval : output file of all eigenvalues -l example.log : output logfile
Run PCA on Input Genotype Data Commands: smartpca.pl –i genotype.ped –a genotype.pedsnp –b genotype.pedind –k 10 –o genotype.pca –p genotype.plot –e genotype.eval –l genotype.log Main Outputs: genotype.pca genotype.plot.pdf
Test the Significance of PCs Phenotype data: r_met.txt PC data: pc.txt Test the Significance of PCs y=read.table("r_met.txt") pc=read.table("pc.txt") y=as.matrix(y) pc=as.matrix(pc) fit=lm(y~pc) summary(fit)
Genotype Imputation Using IMPUTE2
IMPUTE2 IMPUTE version 2 (also known as IMPUTE2) is a genotype imputation and haplotype phasing program based on ideas from Howie et al. 2009: B. N. Howie, P. Donnelly, and J. Marchini (2009) A flexible and accurate genotype imputation method for the next generation of genome-wide association studies. PLoS Genetics 5(6): e1000529 See https://mathgen.stats.ox.ac.uk/impute/impute_v2.html.
IMPUTE2 Input Files Genotype file (specified in -g) Suppose you want to create a genotype for 2 individuals at 5 SNPs whose genotypes are SNP 1 : AA AA SNP 2 : GG GT SNP 3 : CC CT SNP 4 : CT CT SNP 5 : AG GG The correct genotype file would be SNP1 rs1 1000 A C 1 0 0 1 0 0 SNP2 rs2 2000 G T 1 0 0 0 1 0 SNP3 rs3 3000 C T 1 0 0 0 1 0 SNP4 rs4 4000 C T 0 1 0 0 1 0 SNP5 rs5 5000 A G 0 1 0 0 0 1
IMPUTE2 Input Files Map file (specified in -m) This file should have three columns: physical position (in base pairs), recombination rate between current position and next position in map (in cM/Mb), and genetic map position (in cM). The file should also have a header line with an unbroken character string for each column (e.g., "position COMBINED_rate(cM/Mb) Genetic_Map(cM)"). All of IMPUTE2 reference panel download packages come with appropriate recombination map files.
IMPUTE2 Input Files File of known haplotypes (specified in -h) The file contains known haplotypes, with one row per SNP and one column per haplotype. All alleles must be coded as 0 or 1, and each -h file must be provided with a corresponding legend file. IMPUTE2 provides formatted haplotypes from the HapMap Project and the 1,000 Genomes Project in the reference panel download packages.
IMPUTE2 Input Files Legend files (specified in -l) Legend file(s) with information about the SNPs in the -h file(s). Each file should have four columns: rsID, physical position (in base pairs), allele 0, and allele 1. The last two columns specify the alleles underlying the 0/1 coding in the corresponding -h file; these alleles can take values in {A,C,G,T}. Each legend file should also have a header line with an unbroken character string for each column (e.g., "rsID position a0 a1"). IMPUTE2 provides legend files for data from the HapMap Project and the 1,000 Genomes Project in our reference panel download packages. When using two -h files with IMPUTE2, you must supply the corresponding legend files in the same order, i.e., the file with more SNPs comes first.
Basic Commands Genomic interval to use for reference -int <lower> <upper> specifies genomic interval to use for inference, as specified by <lower> and <upper> boundaries in base pair position. The boundaries can be expressed either in long form (e.g., -int 5420000 10420000) or in exponential notation (e.g., -int 5.42e6 10.42e6). This option is particularly useful for restricting test jobs to small regions or splitting whole-chromosome analyses into manageable chunks, as discussed in the section on analyzing whole chromosomes. Effective size of the population -Ne specifies "Effective size" of the population from which your dataset was sampled. IMPUTE2 suggests setting -Ne to 20000 in the majority of modern imputation analyses.
Stand Alignment Options -strand_g specifies file showing the strand orientation of the SNP allele codings in the -g file, relative to a fixed reference point. Each SNP occupies one line, and the file should have two columns: (i) the base pair position of the SNP and (ii) the strand orientation ('+' or '-') of the alleles in the genotype file; the columns should be separated by a single space.
Output Files The main output file follows the same format as the -g file. Use -o to specify name of main output file.
Example This is the most common genotype imputation scenario: we want to impute untyped SNPs in a study dataset from a panel of reference haplotypes. The following command shows how to run this kind of analysis with IMPUTE2, using the example data that come with the program download: ./impute2 \ -m ./Example/example.chr22.map \ -h ./Example/example.chr22.1kG.haps \ -l ./Example/example.chr22.1kG.legend \ -g ./Example/example.chr22.study.gens \ -strand_g ./Example/example.chr22.study.strand \ -int 20.4e6 20.5e6 \ -Ne 20000 \ -o ./Example/example.chr22.one.phased.impute2
Sample Size and Power Calculation
Sample Size and Power Calculation Power analysis is an important aspect of experimental design. It allows us to determine the sample size required to detect an effect of a given size with a given degree of confidence. Conversely, it allows us to determine the probability of detecting an effect of a given size with a given level of confidence, under sample size constraints. If the probability is unacceptably low, we would be wise to alter or abandon the experiment.
Sample Size and Power Calculation The following four quantities have an intimate relationship: sample size effect size significance level = P(Type I error) = probability of finding an effect that is not there power = 1 - P(Type II error) = probability of finding an effect that is there Given any three, we can determine the fourth.
Power Analysis in R The pwr package in R implements power analysis. For each function, you enter three of the four quantities (effect size, sample size, significance level, power) and the fourth is calculated. See reference page: http://www.statmethods.net/stats/power.html.
Power Analysis in R Example: library(pwr) Using a two-tailed test proportions, and assuming a significance level of 0.01 and a common sample size of 30 for each proportion, what effect size can be detected with a power of .75? library(pwr) pwr.2p.test(n=30,sig.level=0.01,power=0.75)
Sample Size Calculation Using Quanto Download page: http://biostats.usc.edu/Quanto.html. Suppose, in a matched case control study, DNA samples have been collected to determine the effects of each SNP’s on the risk of having cardio vascular disease. We are interested in calculating the sample size needed to have the effect size (or odds ratio) in the range of 1.5-2.0 with at least 80 percent power under dominance model. Moreover, the minor allele frequency is chosen to be 10 percent, and a type 1 error level of 0.05.
Sample Size Calculation Using Quanto Under Parameters option, i. Select Outcome/Design>Disease>Case-control (Matched). ii. Select Hypothesis>Gene only. iii. Click onto Gene G and then type onto 0.1 on the allele frequency box. Select dominance inheritance mode. Click Ok. iv. Under Outcome model, specify baseline disease risk which is the disease risk in unexposed genetically normal subjects. For this study, let’s consider the baseline disease risk as 0.1. Under Genetic effect box, specify the effect size. In this case, consider 1.3 to 3.0 with an interval range of 0.5. v. Under Power window, specify power as 0.8 and click ok to calculate sample size. Type 0.05 on the type 1 error rate box. Click ok. vi. Click onto Calculate button.
Sample Size Calculation Using Quanto The following output will be displayed. RG Gene kP 1.3000 10611 0.100522 1.8000 1897 0.101327 2.3000 880 0.102060 2.8000 548 0.102732 The column “Gene” reflects the number of case-control pair needed. P0 is the baseline disease risk specified and kP is the overall disease risk in the general population (calculated by the software). For a range of odds ratio (RG), Quanto provides the number of case-control pairs required for the desired power.
Power software
piface.jar by Lenth (2006) Link: http://homepage.stat.uiowa.edu/~rlenth/Power/ Select the two sample T test sigma1 and sigma2: standard deviation for each group Set true difference of means Solve for power by set sample size
Microarray power/sample size estimation Link:http://bioinformatics.mdanderson.org/MicroarraySampleSize/ Set the accepted # of false positives and fold differences(FC) Set the estimated standard deviation of the gene intensity measurements on the base-two logarithmic scale (0.7 recommended) Solve for sample size and per-gene alpha
RnaSeqSampleSize URL: https://cqs.mc.vanderbilt.edu/shiny/RnaSeqSampleSize/ Use sample size estimation by prior data(say, TCGA data) Use large repNumber to get more precise estimation.(50 may be enough)