GxG and GxE.

Slides:

Advertisements

Similar presentations

Statistical methods for genetic association studies

Advertisements

BST 775 Lecture PLINK – A Popular Toolset for GWAS

Hypothesis: It is an assumption of population parameter ( mean, proportion, variance) There are two types of hypothesis : 1) Simple hypothesis :A statistical.

Regression Analysis Using Excel. Econometrics Econometrics is simply the statistical analysis of economic phenomena Here, we just summarize some of the.

Objectives Cover some of the essential concepts for GWAS that have not yet been covered Hardy-Weinberg equilibrium Meta-analysis SNP Imputation Review.

MALD Mapping by Admixture Linkage Disequilibrium.

More Powerful Genome-wide Association Methods for Case-control Data Robert C. Elston, PhD Case Western Reserve University Cleveland Ohio.

Statistics for Linguistics Students Michaelmas 2004 Week 3 Bettina Braun

Inferences About Process Quality

Data Analysis Statistics. Inferential statistics.

Today Concepts underlying inferential statistics

How to Analyze Data? Aravinda Guntupalli. SPSS windows process Data window Variable view window Output window Chart editor window.

Polymorphism and Variant Analysis Lab

Basic Business Statistics, 11e © 2009 Prentice-Hall, Inc. Chap 8-1 Confidence Interval Estimation.

Lecture 12 Statistical Inference (Estimation) Point and Interval estimation By Aziza Munir.

CHAPTER 14 MULTIPLE REGRESSION

Polymorphism & Variant Analysis Lab Saurabh Sinha Polymorphism and Variant Analysis Lab v1 | Saurabh Sinha 1 Powerpoint by Casey Hanson.

Statistics for Managers Using Microsoft Excel, 5e © 2008 Pearson Prentice-Hall, Inc.Chap 8-1 Statistics for Managers Using Microsoft® Excel 5th Edition.

National Taiwan University Department of Computer Science and Information Engineering Haplotype Inference Yao-Ting Huang Kun-Mao Chao.

A Course In Business Statistics 4th © 2006 Prentice-Hall, Inc. Chap 9-1 A Course In Business Statistics 4 th Edition Chapter 9 Estimation and Hypothesis.

April 4 Logistic Regression –Lee Chapter 9 –Cody and Smith 9:F.

Lecture 4: Statistics Review II Date: 9/5/02  Hypothesis tests: power  Estimation: likelihood, moment estimation, least square  Statistical properties.

EDCI 696 Dr. D. Brown Presented by: Kim Bassa. Targeted Topics Analysis of dependent variables and different types of data Selecting the appropriate statistic.

1 B-b B-B B-b b-b Lecture 2 - Segregation Analysis 1/15/04 Biomath 207B / Biostat 237 / HG 207B.

Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests.

California Pacific Medical Center

Chapter 6: Analyzing and Interpreting Quantitative Data

1 Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved. Example: In a recent poll, 70% of 1501 randomly selected adults said they believed.

Mx modeling of methylation data: twin correlations [means, SD, correlation] ACE / ADE latent factor model regression [sex and age] genetic association.

Linkage. Announcements Problem set 1 is available for download. Due April 14. class videos are available from a link on the schedule web page, and at.

HYPOTHESIS TESTING FOR DIFFERENCES BETWEEN MEANS AND BETWEEN PROPORTIONS.

Educational Research Inferential Statistics Chapter th Chapter 12- 8th Gay and Airasian.

NURS 306, Nursing Research Lisa Broughton, MSN, RN, CCRN RESEARCH STATISTICS.

Genetic mapping and QTL analysis - JoinMap and QTLNetwork -

I. Statistical Methods for Genome-Enabled Prediction of Complex Traits OUTLINE THE CHALLENGES OF PREDICTING COMPLEX TRAITS ORDINARY LEAST SQUARES (OLS)

Date of download: 11/12/2016 Copyright © 2016 American Medical Association. All rights reserved. From: Influence of Child Abuse on Adult DepressionModeration.

Virtual University of Pakistan

Chapter 14 Introduction to Multiple Regression

Imputation Sarah Medland Boulder 2015.

Chapter 7 Confidence Interval Estimation

BINARY LOGISTIC REGRESSION

Chapter 6 Inferences Based on a Single Sample: Estimation with Confidence Intervals Slides for Optional Sections Section 7.5 Finite Population Correction.

Copyright © 2001 American Medical Association. All rights reserved.

Inference and Tests of Hypotheses

upstream vs. ORF binding and gene expression?

Genome Wide Association Studies using SNP

Analyzing and Interpreting Quantitative Data

Linkage analysis & Homozygosity mapping

Genome-Wide Pharmacogenomic Study on Methadone Maintenance Treatment

Gene mapping in mice Karl W Broman Department of Biostatistics

Introduction to Data Formats and tools

Multiple logistic regression

Confidence Interval Estimation

I. Statistical Tests: Why do we use them? What do they involve?

Why general modeling framework?

Introgression of Neandertal- and Denisovan-like Haplotypes Contributes to Adaptive Variation in Human Toll-like Receptors Michael Dannemann, Aida M.

Haplotype Inference Yao-Ting Huang Kun-Mao Chao.

Exercise: Effect of the IL6R gene on IL-6R concentration

One way ANOVA One way Analysis of Variance (ANOVA) is used to test the significance difference of mean of one dependent variable across more than two.

Haplotype Inference Yao-Ting Huang Kun-Mao Chao.

A Flexible Bayesian Framework for Modeling Haplotype Association with Disease, Allowing for Dominance Effects of the Underlying Causative Variants Andrew.

Some statistics questions answered:

Inference on the Mean of a Population -Variance Known

Shuhua Xu, Wei Huang, Ji Qian, Li Jin

A Fast, Powerful Method for Detecting Identity by Descent

Haplotype Inference Yao-Ting Huang Kun-Mao Chao.

Genotype-Imputation Accuracy across Worldwide Human Populations

Introgression of Neandertal- and Denisovan-like Haplotypes Contributes to Adaptive Variation in Human Toll-like Receptors Michael Dannemann, Aida M.

Jung-Ying Tzeng, Daowen Zhang The American Journal of Human Genetics

Presentation transcript:

GxG and GxE

Top 4 SNPs for r_met The top 4 SNPs for r_met are: chr9 rs17180299 chr9 rs17083111 chr9 rs62572435 chr5 rs26411 We want to test SNP by SNP Epistasis for the top 4 SNPs for r_met using PLINK.

PLINK Input Files MAP file: top4SNPs.map PED file: top4SNPs.ped Phenotype data: r_met.txt

SNP by SNP Interaction (GxG) PLINK makes a model based on allele dosage for each SNP, A and B, and fits the model in the form of Y ~ 0 + 1.A + 2.B + 3.AB + e See reference: http://pngu.mgh.harvard.edu/~purcell/plink/epi.shtml PLINK commands: plink --noweb --file top4SNPs --epistasis --epi1 1 --pheno conty.txt --out younameit

SNP by SNP Interaction (GxG) The output is in the form: CHR1 Chromosome of first SNP SNP1 Identifier for first SNP CHR2 Chromosome of second SNP SNP2 Identifier for second SNP OR_INT Odds ratio for interaction STAT Chi-square statistic, 1df P Asymptotic p-value

SNP by SNP Interaction (GxG) Results: CHR1 SNP1 CHR2 SNP2 BETA_INT STAT P 5 rs26411 9 rs62572435 -0.02902 0.03577 0.85 rs17083111 -0.01511 0.01382 0.9064 rs17180299 -0.01058 0.004594 0.946 0.2549 1.007 0.3157 -0.1009 0.1224 0.7264 -0.05149 0.03492 0.8518

SNP by SNP Interaction (GxG) The output can be controlled via plink --noweb --file top4SNPs --epistasis --epi1 0.0001--out younameit which means only record results that are significant p<=0.0001. (This prevents too much output from being generated).

Covariate File PLINK provides the ability to test for a difference in association with a quantitative trait between two environments (or, more generally, two groups). Covariate file: gender.txt Col 1 is family ID, Col 2 is sample ID, Col 3 is gender (male: 1; female: 2)

Quantitative Trait Interaction (GxE) PLINK commands: plink --noweb --file top4SNPs --gxe --covar gender.txt --pheno r_met.txt --out younameit The output is in the form: CHR Chromosome number SNP SNP identifier NMISS1 Number of non-missing genotypes in first group (1) BETA1 Regression coefficient in first group SE1 Standard error of coefficient in first group NMISS2 As above, second group BETA2 As above, second group SE2 As above, second group Z_GXE Z score, test for interaction P_GXE Asymptotic p-value for this test

Quantitative Trait Interaction (GxE) Results: CHR SNP NMISS1 BETA1 SE1 NMISS2 BETA2 SE2 Z_GXE P_GXE 5 rs26411 280 -0.3813 0.08359 63 -0.3518 0.171 -0.1554 0.8765 9 rs62572435 281 0.5774 0.1412 0.8819 0.2821 -0.9654 0.3344 rs17083111 278 0.5029 0.1111 0.6459 0.2565 -0.5115 0.609 rs17180299 0.7273 0.1418 -0.4898 0.6243

Population Stratification Correction Using EIGENSTRAT

EIGENSTRAT The EIGENSTRAT method uses principal components analysis to explicitly model ancestry differences between cases and controls along continuous axes of variation; the resulting correction is specific to a candidate marker’s variation in frequency across ancestral populations, minimizing spurious associations while maximizing power to detect true associations. The EIGENSOFT package has a built-in plotting script and supports multiple file formats and quantitative phenotypes. The package is based on ideas from Price et al. 2006. See https://github.com/DReichLab/EIG.

EIGENSTRAT Input Files (PED Format) genotype file: the same as PLINK PED file *** file name MUST end in .ped *** snp file: the same as PLINK MAP file *** file name MUST end in .pedsnp indiv file: the first six columns of PLINK PED file *** file name MUST end in .pedind ***

Run PCA on Input Genotype Data We call smartpca.pl to run PCA on input genotype data. Options: -i example.ped : genotype file -a example.pedsnp : snp file -b example.pedind : indiv file -k k : (Default is 10) number of principal components to output -o example.pca : output file of principal components -p example.plot : prefix of output plot files of top 2 principal components. (labeling individuals according to labels in indiv file) -e example.eval : output file of all eigenvalues -l example.log : output logfile

Run PCA on Input Genotype Data Commands: smartpca.pl –i genotype.ped –a genotype.pedsnp –b genotype.pedind –k 10 –o genotype.pca –p genotype.plot –e genotype.eval –l genotype.log Main Outputs: genotype.pca genotype.plot.pdf

Test the Significance of PCs Phenotype data: r_met.txt PC data: pc.txt Test the Significance of PCs y=read.table("r_met.txt") pc=read.table("pc.txt") y=as.matrix(y) pc=as.matrix(pc) fit=lm(y~pc) summary(fit)

Genotype Imputation Using IMPUTE2

IMPUTE2 IMPUTE version 2 (also known as IMPUTE2) is a genotype imputation and haplotype phasing program based on ideas from Howie et al. 2009: B. N. Howie, P. Donnelly, and J. Marchini (2009) A flexible and accurate genotype imputation method for the next generation of genome-wide association studies. PLoS Genetics 5(6): e1000529 See https://mathgen.stats.ox.ac.uk/impute/impute_v2.html.

IMPUTE2 Input Files Genotype file (specified in -g) Suppose you want to create a genotype for 2 individuals at 5 SNPs whose genotypes are SNP 1 : AA AA SNP 2 : GG GT SNP 3 : CC CT SNP 4 : CT CT SNP 5 : AG GG The correct genotype file would be SNP1 rs1 1000 A C 1 0 0 1 0 0 SNP2 rs2 2000 G T 1 0 0 0 1 0 SNP3 rs3 3000 C T 1 0 0 0 1 0 SNP4 rs4 4000 C T 0 1 0 0 1 0 SNP5 rs5 5000 A G 0 1 0 0 0 1

IMPUTE2 Input Files Map file (specified in -m) This file should have three columns: physical position (in base pairs), recombination rate between current position and next position in map (in cM/Mb), and genetic map position (in cM). The file should also have a header line with an unbroken character string for each column (e.g., "position COMBINED_rate(cM/Mb) Genetic_Map(cM)"). All of IMPUTE2 reference panel download packages come with appropriate recombination map files.

IMPUTE2 Input Files File of known haplotypes (specified in -h) The file contains known haplotypes, with one row per SNP and one column per haplotype. All alleles must be coded as 0 or 1, and each -h file must be provided with a corresponding legend file. IMPUTE2 provides formatted haplotypes from the HapMap Project and the 1,000 Genomes Project in the reference panel download packages.

IMPUTE2 Input Files Legend files (specified in -l) Legend file(s) with information about the SNPs in the -h file(s). Each file should have four columns: rsID, physical position (in base pairs), allele 0, and allele 1. The last two columns specify the alleles underlying the 0/1 coding in the corresponding -h file; these alleles can take values in {A,C,G,T}. Each legend file should also have a header line with an unbroken character string for each column (e.g., "rsID position a0 a1"). IMPUTE2 provides legend files for data from the HapMap Project and the 1,000 Genomes Project in our reference panel download packages. When using two -h files with IMPUTE2, you must supply the corresponding legend files in the same order, i.e., the file with more SNPs comes first.

Basic Commands Genomic interval to use for reference -int <lower> <upper> specifies genomic interval to use for inference, as specified by <lower> and <upper> boundaries in base pair position. The boundaries can be expressed either in long form (e.g., -int 5420000 10420000) or in exponential notation (e.g., -int 5.42e6 10.42e6). This option is particularly useful for restricting test jobs to small regions or splitting whole-chromosome analyses into manageable chunks, as discussed in the section on analyzing whole chromosomes. Effective size of the population -Ne specifies "Effective size" of the population from which your dataset was sampled. IMPUTE2 suggests setting -Ne to 20000 in the majority of modern imputation analyses.

Stand Alignment Options -strand_g specifies file showing the strand orientation of the SNP allele codings in the -g file, relative to a fixed reference point. Each SNP occupies one line, and the file should have two columns: (i) the base pair position of the SNP and (ii) the strand orientation ('+' or '-') of the alleles in the genotype file; the columns should be separated by a single space.

Output Files The main output file follows the same format as the -g file. Use -o to specify name of main output file.

Example This is the most common genotype imputation scenario: we want to impute untyped SNPs in a study dataset from a panel of reference haplotypes. The following command shows how to run this kind of analysis with IMPUTE2, using the example data that come with the program download: ./impute2 \ -m ./Example/example.chr22.map \ -h ./Example/example.chr22.1kG.haps \ -l ./Example/example.chr22.1kG.legend \ -g ./Example/example.chr22.study.gens \ -strand_g ./Example/example.chr22.study.strand \ -int 20.4e6 20.5e6 \ -Ne 20000 \ -o ./Example/example.chr22.one.phased.impute2

Sample Size and Power Calculation

Sample Size and Power Calculation Power analysis is an important aspect of experimental design. It allows us to determine the sample size required to detect an effect of a given size with a given degree of confidence. Conversely, it allows us to determine the probability of detecting an effect of a given size with a given level of confidence, under sample size constraints. If the probability is unacceptably low, we would be wise to alter or abandon the experiment.

Sample Size and Power Calculation The following four quantities have an intimate relationship: sample size effect size significance level = P(Type I error) = probability of finding an effect that is not there power = 1 - P(Type II error) = probability of finding an effect that is there Given any three, we can determine the fourth.

Power Analysis in R The pwr package in R implements power analysis. For each function, you enter three of the four quantities (effect size, sample size, significance level, power) and the fourth is calculated. See reference page: http://www.statmethods.net/stats/power.html.

Power Analysis in R Example: library(pwr) Using a two-tailed test proportions, and assuming a significance level of 0.01 and a common sample size of 30 for each proportion, what effect size can be detected with a power of .75? library(pwr) pwr.2p.test(n=30,sig.level=0.01,power=0.75)

Sample Size Calculation Using Quanto Download page: http://biostats.usc.edu/Quanto.html. Suppose, in a matched case control study, DNA samples have been collected to determine the effects of each SNP’s on the risk of having cardio vascular disease. We are interested in calculating the sample size needed to have the effect size (or odds ratio) in the range of 1.5-2.0 with at least 80 percent power under dominance model. Moreover, the minor allele frequency is chosen to be 10 percent, and a type 1 error level of 0.05.

Sample Size Calculation Using Quanto Under Parameters option, i. Select Outcome/Design>Disease>Case-control (Matched). ii. Select Hypothesis>Gene only. iii. Click onto Gene G and then type onto 0.1 on the allele frequency box. Select dominance inheritance mode. Click Ok. iv. Under Outcome model, specify baseline disease risk which is the disease risk in unexposed genetically normal subjects. For this study, let’s consider the baseline disease risk as 0.1. Under Genetic effect box, specify the effect size. In this case, consider 1.3 to 3.0 with an interval range of 0.5. v. Under Power window, specify power as 0.8 and click ok to calculate sample size. Type 0.05 on the type 1 error rate box. Click ok. vi. Click onto Calculate button.

Sample Size Calculation Using Quanto The following output will be displayed. RG Gene kP 1.3000 10611 0.100522 1.8000 1897 0.101327 2.3000 880 0.102060 2.8000 548 0.102732 The column “Gene” reflects the number of case-control pair needed. P0 is the baseline disease risk specified and kP is the overall disease risk in the general population (calculated by the software). For a range of odds ratio (RG), Quanto provides the number of case-control pairs required for the desired power.

Power software

piface.jar by Lenth (2006) Link: http://homepage.stat.uiowa.edu/~rlenth/Power/ Select the two sample T test sigma1 and sigma2: standard deviation for each group Set true difference of means Solve for power by set sample size

Microarray power/sample size estimation Link:http://bioinformatics.mdanderson.org/MicroarraySampleSize/ Set the accepted # of false positives and fold differences(FC) Set the estimated standard deviation of the gene intensity measurements on the base-two logarithmic scale (0.7 recommended) Solve for sample size and per-gene alpha

RnaSeqSampleSize URL: https://cqs.mc.vanderbilt.edu/shiny/RnaSeqSampleSize/ Use sample size estimation by prior data(say, TCGA data) Use large repNumber to get more precise estimation.(50 may be enough)