PLINK tutorial, December 2006; Shaun Purcell, PLINK gPLINK Haploview Whole genome association software tutorial Shaun Purcell.

Slides:



Advertisements
Similar presentations
What is an association study? Define linkage disequilibrium
Advertisements

PLINK: a toolset for whole genome association analysis
BST 775 Lecture PLINK – A Popular Toolset for GWAS
Association Tests for Rare Variants Using Sequence Data
Single Nucleotide Polymorphism And Association Studies Stat 115 Dec 12, 2006.
Presented by Qing Duan Dr. Yun Li group UNC at Chapel Hill
Multiple Comparisons Measures of LD Jess Paulus, ScD January 29, 2013.
Objectives Cover some of the essential concepts for GWAS that have not yet been covered Hardy-Weinberg equilibrium Meta-analysis SNP Imputation Review.
Efficient Algorithms for Genome-wide TagSNP Selection across Populations via the Linkage Disequilibrium Criterion Authors: Lan Liu, Yonghui Wu, Stefano.
Association Mapping David Evans. Outline Definitions / Terminology What is (genetic) association? How do we test for association? When to use association.
MALD Mapping by Admixture Linkage Disequilibrium.
Ingredients for a successful genome-wide association studies: A statistical view Scott Weiss and Christoph Lange Channing Laboratory Pulmonary and Critical.
Lab 13: Association Genetics. Goals Use a Mixed Model to determine genetic associations. Understand the effect of population structure and kinship on.
Using HapMap.Org A Tutorial Lincoln Stein, Cold Spring Harbor Laboratory.
Gene-gene and gene-environment interactions Manuel Ferreira Massachusetts General Hospital Harvard Medical School Center for Human Genetic Research.
More Powerful Genome-wide Association Methods for Case-control Data Robert C. Elston, PhD Case Western Reserve University Cleveland Ohio.
:NEUROPSYCHIATRIC GENETICS [BIOSTATISTICS|BIOINFORMATICS] CORE BIOSTATISTIC/BIOINFORMATIC TOOLS FOR GENETICS DATA: DATA MANAGEMENT AND ANALYSIS RICHARD.
Gene-gene and gene-environment interactions Manuel Ferreira Massachusetts General Hospital Harvard Medical School Center for Human Genetic Research.
SNPs DNA differs between humans by 0.1%, (1 in 1300 bases) This means that you can map DNA variation to around 10,000,000 sites in the genome Almost all.
Linkage Analysis in Merlin
Statistical Power Calculations Boulder, 2007 Manuel AR Ferreira Massachusetts General Hospital Harvard Medical School Boston.
Analysis of genome-wide association studies
Polymorphism and Variant Analysis Lab
Population Stratification
Factors to Consider in Selecting a Genotyping Platform Elizabeth Pugh June 22, 2007.
Polymorphism & Variant Analysis Lab Saurabh Sinha Polymorphism and Variant Analysis Lab v1 | Saurabh Sinha 1 Powerpoint by Casey Hanson.
Figure S1. Quantile-quantile plot in –log10 scale for the individual studies The red line represents concordance of observed and expected values. The shaded.
Gene Hunting: Linkage and Association
Introduction to the Gramene Genetic Diversity module 5/2010 Build #31.
Genomics Collaboration Senior Scientist
Lecture 19: Association Studies II Date: 10/29/02  Finish case-control  TDT  Relative Risk.
Type 1 Error and Power Calculation for Association Analysis Pak Sham & Shaun Purcell Advanced Workshop Boulder, CO, 2005.
Jianfeng Xu, M.D., Dr.PH Professor of Public Health and Cancer Biology Director, Program for Genetic and Molecular Epidemiology of Cancer Associate Director,
Polymorphism Haixu Tang School of Informatics. Genome variations underlie phenotypic differences cause inherited diseases.
Tutorial #10 by Ma’ayan Fishelson. Classical Method of Linkage Analysis The classical method was parametric linkage analysis  the Lod-score method. This.
Future Directions Pak Sham, HKU Boulder Genetics of Complex Traits Quantitative GeneticsGene Mapping Functional Genomics.
Basics of Biostatistics for Health Research Session 1 – February 7 th, 2013 Dr. Scott Patten, Professor of Epidemiology Department of Community Health.
GVS: Genome Variation Server Materials prepared by: Warren C. Lathe, PhD Updated: Q Version 2.
Statistical Issues in Genetic Association Studies
Mechanisms of Evolution  Lesson goals:  1. Define evolution in terms of genetics.  2. Using mathematics show how evolution cannot occur unless there.
Shaun Purcell MGH, Boston
Errors in Genetic Data Gonçalo Abecasis. Errors in Genetic Data Pedigree Errors Genotyping Errors Phenotyping Errors.
PLINK / Haploview Whole genome association software tutorial
Using geWorkbench: Working with Sets of Data Fan Lin, Ph. D. Molecular Analysis Tools Knowledge Center Columbia University and The Broad Institute of MIT.
The International Consortium. The International HapMap Project.
GenABEL: an R package for Genome Wide Association Analysis
Practical With Merlin Gonçalo Abecasis. MERLIN Website Reference FAQ Source.
Genetic background and population stratification Shaun Purcell 1,2 & Pak Sham 1 1 Social, Genetic & Developmental Psychiatry Research Centre, IoP, KCL,
Copyright OpenHelix. No use or reproduction without express written consent1.
Fast test for multiple locus mapping By Yi Wen Nisha Rajagopal.
Resources at HapMap.Org HapMap3 Tutorial Marcela K. Tello-Ruiz Cold Spring Harbor Laboratory.
Lectures 7 – Oct 19, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall.
Multiple-Locus Genome-Wide Association Testing David Dean CSE280A.
Linkage. Announcements Problem set 1 is available for download. Due April 14. class videos are available from a link on the schedule web page, and at.
Linkage. Announcements Problem set 1 is available for download. Due April 14. class videos are available from a link on the schedule web page, and at.
Statistical Analysis of Candidate Gene Association Studies (Categorical Traits) of Biallelic Single Nucleotide Polymorphisms Maani Beigy MD-MPH Student.
Association Mapping in Families Gonçalo Abecasis University of Oxford.
Population stratification
Date of download: 11/12/2016 Copyright © 2016 American Medical Association. All rights reserved. From: Influence of Child Abuse on Adult DepressionModeration.
Recombination (Crossing Over)
GxG and GxE.
Population stratification
Introduction to Data Formats and tools
Power to detect QTL Association
Shaun Purcell, Benjamin Neale, Kathe Todd-Brown, Lori Thomas, Manuel A
A Tutorial Lincoln Stein, Cold Spring Harbor Laboratory
Simultaneous Genotype Calling and Haplotype Phasing Improves Genotype Accuracy and Reduces False-Positive Associations for Genome-wide Association Studies 
Population Stratification Practical
Presentation transcript:

PLINK tutorial, December 2006; Shaun Purcell, PLINK gPLINK Haploview Whole genome association software tutorial Shaun Purcell Center for Human Genetic Research, Massachusetts General Hospital, Boston, MA Broad Institute of Harvard & MIT, Cambridge, MA

PLINK tutorial, December 2006; Shaun Purcell,

GUI for many PLINK analyses Data management Summary statistics Population stratification Association analysis IBD-based analysis

PLINK tutorial, December 2006; Shaun Purcell, Computational efficiency Load PED file, generate binary PED file ~68 minutes Load and filter binary PED file ~11 minutes Basic association analysis ~5 minutes 5000 individuals genotyped on 500,000 SNPs Load, filter and analyze~12 seconds 1 permutation (all SNPs)~1.6 seconds 350 individuals genotyped on 100,000 SNPs

PLINK tutorial, December 2006; Shaun Purcell, gPLINK / PLINK in “remote mode” Server, or cluster head node PLINK, WGAS data & computation gPLINK & Haploview: initiating and viewing jobs W W W Secure Shell networking

PLINK tutorial, December 2006; Shaun Purcell, Summary statistics and quality control Whole genome haplotype-based association Assessment of population stratification Further exploration of ‘hits’ Visualization and follow-up using Haploview Whole genome SNP-based association A simulated WGAS dataset

PLINK tutorial, December 2006; Shaun Purcell, In this practical, we will use gPLINK, PLINK and Haploview to… … examine genotyping rates and look for non-random missing data … determine SNP frequencies and test Hardy-Weinberg equilibrium … assess population stratification via clustering, genomic control … test for allelic, genotypic and haplotypic association … perform stratified analyses, conditioning on population strata … assess between-stratum heterogeneity in association signal … examine linkage disequilibrium patterns around associated SNPs … select tag SNPs for follow-up and replication studies

PLINK tutorial, December 2006; Shaun Purcell, Simulated WGAS dataset Real genotypes, but a simulated “disease” 90 Asian HapMap individuals –10K autosomal SNPs from Affymetrix 500K product Simulated quantitative phenotype; median split to create a disease phenotype Illustrative, not realistic!

PLINK tutorial, December 2006; Shaun Purcell, 1) What is the genotyping rate? 2) How many monomorphic SNPs? 3) Evidence of non-random genotyping failure? 4) What is the single most associated SNP? Does it reach genome-wide significance? What is the most associated haplotype? 5) Is there evidence of population stratification from genomic control? 6) Use genotypes to cluster the sample into 2 subpopulations. How well does the clustering recover the known Chinese/Japanese split? 7) Is there evidence for stratification conditional on the two-cluster solution? 8) What is the best SNP controlling for stratification. Is it genome-wide significant? For the most highly associated SNP: 9) Does this SNP pass the Hardy-Weinberg equilibrium test? 10) Does this SNP differ in frequency between the two populations? 11) Is there evidence that this SNP has a different association between the two populations? 12) What are the allele frequencies in cases and controls? Genotype frequencies? What is the odds ratio? 13) Is the rate of missing data equal between cases and controls for this SNP? 14) Does an additive model well characterize the association? What about genotypic, dominant models, etc? Specific questions asked

PLINK tutorial, December 2006; Shaun Purcell, Data used in this practical Available at example.bed Binary format genotype information (do not attempt to view in a standard text editor) example.bim Map file (6 fields: each row is a SNP: chromosome, RS #, genetic position, physical position, allele 1, allele 2) example.fam Individual information file (first 6 columns of a PED file; disease phenotype is column 6) pop.phe Chinese/Japanese population indicator (FID, IID, population code) qt.phe Alternate quantitative trait phenotype file (Family ID, Individual ID, phenotype)

PLINK tutorial, December 2006; Shaun Purcell, The Truth… ChineseJapanese Case347 Control1138 “11”“12”“22” Case52123 Control16232 Group difference Single common variant rs chr8

PLINK tutorial, December 2006; Shaun Purcell, Right-click on the Desktop to create a project folder… …and rename it “project1” A gPLINK “project” is a folder

PLINK tutorial, December 2006; Shaun Purcell, Copy the relevant files into this folder

PLINK tutorial, December 2006; Shaun Purcell, Start a new gPLINK project

PLINK tutorial, December 2006; Shaun Purcell, Select the folder you previously created

PLINK tutorial, December 2006; Shaun Purcell, Configuring the new project Here, we tell gPLINK… … where the PLINK executable is … specify any PLINK prefixes (advanced option for grid computing) … where the Haploview (version 4.0)executable is … where the Haploview (version 4.0) executable is … which text editor to use to view files, e.g. WordPad (write.exe)

PLINK tutorial, December 2006; Shaun Purcell, Data management Recode dataset (A,C,G,T → 1,2) Reorder dataset Flip DNA strand Extract subsets (individuals, SNPs) Remove subsets (individuals, SNPs) Merge 2 or more filesets Compact binary file format

PLINK tutorial, December 2006; Shaun Purcell, Summarizing the data Hardy-Weinberg Mendel errors Missing genotypes Allele frequencies Tests of non-random missingness –by phenotype and by (unobserved) genotype Individual homozygosity estimates Stretches of homozygosity Pairwise IBD estimates

PLINK tutorial, December 2006; Shaun Purcell, Validating the fileset Need to enter a unique root filename: Doesn’t do anything, except (attempt to) load the data and report basic statistics Then add a description (for logging)

PLINK tutorial, December 2006; Shaun Purcell, Clicking on the tree to expand or contract it; individual input or output files can be selected here The log file always gives a lot of useful information: it is good practice always to check it to confirm that an analysis has run okay. Default filters applied here Q1) What is the genotyping rate? Overall genotyping rate

PLINK tutorial, December 2006; Shaun Purcell, Viewing an output file In this case, a list of individuals excluded due to low genotyping rate (just one person here). (A line contains Family ID and Individual ID) Right-click on a selected file

PLINK tutorial, December 2006; Shaun Purcell, Filters and thresholds Most forms have Filter and Thresholds buttons Thresholds exclude people or SNPs based on genotype data Filters exclude people or SNPs based on prespecified lists, or genomic location

PLINK tutorial, December 2006; Shaun Purcell, Q2) How many monomorphic SNPs? We can use thresholds and the Validate fileset option to answer this:

PLINK tutorial, December 2006; Shaun Purcell,

Q3) Evidence of non-random genotyping failure? The Summary Statistics/Missingness option can answer this:

PLINK tutorial, December 2006; Shaun Purcell, Missing rate in cases (A) and controls (U) and a test for whether rate differs

PLINK tutorial, December 2006; Shaun Purcell, REFERENCE SNP 50% 40% 10% 20% 70% GENOTYPED MISSING FLANKING SNP Non-random genotyping failure “Mishap” test ~10% (30,824) of SNPs with >5 missing genotypes fail mishap test at p < 1e-8 ~10% (30,824) of SNPs with >5 missing genotypes fail mishap test at p < 1e-8 Flanking haplotypes GENOMISSING HOM23400 HET4968 For example: rs has 68 missing genotypes (~2.6% missing)

PLINK tutorial, December 2006; Shaun Purcell, Association analysis Case/control –allelic, trend, genotypic –general Cochran-Mantel-Haenszel Family-based TDT Quantitative traits Haplotype analysis –focus on “multimarker predictors” Multilocus tests, covariates, epistasis, etc

PLINK tutorial, December 2006; Shaun Purcell, Standard association tests Q4) What is the most associated SNP?

PLINK tutorial, December 2006; Shaun Purcell, Q5) Evidence of stratification from genomic control?

PLINK tutorial, December 2006; Shaun Purcell, Genomic control Test locus Unlinked ‘null’ markers 22 No stratification 22 Stratification  adjust test statistic

PLINK tutorial, December 2006; Shaun Purcell,

Haplotype based association Specify a list of specific haplotype tests (*.hlist file) Q4b) What is the most associated haplotype?

PLINK tutorial, December 2006; Shaun Purcell, Specifying haplotype tests i_rs rs rs i_rs rs rs i_rs rs rs i_rs rs rs … etc … Or, specifying a sliding window of fixed SNPs with: e.g. --hap-window 4 * rs rs * rs rs * rs rs rs … etc … Specify specific haplotypes Or, specify the locus (i.e. only specify predicting SNPs) ID chr cM bp alleles ID chr cM bp alleles Haplotype SNPs (in data file) Predictors Predicted

PLINK tutorial, December 2006; Shaun Purcell, Haplotype-based tests Haplotype C/C association results (omnibus & haplotype-specific) List of tests that could not be performed, e.g. if the predictor SNPs were removed in the filtering stage

PLINK tutorial, December 2006; Shaun Purcell, Identity-by-state (IBS) sharing Individual 1 A/C G/T A/G A/A G/G | | | | | | Individual 2 C/C T/T A/G C/C G/G IBS Pair from same population Individual 3 A/C G/G A/A A/A G/G | | Individual 4 C/C T/T G/G C/C A/G IBS Pair from different population

PLINK tutorial, December 2006; Shaun Purcell, Empirical assessment of ancestry Han Chinese Japanese Multidimensional scaling plot: ~10K random SNPs Complete linkage IBS-based hierarchical clustering

PLINK tutorial, December 2006; Shaun Purcell, Q6) Use genotypes to cluster the sample into 2 subpopulations Step 1) Generate IBS distances for all pairs (may take a few minutes)

PLINK tutorial, December 2006; Shaun Purcell, Step 2) Cluster individuals based on IBS distances and other constraints Constrain cluster solution to two classes (K=2) Specify previously-generated IBS file (*.genome)

PLINK tutorial, December 2006; Shaun Purcell,

Stratified analysis Cochran-Mantel-Haenszel test Stratified 2×2×K tables AB CD AB CD AB CD AB CD AB CD

PLINK tutorial, December 2006; Shaun Purcell, Select the previously calculated *.cluster2 file. This “cluster file” has one line per individual

PLINK tutorial, December 2006; Shaun Purcell, Q7) Evidence of stratification conditional on cluster solution?

PLINK tutorial, December 2006; Shaun Purcell, Q8) What is the best SNP controlling for stratification?

PLINK tutorial, December 2006; Shaun Purcell, Making a Haploview fileset Select 200kb region around our “best hit”

PLINK tutorial, December 2006; Shaun Purcell,

In the remaining time (if any…) Extract as a new PLINK fileset just the single best SNP (rs ) Using this new file, attempt questions –Here are some clues 9) Summary statistics → Hardy Weinberg 10) Standard association test, with an alternate phenotype 11) Stratified association with Breslow-Day test 12) You’ve already calculated these (i.e. *.assoc, *.hwe ) 13) This is already calculated also (i.e. *.missing) 14) Use genotypic association test Consult the PLINK documentation (

PLINK tutorial, December 2006; Shaun Purcell, In summary We performed whole genome –summary statistics and QC –stratification analysis –conditional and unconditional association analysis We found a single SNP rs that… –is genome-wide significant –has similar frequencies and effects in Japanese and Chinese subpopulations –shows no missing or HW biases –is consistent with an allelic, dosage effect –has common T allele with strong protective effect ( ~0.05 odds ratio)

PLINK tutorial, December 2006; Shaun Purcell, Acknowledgements Julian Maller Dave Bender Jeff Barrett Mark Daly Shaun Purcell Kathe Todd-Brown Ben Neale Mark Daly Pak Sham (g)PLINK development Haploview development