Principal components analysis

Slides:



Advertisements
Similar presentations
Statistical methods for genetic association studies
Advertisements

Population structure.
LECTURE 8 : FACTOR MODELS
Compensation for Measurement Errors Due to Mechanical Misalignments in PCB Testing Anura P. Jayasumana, Yashwant K. Malaiya, Xin He, Colorado State University.
Meta-analysis for GWAS BST775 Fall DEMO Replication Criteria for a successful GWAS P
Genome-wide association mapping Introduction to theory and methodology
Objectives Cover some of the essential concepts for GWAS that have not yet been covered Hardy-Weinberg equilibrium Meta-analysis SNP Imputation Review.
Lab 13: Association Genetics. Goals Use a Mixed Model to determine genetic associations. Understand the effect of population structure and kinship on.
Signatures of Selection
1 Multivariate Statistics ESM 206, 5/17/05. 2 WHAT IS MULTIVARIATE STATISTICS? A collection of techniques to help us understand patterns in and make predictions.
Estimating “Heritability” using Genetic Data David Evans University of Queensland.
Factor Analysis Research Methods and Statistics. Learning Outcomes At the end of this lecture and with additional reading you will be able to Describe.
Association Mapping of Complex Diseases with Ancestral Recombination Graphs: Models and Efficient Algorithms Yufeng Wu UC Davis RECOMB 2007.
MSc GBE Course: Genes: from sequence to function Genome-wide Association Studies Sven Bergmann Department of Medical Genetics University of Lausanne Rue.
Data mining and statistical learning, lecture 4 Outline Regression on a large number of correlated inputs  A few comments about shrinkage methods, such.
PATTERN RECOGNITION : PRINCIPAL COMPONENTS ANALYSIS Prof.Dr.Cevdet Demir
Goals of Factor Analysis (1) (1)to reduce the number of variables and (2) to detect structure in the relationships between variables, that is to classify.
Give me your DNA and I tell you where you come from - and maybe more! Lausanne, Genopode 21 April 2010 Sven Bergmann University of Lausanne & Swiss Institute.
Practical Considerations in Statistical Genetics Ashley Beecham June 19, 2015.
Statistical Arbitrage Ying Chen, Leonardo Bachega Yandong Guo, Xing Liu March, 2010.
Principal Components Analysis BMTRY 726 3/27/14. Uses Goal: Explain the variability of a set of variables using a “small” set of linear combinations of.
Population Stratification
Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University.
1 Control of Population Stratification in Whole-Genome Scans Fei Zou Department of Biostatistics Carolina Center for Genome Sciences.
Population Stratification
IAP workshop, Ghent, Sept. 18 th, 2008 Mixed model analysis to discover cis- regulatory haplotypes in A. Thaliana Fanghong Zhang*, Stijn Vansteelandt*,
What host factors are at play? Paul de Bakker Division of Genetics, Brigham and Women’s Hospital Broad Institute of MIT and Harvard
Input: A set of people with/without a disease (e.g., cancer) Measure a large set of genetic markers for each person (e.g., measurement of DNA at various.
Linkage in selected samples Manuel Ferreira QIMR Boulder Advanced Course 2005.
Descriptive Statistics vs. Factor Analysis Descriptive statistics will inform on the prevalence of a phenomenon, among a given population, captured by.
Lab 13: Association Genetics December 5, Goals Use Mixed Models and General Linear Models to determine genetic associations. Understand the effect.
Principal Components Analysis. Principal Components Analysis (PCA) A multivariate technique with the central aim of reducing the dimensionality of a multivariate.
Future Directions Pak Sham, HKU Boulder Genetics of Complex Traits Quantitative GeneticsGene Mapping Functional Genomics.
Host disease genetics: bovine tuberculosis resistance in
Linear Reduction Method for Tag SNPs Selection Jingwu He Alex Zelikovsky.
Reduces time complexity: Less computation Reduces space complexity: Less parameters Simpler models are more robust on small datasets More interpretable;
PATTERN RECOGNITION : PRINCIPAL COMPONENTS ANALYSIS Richard Brereton
Separation of the largest eigenvalues in eigenanalysis of genotype data from discrete populations Katarzyna Bryc Postdoctoral Fellow, Reich Lab, Harvard.
C2BAT: Using the same data set for screening and testing. A testing strategy for genome-wide association studies in case/control design Matt McQueen, Jessica.
Practical With Merlin Gonçalo Abecasis. MERLIN Website Reference FAQ Source.
Fast test for multiple locus mapping By Yi Wen Nisha Rajagopal.
Mx modeling of methylation data: twin correlations [means, SD, correlation] ACE / ADE latent factor model regression [sex and age] genetic association.
3 “Products” of Principle Component Analysis
Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.1 Lecture 11: Canonical correlation analysis (CANCOR)
Understanding Principle Component Approach of Detecting Population Structure Jianzhong Ma PI: Chris Amos.
Principal Components Analysis ( PCA)
Central limit theorem - go to web applet. Correlation maps vs. regression maps PNA is a time series of fluctuations in 500 mb heights PNA = 0.25 *
Multivariate statistical methods. Multivariate methods multivariate dataset – group of n objects, m variables (as a rule n>m, if possible). confirmation.
Population stratification
Chapter 12 REGRESSION DIAGNOSTICS AND CANONICAL CORRELATION.
Common variation, GWAS & PLINK
Information Management course
A) b) Supplementary Figure 1A: Significance plot(a) and forest plot(b) for the most significant SNP for regions of interest from the meta-analysis of IGP29.
Principal components analysis
Genome Wide Association Studies using SNP
Marker heritability Biases, confounding factors, current methods, and best practices Luke Evans, Matthew Keller.
GxG and GxE.
Preparing data for GWAS analysis
Introduction to Data Formats and tools
Linkage in Selected Samples
Genome-wide Association Studies
Genome-wide Association
Descriptive Statistics vs. Factor Analysis
Nat. Rev. Nephrol. doi: /nrneph
Fast Principal-Component Analysis Reveals Convergent Evolution of ADH1B in Europe and East Asia  Kevin J. Galinsky, Gaurav Bhatia, Po-Ru Loh, Stoyan Georgiev,
Model-free Estimation of Recent Genetic Relatedness
Japanese Population Structure, Based on SNP Genotypes from 7003 Individuals Compared to Other Ethnic Groups: Effects on Population-Based Association Studies 
Fig. 2 Host genetics explains core microbiome composition with heritable microbes serving as hubs within the microbial interaction networks. Host genetics.
Presentation transcript:

Principal components analysis Gavin Band

Why do PCA? PCA is good at detecting “directions” of major variation in your data. This might be: Population structure – subpopulations having different allele frequencies. Unexpected (“cryptic”) relationships. Artifacts such as genotyping errors, etc. Apart from intrinsic interest, these are precisely the factors that need to be controlled for in association tests.

Performing PCA 1. Take genotype data(*)... N samples L SNPs (*) Suitably normalised – see later.

Performing PCA 1. Take genotype data(*)... 2. Form ‘relatedness matrix’… N samples N samples L SNPs N samples rij = relatedness(*) between sample i and sample j. (*) With suitable normalisation: rij ≈ 1 if samples i and j are duplicates (or MZ twins) rij ≈ 0 if samples i and j are unrelated (relative to the sample.) (*) Suitably normalised – see later.

Performing PCA 1. Take genotype data(*)... 2. Form ‘relatedness matrix’… N samples N samples L SNPs N samples 3. Eigen-decompose it... Eigen-decomposition picks out directions in the data along which the variance is maximised. Eigenvalues represent the variance of the data along these directions. You can do this in R! E.g: > R = 1/L * (t(X) %*% X) > V = eigen(R)$vectors > plot( V[,1], V[,2] )

Example (Simulated data, N=50 individuals, L=1000 SNPs) Relatedness matrix R > R = (1/1000) %*% (t(X) * X)

Example (Simulated data, 50 individuals, 1000 SNPs) Eigenvectors Relatedness matrix R v1 v2 > V = eigen(R)$vectors

Example (Simulated data, 50 individuals, 1000 SNPs) Eigenvectors Relatedness matrix R v1 v2 > plot( V[,1], V[,2] )

Caution! PCA picks up any source of variation duplicate Pair of duplicate samples duplicate Badly genotyped sample Badly genotyped sample

Relatedness At a SNP with frequency f: Individual 2 Individual 1 Allele 1 Allele 2 total f 1-f

Relatedness At a SNP with frequency f: “Unrelated” individuals Allele 1 Allele 2 total f2 f(1-f) f (1-f)2 1-f Correlation = 0 “Unrelated” individuals

Relatedness At a SNP with frequency f: Individuals with relatedness r Allele 1 Allele 2 total rf+(1-r)f2 (1-r)f(1-f) f r(1-f)+(1-r)(1-f)2 1-f Correlation = r Individuals with relatedness r

Relatedness and population history – a heuristic explanation Ancestral population Ancestral frequency = p Drift in allele frequency proportional to p(1-p) Drift in allele frequency proportional to p(1-p) Time Population 1 SNP frequency = p1 Population 2 SNP frequency = p2

Relatedness Or: mean centre rows of X and divide by standard deviation, and compute as before: Note: ½rij is almost the same as a kinship coefficient, but is relative to the sample, not an ancestral population.

Association testing Without controlling for structure: Outcome ~ baseline + genotype Traditional approaches control for structure using a number of principal components.: Outcome ~ baseline + genotype + PC1 + PC2 + ... The most recent mixed model approach includes the whole relatedness matrix to control for structure: Outcome ~ baseline + genotype +

Association testing with linear mixed models Outcome ~ baseline + genotype + This is a bit like including all the PCs in a single regression, but constrained to explain a proportional amount of residual variation. In some circumstances it’s been shown to control for structure better than using principal components directly. For example see “Genetic risk and a primary role for cell-mediated immune mechanisms in multiple sclerosis”, IMSGC & WTCCC2, Nature 2011. Or play with it at http://www.well.ox.ac.uk/wtccc2/ms. However – these are linear models and some caveats remain in their use for case/control studies.

Summary PCA good at picking up sources of variation in datasets, including genetic datasets. Any form of variation can be picked up – population structure, but also cohort or plate effects, genotyping error, sample duplication. This is what we want when controlling for structure / unwanted variation in an association test.

Software for performing PCA Plink (v1.9 or above) http://www.cog-genomics.org/plink2 EIGENSOFT http://genetics.med.harvard.edu/reich/Reich_Lab/Software.html Or use R!

Software for mixed model analysis GCTA http://genetics.med.harvard.edu/reich/Reich_Lab/Software.html FastLMM http://research.microsoft.com/en-us/um/redmond/projects/mscompbio/fastlmm/ MMM http://www.helsinki.fi/~mjxpirin/download.html GEMMA http://www.xzlab.org/software.html

Recommended reading “Population Structure and Eigenanalysis”, Patterson N, Price AL, Reich D, PLoS Genetics (2006). (The “SmartPCA” paper) “Population Structure and Cryptic Relatedness in Genetic Association Studies”, Astle W. and Balding DJ, Statistical Science (2009). "Reconciling the analysis of IBD and IBS in complex trait studies”, Powell JE, Visscher PM, Goddard ME Nat. Rev. Genetics (2010). “A Genealogical Interpretation of Principal Components Analysis”, Gil McVean, PLoS Genetics (2009). “Interpreting principal component analyses of spatial population genetic variation”, John Novembre and Matthew Stephens, Nature Genetics (2008). “Advantages and pitfalls in the application of mixed-model association methods”, Yang et al, Nature Genetics (2014)