Preparing data for GWAS analysis

Slides:



Advertisements
Similar presentations
Statistical methods for genetic association studies
Advertisements

Population Genetics 1 Chapter 23 in Purves 7 th edition, or more detail in Chapter 15 of Genetics by Hartl & Jones (in library) Evolution is a change in.
BST 775 Lecture PLINK – A Popular Toolset for GWAS
Note that the genetic map is different for men and women Recombination frequency is higher in meiosis in women.
Alleles = A, a Genotypes = AA, Aa, aa
Lab 4: Inbreeding and Kinship. Inbreeding Causes departure from Hardy-Weinburg Equilibrium Reduces heterozygosity Changes genotype frequencies Does not.
Multiple Comparisons Measures of LD Jess Paulus, ScD January 29, 2013.
CSS 650 Advanced Plant Breeding Module 2: Inbreeding Small Populations –Random drift –Changes in variance, genotypes Mating Systems –Inbreeding coefficient.
Objectives Cover some of the essential concepts for GWAS that have not yet been covered Hardy-Weinberg equilibrium Meta-analysis SNP Imputation Review.
Basics of Linkage Analysis
Chapter 2: Hardy-Weinberg Gene frequency Genotype frequency Gene counting method Square root method Hardy-Weinberg low Sex-linked inheritance Linkage and.
Lab 13: Association Genetics. Goals Use a Mixed Model to determine genetic associations. Understand the effect of population structure and kinship on.
Hardy Weinberg: Population Genetics
Brachydactyly and evolutionary change
Hardy Weinberg: Population Genetics
Chapter 23 Population Genetics © John Wiley & Sons, Inc.
Analysis of genome-wide association studies
Biodiversity IV: genetics and conservation
14 Population Genetics and Evolution. Population Genetics Population genetics involves the application of genetic principles to entire populations of.
Population Genetics Studying the Distribution of Alleles and Genotypes in a Population.
Population Genetics: Chapter 3 Epidemiology 217 January 16, 2011.
Deviations from HWE I. Mutation II. Migration III. Non-Random Mating IV. Genetic Drift A. Sampling Error.
Changing Allele Frequency Chapter 23. What you need to know! The conditions for Hardy-Weinberg Equilibrium How to use the Hardy-Weinberg equation to calculate.
1 Population Genetics Basics. 2 Terminology review Allele Locus Diploid SNP.
Bottlenecks reduce genetic variation – Genetic Drift Northern Elephant Seals were reduced to ~30 individuals in the 1800s.
Addressing cryptic relatedness in candidate samples for 1KG James Nemesh Steve McCarroll 02/13/2012.
Lecture 6: Inbreeding September 4, Last Time uCalculations  Measures of diversity and Merle patterning in dogs  Excel sheet posted uFirst Violation.
Linkage Disequilibrium and Recent Studies of Haplotypes and SNPs
Data Quality Control Suzanne M. Leal Baylor College of Medicine Copyrighted © S.M. Leal 2015.
Lab 4: Inbreeding and Kinship. Inbreeding Reduces heterozygosity Does not change allele frequencies.
Linkage. Announcements Problem set 1 is available for download. Due April 14. class videos are available from a link on the schedule web page, and at.
Schematic of the single variant polymorphism (SNP) genotyping assay.
Biometrical Genetics Shaun Purcell Twin Workshop, March 2004.
8 and 11 April, 2005 Chapter 17 Population Genetics Genes in natural populations.
Please feel free to chat amongst yourselves until we begin at the top of the hour.
Population stratification
Date of download: 7/2/2016 Copyright © 2016 American Medical Association. All rights reserved. From: How to Interpret a Genome-wide Association Study JAMA.
Multifactorial Inheritance
Bottlenecks reduce genetic variation – Genetic Drift
HARDY-WEINBERG EQUILIBRIUM
Common variation, GWAS & PLINK
Quality control for GWAS
Genes within Populations
Do we really need theory?
Evolution and Populations –Essential Questions p
KA 4: Ante- and postnatal screening
Hardy-Weinberg Theorem
Principal components analysis
Genome Wide Association Studies using SNP
Deviations from HWE I. Mutation II. Migration III. Non-Random Mating
Measuring Evolution of Populations
Measuring Evolution of Populations
Measuring Evolution of Populations
Come my friend. Let’s forget the cares of tomorrow
Quantitative Traits in Populations
PLANT BIOTECHNOLOGY & GENETIC ENGINEERING (3 CREDIT HOURS)
Population stratification
Evolution: Hardy-Weinberg Equilibrium
Basic concepts on population genetics
Lecture 4: Testing for Departures from Hardy-Weinberg Equilibrium
Hardy Weinberg: Population Genetics
Identification of Paralogs in RADseq data
Association Analysis Spotted history
Lecture 9: QTL Mapping II: Outbred Populations
Modern Evolutionary Biology I. Population Genetics
Hardy-Weinberg Lab Data
Presentation transcript:

Preparing data for GWAS analysis Tommy Carstensen

AA AB BB Quality Control QC ? Manhattan Plot GWAS

Bad QC > Bad Data > Bad Results Nice Manhattan Questioning Retraction due to inadequate QC Entebbe, Uganda

Genotype data SNP A SNP B SNP C SNP D SNP E Female 1 00 AG GG GA AA CC Female 2 AC Female 3 GC Male 2 CA 00 = missing data September 2013 Entebbe, Uganda

How did we get the genotype data? Genotype Calling Good data SNP1 Bad data SNP2 AA or AB? 00! AA AB basic problem affy uses BRLMM we use chiamo++ BB September 2013 Entebbe, Uganda

Sample QC and SNP QC SNP QC SNP A SNP B SNP C SNP D SNP E Female 1 00 AG GG GA Male 1 AA CC Female 2 AC Female 3 GC Male 2 CA sample QC 00 = missing data September 2013 Entebbe, Uganda

Quality Control Steps Sample QC SNP QC Sample Call Rate/Proportion SNP Call Rate/Proportion Autosomal Heterozygosity Hardy Weinberg Equilibrium (HWE) Sex / Gender X Chromosome Heterozygosity Conservative approach would be to do sample and SNP call rate simultaneously Do not start with PCA – need high quality data Too Much Relatedness Identity By Descent (IBD) Too Little Relatedness / Confounding Principal Component Analysis (PCA) September 2013 Entebbe, Uganda

Quality Control Steps Sample QC SNP QC Sample Call Rate/Proportion SNP Call Rate/Proportion Autosomal Heterozygosity Hardy Weinberg Equilibrium (HWE) Sex / Gender X Chromosome Heterozygosity Conservative approach would be to do sample and SNP call rate simultaneously Do not start with PCA Too Much Relatedness Identity By Descent (IBD) Too Little Relatedness / Confounding Principal Component Analysis (PCA) September 2013 Entebbe, Uganda

How did we get the data? Genotype Calling Good data SNP1 Bad data SNP2 AA or AB? 00! AA AB basic problem affy uses BRLMM we use chiamo++ BB September 2013 Entebbe, Uganda

Sample Call Rate/Proportion SNP1 SNP2 SNP3 SNP4 SNP5 Sample Call Rate Sample1 00 AG GG GA 60% Sample2 AA CC 80% Sample3 AC Sample4 GC 100% Sample5 CA SNP1 SNP2 SNP3 SNP4 SNP5 Sample1 00 AG GG GA Sample2 AA CC Sample3 AC Sample4 GC Sample5 CA Conservative approach would be to do sample call rate and SNP call rate simultaneously QC is an art and not a science The sequence of QC steps matters Thresholds and procedures should depend on the quality of the data and the questions being asked September 2013 Entebbe, Uganda

Quality Control Steps Sample QC SNP QC Sample Call Rate/Proportion SNP Call Rate/Proportion Autosomal Heterozygosity Hardy Weinberg Equilibrium (HWE) Sex / Gender X Chromosome Heterozygosity Conservative approach would be to do sample and SNP call rate simultaneously Do not start with PCA Too Much Relatedness Identity By Descent (IBD) Too Little Relatedness / Confounding Principal Component Analysis (PCA) September 2013 Entebbe, Uganda

Definition of Heterozygosity Rate Copy 1 Copy 2 SNP1 SNP2 SNP3 SNP4 SNP5 SNP6 SNP7 SNP8 A A homozygous C C heterozygous A T heterozygosity = 2/8 = 0.25 C C C G A A A A T T chromosome 1 September 2013 Entebbe, Uganda

Heterozygosity Remove samples deviating from average Deviations could arise due to several reasons Contamination of samples (high heterozygosity) Inbreeding (low heterozygosity) Ancestral differences Data quality / Poor genotype calling Heterozygotes more likely to be missing Sample 1 Sample 2 heterozygous homozygous populations heterozygosity September 2013 Entebbe, Uganda

Correlation between quality metrics sequence matters – black line vs blue line September 2013 Entebbe, Uganda

Quality Control Steps Sample QC SNP QC Sample Call Rate/Proportion SNP Call Rate/Proportion Autosomal Heterozygosity Hardy Weinberg Equilibrium (HWE) Sex / Gender X Chromosome Heterozygosity Conservative approach would be to do sample and SNP call rate simultaneously Do not start with PCA Too Much Relatedness Identity By Descent (IBD) Too Little Relatedness / Confounding Principal Component Analysis (PCA) September 2013 Entebbe, Uganda

Sex check Looking for mislabelled samples Females Bad Good Males Genomewide association study of 14,000 cases of seven common diseases and 3,000 shared controls Don’t use SNPs in pseudoautosomal regions for sex check Good Male 1 allele Crossover between the X and Y chromosome happens between pseudoautosomal regions. SNPs in PARs are thus excluded from analysis. Female 2 alleles Bad September 2013 Entebbe, Uganda

Quality Control Steps Sample QC SNP QC Sample Call Rate/Proportion SNP Call Rate/Proportion Autosomal Heterozygosity Hardy Weinberg Equilibrium (HWE) Sex / Gender X Chromosome Heterozygosity Conservative approach would be to do sample and SNP call rate simultaneously Do not start with PCA Too Much Relatedness Identity By Descent (IBD) Too Little Relatedness / Confounding Principal Component Analysis (PCA) September 2013 Entebbe, Uganda

Relatedness Relatedness is a problem because of overrepresentation of selected alleles, which will bias any multivariate analysis (correlated data!); e.g. PCA or multivariate regression Related samples need to be excluded or taken into account during subsequent analyses One metric of relatedness is Identity By Descent (IBD), which involves calculation of proportion of common alleles between two individuals. Prior to the calculation of IBD, SNPs with a low call rate are permanently excluded and rare SNPs (MAF<5%) and SNPs in Linkage Disequilibrium (LD) are temporarily excluded. EITHER exclude relateds OR take them into account September 2013 Entebbe, Uganda

Relatedness / IBD Completely identical Half-identical Not identical Relationship category Relatedness Monozygotic twins 1 Parent-Offspring 1/2 Full siblings Grandparent-grandchild 1/4 Uncle/Aunt-Nephew/Niece First cousins 1/8 Unrelated Half-identical Not identical Me and my mom Me and my sister Pedigree shows why some regions are IBD September 2013 Entebbe, Uganda

Relatedness / IBD A Ugandan cohort study as an example threshold for temporary exclusion prior to HWE check Count of sample pairs first cousins (12.5%) etc. siblings parent-child uncle-niece aunt-nephew etc. duplicates identical twins Relatedness either needs to be taken into account in analysis, or related individuals need to be removed Maximum IBD for each sample September 2013 Entebbe, Uganda

Quality Control Steps Sample QC SNP QC Sample Call Rate/Proportion SNP Call Rate/Proportion Autosomal Heterozygosity Hardy Weinberg Equilibrium (HWE) Sex / Gender X Chromosome Heterozygosity Conservative approach would be to do sample and SNP call rate simultaneously Do not start with PCA Too Much Relatedness Identity By Descent (IBD) Too Little Relatedness / Confounding Principal Component Analysis (PCA) September 2013 Entebbe, Uganda

How did we get the data? Genotype Calling Good data SNP1 Bad data SNP2 AA or AB? 00! AA AB basic problem affy uses BRLMM we use chiamo++ BB September 2013 Entebbe, Uganda

SNP Call Rate/Proportion Sample1 00 AG GG GA Sample2 AA CC Sample3 AC Sample4 GC Sample5 CA SNP1 SNP2 SNP3 SNP4 SNP5 Sample1 00 AG GG GA Sample2 AA CC Sample3 AC Sample4 GC Sample5 CA SNP Call Rate 60% 80% 100% Conservative approach would be to do sample call rate and SNP call rate simultaneously QC is an art and not a science The sequence of QC steps matters Thresholds and procedures should depend on the quality of the data and the questions being asked September 2013 Entebbe, Uganda

Quality Control Steps Sample QC SNP QC Sample Call Rate/Proportion SNP Call Rate/Proportion Autosomal Heterozygosity Hardy Weinberg Equilibrium (HWE) Sex / Gender X Chromosome Heterozygosity Conservative approach would be to do sample and SNP call rate simultaneously Do not start with PCA Too Much Relatedness Identity By Descent (IBD) Too Little Relatedness Principal Component Analysis (PCA) September 2013 Entebbe, Uganda

Hardy Weinberg Equilibrium SNP1 SNP2 SNP3 Sample1 AC AA Sample2 Sample3 CC Sample4 f (A)=p 4/8 f(C)=q=1-p fe(AA)=p2 1/4 fe(AC)=2pq 2/4 fe(CC)=q2 fo(AA) 0/4 fo(AC) 4/4 fo(CC) SNP1 SNP2 SNP3 Sample1 AC AA Sample2 Sample3 CC Sample4 f (A)=p 4/8 f(C)=q=1-p SNP1 SNP2 SNP3 Sample1 AC AA Sample2 Sample3 CC Sample4 SNP1 SNP2 SNP3 Sample1 AC AA Sample2 Sample3 CC Sample4 f (A)=p 4/8 f(C)=q=1-p fe(AA)=p2 1/4 fe(AC)=2pq 2/4 fe(CC)=q2 allele frequencies random mating Females A (p) C (q) Males AA (p2) AC (pq) CC (q2) expected genotype frequencies Random mating! observed genotype frequencies September 2013 Entebbe, Uganda

When HWE does not apply Non-random mating Selection forces Alleles in disease causing loci Apply HWE only to controls in a case-control study Migration Data quality September 2013 Entebbe, Uganda

Quality Control Steps Sample QC SNP QC Sample Call Rate/Proportion SNP Call Rate/Proportion Autosomal Heterozygosity Hardy Weinberg Equilibrium (HWE) Sex / Gender X Chromosome Heterozygosity Conservative approach would be to do sample and SNP call rate simultaneously Do not start with PCA Too Much Relatedness Identity By Descent (IBD) Too Little Relatedness / Confounding Principal Component Analysis (PCA) September 2013 Entebbe, Uganda

Principal Component Analysis A statistical technique for summarizing many variables with minimal loss of information PCA requires clean non-correlated data September 2013 Entebbe, Uganda

Principal Component Analysis A statistical technique for summarizing many variables with minimal loss of information PCA can reveal Population outliers September 2013 Entebbe, Uganda

PCA and Population Outliers Africa Europe Study samples We will already have gotten rid of some outliers during heterozygosity check PCA - take multiple variables and find dimensions with greatest variability Quality of data very important when doing multivariate statistics such as PCA and hence often last step of QC Asia September 2013 Entebbe, Uganda

Principal Component Analysis A statistical technique for summarizing many variables with minimal loss of information PCA can reveal Population outliers Population structure / Confounding September 2013 Entebbe, Uganda

Why population structure / confounding is a problem in a case-control study “Causal” allele “Non-causal” allele 1) Population 1 over represented in study. 2) Blue allele overrepresented among population 1. 3) Blue allele overrepresented among cases and appears as causal allele. “Causal” allele overrepresented among cases because of higher frequency in overrepresented population among cases Confounding occurs when the population substructure is not equally distributed between case and control groups. Population / Chip / Lab Balding, Nature Reviews Genetics (2006) September 2013 Entebbe, Uganda

Confounding due to chip effects EAST QUAD EAST OCTO September 2013 Entebbe, Uganda

Confounding due to plate effects PC 1 Plate Number September 2013 Entebbe, Uganda

Population structure -  Inflated QQ-plot Observed test statistic λ is a measure of the deviation from the diagonal Expected test statistic September 2013 Entebbe, Uganda

Simple But Effective QC Common Thresholds Sample QC SNP QC Sample Call Rate >97% SNP Call Rate/Proportion >97% Autosomal Heterozygosity mean±3SD Hardy Weinberg Equilibrium (HWE) p>10-4 Sex / Gender PLINK default thresholds Conservative approach would be to do sample and SNP call rate simultaneously Do not start with PCA Identity By Descent (IBD) <0.05 Principal Component Analysis (PCA) EIGENSTRAT, 6SD, PCs 1-10 September 2013 Entebbe, Uganda

Useful references September 2013

Useful references September 2013 Entebbe, Uganda

Useful software PLINK (QC) http://pngu.mgh.harvard.edu/~purcell/plink EIGENSTRAT (PCA) http://genetics.med.harvard.edu/reich/Reich_Lab/Software.html GEMMA (Association) http://home.uchicago.edu/xz7/software SNPTEST (Association) https://mathgen.stats.ox.ac.uk/genetics_software/snptest/snptest.html shellfish : Parallel PCA and data processing for genome-wide SNP data http://www.stats.ox.ac.uk/~davidson/software/shellfish/shellfish.php September 2013 Entebbe, Uganda

Thank you for your attention! Deepti Gurdasani Liz Young Katherine Ripullone September 2013 Entebbe, Uganda