Imputation Sarah Medland Boulder 2015.

Slides:



Advertisements
Similar presentations
Imputation for GWAS 6 December 2012.
Advertisements

Introduction to Haplotype Estimation Stat/Biostat 550.
Review of main points from last week Medical costs escalating largely due to new technology This is an ethical/social problem with major conseq. Many new.
Why I chose: First reading results seemed counterintuitive Introduction full of references I didn’t know Useful? Or Gee Whizz so what?...Needed to read.
BST 775 Lecture PLINK – A Popular Toolset for GWAS
Meta-analysis for GWAS BST775 Fall DEMO Replication Criteria for a successful GWAS P
METHODS FOR HAPLOTYPE RECONSTRUCTION
Presented by Qing Duan Dr. Yun Li group UNC at Chapel Hill
Perspectives from Human Studies and Low Density Chip Jeffrey R. O’Connell University of Maryland School of Medicine October 28, 2008.
Objectives Cover some of the essential concepts for GWAS that have not yet been covered Hardy-Weinberg equilibrium Meta-analysis SNP Imputation Review.
Association Mapping David Evans. Outline Definitions / Terminology What is (genetic) association? How do we test for association? When to use association.
From sequence data to genomic prediction
Ingredients for a successful genome-wide association studies: A statistical view Scott Weiss and Christoph Lange Channing Laboratory Pulmonary and Critical.
Lab 13: Association Genetics. Goals Use a Mixed Model to determine genetic associations. Understand the effect of population structure and kinship on.
MSc GBE Course: Genes: from sequence to function Genome-wide Association Studies Sven Bergmann Department of Medical Genetics University of Lausanne Rue.
The Extraction of Single Nucleotide Polymorphisms and the Use of Current Sequencing Tools Stephen Tetreault Department of Mathematics and Computer Science.
SNPs DNA differs between humans by 0.1%, (1 in 1300 bases) This means that you can map DNA variation to around 10,000,000 sites in the genome Almost all.
NGS Analysis Using Galaxy
Polymorphism and Variant Analysis Lab
Factors to Consider in Selecting a Genotyping Platform Elizabeth Pugh June 22, 2007.
Genetic Mapping Oregon Wolfe Barley Map (Szucs et al., The Plant Genome 2, )
Polymorphism & Variant Analysis Lab Saurabh Sinha Polymorphism and Variant Analysis Lab v1 | Saurabh Sinha 1 Powerpoint by Casey Hanson.
National Taiwan University Department of Computer Science and Information Engineering Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Imputation 2 Presenter: Ka-Kit Lam.
Targeted next generation sequencing for population genomics and phylogenomics in Ambystomatid salamanders Eric M. O’Neill David W. Weisrock Photograph.
Bayesian MCMC QTL mapping in outbred mice Andrew Morris, Binnaz Yalcin, Jan Fullerton, Angela Meesaq, Rob Deacon, Nick Rawlins and Jonathan Flint Wellcome.
Lab 13: Association Genetics December 5, Goals Use Mixed Models and General Linear Models to determine genetic associations. Understand the effect.
California Pacific Medical Center
February 20, 2002 UD, Newark, DE SNPs, Haplotypes, Alleles.
Paul VanRaden and Chuanyu Sun Animal Genomics and Improvement Lab USDA-ARS, Beltsville, MD, USA National Association of Animal Breeders Columbia, MO, USA.
GenABEL: an R package for Genome Wide Association Analysis
Practical With Merlin Gonçalo Abecasis. MERLIN Website Reference FAQ Source.
Biostatistics-Lecture 19 Linkage Disequilibrium and SNP detection
Linkage Disequilibrium and Recent Studies of Haplotypes and SNPs
Variant calling: number of individuals vs. depth of read coverage Gabor T. Marth Boston College Biology Department 1000 Genomes Meeting Cold Spring Harbor.
Schematic of the single variant polymorphism (SNP) genotyping assay.
Canadian Bioinformatics Workshops
WHI Imputation. Target GWAS data WHIMS +, ~5,000-6,000 samples, Illumina Omni express GRANET, ~5,000 samples, Illumina Omni Hipfx, ~4,000-5,000 samples,
Power and Meta-Analysis Dr Geraldine M. Clarke Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015 Africa Centre for.
From Reads to Results Exome-seq analysis at CCBR
Canadian Bioinformatics Workshops
Quality Control Using EasyQC & Meta-Analysis in METAL
Common variation, GWAS & PLINK
Next Generation Sequencing Analysis
Signatures of Selection
Genome Wide Association Studies using SNP
Imputation Sarah Medland Boulder 2015.
Zhengzheng Tang and Danyu Lin March 26, 2013
GxG and GxE.
Preparing data for GWAS analysis
Case Study #2 Session 1, Day 3, Liu
Imputation-based local ancestry inference in admixed populations
ESP6800 BP Analysis (recessive model)
Introduction to Data Formats and tools
Post-GWAS and Mechanistic Analyses
The effect of using sequence data instead of a lower density SNP chip on a GWAS EAAP 2017; Tallinn, Estonia Sanne van den Berg, Roel Veerkamp, Fred van.
Beyond GWAS Erik Fransen.
Exploring and Understanding ChIP-Seq data
Visualising and Exploring BS-Seq Data
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Analysing imputed data in raremetalworker
Genome Biology & Applied Bioinformatics Mehmet Tevfik DORAK, MD PhD
Exercise: Effect of the IL6R gene on IL-6R concentration
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
New Statistical Methods for Family-Based Sequencing Studies
Simultaneous Genotype Calling and Haplotype Phasing Improves Genotype Accuracy and Reduces False-Positive Associations for Genome-wide Association Studies 
Genotype Imputation with Millions of Reference Samples
Perspectives from Human Studies and Low Density Chip
GWAS-eQTL signal colocalisation methods
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Presentation transcript:

Imputation Sarah Medland Boulder 2015

What is imputation? (Marchini & Howie 2010)

3 main reasons for imputation Meta-analysis Fine Mapping Combining data from different chips Other less common uses sporadic missing data imputation correction of genotyping errors imputation of non-SNP variation

Combining data from different chips Example 750 individuals typed on the 370K 550 individuals typed on the 610K Power MAF .2 SNP explaining 1% total variance alpha 5e-08 N=1300, NCP 13.07, power .0331 N=750, NCP 7.54 , power .0034 N=550, NCP 5.53 , power .0009

Another way of looking at this

QQ-plot

Solution Impute all individuals to a single reference based on the SNPs that overlap between the chips Single distribution of NCP and power across all SNP qq plot and manhattan describes the full sample with the same degree of accuracy

Ways to approach imputation 1000 Genomes Phase II haplotypes Imputed GWAS genotypes

Imputation programs minimac3 Impute2 never use plink for imputation! Beagle – not frequently used never use plink for imputation!

How do they compare Similar accuracy Similar features Different data formats minimac3 –> custom vcf format individual=row snp=column Impute2 –> snp=row individual=column Different philosophies Frequentist vs Bayesian

minimac3 http://genome.sph.umich.edu/wiki/Minimac3 Built by Gonçalo Abecasis, Yun Li, Christian Fuchsberger and colleagues Analysis options raremetalworker (continuous phenotypes & family/twin samples) Format converter http://genome.sph.umich.edu/wiki/DosageConvertor Mach2qtl (continuous phenotypes) Mach2dat (binary phenotypes)

Impute2 Built by Jonathan Marchini, Bryan Howie and colleges https://mathgen.stats.ox.ac.uk/impute/impute_v2.html http://genome.sph.umich.edu/wiki/IMPUTE2:_1000_Genomes_Imputation_Cookbook Built by Jonathan Marchini, Bryan Howie and colleges Downstream analysis options SNPtest Quicktest

Files to practice with http://genome.sph.umich.edu/wiki/Minimac3_Imputation_Cookbook

Today – discuss the 2 step approach 1000 Genomes Phase II haplotypes Step 1 Step 2 Imputed GWAS genotypes

Options for imputation DIY – Use a cookbook! http://genome.sph.umich.edu/wiki/Minimac3_Imputation_Cookbook OR http://genome.sph.umich.edu/wiki/IMPUTE2:_1000_Genomes_Imputation_Cookbook UMich Imputation Server https://imputationserver.sph.umich.edu/ Sanger Imputation Server https://imputation.sanger.ac.uk/

UMich imputation Server

Sanger imputation server

Step 0 – Chose your reference Current Publically Available References HapMapII (no phased X data officially released) HapMapIII 1KGP – phase 1 version v3 1KGP – phase 3 version v5 Future non-public references only available via custom imputation servers HRC - 64,976 haplotypes 39,235,157 SNPs CAPPA – African American/Carabbean All ethnicities vs specific

References are in vcf format (more about this format tomorrow)

Not all references are equal!! The Beagle and IMPUTE versions of the references contain variants that do not appear in the publically available 1KGP data! The 1KGP references still contain multiple locations with more than 1 variant & Multiple variants in more than one place!

Step 0 – re-QC your data Convert to PLINK binary format Exclude snps with excessive missingness (>5%), low MAF (<1%), HWE violations (~P<10-4), Mendelian errors Drop all strand ambiguous (palindromic) SNPs – ie A/T or C/G snps Update build and alignment (b37) Check strand! Output your data in the expected format for the phasing program you will use Check the naming convention for the program and reference you want to use rs278405739 OR 22:395704

Strand, strand, strand DNA is a double helix A pairs with T and C pairs with G There are two strands: ATCTGGTACTCCAT TAGACCATGAGGTA Strand 1 Strand 2

Strand, strand, strand What about SNPs? ATCTGGT[A/C]CTCCAT TAGACCA[T/G]GAGGTA Strand 1 Strand 2

Strand, strand, strand What’s the big/annoying problem? ATCTGGT[A/T]CTCCAT TAGACCA[T/A]GAGGTA Strand 1 Strand 2

Strand, strand, strand Two ambiguous SNP types How to check? A/T and G/C All others are resolved How to check? Allele frequencies [know your population] LD [if you have raw data] PLINK and METAL can re-orient strand Remember the ambiguous ones!

Before we move on to talking about phasing Questions?

What is phasing In this context it is really Haplotype Estimation We take genotype data and try to reconstruct the haplotypes Can use reference data to improve this estimation

Step 1 - Phasing Input a diploid target sample and a library of reference haplotypes Selection of conditioning haplotypes. Eagle2 first identifies a subset of 10,000 conditioning haplotypes by ranking reference haplotypes according to the number of discrepancies between each reference haplotype and the homozygous genotypes of the target sample.

Generation of HapHedge data structure. Eagle2 next generates a HapHedge data structure on the selected conditioning haplotypes. The HapHedge encodes a sequence of haplo- type prefix trees (i.e., binary trees on haplotype prefixes) rooted at a sequence of starting positions along the chromosome, thus enabling fast lookup of haplotype frequencies

Exploration of the diplotype space. Having prepared a HapHedge of conditioning haplotypes, Eagle2 performs phasing using a HMM Consolidates reference haplotypes sharing common prefixes reducing computation.

Step 1 – How to phase Data is usually broken into manageable chunks ~20Mb Each phased independently ./eagle --vcfRef HRC.r1-1.GRCh37.chr20.shapeit3.mac5.aa.genotypes.bcf --vcfTarget chunk_20_0000000001_0020000000.vcf.gz --geneticMapFile genetic_map_chr20_combined_b37.txt --outPrefix chunk_20_0000000001_0020000000.phased --bpStart 1 --bpEnd 25000000 --chrom 20 --allowRefAltSwap

Step 2 - Imputation Compares the phased data to the references and infers the missing genotypes. Calculate accuracy metrics ./Minimac3 --refHaps HRC.r1-1.GRCh37.chr1.shapeit3.mac5.aa.genotypes.m3vcf.gz --haps chunk_1_0000000001_0020000000.phased.vcf --start 1 --end 20000000 --window 500000 --prefix chunk_1_0000000001_0020000000 --chr 20 --noPhoneHome --format GT,DS,GP --allTypedSites

Imputing in minimac

Output Info files

Output 3 main genotype output formats Probs format (probability of AA AB and BB genotypes for each SNP) Hard call or best guess (output as A C T or G allele codes) Dosage data (most common – 1 number per SNP, 1-2)

Assessing accuracy of phasing All phasing methods will make errors in the estimation of haplotypes - probability of error increases with length of imputed region Problem – some programs are designed to run on small segments others on whole chromosomes EAGLE2 currently considered the best More work needed that compares like with like

Timing & Memory from Das et al 2016 Prior to EAGLE2

Accuracy of imputation methods

Before we move on to talking about post imputation QC… Questions?

Post imputation QC After imputation you need to check that it worked and the data look ok Things to check Plot r2 across each chromosome look to see where it drops off Plot MAF-reference MAF

Post imputation QC See meta-analysis section For each chromosome check N and % of SNPs: MAF <.5% With r2 0-.3, .3-.6,.6-1 If you have hard calls or probs data HWE P < 10E-6 If you have families convert to hard calls and check for Mendelian errors (should be ~.2%) % should be roughly constant across chromosomes

Post imputation QC See meta-analysis session Next run GWAS for a trait – ideally continuous, calculate lambda and plot: QQ Manhattan SE vs N P vs Z Run the same trait on the observed genotypes – plot imputed vs observed

However, if you are running analyses for a consortium they will probably ask you to analyse all variants regardless of whether they pass QC or not… (If you are setting up a meta-analysis consider allowing cohorts to ignore variants with MAF <.5% and low r2 – it will save you a lot of time)

Issue – the r2 metrics differ between imputation programs

In general fairly close correlation rsq/ ProperInfo/ allelic Rsq 1 = no uncertainty 0 = complete uncertainty .8 on 1000 individuals = amount of data at the SNP is equivalent to a set of perfectly observed genotype data in a sample size of 800 individuals Note Mach uses an empirical Rsq (observed var/exp var) and can go above 1

Bad Imputation Better Imputation Good Imputation

Choices of analysis methods