Statistical Genomics Zhiwu Zhang Washington State University Lecture 7: Impute.

Slides:



Advertisements
Similar presentations
Imputation for GWAS 6 December 2012.
Advertisements

Why this paper Causal genetic variants at loci contributing to complex phenotypes unknown Rat/mice model organisms in physiology and diseases Relevant.
A Method for Detecting Pleiotropy
Meta-analysis for GWAS BST775 Fall DEMO Replication Criteria for a successful GWAS P
Presented by Qing Duan Dr. Yun Li group UNC at Chapel Hill
Perspectives from Human Studies and Low Density Chip Jeffrey R. O’Connell University of Maryland School of Medicine October 28, 2008.
Statistical Classification Rong Jin. Classification Problems X Input Y Output ? Given input X={x 1, x 2, …, x m } Predict the class label y  Y Y = {-1,1},
High resolution detection of IBD Sharon R Browning and Brian L Browning Supported by the Marsden Fund.
From sequence data to genomic prediction
Association Modeling With iPlant
Genotyping of James Watson’s genome from Low-coverage Sequencing Data Sanjiv Dinakar and Yözen Hernández.
A dynamic program algorithm for haplotype block partitioning Zhang, et. al. (2002) PNAS. 99, 7335.
Cleaver – Classification of Expression Array Version 1.0 Hongli Li Spring Computational Biology Computer Science Department UMASS Lowell.
1 Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data Presented by: Tun-Hsiang Yang.
Design Considerations in Large- Scale Genetic Association Studies Michael Boehnke, Andrew Skol, Laura Scott, Cristen Willer, Gonçalo Abecasis, Anne Jackson,
Polymorphism and Variant Analysis Lab
The Evaluation of a Passive Microwave-Based Satellite Rainfall Estimation Algorithm with an IR-Based Algorithm at Short time Scales Robert Joyce RS Information.
Polymorphism & Variant Analysis Lab Saurabh Sinha Polymorphism and Variant Analysis Lab v1 | Saurabh Sinha 1 Powerpoint by Casey Hanson.
Mean and Standard Deviation of Grouped Data Make a frequency table Compute the midpoint (x) for each class. Count the number of entries in each class (f).
10cM - Linkage Mapping Set v2 ABI Median intermarker distance: 4.7 Mb Mean intermarker distance: 5.6 Mb Mean genetic gap distance: 8.9 cM Average Heterozygosity.
SP5 - Neuroinformatics SynapsesSA Tutorial Computational Intelligence Group Technical University of Madrid.
Sequential & Multiple Hypothesis Testing Procedures for Genome-wide Association Scans Qunyuan Zhang Division of Statistical Genomics Washington University.
Chapter 6 – Three Simple Classification Methods © Galit Shmueli and Peter Bruce 2008 Data Mining for Business Intelligence Shmueli, Patel & Bruce.
California Pacific Medical Center
Paul VanRaden and Chuanyu Sun Animal Genomics and Improvement Lab USDA-ARS, Beltsville, MD, USA National Association of Animal Breeders Columbia, MO, USA.
GenABEL: an R package for Genome Wide Association Analysis
Lectures 7 – Oct 19, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall.
Supplemental Figure 1. False trans association due to probe cross-hybridization and genetic polymorphism at single base extension site. (A) The Infinium.
Efficient calculation of empirical p- values for genome wide linkage through weighted mixtures Sarah E Medland, Eric J Schmitt, Bradley T Webb, Po-Hsiu.
Statistical Genomics Zhiwu Zhang Washington State University Lecture 26: Kernel method.
Statistical Genomics Zhiwu Zhang Washington State University Lecture 19: SUPER.
Statistical Genomics Zhiwu Zhang Washington State University Lecture 25: Ridge Regression.
Washington State University
Statistical Genomics Zhiwu Zhang Washington State University Lecture 29: Bayesian implementation.
Statistical Genomics Zhiwu Zhang Washington State University Lecture 9: Linkage Disequilibrium.
Statistical Genomics Zhiwu Zhang Washington State University Lecture 20: MLMM.
Statistics and probability Dr. Khaled Ismael Almghari Phone No:
Statistical Genomics Zhiwu Zhang Washington State University Lecture 4: Statistical inference.
Statistical Genomics Zhiwu Zhang Washington State University Lecture 11: Power, type I error and FDR.
Lecture 10: GWAS by correlation
Lecture 28: Bayesian Tools
Washington State University
Washington State University
Lecture 10: GWAS by correlation
Washington State University
Washington State University
Washington State University
Washington State University
Washington State University
Washington State University
Lecture 23: Cross validation
Lecture 23: Cross validation
Eco 6380 Predictive Analytics For Economists Spring 2016
Canine hip dysplasia is predictable by genotyping
First and last name D E P T O F L First and last name A B R Y M
Haplotype Estimation Using Sequencing Reads
Lecture 11: Power, type I error and FDR
Volume 173, Issue 1, Pages e9 (March 2018)
Lecture 11: Power, type I error and FDR
Washington State University
Simultaneous Genotype Calling and Haplotype Phasing Improves Genotype Accuracy and Reduces False-Positive Associations for Genome-wide Association Studies 
Genotype Imputation with Millions of Reference Samples
Perspectives from Human Studies and Low Density Chip
Washington State University
A Unified Approach to Genotype Imputation and Haplotype-Phase Inference for Large Data Sets of Trios and Unrelated Individuals  Brian L. Browning, Sharon.
Washington State University
Lecture 23: Cross validation
Lecture 29: Bayesian implementation
Washington State University
STAT 515 Statistical Methods I Lecture 1 August 22, 2019 Originally prepared by Brian Habing Department of Statistics University of South Carolina.
Presentation transcript:

Statistical Genomics Zhiwu Zhang Washington State University Lecture 7: Impute

 Homework2 posted, due Feb 17, Wednesday, 3:10PM  Midterm exam: February 26, Friday, 50 minutes (3:35- 4:25PM), 25 questions.  Final exam: May 3, 120 minutes (3:10-5:10PM) for 50 questions. Administration

 Why imputation  How to impute  Stochastic imputation  KNN  BEAGLE Outline

 Most of analyses do not allow missing data  Increase marker density  Meta analyses for multiple studies  Improve GWAS and GS Why imputation

 Coverage: 1X  Missing rate: 38  Imputed by KNN  Filling rate: 97%  Accuracy: 98%  3M SNPs remain Imputation improve density Huang et al. 2010, Nature Genetics

Example of meta analysis Fig. 5. Missing rate of SNPs. There were 21,455 SNPs on Illumina array that was used to derive the predictive formula. Aboutw40% of these SNPs were not present on the Affymetrix array that was used to genotype the dogs for independent validation (including the first and the third most influential SNPs on the Illumina array). The cumulative missing rates of SNPs are plotted against their order (descending log scale) based on their scaling factor. Guo et. al. Osteoarthritis Cartilage. 2011, 19(4): 420–429

Boost statistical power Marchini et. al. Nat Rev Genet Jul;11(7):

 Fill with mean  By major allele  Stochastic imputation with allele frequency  KNN  Haplotype  Much more How to impute

In case of inbred with alleles A or B, the frequency of A is f(A). If x has uniform distribution U(0,1), then missing allele N can be imputed as Stochastic imputation with allele frequency

Data and uniform distribution #Import data myGD=read.table(file=" APIT/data/mdp_numeric.txt",head=T) X.raw=myGD[,-1] X=X.raw #Set missing values mr=.2 #missing rate n=nrow(X) m=ncol(X) dp=m*n #total data points uv=runif(dp) hist(uv)

Missing value simulation missing=uv<mr length(missing) missing[1:10] index.m=matrix(missing,n,m) dim(index.m) X[index.m]=NA X.raw[1:5,1:5] X[1:5,1:5]

Missing value imputation #Define StochasticImpute funciton StochasticImpute=function(X){ n=nrow(X) m=ncol(X) fn=colSums(X, na.rm=T) # sum of genotypes for all individuals fc=colSums(floor(X/3+1),na.rm=T) #count number of non missing individuals fa=fn/(2*fc) #Frequency of allele "2" for(i in 1:m){ index.a=runif(n)<fa[i] index.na=is.na(X[,i]) index.m2=index.a & index.na index.m0=!index.a & index.na X[index.m2,i]=2 X[index.m0,i]=0 } return(X)}

Two types of imputation accuracy #Impute XI= StochasticImpute(X) #Correlation accuracy.r=cor(X.raw[index.m], XI[index.m]) #Proportion of match index.match=X.raw==XI index.mm=index.match&index.m accuracy.m=length(X[index.mm])/length(X[index.m]) accuracy.r accuracy.m

Replication nrep=100 myimp=replicate(nrep,{ uv=runif(dp) #hist(uv) missing=uv<mr length(missing) missing[1:10] index.m=matrix(missing,n,m) dim(index.m) X[index.m]=NA X.raw[1:5,1:5] X[1:5,1:5] #======================================= #Impute with StochasticImpute XI= StochasticImpute(X) #Calcuate accuracy accuracy.r=cor(X.raw[index.m], XI[index.m]) index.match=X.raw==XI index.mm=index.match&index.m accuracy.m=length(X[index.mm])/length(X[index.m]) accuracy.r accuracy.m acc=c(accuracy.r, accuracy.m) }) plot(myimp[1,],myimp[2,])

 One neighbor: green goes to blue  Five neighbors: green goes to red K Nearest Neighbors: vote Income Education

 One neighbor: income is estimated by the nearest neighbor  Two neighbors: income is estimated as the average of the two nearest neighbors  Regression is better than average Predict income by regression Income Education

 Vote: n=2 for education and income  Predict income by education: n=2 for education and income  Impute missing genotypes: n is number of markers Euclidean distance

"impute" R package #install.packages("impute") ## try if URLs are not supported source(" biocLite("impute") library(impute) #Impute and calculate correlation XI= StochasticImpute(X) X.knn= impute.knn(as.matrix(t(X)), k=10) accuracy.r.si=cor(X.raw[index.m], XI[index.m]) accuracy.r.knn=cor(X.raw[index.m], t(X.knn$data)[index.m]) accuracy.r.si accuracy.r.knn

BEAGLE  Java package  JDK required  First release: 2006  Current version: 4.1  Version used in class:  Multiple papers Brian Browning University of Washington Department of Medicine, Division of Medical Genetics Health Sciences Building, K-253 Box Seattle, WA Phone: (206) Fax: (206)

Input file

Output file #Convert to BEAGLE input format index0=X==0 index1=X==1 index2=X==2 indexna=is.na(X) X2=X X2[index0]="A\tA" X2[index1]="A\tB" X2[index2]="B\tB" X2[indexna]="?\t?" myGD2=cbind("M",myGD[,1],X2) setwd("/Users/Zhiwu/Dropbox/Current/ZZLab/WSUCourse/CROPS545/Demo") write.table(myGD2,file="test.bgl",quote=F,sep="\t",col.name=F,row.name=F)

 Command line  From R Run BEAGLE #Impute with BEAGLE system("java -Xmx12g -jar /Users/Zhiwu/Dropbox/Current/ZZLab/WSUCourse/CROP S545/Demo/Beagle/beagle.jar unphased=test.bgl missing=? out=test1" )

Output of BEAGLE

Format conversion #Convert output format genotype.full <- read.delim("test1.test.bgl.phased.gz",sep=" ",head=T) genotype.c=as.matrix(genotype.full[,-(1:2)]) index.A=genotype.c=="A" index.B=genotype.c=="B" nr=nrow(genotype.c) nc=ncol(genotype.c) genotype.n=matrix(0,nr,nc) genotype.n[index.A]=0 genotype.n[index.B]=1 n2=ncol(genotype.n) odd=seq(1,n2-1,2) even=seq(2,n2,2) g0=genotype.n[,odd] g1=genotype.n[,even] X.bgl=g0+g1

Accuracy of BEAGLE #Impute and calculate correlation accuracy.r=cor(X.raw[index.m], X.bgl[index.m]) index.match=X.raw==X.bgl index.mm=index.match&index.m accuracy.m=length(X[index.mm])/length(X[index.m]) accuracy.r accuracy.m

 Why imputation  How to impute  Stochastic imputation  KNN  BEAGLE Highlight