Paul VanRaden and Chuanyu Sun Animal Genomics and Improvement Lab USDA-ARS, Beltsville, MD, USA National Association of Animal Breeders Columbia, MO, USA Paul VanRaden 10 World Congress Genetics Applied Livest. Prod., Vancouver, Canada, August 19, 2014 (1) Fast Imputation Using Medium- or Low-Coverage Sequence Data
Paul VanRaden 10 World Congress Genetics Applied Livest. Prod., Vancouver, Canada, August 19, 2014 (2) Topics l Cost of chip vs. sequence data w Chips: Nonlinear increase with SNP density w Sequence: Linear increase with read depth l Imputation methods for sequence data w Few programs designed for low read depth l Value of including HD chip in sequence data
Paul VanRaden 10 World Congress Genetics Applied Livest. Prod., Vancouver, Canada, August 19, 2014 (3) Analysis of chip vs. sequence data Chip dataSequence data Genotypes are observedGenotype probabilities AA, AB, BB (2, 1, 0)Counts of A, counts of B Exact data, SNP subsetApproximate data, all SNP Impute only missing dataImpute all genotypes 3K, 6K, 50K, 77K, 777K30 million SNPs + CNVs Error rate < 0.05%Error rate 0.5% to 10% Computation importantComputation is crucial
Paul VanRaden 10 World Congress Genetics Applied Livest. Prod., Vancouver, Canada, August 19, 2014 (4) Imputation algorithm (findhap v4) l Prior allele probabilities = pop’n frequency l Compute Prob(nA, nB | genotypes, errate) l Test ancestor haplotype likelihoods first l Find most likely 2 haplotypes from library l Compute haplotype posteriors from priors l Test long, then medium, then short segments
Paul VanRaden 10 World Congress Genetics Applied Livest. Prod., Vancouver, Canada, August 19, 2014 (5) Data sets and imputation tests Data category / parameterLevels tested Simulated sequenced bulls250, 500, 1,000, 10,000 Read depths1, 2, 4, 8, 16 Error rates0%, 1%, 4%, 16% Include HD chip in sequenceYes or no SNPs in sequence and HD30 million and 600,000 Human chromosome 221,102 actual genomes SNPs in sequence and HD394,724 and 39,440
Paul VanRaden 10 World Congress Genetics Applied Livest. Prod., Vancouver, Canada, August 19, 2014 (6) Computation required l Bulls: 250 sequenced HD, 1 chromosome l Time (10 processors): findhap 10 min, BeagleV4 3 days l Memory: findhap 5 Gbytes, Beagle <5 Gbytes l Input data: findhap 0.5 Gbytes, Beagle 5 Gbytes w findhap: 2 bytes / SNP [A, B counts stored as hexadecimal] w Beagle: 20 bytes / SNP [Prob(AA), Prob (AB), Prob(BB)] l Output data: findhap 1 byte vs. Beagle 20 bytes / SNP
Paul VanRaden 10 World Congress Genetics Applied Livest. Prod., Vancouver, Canada, August 19, 2014 (7) Accuracy of Findhap vs. Beagle Sequence + HDImpute from HD ProgramDepthCorrectCorr’nCorrectCorr’n Findhap8X X X Beagle8X X X bulls had sequence + HD, 250 others were imputed from HD
Paul VanRaden 10 World Congress Genetics Applied Livest. Prod., Vancouver, Canada, August 19, 2014 (8) Accuracy from HD for bulls * depth Sequenced BullsDepth Total DepthCorrectCorr’n 2508X2,000X X2,000X ,0002X2,000X ,0001X10,000X Sequences had 1% error, HD imputed using findhap
Paul VanRaden 10 World Congress Genetics Applied Livest. Prod., Vancouver, Canada, August 19, 2014 (9) Accuracy including HD in sequence Sequenced bullsBulls with HD only ReadHD in sequence? DepthNoYesNoYes 16X X X X X Correlations of estimated with true genotypes for 500 bulls sequenced with 1% error and 250 bulls with HD only
Paul VanRaden 10 World Congress Genetics Applied Livest. Prod., Vancouver, Canada, August 19, 2014 (10) Imputation from 10K, 60K, 1X, or 2X Reference population is 500 bulls, 8X read depth, 1% error
Paul VanRaden 10 World Congress Genetics Applied Livest. Prod., Vancouver, Canada, August 19, 2014 (11) Sequenced human read depth * error Correct genotypes %Genotype correlation ReadError rate Depth 0%1%4%16% 0%1%4%16% 16X X X X X humans sequenced for 394,724 SNPs on chromosome 22
Paul VanRaden 10 World Congress Genetics Applied Livest. Prod., Vancouver, Canada, August 19, 2014 (12) Software at l Simulate genotypes (programs written 2007) w pedsim.f90, markersim.f90, genosim.f90 l Simulate A and B counts, Poisson plus error w geno2seq.f90 l Impute using haplotype likelihood ratios w findhap.f90 version 4
Paul VanRaden 10 World Congress Genetics Applied Livest. Prod., Vancouver, Canada, August 19, 2014 (13) Actual HD genotype correlations 2
Paul VanRaden 10 World Congress Genetics Applied Livest. Prod., Vancouver, Canada, August 19, 2014 (14) Simulated HD correlations 2
Paul VanRaden 10 World Congress Genetics Applied Livest. Prod., Vancouver, Canada, August 19, 2014 (15) Conclusions l High read depth is expensive (linear cost) l Low read depth requires additional math w Haplotype probabilities | (A B counts, error) l Imputation improved with findhap version 4 w Up to 400 times faster than Beagle w findhap more accurate for low coverage l Some gain from including HD in sequence
Paul VanRaden 10 World Congress Genetics Applied Livest. Prod., Vancouver, Canada, August 19, 2014 (16) Acknowledgments l Jeff O’Connell and Derek Bickhart provided helpful advice on sequence analysis and software design and testing