P.M. VanRaden and D.M. Bickhart Animal Genomics and Improvement Laboratory, Agricultural Research Service, USDA, Beltsville, MD, USA

Slides:



Advertisements
Similar presentations
2007 Paul VanRaden, Mel Tooker, and Nicolas Gengler Animal Improvement Programs Lab, Beltsville, MD, USA, and Gembloux Agricultural U., Belgium
Advertisements

2007 Paul VanRaden 1, Jeff O’Connell 2, George Wiggans 1, Kent Weigel 3 1 Animal Improvement Programs Lab, USDA, Beltsville, MD, USA 2 University of Maryland.
GBS & GWAS using the iPlant Discovery Environment
Perspectives from Human Studies and Low Density Chip Jeffrey R. O’Connell University of Maryland School of Medicine October 28, 2008.
Genomic imputation and evaluation using 1074 high density Holstein genotypes P. M. VanRaden 1, D. J. Null 1 *, G.R. Wiggans 1, T.S. Sonstegard 2, E.E.
Wiggans, 2014CDCB meeting – August 5 (1) G.R. Wiggans Animal Genomics and Improvement Laboratory Agricultural Research Service, USDA Beltsville, MD
2007 Paul VanRaden and Jeff O’Connell Animal Improvement Programs Lab, Beltsville, MD U MD College of Medicine, Baltimore, MD
Variant discovery Different approaches: With or without a reference? With a reference – Limiting factors are CPU time and memory required – Crossbow –
The Extraction of Single Nucleotide Polymorphisms and the Use of Current Sequencing Tools Stephen Tetreault Department of Mathematics and Computer Science.
Wiggans, 2013RL meeting, Aug. 15 (1) Dr. George R. Wiggans, Acting Research Leader Bldg. 005, Room 306, BARC-West (main office);
Whole Exome Sequencing for Variant Discovery and Prioritisation
Mating Programs Including Genomic Relationships and Dominance Effects Chuanyu Sun 1, Paul M. VanRaden 2, Jeff R. O'Connell 3 1 National Association of.
Chuanyu Sun Paul VanRaden National Association of Animal breeders, USA Animal Improvement Programs Laboratory, USA Increasing long term response by selecting.
WiggansARS Big Data Workshop – July 16, 2015 (1) George R. Wiggans Animal Genomics and Improvement Laboratory Agricultural Research Service, USDA Beltsville,
BickhartADSA Meeting(1) 2013 Tools to Exploit Sequence data to find new markers and Disease Loci in Cattle D. M. Bickhart, H. A. Lewin and G. E. Liu.
2007 J. B. Cole 1,*, P. M. VanRaden 1, J. R. O'Connell 3, C. P. Van Tassell 1,2, T. S. Sonstegard 2, R. D. Schnabel 4, J. F. Taylor 4, and G. R. Wiggans.
Wiggans, 2013Japanese Genomics Tour (1) Dr. George R. WiggansDr. H. Duane Norman Acting Research LeaderInterim Administrator Animal Improvement Programs.
2007 Paul VanRaden Animal Improvement Programs Lab, Beltsville, MD 2011 Avoiding bias from genomic pre- selection in converting.
Wiggans, 2013SRUC Imputation (1) Dr. George R. Wiggans Animal Improvement Programs Laboratory Agricultural Research Service, USDA Beltsville, MD ,
2007 Paul VanRaden, Curt Van Tassell, George Wiggans, Tad Sonstegard, and Jeff O’Connell Animal Improvement Programs Laboratory and Bovine Functional Genomics.
Wiggans, th WCGALP (1) G.R. Wiggans*, T.A. Cooper, D.J. Null, and P.M. VanRaden Animal Genomics and Improvement Laboratory Agricultural Research.
An Efficient Method of Generating Whole Genome Sequence for Thousands of Bulls Chuanyu Sun 1 and Paul M. VanRaden 2 1 National Association of Animal Breeders,
2007 Paul VanRaden and Mel Tooker Animal Improvement Programs Laboratory, USDA Agricultural Research Service, Beltsville, MD, USA
John B. Cole 1, Daniel J. Null *1, Chuanyu Sun 2, and Paul M. VanRaden 1 1 Animal Genomics and Improvement 2 Sexing Technologies Laboratory Navasota, TX.
2007 Paul VanRaden, Mel Tooker, Jan Wright, Chuanyu Sun, and Jana Hutchison Animal Improvement Programs Lab, Beltsville, MD National Association of Animal.
2007 Paul VanRaden, George Wiggans, Jeff O’Connell, John Cole, Animal Improvement Programs Laboratory Tad Sonstegard, and Curt Van Tassell Bovine Functional.
G. R. Wiggans and P. M. VanRaden Animal Improvement Programs Laboratory Agricultural Research Service, USDA, Beltsville, MD
T. A. Cooper and G.R. Wiggans Animal Improvement Programs Laboratory Agricultural Research Service, USDA Beltsville, MD Council.
BNFO 615 Usman Roshan. Short read alignment Input: – Reads: short DNA sequences (upto a few hundred base pairs (bp)) produced by a sequencing machine.
Jeff O’ConnellInterbull annual meeting, Orlando, FL, July 2015 (1) J. R. O’Connell 1 and P. M. VanRaden 2 1 University of Maryland School of Medicine,
Paul VanRaden, 1 Katie Olson, 2 Dan Null, 1 Mehdi Sargolzaei, 3 Marco Winters, 4 and Jan-Thijs van Kaam 5 1 Animal Improvement Programs Laboratory, ARS,
P1097: Candidate causative mutation on BTA18 associated with calving and conformation traits in Holstein bulls J.B. Cole, 1 J.L. Hutchison, 1 D.J. Null,
J. B. Cole * and P. M. VanRaden Animal Improvement Programs Laboratory Agricultural Research Service, USDA Beltsville, MD
G.R. Wiggans Animal Improvement Programs Laboratory Agricultural Research Service, USDA Beltsville, MD 2009 G.R. WiggansCouncil.
John B. Cole Animal Improvement Programs Laboratory Agricultural Research Service, USDA, Beltsville, MD Best prediction.
WiggansCDCB industry meeting – Sept. 29, 2015 (1) George R. Wiggans Animal Genomics and Improvement Laboratory Agricultural Research Service, USDA Beltsville,
G.R. Wiggans* and P.M. VanRaden Animal Improvement Programs Laboratory Agricultural Research Service, USDA Beltsville, MD
John B. Cole Animal Improvement Programs Laboratory Agricultural Research Service, USDA Beltsville, MD AIPL Report.
J. B. Cole 1,*, P. M. VanRaden 1, and C. M. B. Dematawewa 2 1 Animal Improvement Programs Laboratory, Agricultural Research Service, USDA, Beltsville,
2007 Paul VanRaden Animal Improvement Programs Laboratory USDA Agricultural Research Service, Beltsville, MD, USA
Adjustment of breeding values for past and future inbreeding Paul VanRaden*, Lori Smith Animal Improvement Programs Laboratory Agricultural Research Service,
P. M. VanRaden and T. A. Cooper * Animal Genomics and Improvement Laboratory, Agricultural Research Service, USDA, Beltsville, MD, USA
George R. Wiggans Animal Improvement Programs Laboratory Agricultural Research Service, USDA, Beltsville, MD Select Sires’
2007 Paul VanRaden and Melvin Tooker* Animal Improvement Programs Laboratory 2010 Gains in reliability from combining subsets.
2007 Paul VanRaden 1, Jeff O’Connell 2, George Wiggans 1, Kent Weigel 3 1 Animal Improvement Programs Lab, USDA, Beltsville, MD, USA 2 University of Maryland.
Paul VanRaden and Chuanyu Sun Animal Genomics and Improvement Lab USDA-ARS, Beltsville, MD, USA National Association of Animal Breeders Columbia, MO, USA.
2007 Paul VanRaden 1, Jeff O’Connell 2, George Wiggans 1, Kent Weigel 3 1 Animal Improvement Programs Lab, USDA, Beltsville, MD, USA 2 University of Maryland.
2007 Paul VanRaden Animal Improvement Programs Lab, Beltsville, MD Iterative combination of national phenotype, genotype, pedigree,
Multibreed Genomic Evaluation Using Purebred Dairy Cattle K. M. Olson* 1 and P. M. VanRaden 2 1 Department of Dairy Science Virginia Polytechnic and State.
2007 Paul VanRaden, George Wiggans, Jeff O’Connell, John Cole, Animal Improvement Programs Laboratory Tad Sonstegard, and Curt Van Tassell Bovine Functional.
G.R. Wiggans* 1, P.M. VanRaden 1, L.R. Bacheller 1, F.A. Ross, Jr. 1, M.E. Tooker 1, J.L. Hutchison 1, T.S. Sonstegard 2, and C.P. Van Tassell 1,2 1 Animal.
2006 Paul VanRaden Animal Improvement Programs Laboratory Agricultural Research Service, USDA, Beltsville, MD Predicting Genetic.
Qq q q q q q q q q q q q q q q q q q q Background: DNA Sequencing Goal: Acquire individual’s entire DNA sequence Mechanism: Read DNA fragments and reconstruct.
G.R. Wiggans Animal Improvement Programs Laboratory Agricultural Research Service, USDA Beltsville, MD G.R. WiggansADSA 18.
G.R. Wiggans 1, T. A. Cooper 1 *, K.M. Olson 2 and P.M. VanRaden 1 1 Animal Improvement Programs Laboratory Agricultural Research Service, USDA Beltsville,
2007 Paul VanRaden Animal Improvement Programs Laboratory, USDA Agricultural Research Service, Beltsville, MD, USA 2008 New.
G.R. Wiggans Animal Improvement Programs Laboratory Agricultural Research Service, USDA Beltsville, MD Select Sires‘ Holstein.
G.R. Wiggans Animal Improvement Programs Laboratory Agricultural Research Service, USDA Beltsville, MD 2011 National Breeders.
2007 Paul VanRaden 1, Curt Van Tassell 2, George Wiggans 1, Tad Sonstegard 2, Bob Schnabel 3, Jerry Taylor 3, and Flavio Schenkel 4, Paul VanRaden 1, Curt.
Strategies to Incorporate Genomic Prediction Into Population-Wide Genetic Evaluations Nicolas Gengler 1,2 & Paul VanRaden 3 1 Animal Science.
Reliable Identification of Genomic Variants from RNA-seq Data Robert Piskol, Gokul Ramaswami, Jin Billy Li PRESENTED BY GAYATHRI RAJAN VINEELA GANGALAPUDI.
From Reads to Results Exome-seq analysis at CCBR
My vision for dairy genomics
FastHASH: A New Algorithm for Fast and Comprehensive Next-generation Sequence Mapping Hongyi Xin1, Donghyuk Lee1, Farhad Hormozdiari2, Can Alkan3, Onur.
Preprocessing Data Rob Schmieder.
Quality Control & Preprocessing of Metagenomic Data
Methods to compute reliabilities for genomic predictions of feed intake Paul VanRaden, Jana Hutchison, Bingjie Li, Erin Connor, and John Cole USDA, Agricultural.
Genotype Imputation with Millions of Reference Samples
Perspectives from Human Studies and Low Density Chip
Using Haplotypes in Breeding Programs
Presentation transcript:

P.M. VanRaden and D.M. Bickhart Animal Genomics and Improvement Laboratory, Agricultural Research Service, USDA, Beltsville, MD, USA Plant & Animal Genome, San Diego, California January 9 -11, 2016 (1) Fast Single-Pass Alignment and Variant Calling Using Sequencing Data

Paul VanRaden Plant & Animal Genome, San Diego, California, January 9 -11, 2016 (2) Motivation l Genomic methods require much computation w Genotype models replaced pedigree models w Sequence variants replacing chip genotypes w Both increased data by orders of magnitude l Fast methods are available for imputation l Slow methods for alignment, variant calling

Paul VanRaden Plant & Animal Genome, San Diego, California, January 9 -11, 2016 (3) Sequence computation l Alignment reports the chromosome location that best matches the short (150-base) DNA segment to the reference map (2.7 billion). l Often both ends of a longer segment are read and these paired ends are located together. l Variant calling reports if each mapped segment contains a reference or alternate allele at any site. These variants could be previously known or newly discovered.

Paul VanRaden Plant & Animal Genome, San Diego, California, January 9 -11, 2016 (4) Previous strategies l Almost all programs do alignment, and then variant calling, instead of both together. l Program BWA was examined as a popular alignment strategy, and GATK for calling. w “Mapping reads to the reference is a first critical computational challenge whose cost necessitates that each read be aligned independently, guaranteeing that many reads spanning indels will be misaligned.” DePristo et al (2011) GATK paper

Paul VanRaden Plant & Animal Genome, San Diego, California, January 9 -11, 2016 (5) Benefits of new strategy l Most programs align only to Dominette’s DNA l Findmap can align using all known DNA differences among and within breeds l Error rate is reduced by separating known SNPs and indels from machine read errors l Locations are mapped back to the same common reference (UMD3.1)

Paul VanRaden Plant & Animal Genome, San Diego, California, January 9 -11, 2016 (6) Algorithm used in findmap.f90 l Read reference map, store in hash table l Read and hash known variants (SNPs & indels) l Process batches of 1 million paired end reads, send to multiple processors sharing memory w Find location where both ends match map w Count alleles (reference, alternate) & errors l Output alignment and variant call files

Paul VanRaden Plant & Animal Genome, San Diego, California, January 9 -11, 2016 (7) Gaps, k-mers, and hashing strategy Identify long gaps between the same base (A, C, G, or T) TGGATTCTTTATCACTGAGCTACCTGGGAAGCCAAGTAAGC Extend each gap to a 16-base k-mer, convert to an 8-byte integer: Basenum (1, 2, 3, 4) = Base (A, C, G, T) Hashnum = 4 * Hashnum + Basenum, loop across 16-base k-mer Apply hash function (written by George Wiggans, 1988, USDA) Hash map, then hash each read (or its reverse complement)

Paul VanRaden Plant & Animal Genome, San Diego, California, January 9 -11, 2016 (8) Semi-simulated data l Simulated from UMD3.1 reference map l Variant file from run5 of 1,000 bull genomes w 38,062,190 SNPs, 532,179 insertions, and 1,127,620 deletions l Paired ends, length 150, fragment size 1,000 l Advantage of semi-simulated: true locations and true variants are known

Paul VanRaden Plant & Animal Genome, San Diego, California, January 9 -11, 2016 (9) Compare BWA and findmap Computation requiredBWA / GATKfindmap / findvar CPU minutes / 1X, 1 thread62911 CPU minutes / 1X, 10 threads 63 2 Memory (Gbytes), 1 thread Memory (Gbytes), 10 threads 46 Variant calling CPU / 1X, 1 thread2011 Accuracy Correctly placed segments overall 90.5% 92.9% Both ends of pair correctly placed 87.2% 87.6% Both ends wrong 6.2% 1.8%

Paul VanRaden Plant & Animal Genome, San Diego, California, January 9 -11, 2016 (10) Parallel processing speedup

Paul VanRaden Plant & Animal Genome, San Diego, California, January 9 -11, 2016 (11) Program series – example resources Task (per animal)ProgramThreadsMinutes Simulate 10X datamap2seq.f90105 Align 10X data and call previous variantsfindmap.f Sum new variantsfindvar.f9018 Imputation (39 million)findhap4.f90101

Paul VanRaden Plant & Animal Genome, San Diego, California, January 9 -11, 2016 (12) Accuracy of variant calling / discovery Known variantsSNPInsertionDeletion (%) Correct reference allele Correct alternate allele Call rate (paired ends ok) New variants (Homo / het) Correctly detected (10X)91 / 6354 / 3741 / 27 Falsely detected (10X)10178

Paul VanRaden Plant & Animal Genome, San Diego, California, January 9 -11, 2016 (13) Other alignment tests l Perfectly random genome, non-repetitive w Over 99.9% correctly aligned l RepeatMasker and BWA w Took 4.4 instead of 14.1 hours / 1X w Only 45% correctly aligned instead of 91% l Human genome gave results similar to cattle

Paul VanRaden Plant & Animal Genome, San Diego, California, January 9 -11, 2016 (14) File format sizes (Mbytes) Unzipped / zipped file sizesBWA, GATK Findmap, Findvar Input data: Sequence reads / 1X (fastq) 6000 / 1800 Output data: Binary alignment file / 1X (.bam) 3200 / / 360 Called genotypes / animal (.vcf) 1000 / 3879 / 13

Paul VanRaden Plant & Animal Genome, San Diego, California, January 9 -11, 2016 (15) USA use of 1,000 bull genomes l Sequence genotypes from 440 Holsteins l Imputed for 27,000 reference bulls l 700,000 candidate loci + 300,000 HD SNPs l Largest 17K added to 60K routinely used l Average gain of 2.7% reliability across traits l Largest 5K added to next Zoetis chip

Paul VanRaden Plant & Animal Genome, San Diego, California, January 9 -11, 2016 (16) Largest genomic databases 23andMeAncestry.com CDCB / USDA Genotypes>1 million1.2 million SpeciesHuman Cattle Countries5549 Genotyping cost$199$99$ Delivery (weeks) DNA generationsFew Many EBV reliabilityLow High Reference: Web sites:

Paul VanRaden Plant & Animal Genome, San Diego, California, January 9 -11, 2016 (17) Conclusions l Program findmap.f90 uses known variants w Alignment is 50X faster than BWA with 1 processor, 30X faster with 10 processors w 2% more segments are mapped correctly w Output files are simpler and 3-10X smaller l Simulation, alignment, variant calling, and imputation programs available from:

Paul VanRaden Plant & Animal Genome, San Diego, California, January 9 -11, 2016 (18) Acknowledgments  This research was part of USDA-ARS project , “Improving Genetic Predictions in Dairy Animals Using Phenotypic and Genomic Information.”  Jeff O’Connell provided much advice on alignment and variant calling methods.  The reference map was from U. Maryland  The variant list was from Daetwyler et al.