P.M. VanRaden and D.M. Bickhart Animal Genomics and Improvement Laboratory, Agricultural Research Service, USDA, Beltsville, MD, USA

P.M. VanRaden and D.M. Bickhart Animal Genomics and Improvement Laboratory, Agricultural Research Service, USDA, Beltsville, MD, USA paul.vanraden@ars.usda.gov Plant & Animal Genome, San Diego, California January 9 -11, 2016 (1) Fast Single-Pass Alignment and Variant Calling Using Sequencing Data

Paul VanRaden Plant & Animal Genome, San Diego, California, January 9 -11, 2016 (2) Motivation l Genomic methods require much computation w Genotype models replaced pedigree models w Sequence variants replacing chip genotypes w Both increased data by orders of magnitude l Fast methods are available for imputation l Slow methods for alignment, variant calling

Paul VanRaden Plant & Animal Genome, San Diego, California, January 9 -11, 2016 (3) Sequence computation l Alignment reports the chromosome location that best matches the short (150-base) DNA segment to the reference map (2.7 billion). l Often both ends of a longer segment are read and these paired ends are located together. l Variant calling reports if each mapped segment contains a reference or alternate allele at any site. These variants could be previously known or newly discovered.

Paul VanRaden Plant & Animal Genome, San Diego, California, January 9 -11, 2016 (4) Previous strategies l Almost all programs do alignment, and then variant calling, instead of both together. l Program BWA was examined as a popular alignment strategy, and GATK for calling. w “Mapping reads to the reference is a first critical computational challenge whose cost necessitates that each read be aligned independently, guaranteeing that many reads spanning indels will be misaligned.” DePristo et al (2011) GATK paper

Paul VanRaden Plant & Animal Genome, San Diego, California, January 9 -11, 2016 (5) Benefits of new strategy l Most programs align only to Dominette’s DNA l Findmap can align using all known DNA differences among and within breeds l Error rate is reduced by separating known SNPs and indels from machine read errors l Locations are mapped back to the same common reference (UMD3.1)

Paul VanRaden Plant & Animal Genome, San Diego, California, January 9 -11, 2016 (6) Algorithm used in findmap.f90 l Read reference map, store in hash table l Read and hash known variants (SNPs & indels) l Process batches of 1 million paired end reads, send to multiple processors sharing memory w Find location where both ends match map w Count alleles (reference, alternate) & errors l Output alignment and variant call files

Paul VanRaden Plant & Animal Genome, San Diego, California, January 9 -11, 2016 (7) Gaps, k-mers, and hashing strategy Identify long gaps between the same base (A, C, G, or T) TGGATTCTTTATCACTGAGCTACCTGGGAAGCCAAGTAAGC Extend each gap to a 16-base k-mer, convert to an 8-byte integer: Basenum (1, 2, 3, 4) = Base (A, C, G, T) Hashnum = 4 * Hashnum + Basenum, loop across 16-base k-mer Apply hash function (written by George Wiggans, 1988, USDA) Hash map, then hash each read (or its reverse complement)

Paul VanRaden Plant & Animal Genome, San Diego, California, January 9 -11, 2016 (8) Semi-simulated data l Simulated from UMD3.1 reference map l Variant file from run5 of 1,000 bull genomes w 38,062,190 SNPs, 532,179 insertions, and 1,127,620 deletions l Paired ends, length 150, fragment size 1,000 l Advantage of semi-simulated: true locations and true variants are known

Paul VanRaden Plant & Animal Genome, San Diego, California, January 9 -11, 2016 (9) Compare BWA and findmap Computation requiredBWA / GATKfindmap / findvar CPU minutes / 1X, 1 thread62911 CPU minutes / 1X, 10 threads 63 2 Memory (Gbytes), 1 thread 4.646 Memory (Gbytes), 10 threads 46 Variant calling CPU / 1X, 1 thread2011 Accuracy Correctly placed segments overall 90.5% 92.9% Both ends of pair correctly placed 87.2% 87.6% Both ends wrong 6.2% 1.8%

Paul VanRaden Plant & Animal Genome, San Diego, California, January 9 -11, 2016 (10) Parallel processing speedup

Paul VanRaden Plant & Animal Genome, San Diego, California, January 9 -11, 2016 (11) Program series – example resources Task (per animal)ProgramThreadsMinutes Simulate 10X datamap2seq.f90105 Align 10X data and call previous variantsfindmap.f901020 Sum new variantsfindvar.f9018 Imputation (39 million)findhap4.f90101

Paul VanRaden Plant & Animal Genome, San Diego, California, January 9 -11, 2016 (12) Accuracy of variant calling / discovery Known variantsSNPInsertionDeletion (%) Correct reference allele99.898.699.8 Correct alternate allele99.8 99.9 Call rate (paired ends ok)86.682.283.7 New variants (Homo / het) Correctly detected (10X)91 / 6354 / 3741 / 27 Falsely detected (10X)10178

Paul VanRaden Plant & Animal Genome, San Diego, California, January 9 -11, 2016 (13) Other alignment tests l Perfectly random genome, non-repetitive w Over 99.9% correctly aligned l RepeatMasker and BWA w Took 4.4 instead of 14.1 hours / 1X w Only 45% correctly aligned instead of 91% l Human genome gave results similar to cattle

Paul VanRaden Plant & Animal Genome, San Diego, California, January 9 -11, 2016 (14) File format sizes (Mbytes) Unzipped / zipped file sizesBWA, GATK Findmap, Findvar Input data: Sequence reads / 1X (fastq) 6000 / 1800 Output data: Binary alignment file / 1X (.bam) 3200 / 32001200 / 360 Called genotypes / animal (.vcf) 1000 / 3879 / 13

Paul VanRaden Plant & Animal Genome, San Diego, California, January 9 -11, 2016 (15) USA use of 1,000 bull genomes l Sequence genotypes from 440 Holsteins l Imputed for 27,000 reference bulls l 700,000 candidate loci + 300,000 HD SNPs l Largest 17K added to 60K routinely used l Average gain of 2.7% reliability across traits l Largest 5K added to next Zoetis chip

Paul VanRaden Plant & Animal Genome, San Diego, California, January 9 -11, 2016 (16) Largest genomic databases 23andMeAncestry.com CDCB / USDA Genotypes>1 million1.2 million SpeciesHuman Cattle Countries5549 Genotyping cost$199$99$37-135 Delivery (weeks)6-8 1-2 DNA generationsFew Many EBV reliabilityLow High Reference:http://genomemag.com/davies-23andme/#.VdY722zosY1http://genomemag.com/davies-23andme/#.VdY722zosY1 Web sites:https://www.23andme.com/https://www.23andme.com/ http://dna.ancestry.com/ https://www.cdcb.us/ http://aipl.arsusda.gov/Main/site_main.htm

Paul VanRaden Plant & Animal Genome, San Diego, California, January 9 -11, 2016 (17) Conclusions l Program findmap.f90 uses known variants w Alignment is 50X faster than BWA with 1 processor, 30X faster with 10 processors w 2% more segments are mapped correctly w Output files are simpler and 3-10X smaller l Simulation, alignment, variant calling, and imputation programs available from: http://aipl.arsusda.gov/software

Paul VanRaden Plant & Animal Genome, San Diego, California, January 9 -11, 2016 (18) Acknowledgments  This research was part of USDA-ARS project 1265-31000-096-00, “Improving Genetic Predictions in Dairy Animals Using Phenotypic and Genomic Information.”  Jeff O’Connell provided much advice on alignment and variant calling methods.  The reference map was from U. Maryland  The variant list was from Daetwyler et al.

P.M. VanRaden and D.M. Bickhart Animal Genomics and Improvement Laboratory, Agricultural Research Service, USDA, Beltsville, MD, USA

Similar presentations

Presentation on theme: "P.M. VanRaden and D.M. Bickhart Animal Genomics and Improvement Laboratory, Agricultural Research Service, USDA, Beltsville, MD, USA"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

P.M. VanRaden and D.M. Bickhart Animal Genomics and Improvement Laboratory, Agricultural Research Service, USDA, Beltsville, MD, USA

Similar presentations

Presentation on theme: "P.M. VanRaden and D.M. Bickhart Animal Genomics and Improvement Laboratory, Agricultural Research Service, USDA, Beltsville, MD, USA"— Presentation transcript:

Similar presentations

About project

Feedback