WiggansARS Big Data Workshop – July 16, 2015 (1) George R. Wiggans Animal Genomics and Improvement Laboratory Agricultural Research Service, USDA Beltsville, MD , USA Big data in support of genetic improvement of dairy cattle `
WiggansARS Big Data Workshop – July 16, 2015 (2) Mission l Genetic improvement of dairy cattle for economically important traits w Yield (milk, fat, and protein) w Conformation (overall and individual traits) w Longevity (productive life) w Fertility (conception and pregnancy rates) w Calving (dystocia and stillbirth) w Disease resistance (mastitis)
WiggansARS Big Data Workshop – July 16, 2015 (3) Data types l Identification information for animal, sire, and dam: w Name w ID number w Birth date l Animal genotypes from marker panels that that range from 2,900 to 777,962 markers w Breed w Herd w Country Courtesy of Illumina, Inc.
WiggansARS Big Data Workshop – July 16, 2015 (4) Data types (continued) l Records for milk yield, fat percentage, protein percentage, and somatic cell count (1/month) l Appraiser-assigned scores for 16 body and udder characteristics related to conformation (e.g., stature) l Breeding records that include indicator for conception success l Calving difficulty scores and stillbirth occurrences
WiggansARS Big Data Workshop – July 16, 2015 (5) Data amounts l Pedigree records:71,974,045 l Animal genotypes:1,035,590 l Lactation records (since 1960):132,629,200 l Daily yield records (since 1990):641,864,015 l Reproduction event records:176,559,035 l Calving difficulty scores:29,528,607 l Stillbirth scores:19,567,198
WiggansARS Big Data Workshop – July 16, 2015 (6) Computing environment l Computation server w 2.27 GHz CPU (32 cores, 64 threads) w 660 GB RAM w 2.7 TB local storage l Database server w 3.4 GHz CPU (12 cores, 24 threads) w 264 GB RAM w 1.3 TB local storage l Shared storage w 38 TB
WiggansARS Big Data Workshop – July 16, 2015 (7) Data management l Variable length segments for database rows to minimize space and overhead in identifying data l All marker genotypes for an animal stored each as a single byte in a character large object (CLOB) l All breedings and monthly milk yield and component information for a cow’s lactation stored in variable character data types
WiggansARS Big Data Workshop – July 16, 2015 (8) Programming languages lClC w Database interface including data editing l FORTRAN w Calculation of genetic merit estimates l SAS w Data preparation, checking, and delivery
WiggansARS Big Data Workshop – July 16, 2015 (9) Calculation schedule l Triannual genetic merit estimates from processed phenotypic data l Monthly genomic evaluations based on estimates of marker effects using genotypic data and triannual phenotype-based evaluations l Weekly evaluations using marker effect estimates from monthly evaluations
WiggansARS Big Data Workshop – July 16, 2015 (10) Transition to industry l Council on Dairy Cattle Breeding w Database maintenance w Calculation and distribution of genetic merit estimates w Interface with evaluation users and data suppliers l ARS w Research and development using data made available by Council
WiggansARS Big Data Workshop – July 16, 2015 (11) Research resource l Massive amount of genomic data Location of causal genetic variants l Investigation of haplotypes never found in a homozygous state ÜDiscovery of chromosomal abnormalities resulting in early embryonic death l Investigation of sons of heterozygous sires ÜSpecific markers associated with differences between sons by haplotype
WiggansARS Big Data Workshop – July 16, 2015 (12) Genetic merit of marketed Holstein bulls Average gain: $19.42/year Average gain: $47.95/year Average gain: $87.49/year
WiggansARS Big Data Workshop – July 16, 2015 (13) Working with sequence data l Sequence available from 1000 Bull Genomes Project hosted in Australia l Project funded by industry to sequence over 200 bulls to create a haplotype library l A posteriori granddaughter design to locate chromosomal segments of interest from 71 bulls each with over 100 genotyped and progeny- tested sons
WiggansARS Big Data Workshop – July 16, 2015 (14) Imputing sequence data l Haplotype library supports imputation l Genotypes from genotyping chips can be imputed to full sequence l Lower accuracy of sequence data compared with chip genotypes accommodated by dealing in dosages to represent allele content l Findhap v4 (VanRaden) fast and more accurate than Beagle at low × coverage
WiggansARS Big Data Workshop – July 16, 2015 (15) Alignment of sequence data l Alignment – determining location of chromosomal segments provided by sequencer l Findmap – matches segment against library of haplotypes l Preserves low-frequency variants l Does not identify new variants l Uses a hash table to find variant enabling rapid processing
WiggansARS Big Data Workshop – July 16, 2015 (16) Accuracy of Findhap vs. Beagle* Sequence + HDImputed from HD ProgramDepthCorrectCorrelationCorrectCorrelation Findhap8× × × Beagle8× × × *250 bulls had sequence + HD; 250 others were imputed from HD
WiggansARS Big Data Workshop – July 16, 2015 (17) Data storage and backup l Disk storage being added w Compression option being investigated l Back up to tape with weekly submission to off- site storage l Expect to have internet 2 connection w Facilitate sharing of sequence data
WiggansARS Big Data Workshop – July 16, 2015 (18) Summary l Highly successful program leading to annual increases in genetic merit for production efficiency l Large database of phenotypic and genomic data provided by industry l Research projects to determine mechanism of genetic control of economically important traits l Data processing techniques developed so that rapid turnaround could be realized