BIG Data: Knowledge for Improving Vaccine Virus Selection Richard H. Scheuermann, Ph.D. Director of Informatics JCVI
Big Data BIG DATA
Big Data Volumes
Big Data in Biology
Big Data 3 V’s
Biological data types and analysis objectives Genomics – Nucleotide genome sequences, metagenomic sequences – Gene finding, functional annotation, sequence alignment, homology determination, comparative analysis, phylogenetic inferencing, association analysis, mutation functional prediction, species distribution analysis Transcriptomics – RNA expression levels, transcription factor binding, chromatin structure information – Differential expression, clustering, functional enrichment, transcriptional regulation/causal reasoning Proteomics – Proteins levels, protein structures, protein interactions – Protein identification, protein functional predictions, structural predictions, structural comparison, molecular dynamic simulation, mutation functional prediction, docking predictions, network analysis Metabolomics – Metabolite/small molecule levels – Pathway/network analysis Imaging – Microscopy images, MRI images, CT scans – Feature extraction, high content screening Cytometry – Cell levels, cell phenotypes – Cell population clustering, cell biomarker discovery Systems biology – All of the above – Network analysis, causal reasoning, reverse causal reasoning, drug target prediction, regulatory network analysis, information flow, population dynamics, modeling and simulation
Variety
No Variety
Big Data Volume + Variety = Value Variety = Metadata
DMID Genomics Courtesy of Alison Yao, DMID
Bioinformatics Resource Centers (BRCs)
IRD Home Page Comprehensive collection flu-related data and analysis tools Free use without restrictions Standardization and integration
IRD Data Summary Protein Structures 412structure files 379Influenza A 9PB2 6PB1 1PB1-F2 25PA 162HA 20NP 110NA 6M1 15M2 27NS1 2NS2 Host Factor Data 55experiments 35transcriptomics 16proteomics 4lipidomics 2968experiment samples 544host factor biosets host factor responses Sequence Features 4794Sequence Features 321Structural 176Functional 122Sequence alterations 4175Epitopes Variant Types Data in IRD
GSC-BRC Metadata Working Group Collaboration between U.S. Genome Sequence Centers for Infectious Diseases and Bioinformatics Resource Centers What kind of data should be collected for a sequencing specimen? How should the information be represented? Decisions driven by usage
organism environmental material equipment person specimen source role specimen capture role specimen collector role temporal-spatial region spatial region temporal interval GPS location date/time specimen X specimen isolation procedure X isolation protocol has_input has_output plays has_specification has_part denotes located_in name denotes spatial region geographic location denotes located_in affiliation has_affiliation ID denotes specimen type instance_of specimen isolation procedure type instance_of Specimen Isolation plays has_input organism part hypothesis is_about IRB/IACUC approval has_authorization environment has_quality organism pathogenic disposition has part has disposition ID denotes CS1 genderagehealth status has quality CS4CS5/6CS7 CS2/3 CS8 CS9/10 CS11/12 CS13 CS14 CS18 CS15/16
Metadata Processes data transformations – image processing assembly sequencing assay specimen source – organism or environmental specimen collector input sample reagents technician equipment typeID qualities temporal-spatial region data transformations – variant detection serotype marker detect. gene detection primary data sequence data genotype/serotype/ gene data specimen microorganism enriched NA sample microorganism genomic NA specimen isolation process isolation protocol sample processing data archiving process sequence data record has_input has_output has_specificationhas_part is_about has_input has_output has_input has_output is_about GenBank ID denotes located_in denotes has_input has_quality instance_of temporal-spatial region located_in Specimen Isolation Material Processing Data Processing Sequencing Assay Investigation temporal-spatial region located_in temporal-spatial region located_in temporal-spatial region located_in temporal-spatial region located_in quality assessment assay Quality Assessment has_input has_output
Data Standards Dugan V, et al. PLOS One 2014, submitted.
Can we monitor influenza genetic drift and predict when a new variant has escaped protective immunity? Genetic Drift and Escape from Protective Immunity
Evolutionary drivers Viruses experiences 2 main drivers of evolution: Selection against deleterious amino acid substitutions in order to maintain important structural and functional elements Selection for amino acid mutations that result in viruses that evade pre- existing immunity and other characteristics of enhanced fitness Functional Constraint Immune Pressure Purifying selection Diversifying selection
Selective Pressures on HA Hemagglutinin (HA) protein is: Responsible for virus attachment and entrance into the host cell A major antigenic component of the virus If we can determine which regions of HA are targets of protective immunity, we can monitor genetic drift in those regions to predict escape. Regions undergoing diversifying selection as HA naturally evolves would correspond to the relevant epitopes for protective immunity This information could be used to help predict when new vaccine strains are warranted
Approach 1.Map all experimentally defined immune epitopes on the H1 HA protein 1.Identify sites that have experienced diversifying selection in pre-pandemic H1N1 strains and use to select immune epitopes likely to be targets of protective immunity. 1.Determine whether these regions are being targeted for the mutation during the ongoing evolution of the pandemic H1N1 lineage Pre-pandemic HA Pandemic HA
B-cell Epitopes from Immune Epitope Database (IEDB)
Identifying Sites Experiencing Diversifying Selection Selection Pressure using Fast Unconstrained Bayesian Approximation (FUBAR) – Murrell B, et al. (2013) Mol. Biol. Evol. 30(5):1196–1205: dN : Rate of non-synonymous substitutions dS : Rate of synonymous substitutions Non-synonymous Substitution: CTA (Leu) CCA (Pro) Synonymous substitution: CTA (Leu) CTG (Leu) The non-synonymous and synonymous rates are estimated for each site by calculating the posterior probability, Prob(dN site, dS site │Data site, Tree, Codon Substitution Rate, Codon Freq). Sites are considered to be under diversifying selection if the (dN/dS) observed > (dN/dS) expected has a Bayesian score > 0.9. Calculated using all H1 NA sequences prior to the 2009 pandemic (pre-pandemic) – 2105 full length HA protein sequences
Sites Experiencing Diversifying Selection Found 7 sites experiencing diversifying selection in pre-pandemic H1 HA Threshold = 0.9 Bayesian Score
B-cell Epitopes with Diversified Sites p =.02 }
Relevant B-cell Epitopes Sa Sb Caton et al /7 diversifying sites correspond to two well characterized B cell/antibody epitopes that may be targets of protective immunity 2/7 sites do not correspond to any previously characterized B cell/antibody epitope Highlight “evolutionary regions of interest”
Test Predictions on Pandemic Drift Meta-CATS (Pickett BE, et al. (2013) Virology, 447:45-51) is a statistical tools that determines if nucleotide or amino acid residues at each position in a multiple sequence alignment are significantly different between groups of sequences using a chi-squared statistic Group 1 (Early Pandemic Isolates): – Original outbreak sequences (21 earliest 2009 pandemic North American sequences) Group 2 (Late Pandemic Isolates): – California and season (15 sequences) – Florida and season (21 sequences) – New York and season (13 sequences)
Meta-CATS Results (California) Group 1: Early pandemic Group 2: Late CA pandemic (season and 13-14)
Results Site Diversifying Sites from Pre- Pandemic Diversifying Sites from Pandemic Meta-Cats (CA season 12-13,13-14) Meta-Cats (FL season , 13-14) Meta-Cats (NY season 12-13, 13-14) Diversified Epitopes (# epitopes) T-cell Epitope (# epitopes) 52++(6) 101++(5) (6) (8)+ (6) (4)+ (5) (4)+ (5) (5)+ (4) (2) (6)+ (5) (3)+ (6) (2) (2) (1)+(4) (1)+(5) (5) (1) (4) (6) (3) 389+(5) (5) (4) (1) 544++(4) Sa Sb T-cell
Test Relevant B-cell Epitopes Sa Sb
Tree Analysis Flu Season Legend Dominant residue in outbreak strains Dominant residue in late pandemic strains Remaining amino acids
Flu Season Legend Dominant residue in outbreak strains Dominant residue in late pandemic strains Remaining amino acids Tree Analysis
Tree Analysis Summary Flu Season S220T E391K S468N S202T D114N E516K K300E K180Q A273T
Big Data to Knowledge Volume + Variety = Value Variety = Metadata Data + Metadata + Integration + Interpretation = Knowledge
Big Data for Vaccine Selection Large scale statistical genomic analysis can identify sites experiencing diversifying selection – Help determine how much sequence data is needed When integrated with immune epitope data, could pinpoint those regions important for protective immunity and predict relevant antigenic drift – Natural experiment to identify correlates of protective immunity Monitoring genetic drift in these regions could augment approaches like antigenic cartography/landscape analysis to determine when vaccine candidates should be adjusted
36 U.T. Southwestern/JCVI – Richard Scheuermann (PI) – Burke Squires – Jyothi Noronha – Alex Lee – Brian Aevermann – Brett Pickett – Yun Zhang MSSM – Adolfo Garcia-Sastre – Eric Bortz – Gina Conenello – Peter Palese Vecna – Chris Larsen – Al Ramsey LANL – Catherine Macken – Mira Dimitrijevic U.C. Davis – Nicole Baumgarth Northrop Grumman – Ed Klem – Mike Atassi – Kevin Biersack – Jon Dietrich – Wenjie Hua – Wei Jen – Sanjeev Kumar – Xiaomei Li – Zaigang Liu – Jason Lucas – Michelle Lu – Bruce Quesenberry – Barbara Rotchford – Hongbo Su – Bryan Walters – Jianjun Wang – Sam Zaremba – Liwei Zhou – Zhiping Gu IRD SWG – Gillian Air, OMRF – Carol Cardona, Univ. Minnesota – Adolfo Garcia-Sastre, Mt Sinai – Elodie Ghedin, Univ. Pittsburgh – Martha Nelson, Fogarty – Daniel Perez, Univ. Maryland – Gavin Smith, Duke Singapore – David Spiro, JCVI – Dave Stallknecht, Univ. Georgia – David Topham, Rochester – Richard Webby, St Jude USDA – David Suarez Sage Analytica – Robert Taylor – Lone Simonsen CEIRS Centers Acknowledgments N01AI40041 HHSN C