BIG Data: Knowledge for Improving Vaccine Virus Selection Richard H. Scheuermann, Ph.D. Director of Informatics JCVI.

Slides:



Advertisements
Similar presentations
Quick Lesson on dN/dS Neutral Selection Codon Degeneracy Synonymous vs. Non-synonymous dN/dS ratios Why Selection? The Problem.
Advertisements

Office of Infectious Diseases Computational Challenges for Infectious Diseases Michael Shaw, PhD OID/Office of the Director.
Virus Pathogen Resource (ViPR) 26 September 2011 Richard H. Scheuermann, Ph.D. Department of Pathology U.T. Southwestern Medical Center.
GSC-BRC Metadata Standards Richard H. Scheuermann U.T. Southwestern Medical Center.
Centers of Excellence for Influenza Research and Surveillance 6 th Annual Meeting Aug 1, 2012 Status of IRD Development.
Standardizing Metadata Associated with NIAID Genome Sequencing Center Projects Richard H. Scheuermann, Ph.D. Department of Pathology Division of Biomedical.
Systems Biology Data Dissemination Working Group 25FEB2015.
Introduction to Bioinformatics Richard H. Scheuermann, Ph.D. Director of Informatics JCVI.
Host cell responses to viral infection can be monitored by a variety of different high throughput experimental methodologies in order to understand the.
Bioinformatics Resource Centers Influenza Research Database (IRD) Virus Pathogen Database and Analysis Resource (ViPR) 8 December 2010 Richard.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Model Selection Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis Technical.
Molecular Evolution with an emphasis on substitution rates Gavin JD Smith State Key Laboratory of Emerging Infectious Diseases & Department of Microbiology.
Bioinformatics Student host Chris Johnston Speaker Dr Kate McCain.
Influenza A Virus Pandemic Prediction and Simulation Through the Modeling of Reassortment Matthew Ingham Integrated Sciences Program University of British.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Model Selection Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis Technical.
Integrated Bioinformatics Data and Analysis Tools for Herpesviridae Viruses in the Virus Pathogen Resource (ViPR) Yun Zhang 1, Brett Pickett 1, Eva Sadat.
Richard H. Scheuermann, Ph.D. Department of Pathology Division of Biomedical Informatics U.T. Southwestern Medical Center Standardizing Metadata Associated.
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
Evolution as a Confounding Factor in Genetic Association Studies 14 December 2011 Richard H. Scheuermann, Ph.D. Department of Pathology U.T. Southwestern.
Statistical Bioinformatics QTL mapping Analysis of DNA sequence alignments Postgenomic data integration Systems biology.
Standardizing Metadata Associated with NIAID Genome Sequencing Center Projects and their Implementation in NIAID Bioinformatics Resource Centers Richard.
Laboratory Training for Field Epidemiologists Typing May 2007 Sequencing and Phylogeny.
Databases and tools to study the genomes of hundreds of pathogens, plants, and mammals Richard H. Scheuermann, Ph.D. Director of Informatics J. Craig Venter.
Sequence Feature Variant Type and Evolutionary Trajectory Analysis using the Influenza Research Database (IRD) 19 July 2011 Richard H. Scheuermann,
Influenza Research Database (IRD): A Web-based Resource for Influenza Virus Data and Analysis Victoria Hunt 1 *, R. Burke Squires 1, Jyothi Noronha 1,
Data Mining in the Influenza Research Database (IRD) and the Virus Pathogen Resource (ViPR) JCVI-GSCID/NIAID Workshop University of Limpopo 01 June 2011.
Comparative Genomics in the Influenza Research Database 17 June 2011 Richard H. Scheuermann, Ph.D. Department of Pathology U.T. Southwestern.
Sequence Variation Identification and Functional/Structural Inference in the Influenza Research Database (IRD) and Virus Pathogen Resource (ViPR) Yun Zhang.
Richard H. Scheuermann, Ph.D. Department of Pathology, UT Southwestern March 30, 2011 Virus Bioinformatics Resource Centers – ViPR & IRD.
Influenza Research Database (IRD) 26 September 2011 Richard H. Scheuermann, Ph.D. Department of Pathology U.T. Southwestern Medical Center.
BioHealthBase: The Bioinformatics Resource Center for Francisella tularensis Shubhada Godbole 1, Stephen M. Beckstrom-Sternberg 2,3, Paul S. Keim 2,3,
Statistical Tool for Identifying Sequence Variations That Correlate with Virus Phenotypic Characteristics in the Virus Pathogen Resource (ViPR) July 22,
Finish up array applications Move on to proteomics Protein microarrays.
BioHealthBase: A Web-based Database and Analysis Resource for Francisella Shubhada Godbole 1, Jyothi Noronha 1, Burke Squires 1, Victoria Hunt 1, Ed Klem.
Using Comparative Genomics to Explore the Genetic Code of Influenza Sangeeta Venkatachalam.
Evolution of influenza A Rachel Albert Craig Bland Evolution of influenza A.
THE QUESTION: SHOULD I GET A FLU SHOT EACH YEAR?.
Yun Zhang J. Craig Venter Institute San Diego, CA, USA August 4, 2012 Integrated Bioinformatics Data and Analysis Tools for Herpesviridae.
Large-scale knowledge aggregation for infectious diseases ASEAN-China International Bioinformatics Workshop Singapore, 17 th April 2008 Olivo Miotto Institute.
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
Introduction to Bioinformatics.
Statistical Tool for Identifying Sequence Variations that Correlate with Virus Phenotypic Characteristics in the Virus Pathogen Resource (ViPR) Brett E.
Richard H. Scheuermann, Ph.D. November 5, 2012 Support for Systems Biology Data in IRD/ViPR - Proteomics.
A Systems Approach to Infectious Disease Research: Influenza Develop a molecular network model of the interaction between influenza virus and the innate.
Identifying and Modeling Selection Pressure (a review of three papers) Rose Hoberman BioLM seminar Feb 9, 2004.
Pattern Matching Rhys Price Jones Anne R. Haake. What is pattern matching? Pattern matching is the procedure of scanning a nucleic acid or protein sequence.
Biological Signal Detection for Protein Function Prediction Investigators: Yang Dai Prime Grant Support: NSF Problem Statement and Motivation Technical.
Integration of Host Factor Data into the Virus Pathogen Database and Analysis Resource (ViPR) and the Influenza Research Database (IRD) Brett E. Pickett.
SESSION CHAIR: RICHARD SCHEUERMANN (VIPR & IRD) BRC2011 Session #5 – Data Standards and Metadata.
The Informatics Crystal Ball: Mining the Past to Predict the Species Jump Event 19 April 2011 Richard H. Scheuermann, Ph.D. Department of.
EB3233 Bioinformatics Introduction to Bioinformatics.
Richard H. Scheuermann, Ph.D. November 5, 2012 Support for Systems Biology Data in IRD/ViPR.
Bioinformatics and Computational Biology
Alternative Splicing (a review by Liliana Florea, 2005) CS 498 SS Saurabh Sinha 11/30/06.
Influenza Ontology Infectious Disease Ontology Workshop 2008 Burke Squires.
No reference available
Biological Networks. Can a biologist fix a radio? Lazebnik, Cancer Cell, 2002.
“Neutralizing Antibodies Derived from the B Cells of 1918 Influenza Pandemic Survivors” (Yu et. al) Daniel Greenberg.
Ontology Driven Data Collection for EuPathDB Jie Zheng, Omar Harb, Chris Stoeckert Center for Bioinformatics, University of Pennsylvania.
Human survivorship Developed Developing Bob May (2007), TREE 22:
A high-resolution map of human evolutionary constraints using 29 mammals Kerstin Lindblad-Toh et al Presentation by Robert Lewis and Kaylee Wells.
Viral evolution and adaption
Human survivorship Developed Developing Bob May (2007), TREE 22:
Pipelines for Computational Analysis (Bioinformatics)
ICAR-Directorate of Foot-and-mouth disease, Mukteswar, India
EXTENDING GENE ANNOTATION WITH GENE EXPRESSION
A Web-based Interactive Genome Library for Surveillance, Detection, Characterization and Drug-Resistance Monitoring of Influenza Virus Infection in the.
Introduction to Bioinformatic
Volume 13, Issue 12, Pages (December 2015)
J. -H. Lin, S. -C. Chiu, J. -C. Cheng, H. -W. Chang, K. -L. Hsiao, Y
Presentation transcript:

BIG Data: Knowledge for Improving Vaccine Virus Selection Richard H. Scheuermann, Ph.D. Director of Informatics JCVI

Big Data BIG DATA

Big Data Volumes

Big Data in Biology

Big Data 3 V’s

Biological data types and analysis objectives Genomics – Nucleotide genome sequences, metagenomic sequences – Gene finding, functional annotation, sequence alignment, homology determination, comparative analysis, phylogenetic inferencing, association analysis, mutation functional prediction, species distribution analysis Transcriptomics – RNA expression levels, transcription factor binding, chromatin structure information – Differential expression, clustering, functional enrichment, transcriptional regulation/causal reasoning Proteomics – Proteins levels, protein structures, protein interactions – Protein identification, protein functional predictions, structural predictions, structural comparison, molecular dynamic simulation, mutation functional prediction, docking predictions, network analysis Metabolomics – Metabolite/small molecule levels – Pathway/network analysis Imaging – Microscopy images, MRI images, CT scans – Feature extraction, high content screening Cytometry – Cell levels, cell phenotypes – Cell population clustering, cell biomarker discovery Systems biology – All of the above – Network analysis, causal reasoning, reverse causal reasoning, drug target prediction, regulatory network analysis, information flow, population dynamics, modeling and simulation

Variety

No Variety

Big Data Volume + Variety = Value Variety = Metadata

DMID Genomics Courtesy of Alison Yao, DMID

Bioinformatics Resource Centers (BRCs)

IRD Home Page Comprehensive collection flu-related data and analysis tools Free use without restrictions Standardization and integration

IRD Data Summary Protein Structures 412structure files 379Influenza A 9PB2 6PB1 1PB1-F2 25PA 162HA 20NP 110NA 6M1 15M2 27NS1 2NS2 Host Factor Data 55experiments 35transcriptomics 16proteomics 4lipidomics 2968experiment samples 544host factor biosets host factor responses Sequence Features 4794Sequence Features 321Structural 176Functional 122Sequence alterations 4175Epitopes Variant Types Data in IRD

GSC-BRC Metadata Working Group Collaboration between U.S. Genome Sequence Centers for Infectious Diseases and Bioinformatics Resource Centers What kind of data should be collected for a sequencing specimen? How should the information be represented? Decisions driven by usage

organism environmental material equipment person specimen source role specimen capture role specimen collector role temporal-spatial region spatial region temporal interval GPS location date/time specimen X specimen isolation procedure X isolation protocol has_input has_output plays has_specification has_part denotes located_in name denotes spatial region geographic location denotes located_in affiliation has_affiliation ID denotes specimen type instance_of specimen isolation procedure type instance_of Specimen Isolation plays has_input organism part hypothesis is_about IRB/IACUC approval has_authorization environment has_quality organism pathogenic disposition has part has disposition ID denotes CS1 genderagehealth status has quality CS4CS5/6CS7 CS2/3 CS8 CS9/10 CS11/12 CS13 CS14 CS18 CS15/16

Metadata Processes data transformations – image processing assembly sequencing assay specimen source – organism or environmental specimen collector input sample reagents technician equipment typeID qualities temporal-spatial region data transformations – variant detection serotype marker detect. gene detection primary data sequence data genotype/serotype/ gene data specimen microorganism enriched NA sample microorganism genomic NA specimen isolation process isolation protocol sample processing data archiving process sequence data record has_input has_output has_specificationhas_part is_about has_input has_output has_input has_output is_about GenBank ID denotes located_in denotes has_input has_quality instance_of temporal-spatial region located_in Specimen Isolation Material Processing Data Processing Sequencing Assay Investigation temporal-spatial region located_in temporal-spatial region located_in temporal-spatial region located_in temporal-spatial region located_in quality assessment assay Quality Assessment has_input has_output

Data Standards Dugan V, et al. PLOS One 2014, submitted.

Can we monitor influenza genetic drift and predict when a new variant has escaped protective immunity? Genetic Drift and Escape from Protective Immunity

Evolutionary drivers Viruses experiences 2 main drivers of evolution: Selection against deleterious amino acid substitutions in order to maintain important structural and functional elements Selection for amino acid mutations that result in viruses that evade pre- existing immunity and other characteristics of enhanced fitness Functional Constraint Immune Pressure Purifying selection Diversifying selection

Selective Pressures on HA Hemagglutinin (HA) protein is: Responsible for virus attachment and entrance into the host cell A major antigenic component of the virus If we can determine which regions of HA are targets of protective immunity, we can monitor genetic drift in those regions to predict escape. Regions undergoing diversifying selection as HA naturally evolves would correspond to the relevant epitopes for protective immunity This information could be used to help predict when new vaccine strains are warranted

Approach 1.Map all experimentally defined immune epitopes on the H1 HA protein 1.Identify sites that have experienced diversifying selection in pre-pandemic H1N1 strains and use to select immune epitopes likely to be targets of protective immunity. 1.Determine whether these regions are being targeted for the mutation during the ongoing evolution of the pandemic H1N1 lineage Pre-pandemic HA Pandemic HA

B-cell Epitopes from Immune Epitope Database (IEDB)

Identifying Sites Experiencing Diversifying Selection Selection Pressure using Fast Unconstrained Bayesian Approximation (FUBAR) – Murrell B, et al. (2013) Mol. Biol. Evol. 30(5):1196–1205: dN : Rate of non-synonymous substitutions dS : Rate of synonymous substitutions Non-synonymous Substitution: CTA (Leu)  CCA (Pro) Synonymous substitution: CTA (Leu)  CTG (Leu) The non-synonymous and synonymous rates are estimated for each site by calculating the posterior probability, Prob(dN site, dS site │Data site, Tree, Codon Substitution Rate, Codon Freq). Sites are considered to be under diversifying selection if the (dN/dS) observed > (dN/dS) expected has a Bayesian score > 0.9. Calculated using all H1 NA sequences prior to the 2009 pandemic (pre-pandemic) – 2105 full length HA protein sequences

Sites Experiencing Diversifying Selection Found 7 sites experiencing diversifying selection in pre-pandemic H1 HA Threshold = 0.9 Bayesian Score

B-cell Epitopes with Diversified Sites p =.02 }

Relevant B-cell Epitopes Sa Sb Caton et al /7 diversifying sites correspond to two well characterized B cell/antibody epitopes that may be targets of protective immunity 2/7 sites do not correspond to any previously characterized B cell/antibody epitope Highlight “evolutionary regions of interest”

Test Predictions on Pandemic Drift Meta-CATS (Pickett BE, et al. (2013) Virology, 447:45-51) is a statistical tools that determines if nucleotide or amino acid residues at each position in a multiple sequence alignment are significantly different between groups of sequences using a chi-squared statistic Group 1 (Early Pandemic Isolates): – Original outbreak sequences (21 earliest 2009 pandemic North American sequences) Group 2 (Late Pandemic Isolates): – California and season (15 sequences) – Florida and season (21 sequences) – New York and season (13 sequences)

Meta-CATS Results (California) Group 1: Early pandemic Group 2: Late CA pandemic (season and 13-14)

Results Site Diversifying Sites from Pre- Pandemic Diversifying Sites from Pandemic Meta-Cats (CA season 12-13,13-14) Meta-Cats (FL season , 13-14) Meta-Cats (NY season 12-13, 13-14) Diversified Epitopes (# epitopes) T-cell Epitope (# epitopes) 52++(6) 101++(5) (6) (8)+ (6) (4)+ (5) (4)+ (5) (5)+ (4) (2) (6)+ (5) (3)+ (6) (2) (2) (1)+(4) (1)+(5) (5) (1) (4) (6) (3) 389+(5) (5) (4) (1) 544++(4) Sa Sb T-cell

Test Relevant B-cell Epitopes Sa Sb

Tree Analysis Flu Season Legend Dominant residue in outbreak strains Dominant residue in late pandemic strains Remaining amino acids

Flu Season Legend Dominant residue in outbreak strains Dominant residue in late pandemic strains Remaining amino acids Tree Analysis

Tree Analysis Summary Flu Season S220T E391K S468N S202T D114N E516K K300E K180Q A273T

Big Data to Knowledge Volume + Variety = Value Variety = Metadata Data + Metadata + Integration + Interpretation = Knowledge

Big Data for Vaccine Selection Large scale statistical genomic analysis can identify sites experiencing diversifying selection – Help determine how much sequence data is needed When integrated with immune epitope data, could pinpoint those regions important for protective immunity and predict relevant antigenic drift – Natural experiment to identify correlates of protective immunity Monitoring genetic drift in these regions could augment approaches like antigenic cartography/landscape analysis to determine when vaccine candidates should be adjusted

36 U.T. Southwestern/JCVI – Richard Scheuermann (PI) – Burke Squires – Jyothi Noronha – Alex Lee – Brian Aevermann – Brett Pickett – Yun Zhang MSSM – Adolfo Garcia-Sastre – Eric Bortz – Gina Conenello – Peter Palese Vecna – Chris Larsen – Al Ramsey LANL – Catherine Macken – Mira Dimitrijevic U.C. Davis – Nicole Baumgarth Northrop Grumman – Ed Klem – Mike Atassi – Kevin Biersack – Jon Dietrich – Wenjie Hua – Wei Jen – Sanjeev Kumar – Xiaomei Li – Zaigang Liu – Jason Lucas – Michelle Lu – Bruce Quesenberry – Barbara Rotchford – Hongbo Su – Bryan Walters – Jianjun Wang – Sam Zaremba – Liwei Zhou – Zhiping Gu IRD SWG – Gillian Air, OMRF – Carol Cardona, Univ. Minnesota – Adolfo Garcia-Sastre, Mt Sinai – Elodie Ghedin, Univ. Pittsburgh – Martha Nelson, Fogarty – Daniel Perez, Univ. Maryland – Gavin Smith, Duke Singapore – David Spiro, JCVI – Dave Stallknecht, Univ. Georgia – David Topham, Rochester – Richard Webby, St Jude USDA – David Suarez Sage Analytica – Robert Taylor – Lone Simonsen CEIRS Centers Acknowledgments N01AI40041 HHSN C