Bioinformatics of Disease: immune epitope prediction

Slides:

Advertisements

Similar presentations

T-cell epitope prediction by molecular dynamics simulations Irini Doytchinova Medical University of Sofia School of Pharmacy Medical University of Sofia.

Advertisements

Hotspot Hunter: a computational system for large-scale screening and selection of candidate immunological hotspots in pathogen proteomes G.L. Zhang, A.M.

Understanding biology through structuresCourse work 2006 Understanding Immune Recognition.

Introduction to Immunology

(A) Mutations within neoepitopes lead to structural alterations across the peptide backbone, as illustrated with structural snapshots from the simulations.

Computer Aided Vaccine Design Dr G P S Raghava. Concept of Drug and Vaccine Concept of Drug Concept of Drug –Kill invaders of foreign pathogens –Inhibit.

MHC Polymorphism Ole Lund. Objectives What is HLA polymorphism? What is it good for? How does it make life difficult for vaccine design? Definition of.

Computational Immunology An Introduction Rose Hoberman BioLM Seminar April 2003.

Structural bioinformatics

MHC Polymorphism. MHC Class I pathway Figure by Eric A.J. Reits.

“Inverse Kinematics” The Loop Closure Problem in Biology Barak Raveh Dan Halperin Course in Structural Bioinformatics Spring 2006.

FLEX* - REVIEW.

. Protein Structure Prediction [Based on Structural Bioinformatics, section VII]

Protein Structure Prediction Samantha Chui Oct. 26, 2004.

T Cell Receptor (TCR) & MHC Complexes-Antigen Presentation

Institute of Immunology, ZJU

Overview of Bioinformatics A/P Shoba Ranganathan Justin Choo National University of Singapore A Tutorial on Bioinformatics.

Protein Tertiary Structure Prediction

BIOC3010: Bioinformatics - Revision lecture Dr. Andrew C.R. Martin

Methods MHC class-I T cell epitope prediction for Nef Consensus and ancestral sequences of the Nef protein for the different HIV-1 subtypes were obtained.

Machine-learning in building bioinformatics databases for infectious diseases Victor Tong Institute for Infocomm Research A*STAR, Singapore ASEAN-China.

COMPARATIVE or HOMOLOGY MODELING

Bioinformatics and medicine: Are we meeting the challenge?

CRB Journal Club February 13, 2006 Jenny Gu. Selected for a Reason Residues selected by evolution for a reason, but conservation is not distinguished.

HIV-1 evolution in response to immune selection pressures

The Major Histocompatibility Complex (MHC) In all vertebrates there is a genetic region that has a major influence on graft survival This region is referred.

Supporting bioinformatics education in the Asia-Pacific Shoba Ranganathan Professor and Chair – Bioinformatics Dept. of Chemistry and Biomolecular Sciences.

From Structure to Function. Given a protein structure can we predict the function of a protein when we do not have a known homolog in the database ?

CHAPTER 23 Molecular Immunology.

A Study of Residue Correlation within Protein Sequences and its Application to Sequence Classification Christopher Hemmerich Advisor: Dr. Sun Kim.

Protein Classification II CISC889: Bioinformatics Gang Situ 04/11/2002 Parts of this lecture borrowed from lecture given by Dr. Altman.

Telling self from non-self: Learning the language of the Immune System Rose Hoberman and Roni Rosenfeld BioLM Workshop May 2003.

Protein Structure & Modeling Biology 224 Instructor: Tom Peavy Nov 18 & 23, 2009

Structural Modelling and Bioinformatics in Drug Discovery and Infectious Disease Shoba Ranganathan Professor and Chair – Bioinformatics Dept. of Chemistry.

Altman et al. JACS 2008, Presented By Swati Jain.

Biological Signal Detection for Protein Function Prediction Investigators: Yang Dai Prime Grant Support: NSF Problem Statement and Motivation Technical.

1 Web Site: Dr. G P S Raghava, Head Bioinformatics Centre Institute of Microbial Technology, Chandigarh, India Prediction.

Chapter 5 The Content of the Genome 5.1 Introduction genome – The complete set of sequences in the genetic material of an organism. –It includes the.

Statistical physics of T cell receptor selection and function Thesis committee meeting, 04/15/2009 Andrej Košmrlj Physics Department Massachusetts Institute.

Central dogma: the story of life RNA DNA Protein.

Fe A. Bartolome, MD, FPASMAP Department of Microbiology Our Lady of Fatima University.

Lecture 1: Immunogenetics Dr ; Kwanama

Motif Search and RNA Structure Prediction Lesson 9.

Specific Defenses of the Host Part 2 (acquired or adaptive immunity)

Bioinformatics in Vaccine Design

Structural classification of Proteins SCOP Classification: consists of a database Family Evolutionarily related with a significant sequence identity Superfamily.

Lecture 19 November 16 th 2010 Quiz 2 scheduled for November 23 rd not November 18th.

Surflex: Fully Automatic Flexible Molecular Docking Using a Molecular Similarity-Based Search Engine Ajay N. Jain UCSF Cancer Research Institute and Comprehensive.

How NMR is Used for the Study of Biomacromolecules Analytical biochemistry Comparative analysis Interactions between biomolecules Structure determination.

Surface Defect Inspection: an Artificial Immune Approach Dr. Hong Zheng and Dr. Saeid Nahavandi School of Engineering and Technology.

A new protein-protein docking scoring function based on interface residue properties Reporter: Yu Lun Kuo (D )

T cell receptor & MHC complexes-Antigen presentation

T Cell Receptor (TCR) & MHC Complexes-Antigen Presentation

Intracellular Pathogens Extracellular Pathogens

APPLICATIONS OF BIOINFORMATICS IN DRUG DISCOVERY

The Major Histocompatibility Complex (MHC)

Molecular Docking Profacgen. The interactions between proteins and other molecules play important roles in various biological processes, including gene.

Virtual Screening.

Genomes and Their Evolution

Volume 9, Issue 2, Pages (February 2002)

Ligand Docking to MHC Class I Molecules

The major histocompatibility complex (MHC) and MHC molecules

Telling self from non-self: Learning the language of the Immune System

The Major Histocompatibility Complex (MHC)

Protein structure prediction.

Volume 19, Issue 8, Pages (August 2011)

Mr.Halavath Ramesh 16-MCH-001 Dept. of Chemistry Loyola College University of Madras-Chennai.

Mr.Halavath Ramesh 16-MCH-001 Dept. of Chemistry Loyola College University of Madras-Chennai.

Mr.Halavath Ramesh 16-MCH-001 Dept. of Chemistry Loyola College University of Madras-Chennai.

Mr.Halavath Ramesh 16-MCH-001 Dept. of Chemistry Loyola College University of Madras-Chennai.

Presentation transcript:

Bioinformatics of Disease: immune epitope prediction Shoba Ranganathan Professor and Chair – Bioinformatics Dept. of Chemistry and Biomolecular Sciences & Adjunct Professor Biotechnology Research Institute Dept. of Biochemistry Macquarie University Yong Loo Lin School of Medicine Sydney, Australia National University of Singapore, Singapore (shoba.ranganathan@mq.edu.au) (shoba@bic.nus.edu.sg) Visiting scientist @ Institute for Infocomm Research (I2R), Singapore

Bioinformatics is ….. Bioinformatics is the study of living systems through computation

Data in Bioinformatics (in the main) and their management and analysis Genetics and populations Networks, pathways and systems Sequences Structures Genomes Data in Bioinformatics (in the main) and their management and analysis Transcriptomes Databases, ontologies Data & text mining Algorithms Maths/Stats Physics/ Chemistry Evolution and phylogenetics

Overview of my research Genome analysis Transcriptome analysis Protein/Proteome analysis Systems Biology Immunoinformatics Genome-phenome mapping Biodiversity Informatics

5. What is Immunoinformatics? Using Bioinformatics to address problems in Immunology Application of bioinformatics to accelerate immune system research has the potential to deliver vaccines and address immunotherapeutics. Computational systems biology of immune response

Immunoinformatics Immunology Computer Science Biology Structural Immunoinformatics is the convergence btw 3 discipline of science – immunology, computer science and structural biology to facilitate research in immunology. Research in immunology provides insights into how our body respond to infection and disease while structural biology provides us with knowledge of how the various receptor-ligand systems interact. Computer science, on the other hand, provides an effective means to store and analyze large volumes of complex data. Combining the three fields increases the efficiency of biological research and offers the potential for major advances in the study of biological systems. In this research, my project is focused on an important component of the immune system, which is the T-cell mediated adaptive immune responses.

IMMUNOINFORMATICS Basic Clinical immunology immunology Networks, -omics Networks, pathways, and systems IMMUNOINFORMATICS Artificial intelligence Cell biology Physics/ Chemistry Databases Algorithms Maths/Stats

Disease alleviation Genome screening - marker detection Proteomics/genomics of diseased state Sequence analysis of antigens/markers Structure analysis of antigens T cell epitope analysis Antibody epitope analysis Vaccine design

Summary Introduction Structural Immunoinformatic Database development Networks, pathways and systems Summary Genetics and populations Introduction Structural Immunoinformatic Database development Data Analysis Computational models Applications Clinical immunology Basic immunology -omics

The immune system Composed of many interdependent cell types, organs, and tissues to protect the body from infections (bacterial, parasitic, fungal, or viral) and arrest abnormal growth and differentiation Inappropriate immune responses lead to allergies and autoimmunity 2nd most complex system in the human body

Genomics vs. Immunomics Genomics: solving the genome puzzle 104 genes coding for 106 products Immunomics: understanding immune response 102-103 genes leading to >1012 products Enormous diversity in immunomics has implications for immune function and modulation

It is a numbers game…. >1013 MHC class I haplotypes (IMGT-HLA) 107-1015 T cell receptors (Arstila et al., 1999) >109 combinatorial antibodies (Jerne, 1993) 1012 B cell clonotypes (Jerne, 1993) 1011 linear epitopes composed of nine amino acids >>1011 conformational epitopes

T cell mediated adaptive immune response Specific peptide residues critical for stimulating cellular immune responses Major histocompatibility complex (MHC) molecules (Human Leukocyte Antigen or HLA in humans) bind and present short antigenic peptides to T cell receptors, for inspection Antigen presentation is by two classes of MHC (class I and class II) Those peptides that bind to specific MHC and trigger T cell recognition (T cell epitopes) are targets for vaccine and immunotherapy development

How to generate a T cell-mediated immune response 3. T cell receptor 2. MHC 1. Epitope

Major histocompatibility complex Gene structure of the human MHC 3D structure of the human MHC MHC Class II MHC Class I

MHC Class I for endogenous peptides Figure by Eric A.J. Reits

MHC class II for exogenous peptides Figure by Eric A.J. Reits

Yewdell et al. Ann. Rev Immunol (1999) Antigen processing pathway: peptides, MHC, T-cells Degradation of antigen Peptide binding to MHC Recognition of peptide-MHC complex by T-cells Yewdell et al. Ann. Rev Immunol (1999) 0.05% chance of immunogenicity MHC or major histocompatibility complexes are a series of genes on chromosome 6 that code for cell surface proteins which control the adaptive immune response. There are generally 2 major groups, the Class I and II MHC genes that encode human leukocyte antigen (HLA) and perform antigen presentation. This diagram illustrates the antigen processing pathway. Of particular importance is the binding of antigenic peptides to MHC and the presentation of MHC-peptides to T-cells. T-cells upon recognizing MHC-peptide complex secretes cytokines which stimulate proliferation of T-cells, B-cells, macrophages and production of antibodies by B-cells. How well a peptide binds to the MHC is determined by its binding affinity and is represented by the dissociation constant Kd. There are many different experimental techniques used to obtain an estimate of Kd, such as competitive binding assay or IC50 and peptide binding stabilization assay or BL50. In addition to the binding affinity, the stability of the complex is best reflected by the Gibb’s free energy which is related to the dissociation constant by the following equation G = RT ln Kd. R = Boltzmann constant Boltzmann constant is the physical term relating temperature to energy, named after Ludwig Boltzmann. The experimentally determined value is k=1.380 6503 x 10-23 JK-1 20% processed 0.5% bind MHC 50% CTL response

Physico-chemical properties affect MHC-peptide binding There are many factors that affect the binding of MHC-peptides and some of these are shown in this diagram. Physically, 5 or more factors affect MHC-peptide interaction: These include residue size, residue position, the orientation of side-chains, peptide length and peptide backbone conformation. Chemically, four or more factors affect binding: These include chemical property of amino acids, the overall chemistry of the binding groove, overall chemistry of peptide and chemistry of environment. In addition, some other factors include the possibility and absence of anchor residues as well as uncertain core residues of peptide in the binding groove.

Epitope prediction º “Fishing”

Computational models can help identify T cell epitopes Suggest candidate epitopes by in silico screening of entire proteins and even proteomes with specificity at: the allele level the supertype level disease-implicated alleles alone. Minimize the number of wet-lab experiments Cut down the lead time involved in epitope discovery and vaccine design

Tong, Tan and Ranganathan (2007) Briefings in Bioinformatics 8: 96-108 Predicting MHC-binding peptides Tong, Tan and Ranganathan (2007) Briefings in Bioinformatics 8: 96-108 Sequence-based approach Pattern recognition techniques binding motif, matrices, ANN, HMM, SVM Main limitations: Require large amount of data for training Preclude data with limited sequence conservation Structure-based approach Rigid backbone modeling techniques Flexible docking techniques Main advantage: large training datasets unnecessary

Our aim: Structure-based prediction of MHC-binding peptides

Why structure? generate biologically meaningful data for analysis Great potential to: generate biologically meaningful data for analysis predict candidate peptides for alleles that have not been widely studied, where sequence-based approaches fail or are not attempted predict binding affinity of peptides predict non-contiguous epitopes Structure determination through experimental methods is both expensive and time-consuming Has not been extensively studied due to high computational costs and development complexity There are several reasons why structure-based prediction is adopted in this research. One of the reasons is that I am interested in understanding the molecular basis of what constitutes an ‘binding peptide’ and how it differs from non-binding peptides. In particular, what is the selection mechanism of different alleles and how do they differ among different alleles. This approach can be used to generate biologically meaningful data such as the interacting residues and important bonds and contacts for analysis. To date, structure-based approach have not been extensively studied due to high complexity in developing the technique and long computational time. So one of my aims is to investigate how well structure-based prediction can be applied in this context. Another reason is structure-based prediction offers the potential to predict potential peptides for alleles that have not been widely studied. This is particularly attractive in the absence of large quantity of data for training machine-learning techniques. Moreover, structure-based prediction offers the potential to predict the absolute binding affinity of peptides. Lastly, determination of crystal structures through experimental methods is very expensive and time-consuming and structure-based prediction can be used to provide a reliable estimate of the crystal structure in the presence of a high quality template.

Existing Structure-based Prediction Techniques Protein Threading [Altuvia et al. 1995; Schueler-Furman et al. 2000] Homology Modeling [Michielin et al. 2000] Rigid/Flexible Docking [Rosenfeld et al. 1993; Sezerman et al. 1996; Rognan et al. 1999; Desmet et al. 2000; Michielin et al. 2003] In general, structure-based techniques can be broadly classified into three categories: protein threading, homology modeling and docking. (1) Protein threading involves substituting the amino acid sequence of a known peptide bound to a given MHC with the target peptide while retaining the backbone conformation. (2) In Homology Modeling, the amino acid sequence of a peptide is adopted to the structure of a homologous protein with known 3D structure. (3) While, Docking attempts to find the best fit conformation of peptide within the binding groove. There is 2 types of docking techniques: rigid docking does not consider flexibility of the ligand and receptor while flexible docking considers the flexibility of either the ligand, the receptor or both the ligand and receptor.

Will existing structure-based techniques suffice? Quality of predicted structures Protein Threading, Homology Modeling and Rigid Docking Cannot handle peptide flexibility Available flexible docking techniques Poor accuracy Too slow Usability of Models to predict binding Existing free energy scoring functions Tested only on small datasets Poor correlation with experimental data Two issues must be addressed for structure-based prediction, namely the quality of the generated models and whether these models can be successfully applied to predict binding. However, existing structure-based prediction techniques encounter difficulties in addressing the two issues that was raised. For the first issue, the main problem faced by protein threading, homology modeling and rigid docking techniques is that they cannot handle peptide flexibility This problem is critical because MHC-peptides are generally small in length and an inaccurate structure will have a deep impact on the usability of the modeled structure. In addition, existing flexible docking techniques are mostly too slow and inaccurate for application. Concerning the second issue, the majority of existing free energy scoring functions have only been tested on a small set of up to 5 MHC-peptides and their application to modeled structures are hypothetical. In addition, all of them could not correlate well with experimental binding data.

Hypothesis for epitope selection Peptides bound to MHC alleles are similar to substrates bound to enzymes “Lock-and-key” mechanism for peptide selection Shape Size Electrostatic characteristics

Structural Immunoinformatic Database development Data Analysis Databases, ontologies Introduction Structural Immunoinformatic Database development Data Analysis Computational models Applications Sequences Structures Genetics and populations Basic immunology

RDB of 82 curated pMHC complexes (Class I: 64 & Class II:18) MPID:MHC-Peptide Interaction Database Govindarajan et al. (2003) Bioinformatics, 19: 309-310 RDB of 82 curated pMHC complexes (Class I: 64 & Class II:18)

Peptide/MHC interaction characteristics Length Gap Volume Interface area Interacting Residues Intermolecular hydrogen bonds Gap volume Interface area Gap index =

MPID-T: MHC-Peptide-T Cell Receptor Interaction Database Tong et al MPID-T: MHC-Peptide-T Cell Receptor Interaction Database Tong et al. (2006) Applied Bioinformatics, 5: 111-114 187 curated pMHC 16 with TCR Human:110, Murine:74 and Rat:3 Alleles: 40 (interface area, H bonds, gap volume and gap index)

Distribution of MHC by allele 101 new entries 187 entries (Human: 110; Murine: 74; Rat: 3) 134 non-redundant entries (class I: 100; class II: 34) 121 class I and 41 class II entries 26 HLA alleles (class I: 18; class II: 8) 14 rodent alleles (class I: 8; class II: 6) 16 TCR/peptide/MHC complexes

Peptide/MHC binding motifs Polar Amide Basic Acidic Hydrophobic Conserved peptide properties in solution structures Classified according to Alleles Peptide length

How to obtain structures of experimentally unsolved alleles? There were only 36 crystal structures of unique MHC (2006) alleles vs. 1765 unique MHC alleles identified in IMGT/HLA database Structure determination through experimental methods is both expensive and time-consuming Homology model building for alleles with no structural data!

Structural Immunoinformatic Database development Structures Introduction Structural Immunoinformatic Database development Data Analysis of pMHC Class I complexes Computational models Applications Data & text mining Maths/Stats

MHC Class I superfamilies have different interaction characteristics Single linkage cluster analysis of 68 pMHC Class I complexes from 13 alleles (all available A and B) Superfamily HLA-A2 (36 entries) HLA-B7 (12 entries) HLA-B27 (18 entries) Interface area (Å2) 846.3±48.9 876.7±72.4 934.0±136.0 Gap volume (Å3) 799.8±195.2 870.2±198.0 985.1±101.5 Gap index 0.9±0.2 1.0±0.1 1.0±0.3 Hydrogen bonds 11.1±1.9 Concentrated at pockets A, B, F 14.3±2.3 Well distributed 17.9±2.8

Data 68 peptide–HLA complexes spanning 13 classes I alleles from MPID-T Hierarchical clustering Hierarchical clustering using the agglomerative algorithm. Distance between structures computed by single-linkage method (MATLAB version 7.0) based on the separation between the each pair of data points. Nearest neighbors merged into clusters. Smaller clusters were then merged into larger clusters based on inter-cluster distances, until all structures are combined. Last 3 levels considered for defining HLA class I supertypes. Interaction parameters Significant for the characterization of peptide/MHC interface: Intermolecular hydrogen bonds pMHC Interface area Binding characteristics of HLA supertypes analyzed Details Gap volume Gap index

Do the Class I alleles aggregate into “superfamilies” using receptor-ligand interaction patterns? Legend B27 B44 B7, black; B27, green; B44, orange; B62, blue; Outlier/B8, red. B7 B62 B8

MHC Class I superfamilies from receptor-ligand interactions 80 HLA class I complexes 13 class I alleles Five descriptors Hierarchical clustering using nearest neighbor algorithm 77% consensus with data from other groups Supertype definition: receptor structure, ligand binding motifs, or receptor-ligand interaction patterns Tong, Tan and Ranganathan (2007) Bioinformatics, 23: 177-183 B7, black; B27, green; B44, orange; B62, blue; Outlier/B8, red. Legend B27 B44 B7 B62 B8

Structural Immunoinformatic Database development Data Analysis Sequences Structures Introduction Structural Immunoinformatic Database development Data Analysis Computational models Applications Physics/ Chemistry Maths/Stats

Two-step approach to predict MHC-binding peptides Finding the best fit conformation (docking) of peptides within the MHC binding groove Screening potential binders from the background

>1010 possible conformations for a 10-residue peptide Docking is a computationally exhaustive procedure Large number of possible peptide conformations 3 global translational degrees of freedom 3 global rotational degrees of freedom 1 conformational degree of freedom for each rotatable bond >1010 possible conformations for a 10-residue peptide y x z R N C Ca O   defines a space of (2k)n possible conformations for a protein with n amino acids (assuming each phi and psi pair is allowed to assume k distinct values).

Conservation of nonamer peptide backbone conformation Class I peptides N-termini residues 0.02 – 0.29 Å C-termini residues 0.00 – 0.25 Å Class II binding registers Only 9 residues fit in the binding groove 0.01 – 0.22 Å 0.02 – 0.27 Å

Rapid docking of peptide to MHC Tong, Tan & Ranganathan (2004) Protein Sci. 13:2523-2532 Anchoring root fragments to reduce search space (Pseudo-Brownian rigid body docking ) Loop modeling (Loop closure of central backbone by satisfaction of spatial restraints) Ligand backbone and side-chain refinement (entire backbone and interacting side-chains 1 2 3

… … … … … … … … … … … … Benchmarking with experimental Structural data Docking peptides from 40 non-redundant complexes back into the original allele structure 85% (34/40) of peptides within RMSD of 1.00 Å from experimental structure – min 0.09 Å, max 1.53 Å … … … … Original allele … … … … Original peptide … … … … Original allele

Benchmarking using a single template Docking peptides from 13 non-redundant complexes into a single template structure 77% (10/13) of peptides within RMSD of 1.00 Å from experimental structure – min 0.38 Å, max 1.48 Å … … … … Original allele … … … … Original peptide Single template

Benchmarking with existing techniques Author Technique Peptide RMSDa RMSDb Rognan et al. Simulated Annealing TLTSCNTSV 1.04 0.46 FLPSDFFPSV 1.59 1.10 GILGFVFTL 0.32 ILKEPVHGV 0.87 LLFGYPVYV 0.78 0.33 Desmet et al. Combinatorial Buildup Algorithm RGYVYQGL 0.56 Rosenfeld et al. Multiple Copy Algorithm FAPGNYPAL 2.70 0.40 1.40 Sezerman et al. ILKGPVHGV 1.30 1.60 2.20 aRMSD of peptide backbone obtained from respective authors. bRMSD of peptide backbone obtained in our work from redocking bound complexes and single template respectively.

Quantitative separation of binders from non-binders: empirical free energy scoring function DQ3.2b involved in several autoimmune diseases: Celiac disease insulin-dependent diabetes mellitus IDDM-associated periodontal disease autoimmune polyendocrine syndrome type II

Quantitative separation of binders from non-binders: empirical free energy scoring function Gbind = αGH + βGS + GEL + C Gbind = binding free energy GH = hydrophobic term GS = decrease in side chain entropy GEL = electrostatic term C = entropy change in system due to external factors α, β, γ optimized by least-square multivariate regression with experimental binding affinities (IC50) of MHC-peptides in training dataset (Rognan et al., 1999)

Test case: MHC Class II DQ8 DQ3.2b (DQA1*0301/DQB1*0302) is involved in several autoimmune diseases: Celiac disease insulin-dependent diabetes mellitus IDDM-associated periodontal disease autoimmune polyendocrine syndrome type II

Data used Structure: 1JK8 - DQ3.2β–insulin B9-23 complex Dataset I: 127 peptides with experimentally determined IC50 values [70 high-affinity (IC50 < 500 nM), 13 medium-affinity (500 nM < IC50 < 1500 nM )and 23 low-affinity (1500 < IC50 < 5000 nM) binders and 21 non-binders (5000 < IC50)] derived from biochemical studies. 87 with known binding registers. Dataset II: 12 Dermatophagoides pternnyssinus (Der p 2) peptides with experimental T-cell proliferation values from functional studies, with 7 peptides eliciting DQ3.2β-restricted T-cell proliferation.

Scoring: Training & testing datasets 56 binding conformations with known registers 30 non-binding conformations from 3 non-binders Testing Test set 1 – 68 peptides from biochemical studies 16 strong ; 13 medium; 21 weak; 18 non-binders Test set 2 – 12 peptides from functional studies 7 elicit T-cell proliferation

Screening class II binding register: a sliding window approach E285B 112-126 peptide Y Q T I E E N I K I F E E D A Core sequence Binding Energy YQTIEENIK -23.12 QTIEENIKI -21.34 TIEENIKIF -25.32 IEENIKIFE -29.53 EENIKIFEE -32.27 ENIKIFEED -21.72 NIKIFEEDA -22.95

Training and test sets Training of the DQ3.2β prediction model was performed by sampling the bound conformations of binding peptides with experimentally determined registers that can be recognized by MHC, and the best conformations of non-binding peptides without any preferred register in the binding groove. Dataset I was divided into training and test datasets. Training set: 59 peptides with 56 binding conformations with known registers and 30 non-binding conformations generated from the 3 non-binding peptides without any binding registers. Test set 1: 68 peptides (the rest of Dataset I) with experimental IC50 values (16 high-affinity binders, 13 medium affinity binders, 21 low affinity binders and 18 non-binders) from biochemical studies (with 31 binding registers) and Test set 2: all 12 peptides from Dataset II, with known T-cell proliferation values.

Binding energy determination ICM software (Abagyan and Totrov, 1999) hydrophobic energy computed as the product of solvent accessible surface area entropic contribution from the protein side-chains computed from the maximal burial entropies for each type of amino acid and their relative accessibilities electrostatic term composed of receptor-ligand coulombic interactions and the desolvation of partial charges transferred from an aqueous medium to a protein core environment numeric solution of the Poisson equation using an implementation of the boundary element algorithm entropy change in the system due to the decrease of free molecular concentration and the loss of rotational/ translational degrees of freedom upon binding.

4-step protocol used Docking A B C D Anchoring root fragments (probes) to reduce search space Loop modeling Refinement of binding register Extension of flanking residues for MHC Class II A B C D

Parameters optimized Default ICM coefficients (a=b=g=1; C=0) resulted in poor correlation (r2=0.43, s=2.91 kJ/mol) The optimal scoring function, after 10-fold cross-validation (q2=0.85, spress=2.20 kJ/mol):

Accuracy estimates Sensitivity (SE), specificity (SP) and receiver operating characteristic (ROC) analysis % Predicted binders: SE=TP/(TP+FN) and non-binders: SP=TN/(TN+FP), ROC curve is generated by plotting SE as a function of (1-SP) for various classification thresholds. The area under the ROC curve (AROC) provides a measure of overall prediction accuracy: AROC<70% for poor, AROC>80% for good and AROC>90% for excellent predictions We consider values of SP≥80% useful in practice and assessed SE for three values of SP (80%, 90% and 95%).

Accuracy estimates Sensitivity (SE) = number of binders correctly predicted = TP/AP (TP+FN) Specificity (SP) = number of non-binders correctly predicted = TN/AN (TN+FP) Area under ROC (receiver operating characteristics) curve: >90% excellent >80% good

Results for Training set High SE (good for most predictions) Very few FPs, but also fewer predictions

Screening class II binding register: HLA-DQ8 prediction accuracy for Test Set I Group LMH MH H AROC 0.88 0.93 Classification of binding peptides High-affinity binders (H) IC50 ≤ 500 nM Medium-affinity binders (M) 500 nM < IC50 ≤ 1500 nM Low-affinity binders (L) 1500 < IC50 ≤ 5000 nM

Test Set 1: Improved detection of binders lacking position specific binding motifs

Binding registers T-cell proliferation 20/23 (87%) binding registers Only register (aa 4-12) from Test Set 2 (Der p 2: 1-20) (SE=0.80; SP(LMH)=0.90) Top 5 predictions are experimental positives at very stringent threshold criteria (SE=0.95; SP(H)=0.63) T-cell proliferation

Multiple registers (SP=0.95, SE(LMHP =0.81): 58% of Test Set 1) Mainly for medium and high binders Experimental support: Sinha et al. for DRB1*0402 Is this why binding motifs are unsuccessful?

Introduction Structural Immunoinformatic Database development Data Analysis Computational models developed Applications

Pemphigus vulgaris (PV) www.aafp.org adam.about.com Autoimmune blistering skin disorder Characterized by autoantibodies targeting desmoglein-3 (Dsg3) Strong association with DR4 and DR6 alleles http://www.medscape.com

Who are the major players in PV? DR4 PV implicated alleles (for Semitic) DRB1*0401 DRB1*0402 DRB1*0404 DRB1*0406 DR6 PV implicated alleles (for Caucasians) DRB1*1401 DRB1*1404 DRB1*1405 DQB1*0503

What is known about DR4? DR4 PV implicated alleles (DRB1*0401, *0402, *0404, *0406) High sequence conservation 97.9 – 99.0% identity 98.4 – 99.5% similarity High structural conservation Cα RMSD <0.22 Å for all key binding pockets 7 polymorphic residues within binding cleft Pocket 1 (β86), Pocket 4 (β70, 71, 74) Pocket 6 (β11) Pocket 7 (β71) Pocket 9 (β37)

What is known about DR6? DR6 PV implicated alleles (DRB1*1401, *1404, *1405, DQB1*0503) High sequence conservation 85.8 – 94.1% identity 83.2 – 97.3% similarity High structural conservation Cα RMSD <0.22 Å for all key binding pockets 14 polymorphic residues within binding clefts Pocket 1 (β86) Pocket 4 (β13, 70, 71, 74, 78) Pocket 6 (β11) Pocket 7 (β28, 30, 67, 71) Pocket 9 (β9, 37, 57, 60)

Clues… 9 stimulatory Dsg3 peptides tested on PV patients possessing DR4 and DR6 PV implicated alleles Dsg3 96-112 (DR4, DR6) Dsg3 191-205 (DR4, DR6) Dsg3 206-220 (DR4, DR6) Dsg3 252-266 (DR4, DR6) Dsg3 342-356 (DR4, DR6) Dsg3 380-394 (DR4, DR6) Dsg3 763-777 (DR4, DR6) Dsg3 810-824 (DR4) Dsg3 963-977 (DR4)

Disease associated alleles vs. innocent bystanders DR4 PV 8/9 investigated Dsg3 peptides fit perfectly into DRB1*0402 Atomic clashes with all other investigated DR4 subtypes DR6 PV 6/9 investigated Dsg3 peptides fit perfectly into DRB1*0503 Atomic clashes with all other investigated DR6 subtypes HLA association in DR6 PV more likely to be at DQ than DR locus Consistent with experimental work done by Sinha et al. (2002, 2005, 2006) Tong et al. (2006) Immunome Research, 2: 1

Whither sequence motifs (again!)? 1/9 investigated Dsg3 peptides fits existing binding motifs Flanking residues – clashes in fitting binding register Register-shift for Peptide V (Dsg3 342-356) Detected binding register: Dsg3 346-354 Binding motifs: Dsg3 347-355 (Veldman et al., 2003) : Dsg3 345-353 (Sinha et al., 2006) Veldman P1 hydrophobic P4 positive P6 small, hydrophobic or hydrophilic Sinha P4 positive, neutral P6 large, small

Large-scale screening of Dsg3 peptides Tong et al. (2006) BMC Bioinformatics, 7(Suppl 5): S7 Docking of 936 15mer Dsg3 peptides generated using a sliding window of size 15 across the entire Dsg3 glycoprotein Dsg3 peptide (sliding window width 15) N C Binding register (sliding window width 9) Flanking residues Training set: 8 peptides each, with exp. IC50 values and known binding registers (5 binders and 3 non-binders)

Large-scale screening of Dsg3 peptides

Common epitopes possibly responsible for inducing disease in DR4 & DR6 patients Significant level of cross reactivity observed between DRB1*0402 and DQB1*0503 ( AROC=0.93) 57% of peptides investigated in this study predicted to bind to both alleles with high affinity 90% of known Dsg3 peptides predicted to bind to both alleles 12/20 top predicted DQB1*0503-specific Dsg3 peptides from transmembrane region All top predicted DQB1*0402-specific Dsg3 peptides from extracellular regions Disease initiation implications: DR4 from ECD; DR6 from TM

Multiple binding registers revisited 76% (410/539) predicted high-affinity binders to DRB1*0402 possess > 2 binding registers 57% (384/673) predicted high-affinity binders to DQB1*0503 possess > 2 binding registers 66% (354/539) bind both alleles at different registers Similar proportion (70%) detected in known binders to both alleles Both alleles bind similar peptides via different binding registers

What next? We have developed a predictive model for HLA-C (Cw*0401) with very limited (only six) experimental binding values. The model yields excellent results for test data (AROC=0.93). Application to determine immunological hot spots for HIV-1 p24gag and gp160gag glycoproteins shows binding energies similar to HLA-A and –B.

Conclusions Computational models for immunogenic epitope prediction can be successfully developed, even for alleles with limited experimental data. While computations can never completely replace “wet-lab” experiments, in silico predictions can significantly cut down the development time of therapeutic vaccines.

1. Genome analysis Approaches EST analysis Annotation pipeline using workflow strategies Applications Parasitic nematodes Cancer EST data Outcomes Comprehensive annotation at the gene and protein levels Novel &/or pathogen-specific genes Immune response evasion strategies

2. Transcriptome analysis Approaches Graph formalism for alternative splicing Genome-wide analysis Applications Drosophila genome Chicken compared to human and mouse Kallikrein variants as markers Outcomes New mRNA-gDNA alignment method, MGAlign & MGAlignIt First splicing graph database, DEDB Web server for splicing graphs, ASGS Sub-graph elements for alternative splicing Multi-species splicing graph database, GraphDB

3. Protein/Proteome research: Origin and evolution of structural domains Approaches Intron mapping to domain boundary All eukaryotic proteins analyzed Applications Domain prediction in EST/genome data Effect of splice variants on domains Outcomes New database of protein coding genes, XPro Visualization of intronic locations on protein structural doimains, XDomView Analysis tool, Go Module Viewer

3. Protein/Proteome research: Small disulfide-rich proteins <100 aa per domain; ≥ 2 SS bonds Approaches Multiple structure alignment and hierarchical classification Comparative modeling rules Sequence, structure and evolutionary analysis of Potato II inhibitor family Outcomes New database, DSFD Server for model building, SDPMOD Understanding of wound-induced protease inhibitor folding Applications Design of protease inhibitors, channel modulators, growth regulators

3. Protein/Proteome research: Protease cleavage site prediction Approaches Detailed structural modeling and docking of signal peptide moiety to signal peptidase I SVM for caspases Applications Enhanced production of therapeutic and cemmercial heterologous proteins Apoptosis initiation Outcomes New databases, SPdb, CasBase Server for caspase clevage prediction, CASVM Signal peptide cleavage prediction (under development)

4. Systems Biology Approaches Holistic computational, molecular biology and FRET study to locate secretion roadblocks EST analysis of host-parasite interactions Applications Trichoderma reesei as fungal bioreactor Parasites that lead to: liver cancer - food borne trematode (Opisthorchis viverrini) and bladder cancer (Schistosoma haematobium). Outcomes Improved heterologous protein production using filamentous fungi Understanding of how parasites evade host immune activation

6. Genome-Phenome mapping Approaches Mutation data for non-laboratory animals Mapping to OMIM Mapping to structure Applications OMIA-OMIM mapping to structure Correlation between genotype and disease pehnotype Outcomes OMIA database, with links to OMIM (courtesy NCBI) Mutations linked to severity of disease for α-D-mannosidosis Predictions of new human disease mutations from known mutation sites in cow, cat and guinea pig

7. Biodiversity Informatics: Customary medicinal plants Approaches Integrating, visualizing and analyzing ethnobotanical, phytochemical and pharmacological data on customary medicinal plants Data from Australian aboriginal elders and Indian Siddha doctors Applications Novel antimicrobial, anti-inflammatory and anti-cancer lead compunds Outcomes CMkb, an integrated knowledgebase

Dedications Prof. Bernard Pullman Mme. Alberte Pullman My brother, a CML survivor

Acknowledgements Dr. (Victor) J.C. Tong, NUS&I2R, Singapore A/Prof. Tin Wee Tan, NUS Dr. Animesh Sinha, Weill Medical College of Cornell University & Michigan State University, USA Drs. J. Tom August (JHU) and Vladimir Brusic (DFCI) (NIAID-NIH Grant #5 U19 AI56541 & Contract #HHSN266200400085C). All of you!