Download presentation
Presentation is loading. Please wait.
Published byFerdinand Joseph Barton Modified over 9 years ago
2
Predicting interactions between small molecules and proteins › Vital to the drug discovery process › Key to understanding biological processes 3 classes of drug targets › G-protein-coupled receptors (GPCRs) › Enzymes › Ion channels
3
Consider each target independently from other proteins Ligand-based approach › Compare to known ligands of the target › Requires knowledge about other ligands of a given target Structure-based or docking approaches › Uses 3D structure of the target to determine how well a ligand can bind › Requires 3D structure of the target › Very time consuming Cannot apply if no ligand or 3D structure is known for a given target
4
Chemical space: › set of all small molecules Biological space: › set of all proteins or protein families Mine the entire chemical space for interactions with the biological space Knowledge of some ligands for a target can help to predict ligands for similar targets
5
Ligand-based chemogenomics › Look at families or subfamilies of proteins › Model ligands at the level of a family Target-based chemogenomics › Cluster receptors based on ligand binding site similarity › Use known ligands for each cluster to infer shared ligands Target-ligand approach › Use binding information for targets to predict ligands for another target in a single step
6
Bock and Gough (2005) › Describe ligand-receptor complexes by merging ligand and target descriptors › Use machine learning methods to predict if a ligand-receptor pair forms a complex Erhan et al. (2006) › Merge a set of ligand descriptors with a set of receptor descriptors in a framework of neural networks and support vector machines › Offers a large flexibility in the choice of descriptors
7
Investigates different types of descriptors Builds upon recent developments in kernel methods › In bio- and cheminformatics Tests different methods for prediction of ligands › For 3 major classes of targets Shows that the choice of representation greatly effects accuracy New kernel based on hierarchies of receptors outperforms all other descriptors › Performs especially well for targets with few or no known ligands
8
Given n target/molecule pairs (t 1,c 1 ), …, (t n, c n ) known to form complexes or not › Each pair is represented by a vector (t,c) Estimate a linear function › f(t,c)=w ┬ (t,c) Whose sign is used to predict if a chemical c can bind to a target t The vector w is estimated from the training set
9
Represent a molecule c by a vector lig (c) R dc › Encode physiochemical and structural properties › Model interactions between small molecules and a single target Represent a protein t by a vector tar (t) R dt › Capture properties of the proteins sequence or structure › Infer models that predict the structural or functional class of a protein Need to represent a pair (c,t) in a single vector › Capture interactions between features of the molecule and protein that can be useful predictors › Multiply a descriptor of c with a descriptor of t
10
(c,t) = lig (c) tar (t) Represent the set of all possible products of features of c and t d c x d t vector › The (i,j)-th entry is the product of the i-th entry of lig (c) by the j-th entry of tar (t) Size may be prohibitively large Use kernel methods
11
Can process large- or infinite-dimensional patters if the inner product between any two patterns can be computed Can factorize the inner product between two tensor product vectors › ( lig (c) tar (t)) ┬ ( lig (c’) tar (t’)) › = lig (c) ┬ lig (c’) x tar (t) ┬ tar (t’) Obtain the inner product between two tensor products › K((c,c’),(t,t’))= K ligand (c,c’) x K target (t,t’) K ligand (c,c’)= lig (c) ┬ lig (c’) K target (t,t’)= tar (t) ┬ tar (t’)
12
Have been impressive advances in use of SVM in chemoinformatics Kernels have been designed using: › Physiochemical properties of molecules › 2D or 3D fingerprints › Comparison of 2D and 3D structures of molecules Detection of common substructures in 2D graphs Encoding various properties of 3D structures Used in single-target virtual screening and prediction of pharmacokinetics and toxicity
13
Classical choice State-of-the-art performance K ligand (c,c’) = lig (c) ┬ lig (c’) / [ lig (c) ┬ lig (c) + lig (c’) ┬ lig (c’) - lig (c) ┬ lig (c’)] lig (c) ┬ is a binary vector Bits indicate if the 2D structure of c contains all linear paths of length l or less as a subgraph › Choose l=8 Used ChemCPP software to compute
14
SVM and kernel methods are widely used in bioinformatics Various Kernels have been proposed based on: › Amino-acid sequence of proteins › 3D structures of proteins › Pattern of occurrences of proteins in multiple sequenced genomes Used for various tasks related to structural or functional classification of proteins
15
K Dirac (t,t’) › = 1 if t = t’ › = 0 otherwise Represents different targets as orthonormal vectors Orthogonality between two proteins t and t’ implies orthogonality between all pairs (c,t) and (c’,t’) for any two molecules c and c’ › Learning is performed independently for each target protein › Does not share any information of known ligands between different targets
16
K multitask (t,t’) = 1 + K dirac (t,t’) Removes the orthogonality Combines target-specific properties of the ligands and general properties across all targets Allows sharing of information during learning Preserves the specificities of the ligands for each target Does not weigh much how known interactions should contribute
17
Empirical observations suggest that molecules that bind to t are only likely to bind to t’ if they are similar in terms of structure or evolutionary history › Can be detected by comparing protein sequences Mismatch kernel: › compares short sequences of amino acids up to some number of mismatches › Choose 3mers with a maximum of one mismatch Local alignment kernel: › uses the alignment score between the primary sequences of proteins to measure their similarity
18
K hierarchy (t,t’)=( h (t), h (t’)) h (t) has a feature for each node in the hierarchy › Is set to 1 if the node is part of t’s hierarchy › Is set to 0 otherwise › Plus one feature is constantly set to 1 Use data from the target and data from other targets, giving it smaller weight Performed the best in the experiments
19
Enzyme Commission numbers › International Union of Biochemistry and Molecular Biology (1992) › Classifies by the chemical reaction they catalyze › Four-level hierarchy For example, › EC 1 includes oxidoreductases › EC 1.2 includes oxidoreductases that act on the aldehyde or oxo group of donors › EC 1.2.2 has NAD+ or NADP+ as an acceptor › EC 1.2.2.1 caltalyze the oxidation of formate to bicarbonate Enzymes that are close in the hierarchy should have similar ligands
20
GPCRs are grouped into four classes › Group A: rhodopsin family › Group B: secretin family › Group C: metabotropic family › Group D: regroups more divers receptors KEGG database subdivides rhodopsin family into three subgroups › Amine receptors › Peptide receptors › Other receptors And adds a second level of classification based on the type of ligands or known subdivisions
21
The KEGG database divides ion channels into 8 classes › Cys-loop superfamily › Glutamate-gated cation channels › Epithelial and related Na + channels › Voltage-gated cation channels › Related to voltage-gated cation channels › Related to inward rectifier K + channels › Chloride channels › Related to ATPase-linked transporters Each class is further subdivided › By, for example, the type of ligands or type of ion passing through the channel
22
Extracted compound interaction data from KEGG BRITE database › Known compounds for each target › Type of interaction Enzymes: inhibitor, cofactor, effector GPCR: antagonist, full/partial agonist Ion Channels: pore blocker, positive/negative allosteric modulator, agonist, antagonist Did not take into account › Orthologs of targets › Enzymes with same EC number › Compounds with no molecular descriptor Primarily peptides › Targets with no known compounds
23
Generated as many negative ligand-target pairs as known ligand-target pairs › Randomly chose ligands › Produced false negatives › Need experimentally confirmed negative pairs 2436 data points for enzymes › 675 enzymes, 524 compounds 798 data points for GPCRs › 100 receptors, 219 compounds 2230 data points for ion channels › 114 channels, 462 compounds
24
Distribution of the number of known ligands per target for enzymes, GPCR, and ion channel datasets Each bar indicates the proportion of targets for which a given number of training points are available Few compounds are known for most targets Jacob, L. et al. Bioinformatics 2008 24:2149-2156; doi:10.1093/bioinformatics/btn409
25
Experiment 1 › Trained an SVM classifier on all points involving other targets of the family plus a fraction of points involving t › Tested on the remaining data points for t › Assesses the accuracy for a given target when using ligands for other targets for training Experiment 2 › Trained an SVM classifier using only interactions that did not involve t › Tested on data points that did involve t › Simulated making predictions for targets with no known ligands Measured performance using the area under the ROC curve (AUC)
26
Mean AUC on each dataset with various target kernels Hierarchy kernel shows significant improvements › Sharing information for known ligands of different targets › Incorporating prior information into the kernels K tar \ TargetEnzymesGPCRChannels Dirac0.646±0.0090.750±0.0230.770±0.020 Multitask0.931±0.0060.749±0.0220.873±0.015 Hierarchy0.955±0.0050.926±0.0150.925±0.012 Mismatch0.725±0.0090.805±0.0230.875±0.015 Local alignment0.676±0.0090.824±0.0210.901±0.013
27
Target kernel Gram matrices (K tar ) for ion channels with multitask, hierarchy, and local alignment kernels Hierarchy kernel adds structure information Local alignment kernel retains some substructures For GPCR and enzymes, almost no structure is found by the sequence kernels Jacob, L. et al. Bioinformatics 2008 24:2149-2156; doi:10.1093/bioinformatics/btn409
28
Relative improvement of the hierarchy kernel against the Dirac kernel as a function of the number of known ligands for enzymes, GPCR, and ion channel datasets Strong improvement when few ligands are known Decreases when enough training points become available After a certain point, performance is impaired Jacob, L. et al. Bioinformatics 2008 24:2149-2156; doi:10.1093/bioinformatics/btn409
29
Mean AUC on each dataset with various target kernels Dirac kernel showed random behavior › Learning with no training data Hierarchy kernel still gives reasonable results › 1.7%, 5.1%, 7.2% loss for enzymes, GPCR, and ion channels compared to the first experiment K tar \ TargetEnzymesGPCRChannels Dirac0.500±0.000 Multitask0.902±0.0080.576±0.0260.704±0.026 Hierarchy0.938±0.0060.875±0.0200.853±0.019 Mismatch0.602±0.0080.703±0.0270.729±0.024 Local alignment0.535±0.0050.751±0.0250.772±0.023
30
1. Rognan D: Chemogenomic approaches to rational drug design. Br J Pharmacol 2007, 152 :38-52. 2. Kanehisa M, Goto S, Kawashima S, Nakaya A: {The KEGG databases at GenomeNet}. Nucl. Acids Res. 2002, 30 :42-46. 3. Jacob L, Vert J: Protein-ligand interaction prediction: an improved chemogenomics approach. Bioinformatics 2008, 24 :2149-2156. 4. Erhan D, L'Heureux P, Yue SY, Bengio Y: Collaborative Filtering on a Family of Biological Targets. Journal of Chemical Information and Modeling 2006, 46 :626-635. 5. Bock JR, Gough DA: Virtual Screen for Ligands of Orphan G Protein- Coupled Receptors. Journal of Chemical Information and Modeling 2005, 45 :1402-1414.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.