 Predicting interactions between small molecules and proteins › Vital to the drug discovery process › Key to understanding biological processes  3 classes.

Slides:



Advertisements
Similar presentations
Analysis of High-Throughput Screening Data C371 Fall 2004.
Advertisements

Feature selection and transduction for prediction of molecular bioactivity for drug design Reporter: Yu Lun Kuo (D )
PharmaMiner: Geometric Mining of Pharmacophores 1.
Development of methods for the analysis of ligand-protein interactions by Maris Lapinsh; Advisor Jarl Wikberg Division of Pharmacology, Uppsala University.
Machine Learning for Protein Classification Ashutosh Saxena CS 374 – Algorithms in Biology Thursday, Nov 16, 2006.
Mining frequent patterns in protein structures: A study of protease families Dr. Charles Yan CS6890 (Section 001) ST: Bioinformatics The Machine Learning.
Biological Data Mining A comparison of Neural Network and Symbolic Techniques
Protein Homology Detection Using String Alignment Kernels Jean-Phillippe Vert, Tatsuya Akutsu.
Active Learning Strategies for Drug Screening 1. Introduction At the intersection of drug discovery and experimental design, active learning algorithms.
Modeling Gene Interactions in Disease CS 686 Bioinformatics.
Scalable Text Mining with Sparse Generative Models
Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.
Remote Homology detection: A motif based approach CS 6890: Bioinformatics - Dr. Yan CS 6890: Bioinformatics - Dr. Yan Swati Adhau Swati Adhau 04/14/06.
Pharmacophore and FTrees
Protein Tertiary Structure Prediction
Cédric Notredame (30/08/2015) Chemoinformatics And Bioinformatics Cédric Notredame Molecular Biology Bioinformatics Chemoinformatics Chemistry.
SUPERVISED NEURAL NETWORKS FOR PROTEIN SEQUENCE ANALYSIS Lecture 11 Dr Lee Nung Kion Faculty of Cognitive Sciences and Human Development UNIMAS,
Topological Summaries: Using Graphs for Chemical Searching and Mining Graphs are a flexible & unifying model Scalable similarity searches through novel.
CRB Journal Club February 13, 2006 Jenny Gu. Selected for a Reason Residues selected by evolution for a reason, but conservation is not distinguished.
Introduction to Chemoinformatics Irene Kouskoumvekaki Associate Professor December 12th, 2012 Biological Sequence Analysis course.
LSM3241: Bioinformatics and Biocomputing Lecture 3: Machine learning method for protein function prediction Prof. Chen Yu Zong Tel:
PART II. Prediction of functional regions within disordered proteins Zsuzsanna Dosztányi MTA-ELTE Momentum Bioinformatics Group Department of Biochemistry.
PROTEIN STRUCTURE CLASSIFICATION SUMI SINGH (sxs5729)
Use of Machine Learning in Chemoinformatics Irene Kouskoumvekaki Associate Professor December 12th, 2012 Biological Sequence Analysis course.
One-class Training for Masquerade Detection Ke Wang, Sal Stolfo Columbia University Computer Science IDS Lab.
Ozgur Ozturk, Ahmet Sacan, Hakan Ferhatosmanoglu, Yusu Wang The Ohio State University LFM-Pro: a tool for mining family-specific sites in protein structure.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Xiangnan Kong,Philip S. Yu Multi-Label Feature Selection for Graph Classification Department of Computer Science University of Illinois at Chicago.
Exploring Alternative Splicing Features using Support Vector Machines Feature for Alternative Splicing Alternative splicing is a mechanism for generating.
Protein-Protein Interaction Hotspots Carved into Sequences Yanay Ofran 1,2, Burkhard Rost 1,2,3 1.Department of Biochemistry and Molecular Biophysics,
Biological Signal Detection for Protein Function Prediction Investigators: Yang Dai Prime Grant Support: NSF Problem Statement and Motivation Technical.
Virtual Screening C371 Fall INTRODUCTION Virtual screening – Computational or in silico analog of biological screening –Score, rank, and/or filter.
PREDICTION OF CATALYTIC RESIDUES IN PROTEINS USING MACHINE-LEARNING TECHNIQUES Natalia V. Petrova (Ph.D. Student, Georgetown University, Biochemistry Department),
Metabolic Network Inference from Multiple Types of Genomic Data Yoshihiro Yamanishi Centre de Bio-informatique, Ecole des Mines de Paris.
Bioinformatics MEDC601 Lecture by Brad Windle Ph# Office: Massey Cancer Center, Goodwin Labs Room 319 Web site for lecture:
Journal report: High Resolution Model of Transcription Factor- DNA Affinities Improve In Vitro and In Vivo Binding Predictions Paper by: Phadera Gius,
Protein motif extraction with neuro-fuzzy optimization Bill C. H. Chang and Author : Bill C. H. Chang and Saman K. Halgamuge Saman K. Halgamuge Adviser.
1 CISC 841 Bioinformatics (Fall 2007) Kernel engineering and applications of SVMs.
Computational Approaches for Biomarker Discovery SubbaLakshmiswetha Patchamatla.
PharmaMiner: Geometric Mining of Pharmacophores 1.
Application of latent semantic analysis to protein remote homology detection Wu Dongyin 4/13/2015.
Feature Extraction Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and.
05/02/2008 Jae Hyun Kim Genome scale enzyme-metabolite and drug-target interaction predictions using the signature molecular descriptor Faulon, J. L.,
Introduction to Chemoinformatics and Drug Discovery Irene Kouskoumvekaki Associate Professor February 15 th, 2013.
CZ5225 Methods in Computational Biology Lecture 2-3: Protein Families and Family Prediction Methods Prof. Chen Yu Zong Tel:
Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features 王荣 14S
Combining Evolutionary Information Extracted From Frequency Profiles With Sequence-based Kernels For Protein Remote Homology Detection Name: ZhuFangzhi.
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
Use of Machine Learning in Chemoinformatics
Final Report (30% final score) Bin Liu, PhD, Associate Professor.
Surflex: Fully Automatic Flexible Molecular Docking Using a Molecular Similarity-Based Search Engine Ajay N. Jain UCSF Cancer Research Institute and Comprehensive.
A Kernel Approach for Learning From Almost Orthogonal Pattern * CIS 525 Class Presentation Professor: Slobodan Vucetic Presenter: Yilian Qin * B. Scholkopf.
InterPro Sandra Orchard.
Nawanol Theera-Ampornpunt, Seong Gon Kim, Asish Ghoshal, Saurabh Bagchi, Ananth Grama, and Somali Chaterji Fast Training on Large Genomics Data using Distributed.
Mismatch String Kernals for SVM Protein Classification Christina Leslie, Eleazar Eskin, Jason Weston, William Stafford Noble Presented by Pradeep Anand.
We propose an accurate potential which combines useful features HP, HH and PP interactions among the amino acids Sequence based accessibility obtained.
Improving compound–protein interaction prediction by building up highly credible negative samples Toward more realistic drug-target interaction predictions.
Molecular Modeling in Drug Discovery: an Overview
A new protein-protein docking scoring function based on interface residue properties Reporter: Yu Lun Kuo (D )
Page 1 Computer-aided Drug Design —Profacgen. Page 2 The most fundamental goal in the drug design process is to determine whether a given compound will.
Hyunghoon Cho, Bonnie Berger, Jian Peng  Cell Systems 
APPLICATIONS OF BIOINFORMATICS IN DRUG DISCOVERY
Ligand-Based Structural Hypotheses for Virtual Screening
Molecular Docking Profacgen. The interactions between proteins and other molecules play important roles in various biological processes, including gene.
Predicting Active Site Residue Annotations in the Pfam Database
Virtual Screening.
Reporter: Yu Lun Kuo (D )
SEG5010 Presentation Zhou Lanjun.
Modeling IDS using hybrid intelligent systems
Hyunghoon Cho, Bonnie Berger, Jian Peng  Cell Systems 
Presentation transcript:

 Predicting interactions between small molecules and proteins › Vital to the drug discovery process › Key to understanding biological processes  3 classes of drug targets › G-protein-coupled receptors (GPCRs) › Enzymes › Ion channels

 Consider each target independently from other proteins  Ligand-based approach › Compare to known ligands of the target › Requires knowledge about other ligands of a given target  Structure-based or docking approaches › Uses 3D structure of the target to determine how well a ligand can bind › Requires 3D structure of the target › Very time consuming  Cannot apply if no ligand or 3D structure is known for a given target

 Chemical space: › set of all small molecules  Biological space: › set of all proteins or protein families  Mine the entire chemical space for interactions with the biological space  Knowledge of some ligands for a target can help to predict ligands for similar targets

 Ligand-based chemogenomics › Look at families or subfamilies of proteins › Model ligands at the level of a family  Target-based chemogenomics › Cluster receptors based on ligand binding site similarity › Use known ligands for each cluster to infer shared ligands  Target-ligand approach › Use binding information for targets to predict ligands for another target in a single step

 Bock and Gough (2005) › Describe ligand-receptor complexes by merging ligand and target descriptors › Use machine learning methods to predict if a ligand-receptor pair forms a complex  Erhan et al. (2006) › Merge a set of ligand descriptors with a set of receptor descriptors in a framework of neural networks and support vector machines › Offers a large flexibility in the choice of descriptors

 Investigates different types of descriptors  Builds upon recent developments in kernel methods › In bio- and cheminformatics  Tests different methods for prediction of ligands › For 3 major classes of targets  Shows that the choice of representation greatly effects accuracy  New kernel based on hierarchies of receptors outperforms all other descriptors › Performs especially well for targets with few or no known ligands

 Given n target/molecule pairs (t 1,c 1 ), …, (t n, c n ) known to form complexes or not › Each pair is represented by a vector  (t,c)  Estimate a linear function › f(t,c)=w ┬  (t,c)  Whose sign is used to predict if a chemical c can bind to a target t  The vector w is estimated from the training set

 Represent a molecule c by a vector  lig (c)  R dc › Encode physiochemical and structural properties › Model interactions between small molecules and a single target  Represent a protein t by a vector  tar (t)  R dt › Capture properties of the proteins sequence or structure › Infer models that predict the structural or functional class of a protein  Need to represent a pair (c,t) in a single vector › Capture interactions between features of the molecule and protein that can be useful predictors › Multiply a descriptor of c with a descriptor of t

  (c,t) =  lig (c)   tar (t)  Represent the set of all possible products of features of c and t  d c x d t vector › The (i,j)-th entry is the product of the i-th entry of  lig (c) by the j-th entry of  tar (t)  Size may be prohibitively large  Use kernel methods

 Can process large- or infinite-dimensional patters if the inner product between any two patterns can be computed  Can factorize the inner product between two tensor product vectors › (  lig (c)   tar (t)) ┬ (  lig (c’)   tar (t’)) › =  lig (c) ┬  lig (c’) x  tar (t) ┬  tar (t’)  Obtain the inner product between two tensor products › K((c,c’),(t,t’))= K ligand (c,c’) x K target (t,t’)  K ligand (c,c’)=  lig (c) ┬  lig (c’)  K target (t,t’)=  tar (t) ┬  tar (t’)

 Have been impressive advances in use of SVM in chemoinformatics  Kernels have been designed using: › Physiochemical properties of molecules › 2D or 3D fingerprints › Comparison of 2D and 3D structures of molecules  Detection of common substructures in 2D graphs  Encoding various properties of 3D structures  Used in single-target virtual screening and prediction of pharmacokinetics and toxicity

 Classical choice  State-of-the-art performance  K ligand (c,c’) =  lig (c) ┬  lig (c’) / [  lig (c) ┬  lig (c) +  lig (c’) ┬  lig (c’) -  lig (c) ┬  lig (c’)]   lig (c) ┬ is a binary vector  Bits indicate if the 2D structure of c contains all linear paths of length l or less as a subgraph › Choose l=8  Used ChemCPP software to compute

 SVM and kernel methods are widely used in bioinformatics  Various Kernels have been proposed based on: › Amino-acid sequence of proteins › 3D structures of proteins › Pattern of occurrences of proteins in multiple sequenced genomes  Used for various tasks related to structural or functional classification of proteins

 K Dirac (t,t’) › = 1 if t = t’ › = 0 otherwise  Represents different targets as orthonormal vectors  Orthogonality between two proteins t and t’ implies orthogonality between all pairs (c,t) and (c’,t’) for any two molecules c and c’ › Learning is performed independently for each target protein › Does not share any information of known ligands between different targets

 K multitask (t,t’) = 1 + K dirac (t,t’)  Removes the orthogonality  Combines target-specific properties of the ligands and general properties across all targets  Allows sharing of information during learning  Preserves the specificities of the ligands for each target  Does not weigh much how known interactions should contribute

 Empirical observations suggest that molecules that bind to t are only likely to bind to t’ if they are similar in terms of structure or evolutionary history › Can be detected by comparing protein sequences  Mismatch kernel: › compares short sequences of amino acids up to some number of mismatches › Choose 3mers with a maximum of one mismatch  Local alignment kernel: › uses the alignment score between the primary sequences of proteins to measure their similarity

 K hierarchy (t,t’)=(  h (t),  h (t’))   h (t) has a feature for each node in the hierarchy › Is set to 1 if the node is part of t’s hierarchy › Is set to 0 otherwise › Plus one feature is constantly set to 1  Use data from the target and data from other targets, giving it smaller weight  Performed the best in the experiments

 Enzyme Commission numbers › International Union of Biochemistry and Molecular Biology (1992) › Classifies by the chemical reaction they catalyze › Four-level hierarchy  For example, › EC 1 includes oxidoreductases › EC 1.2 includes oxidoreductases that act on the aldehyde or oxo group of donors › EC has NAD+ or NADP+ as an acceptor › EC caltalyze the oxidation of formate to bicarbonate  Enzymes that are close in the hierarchy should have similar ligands

 GPCRs are grouped into four classes › Group A: rhodopsin family › Group B: secretin family › Group C: metabotropic family › Group D: regroups more divers receptors  KEGG database subdivides rhodopsin family into three subgroups › Amine receptors › Peptide receptors › Other receptors  And adds a second level of classification based on the type of ligands or known subdivisions

 The KEGG database divides ion channels into 8 classes › Cys-loop superfamily › Glutamate-gated cation channels › Epithelial and related Na + channels › Voltage-gated cation channels › Related to voltage-gated cation channels › Related to inward rectifier K + channels › Chloride channels › Related to ATPase-linked transporters  Each class is further subdivided › By, for example, the type of ligands or type of ion passing through the channel

 Extracted compound interaction data from KEGG BRITE database › Known compounds for each target › Type of interaction  Enzymes: inhibitor, cofactor, effector  GPCR: antagonist, full/partial agonist  Ion Channels: pore blocker, positive/negative allosteric modulator, agonist, antagonist  Did not take into account › Orthologs of targets › Enzymes with same EC number › Compounds with no molecular descriptor  Primarily peptides › Targets with no known compounds

 Generated as many negative ligand-target pairs as known ligand-target pairs › Randomly chose ligands › Produced false negatives › Need experimentally confirmed negative pairs  2436 data points for enzymes › 675 enzymes, 524 compounds  798 data points for GPCRs › 100 receptors, 219 compounds  2230 data points for ion channels › 114 channels, 462 compounds

Distribution of the number of known ligands per target for enzymes, GPCR, and ion channel datasets  Each bar indicates the proportion of targets for which a given number of training points are available  Few compounds are known for most targets Jacob, L. et al. Bioinformatics : ; doi: /bioinformatics/btn409

 Experiment 1 › Trained an SVM classifier on  all points involving other targets of the family  plus a fraction of points involving t › Tested on the remaining data points for t › Assesses the accuracy for a given target when using ligands for other targets for training  Experiment 2 › Trained an SVM classifier using only interactions that did not involve t › Tested on data points that did involve t › Simulated making predictions for targets with no known ligands  Measured performance using the area under the ROC curve (AUC)

Mean AUC on each dataset with various target kernels  Hierarchy kernel shows significant improvements › Sharing information for known ligands of different targets › Incorporating prior information into the kernels K tar \ TargetEnzymesGPCRChannels Dirac0.646± ± ±0.020 Multitask0.931± ± ±0.015 Hierarchy0.955± ± ±0.012 Mismatch0.725± ± ±0.015 Local alignment0.676± ± ±0.013

Target kernel Gram matrices (K tar ) for ion channels with multitask, hierarchy, and local alignment kernels  Hierarchy kernel adds structure information  Local alignment kernel retains some substructures  For GPCR and enzymes, almost no structure is found by the sequence kernels Jacob, L. et al. Bioinformatics : ; doi: /bioinformatics/btn409

Relative improvement of the hierarchy kernel against the Dirac kernel as a function of the number of known ligands for enzymes, GPCR, and ion channel datasets  Strong improvement when few ligands are known  Decreases when enough training points become available  After a certain point, performance is impaired Jacob, L. et al. Bioinformatics : ; doi: /bioinformatics/btn409

Mean AUC on each dataset with various target kernels  Dirac kernel showed random behavior › Learning with no training data  Hierarchy kernel still gives reasonable results › 1.7%, 5.1%, 7.2% loss for enzymes, GPCR, and ion channels compared to the first experiment K tar \ TargetEnzymesGPCRChannels Dirac0.500±0.000 Multitask0.902± ± ±0.026 Hierarchy0.938± ± ±0.019 Mismatch0.602± ± ±0.024 Local alignment0.535± ± ±0.023

1. Rognan D: Chemogenomic approaches to rational drug design. Br J Pharmacol 2007, 152 : Kanehisa M, Goto S, Kawashima S, Nakaya A: {The KEGG databases at GenomeNet}. Nucl. Acids Res. 2002, 30 : Jacob L, Vert J: Protein-ligand interaction prediction: an improved chemogenomics approach. Bioinformatics 2008, 24 : Erhan D, L'Heureux P, Yue SY, Bengio Y: Collaborative Filtering on a Family of Biological Targets. Journal of Chemical Information and Modeling 2006, 46 : Bock JR, Gough DA: Virtual Screen for Ligands of Orphan G Protein- Coupled Receptors. Journal of Chemical Information and Modeling 2005, 45 :