Exploring Chemical Space with Computers—Challenges and Opportunities Pierre Baldi UCI.

Slides:



Advertisements
Similar presentations
SOMA2 – Drug Design Environment. Drug design environment – SOMA2 The SOMA2 project Tekes (National Technology Agency of Finland) DRUG2000 program.
Advertisements

1 Real World Chemistry Virtual discovery for the real world Joe Mernagh 19 May 2005.
Indiana University School of David Wild – CICC Quarterly Meeting, Jan Page 1 Projects 1-4 update David Wild CICC Quarterly Meeting January 27.
Analysis of High-Throughput Screening Data C371 Fall 2004.
3D Molecular Structures C371 Fall Morgan Algorithm (Leach & Gillet, p. 8)
PharmaMiner: Geometric Mining of Pharmacophores 1.
Introduction and Importance of Bioinformatics: Application in Drug/Vaccine Design G. P. S. Raghava Web:
Finding regulatory modules from local alignment - Department of Computer Science & Helsinki Institute of Information Technology HIIT University of Helsinki.
Using Structure Indices for Efficient Approximation of Network Properties Matthew J. Rattigan, Marc Maier, and David Jensen University of Massachusetts.
Jeffery Loo NLM Associate Fellow ’03 – ’05 chemicalinformaticsforlibraries.
Discovering Substructures in Chemical Toxicity Domain Masters Project Defense by Ravindra Nath Chittimoori Committee: DR. Lawrence B. Holder, DR. Diane.
Using Bioinformatics to Make the Bio- Math Connection The Confessions of a Biology Teacher.
BL5203: Molecular Recognition & Interaction Lecture 5: Drug Design Methods Ligand-Protein Docking (Part I) Prof. Chen Yu Zong Tel:
The Implicit Mapping into Feature Space. In order to learn non-linear relations with a linear machine, we need to select a set of non- linear features.
Active Learning Strategies for Drug Screening 1. Introduction At the intersection of drug discovery and experimental design, active learning algorithms.
3. Chemical Data and Data Bases. 2 Datasets and Databases Many small datasets are available Several commercial databases of compounds and reactions (e.g.
Chemoinformatics P. Baldi, J. Chen, and S. J. Swamidass School of Information and Computer Sciences Institute for Genomics and Bioinformatics University.
Protein Structures.
Important Points in Drug Design based on Bioinformatics Tools History of Drug/Vaccine development –Plants or Natural Product Plant and Natural products.
Overview of Bioinformatics A/P Shoba Ranganathan Justin Choo National University of Singapore A Tutorial on Bioinformatics.
Protein Tertiary Structure Prediction
Cédric Notredame (30/08/2015) Chemoinformatics And Bioinformatics Cédric Notredame Molecular Biology Bioinformatics Chemoinformatics Chemistry.
Ch10. Intermolecular Interactions and Biological Pathways
Combinatorial Chemistry and Library Design
Knowledgebase Creation & Systems Biology: A new prospect in discovery informatics S.Shriram, Siri Technologies (Cytogenomics), Bangalore S.Shriram, Siri.
Erice 2008 Introduction to PDB Workshop From Molecules to Medicine: Integrating Crystallography in Drug Discovery Erice, 29 May - 8 June Peter Rose
Bioinformatics Timothy Ketcham Union College Gradutate Seminar 2003 Bioinformatics.
Semantic Similarity over Gene Ontology for Multi-label Protein Subcellular Localization Shibiao WAN and Man-Wai MAK The Hong Kong Polytechnic University.
A genetic algorithm for structure based de-novo design Scott C.-H. Pegg, Jose J. Haresco & Irwin D. Kuntz February 21, 2006.
Department of Chemistry A state-of-the-art instrumental park is available to purify and characterize the synthesized molecules The research activities.
Introduction to Chemoinformatics Irene Kouskoumvekaki Associate Professor December 12th, 2012 Biological Sequence Analysis course.
CS 790 – Bioinformatics Introduction and overview.
Kernel Methods A B M Shawkat Ali 1 2 Data Mining ¤ DM or KDD (Knowledge Discovery in Databases) Extracting previously unknown, valid, and actionable.
Use of Machine Learning in Chemoinformatics Irene Kouskoumvekaki Associate Professor December 12th, 2012 Biological Sequence Analysis course.
Function first: a powerful approach to post-genomic drug discovery Stephen F. Betz, Susan M. Baxter and Jacquelyn S. Fetrow GeneFormatics Presented by.
Open source software and web services for designing therapeutic molecules G. P. S. Raghava, Head Bioinformatics Centre, Institute of Microbial Technology,
Xiangnan Kong,Philip S. Yu Multi-Label Feature Selection for Graph Classification Department of Computer Science University of Illinois at Chicago.
1 Cheminformatics David Shiuan Department of Life Science and Institute of Biotechnology National Dong Hwa University.
SimBioSys Inc.© Slide #1 Enrichment and cross-validation studies of the eHiTS high throughput screening software package.
Virtual Screening C371 Fall INTRODUCTION Virtual screening – Computational or in silico analog of biological screening –Score, rank, and/or filter.
Introduction to Bioinformatics Dr. Rybarczyk, PhD University of North Carolina-Chapel Hill
Bioinformatics MEDC601 Lecture by Brad Windle Ph# Office: Massey Cancer Center, Goodwin Labs Room 319 Web site for lecture:
An overview of Bioinformatics. Cell and Central Dogma.
2 classes: ICS 280, BIT Forum Meeting only on Mondays from 5 to 6:20 in CS2 136 (BIT). (P. Baldi and L. Ralaivola) ICS 280: Baldi group meeting and projects.
Introduction to Chemoinformatics and Drug Discovery Irene Kouskoumvekaki Associate Professor February 15 th, 2013.
Nonlinear differential equation model for quantification of transcriptional regulation applied to microarray data of Saccharomyces cerevisiae Vu, T. T.,
Use of Machine Learning in Chemoinformatics
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
4. Molecular Similarity. 2 Similarity and Searching Historical Progression Similarity Measures Fingerprint Construction “Pathological” Cases MinMax- Counts.
See also: See also: 1. a short film produced by Studio KO graphic designers, which introduces the key notions for drug.
Computational Approach for Combinatorial Library Design Journal club-1 Sushil Kumar Singh IBAB, Bangalore.
Improving compound–protein interaction prediction by building up highly credible negative samples Toward more realistic drug-target interaction predictions.
Molecular Modeling in Drug Discovery: an Overview
Indiana University School of Indiana University ECCR Summary Infrastructure: Cheminformatics web service infrastructure made available as a community resource.
Natural products from plants
Page 1 Computer-aided Drug Design —Profacgen. Page 2 The most fundamental goal in the drug design process is to determine whether a given compound will.
APPLICATIONS OF BIOINFORMATICS IN DRUG DISCOVERY
Ligand-Based Structural Hypotheses for Virtual Screening
Molecular Docking Profacgen. The interactions between proteins and other molecules play important roles in various biological processes, including gene.
1 Department of Engineering, 2 Department of Mathematics,
Virtual Screening.
1 Department of Engineering, 2 Department of Mathematics,
1 Department of Engineering, 2 Department of Mathematics,
Protein Structures.
BIOINFORMATICS Summary
Important Points in Drug Design based on Bioinformatics Tools
Protein structure prediction.
Mr.Halavath Ramesh 16-MCH-001 Dept. of Chemistry Loyola College University of Madras-Chennai.
Mr.Halavath Ramesh 16-MCH-001 Dept. of Chemistry Loyola College University of Madras-Chennai.
Describing a crystal to a computer: How to represent and predict material structure with machine learning Keith T Butler.
Presentation transcript:

Exploring Chemical Space with Computers—Challenges and Opportunities Pierre Baldi UCI

Chemical Informatics Historical perspective: physics, chemistry and biology Understanding chemical space Small molecules (systems biology, chemical synthesis, drug design, nanotechnology)

Chemical Space StarsSmall Mol. Existing Virtual (?) Access Difficult“Easy” Mode IndividualCombinatorial

Chemical Space

Chemical Informatics Historical perspective: physics, chemistry and biology Understanding chemical space Small molecules (systems biology, chemical synthesis, drug design, nanotechnology) Predict physical, chemical, biological properties (classification/regression) Build filters/tools to efficiently navigate chemical space to discover new drugs, new galaxies, etc.

Methods Spetrum: Schrodinger Equation Molecular Dynamics Machine Learning (e.g. SS prediction)

Chemical Informatics Informatics must be able to deal with variable-size structured data Graphical Models (Recursive) Neural Networks ILP GA SGs Kernels

Two Essential Ingredients 1. Data 2. Similarity Measures Bioinformatics analogy and differences: Data (GenBank, Swissprot, PDB) Similarity (BLAST)

Data Mutag (Mutagenicity) 200 compounds (125/63), mutagenicity in Salmonella PTC (Predictive Toxicity Challenge) A few hundred compounds, carcinogenicity (FM,MM,FR,MR) NCI (Anti-cancer activity) 70,000 compounds screened for ability to inhibit growth in 60 human tumor cell lines Alkanes (Boiling points) All 150 non-cyclic alkanes (C n H 2n+2 ) with n<11 and their boiling points ([- 164,174]) Benzodiazepines (QSAR) 79 1,4-benzodiazepines-2-one, affinity towards GABA A ChemDB 7M compounds

Similarity Rapid Searches of Large Databases Predictive Methods (Kernel Methods) Why it is not hopeless?

Similarity Rapid Search of Large Databases Protein Receptor (Docking) Small Molecule/Ligand (Similarity) Small Molecule/Ligand (Similarity) Predictive Methods (Kernel Methods) Why it is not hopeless OrganicChemicals

Linear Classifiers

Classification Learning to Classify Limited number of training examples (molecules, patients, sequences, etc.) Learning algorithm (how to build the classifier?) Generalization: should correctly classify test data. Formalization X is the input space Y (e.g. toxic/non toxic, or {1,- 1}) is the target class f: X → Y is the classifier.

Classification  Fundamental Point:  f is entirely determined by the dot products  x i,x j  measuring the similarity between pairs of data points

Non Linear Classification (Kernel Methods) We can transform a nonlinear problem into a linear one using a kernel.

Non Linear Classification (Kernel Methods) We can transform a nonlinear problem into a linear one using a kernel K. Fundamental property: the linear decision surface depends on K(x i,x j )=  (x i ),  (x j ) . All we need is the Gram similarity matrix K. K defines the local metric of the embedding space.

Similarity: Data Representations NC(O)C(=O)O

Molecular Representations 1D: SMILES strings 2D: Graph of bonds 2.5D: Surfaces 3D: Atomic coordinates 4D: Temporal evolution

15 Total: 1D SMILES Kernel CCCCCCc1ccc(cc1O)O CCCCCc1ccc(cc1)CO

2D Molecule Graph Kernel For chemical compounds atom/node labels: A = {C,N,O,H, … } bond/edge labels: B = {s, d, t, ar, … } Count labeled paths Fingerprints (CsNsCdO)

Similarity Measures

3D Coordinate Kernel 1.4 A 2.0 A 2.8 A 3.4 A 4.2 A

Example of Results

Results

Example of Results

Summary Derived a variety of kernels for small molecules State-of-the-art performance on several benchmark datasets 2D kernels slightly better than 1D and 3D kernels Many possible extensions: 2.5D kernels, isomers, etc… Need for larger data sets and new models of cooperation in the chemistry community Many open (ML) questions (e.g. clustering and visualizing 10 7 compounds, intelligent recognition of useful molecules, information retrieval from literature, docking, prediction of reaction rates, matching table of all proteins against all known compounds, origin of life) Chemistry version of the Turing test

ChemDB 7M compounds (3.5M unique) Commercially available PostgreSQL/Oracle Annotation (Experimental, Computational) Searchable Web interface Similarity, in silico reactions

Acknowledgements Informatics Liva Ralaivola J. Chen S. J. Swamidass Yimeng Dou Peter Phung Jocelyne Bruand Funding NIH NSF IGB Pharmacology Daniele Piomelli Chemistry G. Weiss J. S. Nowick R. Chamberlin

New Questions Predict drug-like molecules? toxicity? New Strategies How can we search efficiently? Intelligently? New data structures and algorithms Optimizing old structures How can we understand this much data? Cluster and visualize millions of data points Define commercially accessible space. Are there other useful things we can do with this? Discover new polymers, etc. Wonder about the origin of life. Combinatorially combine all known chemicals.

Acknowledgements Jocelyne Bruand Peter Phung Liva Ralaivola S. Joshua Swamidass Yimeng Dou NIH/NSF/IGB Questions

Docking Database of potential drugs 6 million small molecules … Query: Binding Site of Protein Scoring Function & Efficient Minimizer

Some Targets P53 (Luecke) ACCD5 (Tsai) IMPDH, PPAR, etc. (Luecke) HIV Integrase (Robinson)

P53

Drug Rescue of P53 Mutants

Docking → ChemDB ~6 million commercially available compounds Searchable, annotated, downloadable. Other Databases: Cambridge Structural Database ChemBank PubChem

Chemical Toxicity Prediction By Kernel Methods Jonathan Chen S Joshua Swamidass The Baldi Lab

Data Flow Toxicity State List Predictions Gram Matrix 4Yes2No3Yes 1No IDToxic? Kernel Linear Classifier

Results

Example of Results Kernel/Method Mutag MM FM MR FR Kashima (2003) Kashima (2003) D SMILES spec D SMILES spec D Tanimoto D MinMax D Tanimoto, l = 1024, b = D Hybrid l = 1024, b = D Tanimoto, l = 512, b = D Hybrid l = 512, b = D Tanimoto, l = MI D Hybrid l = MI D Tanimoto, l = MI D Hybrid l = MI D Histogram

Chemical Informatics Historical perspective: physics, chemistry and biology Understanding chemical space Small molecules (systems biology, chemical synthesis, drug design, nanotechnology) Catalog Predict physical, chemical, biological properties Build filters/tools to efficiently navigate chemical space to discover new drugs, new galaxies, etc.

Datasets

Small Molecules as Undirected Labeled Graphs of Bonds atom/node labels: A = {C,N,O,H, … } bond/edge labels: B = {s, d, t, ar, … }

Chemical Informatics Historical perspective: physics, chemistry and biology Understanding chemical space Small molecules (systems biology, chemical synthesis, drug design, nanotechnology) Bioinformatics analogy: Catalog (GenBank) Search (BLAST) Predict physical, chemical, biological properties Build filters/tools to efficiently navigate chemical space to discover new drugs, new galaxies, etc.

Chemical Informatics Historical perspective: physics, chemistry and biology Understanding chemical space Small molecules (systems biology, chemical synthesis, drug design, nanotechnology) Bioinformatics analogy: Catalog (GenBank) Search (BLAST) Predict physical, chemical, biological properties Build filters/tools to efficiently navigate chemical space to discover new drugs, new galaxies, etc.