Exploring Chemical Space with Computers—Challenges and Opportunities Pierre Baldi UCI.

Exploring Chemical Space with Computers—Challenges and Opportunities Pierre Baldi UCI

Chemical Informatics Historical perspective: physics, chemistry and biology Understanding chemical space Small molecules (systems biology, chemical synthesis, drug design, nanotechnology)

Chemical Space StarsSmall Mol. Existing10 22 10 7 Virtual010 60 (?) Access Difficult“Easy” Mode IndividualCombinatorial

Chemical Space

Chemical Informatics Historical perspective: physics, chemistry and biology Understanding chemical space Small molecules (systems biology, chemical synthesis, drug design, nanotechnology) Predict physical, chemical, biological properties (classification/regression) Build filters/tools to efficiently navigate chemical space to discover new drugs, new galaxies, etc.

Methods Spetrum: Schrodinger Equation Molecular Dynamics Machine Learning (e.g. SS prediction)

Chemical Informatics Informatics must be able to deal with variable-size structured data Graphical Models (Recursive) Neural Networks ILP GA SGs Kernels

Two Essential Ingredients 1. Data 2. Similarity Measures Bioinformatics analogy and differences: Data (GenBank, Swissprot, PDB) Similarity (BLAST)

Data Mutag (Mutagenicity) 200 compounds (125/63), mutagenicity in Salmonella PTC (Predictive Toxicity Challenge) A few hundred compounds, carcinogenicity (FM,MM,FR,MR) NCI (Anti-cancer activity) 70,000 compounds screened for ability to inhibit growth in 60 human tumor cell lines Alkanes (Boiling points) All 150 non-cyclic alkanes (C n H 2n+2 ) with n<11 and their boiling points ([- 164,174]) Benzodiazepines (QSAR) 79 1,4-benzodiazepines-2-one, affinity towards GABA A ChemDB 7M compounds

Similarity Rapid Searches of Large Databases Predictive Methods (Kernel Methods) Why it is not hopeless?

Similarity Rapid Search of Large Databases Protein Receptor (Docking) Small Molecule/Ligand (Similarity) Small Molecule/Ligand (Similarity) Predictive Methods (Kernel Methods) Why it is not hopeless OrganicChemicals

Linear Classifiers

Classification Learning to Classify Limited number of training examples (molecules, patients, sequences, etc.) Learning algorithm (how to build the classifier?) Generalization: should correctly classify test data. Formalization X is the input space Y (e.g. toxic/non toxic, or {1,- 1}) is the target class f: X → Y is the classifier.

Classification  Fundamental Point:  f is entirely determined by the dot products  x i,x j  measuring the similarity between pairs of data points

Non Linear Classification (Kernel Methods) We can transform a nonlinear problem into a linear one using a kernel.

Non Linear Classification (Kernel Methods) We can transform a nonlinear problem into a linear one using a kernel K. Fundamental property: the linear decision surface depends on K(x i,x j )=  (x i ),  (x j ) . All we need is the Gram similarity matrix K. K defines the local metric of the embedding space.

Similarity: Data Representations NC(O)C(=O)O

Molecular Representations 1D: SMILES strings 2D: Graph of bonds 2.5D: Surfaces 3D: Atomic coordinates 4D: Temporal evolution

15 Total: 1D SMILES Kernel CCCCCCc1ccc(cc1O)O CCCCCc1ccc(cc1)CO

2D Molecule Graph Kernel For chemical compounds atom/node labels: A = {C,N,O,H, … } bond/edge labels: B = {s, d, t, ar, … } Count labeled paths Fingerprints (CsNsCdO)

Similarity Measures

3D Coordinate Kernel 1.4 A 2.0 A 2.8 A 3.4 A 4.2 A

Example of Results

Results

Example of Results

Summary Derived a variety of kernels for small molecules State-of-the-art performance on several benchmark datasets 2D kernels slightly better than 1D and 3D kernels Many possible extensions: 2.5D kernels, isomers, etc… Need for larger data sets and new models of cooperation in the chemistry community Many open (ML) questions (e.g. clustering and visualizing 10 7 compounds, intelligent recognition of useful molecules, information retrieval from literature, docking, prediction of reaction rates, matching table of all proteins against all known compounds, origin of life) Chemistry version of the Turing test

ChemDB 7M compounds (3.5M unique) Commercially available PostgreSQL/Oracle Annotation (Experimental, Computational) Searchable Web interface Similarity, in silico reactions

Acknowledgements Informatics Liva Ralaivola J. Chen S. J. Swamidass Yimeng Dou Peter Phung Jocelyne Bruand Funding NIH NSF IGB Pharmacology Daniele Piomelli Chemistry G. Weiss J. S. Nowick R. Chamberlin

New Questions Predict drug-like molecules? toxicity? New Strategies How can we search efficiently? Intelligently? New data structures and algorithms Optimizing old structures How can we understand this much data? Cluster and visualize millions of data points Define commercially accessible space. Are there other useful things we can do with this? Discover new polymers, etc. Wonder about the origin of life. Combinatorially combine all known chemicals.

Acknowledgements Jocelyne Bruand Peter Phung Liva Ralaivola S. Joshua Swamidass Yimeng Dou NIH/NSF/IGB Questions

Docking Database of potential drugs 6 million small molecules … Query: Binding Site of Protein Scoring Function & Efficient Minimizer

Some Targets P53 (Luecke) ACCD5 (Tsai) IMPDH, PPAR, etc. (Luecke) HIV Integrase (Robinson)

Drug Rescue of P53 Mutants

Docking → ChemDB ~6 million commercially available compounds Searchable, annotated, downloadable. Other Databases: Cambridge Structural Database ChemBank PubChem

Chemical Toxicity Prediction By Kernel Methods Jonathan Chen S Joshua Swamidass The Baldi Lab

Data Flow Toxicity State List Predictions Gram Matrix 4Yes2No3Yes 1No IDToxic? Kernel Linear Classifier

Results

Example of Results Kernel/Method Mutag MM FM MR FR Kashima (2003) 89.1 61.0 61.0 62.8 66.7 Kashima (2003) 85.1 64.3 63.4 58.4 66.1 1D SMILES spec. 84.0 66.1 61.3 57.3 66.1 1D SMILES spec+ 85.6 66.4 63.057.6 67.0 2D Tanimoto 87.8 66.4 64.2 63.7 66.7 2D MinMax 86.2 64.0 64.5 64.5 66.4 2D Tanimoto, l = 1024, b = 1 87.2 66.1 62.4 65.7 66.9 2D Hybrid l = 1024, b = 1 87.2 65.2 61.9 64.2 65.8 2D Tanimoto, l = 512, b = 1 84.6 66.4 59.9 59.9 66.1 2D Hybrid l = 512, b = 1 86.7 65.2 61.0 60.7 64.7 2D Tanimoto, l = 1024 + MI 84.6 63.1 63.0 61.9 66.7 2D Hybrid l = 1024 + MI 84.6 62.8 63.7 61.9 65.5 2D Tanimoto, l = 512 + MI 85.6 60.1 61.0 61.3 62.4 2D Hybrid l = 512 + MI 86.2 63.7 62.7 62.2 64.4 3D Histogram 81.9 59.8 61.0 60.8 64.4

Chemical Informatics Historical perspective: physics, chemistry and biology Understanding chemical space Small molecules (systems biology, chemical synthesis, drug design, nanotechnology) Catalog Predict physical, chemical, biological properties Build filters/tools to efficiently navigate chemical space to discover new drugs, new galaxies, etc.

Datasets

Small Molecules as Undirected Labeled Graphs of Bonds atom/node labels: A = {C,N,O,H, … } bond/edge labels: B = {s, d, t, ar, … }

Chemical Informatics Historical perspective: physics, chemistry and biology Understanding chemical space Small molecules (systems biology, chemical synthesis, drug design, nanotechnology) Bioinformatics analogy: Catalog (GenBank) Search (BLAST) Predict physical, chemical, biological properties Build filters/tools to efficiently navigate chemical space to discover new drugs, new galaxies, etc.

Exploring Chemical Space with Computers—Challenges and Opportunities Pierre Baldi UCI.

Similar presentations

Presentation on theme: "Exploring Chemical Space with Computers—Challenges and Opportunities Pierre Baldi UCI."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Exploring Chemical Space with Computers—Challenges and Opportunities Pierre Baldi UCI.

Similar presentations

Presentation on theme: "Exploring Chemical Space with Computers—Challenges and Opportunities Pierre Baldi UCI."— Presentation transcript:

Similar presentations

About project

Feedback