© 2001, Boehringer, Inc. - All Rights Reserved. SCA: New Cluster Algorithm for Structural Diversity Analysis and Applications Presented at Spotfire Users.

Slides:

Advertisements

Similar presentations

Object Recognition from Local Scale-Invariant Features David G. Lowe Presented by Ashley L. Kapron.

Advertisements

Analysis of High-Throughput Screening Data C371 Fall 2004.

Learning Trajectory Patterns by Clustering: Comparative Evaluation Group D.

Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.

Computer vision: models, learning and inference Chapter 13 Image preprocessing and feature extraction.

ABCD Flexsim-R: A new 3D descriptor for combinatorial library design and in-silico screening 2 nd Joint Sheffield Conference on Chemoinformatics: Computational.

Neurocomputing,Neurocomputing, Haojie Li Jinhui Tang Yi Wang Bin Liu School of Software, Dalian University of Technology School of Computer Science,

Non-linear Dimensionality Reduction CMPUT 466/551 Nilanjan Ray Prepared on materials from the book Non-linear dimensionality reduction By Lee and Verleysen,

X0 xn w0 wn o Threshold units SOM.

Self Organizing Maps. This presentation is based on: SOM’s are invented by Teuvo Kohonen. They represent multidimensional.

ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

Adaptive Offset Subspace Self- Organizing Map with an Application to Handwritten Digit Recognition Huicheng Zheng, Pádraig Cunningham and Alexey Tsymbal.

CONTENT BASED FACE RECOGNITION Ankur Jain 01D05007 Pranshu Sharma Prashant Baronia 01D05005 Swapnil Zarekar 01D05001 Under the guidance of Prof.

Reduced Support Vector Machine

Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.

Dimension reduction : PCA and Clustering by Agnieszka S. Juncker

Quantitative Structure-Activity Relationships (QSAR) Comparative Molecular Field Analysis (CoMFA) Gijs Schaftenaar.

Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman.

Clustering Petter Mostad. Clustering vs. class prediction Class prediction: Class prediction: A learning set of objects with known classes A learning.

Exploring Microarray data Javier Cabrera. Outline 1.Exploratory Analysis Steps. 2.Microarray Data as Multivariate Data. 3.Dimension Reduction 4.Correlation.

INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

Dimension reduction : PCA and Clustering by Agnieszka S. Juncker Part of the slides is adapted from Chris Workman.

Hazırlayan NEURAL NETWORKS Radial Basis Function Networks I PROF. DR. YUSUF OYSAL.

Distributed Model-Based Learning PhD student: Zhang, Xiaofeng.

Lecture 09 Clustering-based Learning

Clustering and MDS Exploratory Data Analysis. Outline What may be hoped for by clustering What may be hoped for by clustering Representing differences.

Dan Simon Cleveland State University

Dimension Reduction and Visualization of Large High-Dimensional Data via Interpolation Seung-Hee Bae, Jong Youl Choi, Judy Qiu, and Geoffrey Fox School.

Face Recognition Using Neural Networks Presented By: Hadis Mohseni Leila Taghavi Atefeh Mirsafian.

Item-based Collaborative Filtering Recommendation Algorithms

Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.

 C. C. Hung, H. Ijaz, E. Jung, and B.-C. Kuo # School of Computing and Software Engineering Southern Polytechnic State University, Marietta, Georgia USA.

The Tutorial of Principal Component Analysis, Hierarchical Clustering, and Multidimensional Scaling Wenshan Wang.

VAMOS Visualization of Accessible Molecular Space A new compound filtering and selection interface Spotfire User Conference - Europe - May , 2003.

Self-organizing Maps Kevin Pang. Goal Research SOMs Research SOMs Create an introductory tutorial on the algorithm Create an introductory tutorial on.

Artificial Neural Networks Dr. Abdul Basit Siddiqui Assistant Professor FURC.

COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.

EMIS 8381 – Spring Netflix and Your Next Movie Night Nonlinear Programming Ron Andrews EMIS 8381.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Adaptive nonlinear manifolds and their applications to pattern.

Using geWorkbench: Hierarchical & SOM Clustering Fan Lin, Ph. D Molecular Analysis Tools Knowledge Center Columbia University and The Broad Institute of.

Data Reduction. 1.Overview 2.The Curse of Dimensionality 3.Data Sampling 4.Binning and Reduction of Cardinality.

Optimization with Neural Networks Presented by: Mahmood Khademi Babak Bashiri Instructor: Dr. Bagheri Sharif University of Technology April 2007.

A genetic approach to the automatic clustering problem Author : Lin Yu Tseng Shiueng Bien Yang Graduate : Chien-Ming Hsiao.

SOM-based Data Visualization Methods Author:Juha Vesanto Advisor:Dr. Hsu Graduate:ZenJohn Huang IDSL seminar 2002/01/24.

Selecting Diverse Sets of Compounds C371 Fall 2004.

Gene expression & Clustering. Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species –Dynamic.

381 Self Organization Map Learning without Examples.

Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9.

Semiconductors, BP&A Planning, DREAM PLAN IDEA IMPLEMENTATION.

Chapter 13 (Prototype Methods and Nearest-Neighbors )

PATTERN RECOGNITION : PRINCIPAL COMPONENTS ANALYSIS Richard Brereton

© 2002 IBM Corporation IBM Research 1 Policy Transformation Techniques in Policy- based System Management Mandis Beigi, Seraphin Calo and Dinesh Verma.

MACHINE LEARNING 7. Dimensionality Reduction. Dimensionality of input Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 8. Text Clustering.

Collaborative Filtering via Euclidean Embedding M. Khoshneshin and W. Street Proc. of ACM RecSys, pp , 2010.

Presented by: Dardan Xhymshiti Spring 2016:. Authors: Publication:  ICDM 2015 Type:  Research Paper 2 Michael ShekelyamGregor JosseMatthias Schubert.

ViSOM － A Novel Method for Multivariate Data Projection and Structure Visualization Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Hujun Yin.

Example Apply hierarchical clustering with d min to below data where c=3. Nearest neighbor clustering d min d max will form elongated clusters!

A Tutorial on Spectral Clustering Ulrike von Luxburg Max Planck Institute for Biological Cybernetics Statistics and Computing, Dec. 2007, Vol. 17, No.

Computational Approach for Combinatorial Library Design Journal club-1 Sushil Kumar Singh IBAB, Bangalore.

Recognizing specific objects Matching with SIFT Original suggestion Lowe, 1999,2004.

Multivariate statistical methods. Multivariate methods multivariate dataset – group of n objects, m variables (as a rule n>m, if possible). confirmation.

Advanced Strategies for Metabolomic Data Analysis Dmitry Grapov, PhD.

DATA MINING: CLUSTER ANALYSIS (3) Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.

CSE 4705 Artificial Intelligence

Data Mining, Neural Network and Genetic Programming

Self Organizing Maps: Parametrization of Parton Distribution Functions

Principal Component Analysis (PCA)

DATA MINING Introductory and Advanced Topics Part II - Clustering

Presentation transcript:

© 2001, Boehringer, Inc. - All Rights Reserved. SCA: New Cluster Algorithm for Structural Diversity Analysis and Applications Presented at Spotfire Users Conference Jun Xu Boehringer Ingelheim Pharmaceuticals, Inc. May 3, 2001

Introduction: Diversity & Drug Design Lead Screening –Select compounds for UHTS –Select compounds for acquisition Combinatorial Library Design –Compare virtual libraries –Compare virtual libraries against existing inventory –Select sub-library to make

Importance of Data Visualization Graphically review structural diversity Graphically filter unwanted compounds Graphically select sub-set Graphically study the relations between structure and activity

Challenge! Chemical structures are graphs The number of compounds in a library can be very large

Solution to study the diversity of a large compound library conventional methods

Mapping Principal Component Analysis (PCA) –Transform a matrix M(m,n) to M’(m,n’) –The n’ dismensions are sorted based on the eigenvalues –If the top-three dimensions can explain >85% of the data, the M’(m,3) is the fair approximation of M(m,n), otherwise PCA cannot be used for mapping Multi-Dimensional Scaling (MDS) –Based on distance matrix –Convert M(m,n) to M’(m,2) in an irrational method One of the Problems –The new dimensions have no chemical/physical meaning

An example of mapping

Clustering To divide n objects into m bins (n  m) The clustering is pattern recognition The clustering can be a unsupervised learning

General steps for clustering Select the data of describing objects Extract patterns from the data –normalizing rows –normalizing columns –normalizing methods Measure Similarity Select a proper and robust clustering method

Problems in conventional methods Selecting and computing “correct” descriptors are difficult and time-consuming Hierarchical algorithms force “dogs” and “cats” to be together Non-hierarchical algorithms ask for “number of clusters” and other settings SOM method asks you to set at least eight irrational parameters

How many do you want? How many clusters are in my library?... K-mean cluster:

K-mean and K-nearest Neighbor Approaches Assuming the number of clusters is known Computing complexity: N j represents the number of jth combinations in k clusters (groups) n represents the number of objects n i represents the number of objects in the ith cluster k represents the number of clusters It is NP-complete problem

Self Organization Map (SOM) Approach To run SOM, 8 parameters have to be set up properly as follows: –Data Initialization: random or ordered –Neighborhood function: Bubble or Gaussian –Neuron topology: hexagonal or Rectangular –Neural dimensions: X and Y (how many cells/neurons) –Number of training steps: such as, 10,000 –Initial learning rate: such as, 0.03 –Initial radius of training area: such as, 10 –Monitoring parameter: number of steps for generating 2D points on a plane, such as, 100

S-Cluster: New approach No need to compute descriptors No need to give the number of clusters Faster Rational parameters Results are explained chemically

S-Cluster Algorithm (1) Extract scaffolds Reference scaffold (S v ): –number of smallest set of smallest rings (sssrs) –number of non-H atoms (atoms) –number of bonds (excluding H bonds) (bonds) –sum of non-H atomic numbers (zs) –V v = { sssrs, atoms, bonds, zs } Sv

Deriving Scaffolds

S-Cluster Algorithm (2) The complexity of a structure: V i = { sssrs, atoms, bonds, zs } Si for S i V v = { sssrs, atoms, bonds, zs } from a reference scaffold P i = || V v + V i || M i = || V v - V i ||

S-Cluster Algorithm (3) The “Cyclicity” of a structure –The sum of heavy atomic numbers (a) –The umber of rotating bonds ( r ) –The number of 1-degree nodes (d1) –The number of double bonds (db) –The number of triple bonds (tb) –The number of 2-degree nodes (d2) –V s = { a, r, d1, db, tb, d2 } saffold –V i = { a, r, d1, db, tb, d2 } structure(i)

Results and discussions Cluster following libraries together: –ACD (250,468 structures) –NCI (126,554, MDL 1994) –CMC (4591 oral drugs) –MDDR (6347 launch or pre-clinical drugs or compounds) Cluster all 387,960 structures on an NT laptop (Compaq, Armada E700) Running time: 1 h 42 mins

Cyclicity vs Complexity

Most complicated structure is on the upper-right

Most chain-like structure is on the bottom-left

Zoom-in: Substituent Patterns

Diversity “Island” and “Density” A: Single O substituents B: Single F substituents

“Cyclicity” vs Average Electronegativity

“Cyclicity” vs H-Bond Donors

Reagent Selector (R) Clustering Result(Jarvis-Patrick Method): Input 116 compounds, Ask for 26 clusters This is cluster 2

Result from the S-Cluster Algorithm: Input 116 compounds, 26 clusters were found This is cluster 2

Applications Evaluate libraries Compare libraries Design a focused library

Blue: Virtual Library Red: Target Library

The optimized sub-library to be made from the virtual library

But, if you still want to cluster molecules (genes, or small molecules) based upon their property/activity arrays... We have V-Cluster (Vector Cluster Algorithm) for these requirement, it will be presented later

Conclusions We emphasize on finding natural clusters There must be chemical/physical explanations for computational results Before a software “button” is pushed, the mathematical/chemical/physical/biological meaning should be understood Good algorithm should be robust

Acknowledgements Cheminformatics/Medicinal Chemistry –Dr. Qiang Zhang –Dr. Hans Briem –Dr. Ron Magolda