Fusing database rankings in similarity-based virtual screening Peter Willett, University of Sheffield.

Slides:



Advertisements
Similar presentations
The Robert Gordon University School of Engineering Dr. Mohamed Amish
Advertisements

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Analysis of High-Throughput Screening Data C371 Fall 2004.
Design of Experiments Lecture I
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.
Personalia: Pre-Sheffield Batchelor’s degree in Chemistry at Oxford Pre-university job in my local public library system Chemistry or information science?
Ranking models in IR Key idea: We wish to return in order the documents most likely to be useful to the searcher To do this, we want to know which documents.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
Evaluating Search Engine
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
Morris LeBlanc.  Why Image Retrieval is Hard?  Problems with Image Retrieval  Support Vector Machines  Active Learning  Image Processing ◦ Texture.
A Study on Feature Selection for Toxicity Prediction*
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 11: Probabilistic Information Retrieval.
Luddite: An Information Theoretic Library Design Tool Jennifer L. Miller, Erin K. Bradley, and Steven L. Teig July 18, 2002.
1/ 30. Problems for classical IR models Introduction & Background(LSI,SVD,..etc) Example Standard query method Analysis standard query method Seeking.
Case-based Reasoning System (CBR)
Data Fusion Eyüp Serdar AYAZ İlker Nadi BOZKURT Hayrettin GÜRKÖK.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
Presented by Zeehasham Rasheed
Active Learning Strategies for Drug Screening 1. Introduction At the intersection of drug discovery and experimental design, active learning algorithms.
Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.
1 BrainWave Biosolutions Limited Accelerating Life Science Research through Technology.
Information Retrieval
Flash talk by: Aditi Garg, Xiaoran Wang Authors: Sarah Rastkar, Gail C. Murphy and Gabriel Murray.
Data Mining Techniques
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Combinatorial Chemistry and Library Design
Similarity Methods C371 Fall 2004.
Leiden University. The university to discover. Enhancing Search Space Diversity in Multi-Objective Evolutionary Drug Molecule Design using Niching 1. Leiden.
Reyyan Yeniterzi Weakly-Supervised Discovery of Named Entities Using Web Search Queries Marius Pasca Google CIKM 2007.
Redeeming Relevance for Subject Search in Citation Indexes Shannon Bradshaw The University of Iowa
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
Glasgow 02/02/04 NN k networks for content-based image retrieval Daniel Heesch.
Use of Machine Learning in Chemoinformatics Irene Kouskoumvekaki Associate Professor December 12th, 2012 Biological Sequence Analysis course.
Identifying Applicability Domains for Quantitative Structure Property Relationships Mordechai Shacham a, Neima Brauner b Georgi St. Cholakov c and Roumiana.
1 Automatic Classification of Bookmarked Web Pages Chris Staff Second Talk February 2007.
University of Malta CSA3080: Lecture 4 © Chris Staff 1 of 14 CSA3080: Adaptive Hypertext Systems I Dr. Christopher Staff Department.
University of Malta CSA3080: Lecture 6 © Chris Staff 1 of 20 CSA3080: Adaptive Hypertext Systems I Dr. Christopher Staff Department.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Authors: Marius Pasca and Benjamin Van Durme Presented by Bonan Min Weakly-Supervised Acquisition of Open- Domain Classes and Class Attributes from Web.
2005/12/021 Content-Based Image Retrieval Using Grey Relational Analysis Dept. of Computer Engineering Tatung University Presenter: Tienwei Tsai ( 蔡殿偉.
2005/12/021 Fast Image Retrieval Using Low Frequency DCT Coefficients Dept. of Computer Engineering Tatung University Presenter: Yo-Ping Huang ( 黃有評 )
Query Sensitive Embeddings Vassilis Athitsos, Marios Hadjieleftheriou, George Kollios, Stan Sclaroff.
A Repetition Based Measure for Verification of Text Collections and for Text Categorization Dmitry V.Khmelev Department of Mathematics, University of Toronto.
Vector Space Models.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
Threshold Setting and Performance Monitoring for Novel Text Mining Wenyin Tang and Flora S. Tsai School of Electrical and Electronic Engineering Nanyang.
Investigating semantic similarity measures across the Gene Ontology: the relationship between sequence and annotation Bioinformatics, July 2003 P.W.Load,
Design of a Compound Screening Collection Gavin Harper Cheminformatics, Stevenage.
1 What Makes a Query Difficult? David Carmel, Elad YomTov, Adam Darlow, Dan Pelleg IBM Haifa Research Labs SIGIR 2006.
A Logistic Regression Approach to Distributed IR Ray R. Larson : School of Information Management & Systems, University of California, Berkeley --
DISTRIBUTED INFORMATION RETRIEVAL Lee Won Hee.
Predicting patterns of biological performance using chemical substructure features Diego Borges-Rivera 08/04/08.
Use of Machine Learning in Chemoinformatics
Introduction to Information Retrieval Introduction to Information Retrieval Lecture Probabilistic Information Retrieval.
4. Molecular Similarity. 2 Similarity and Searching Historical Progression Similarity Measures Fingerprint Construction “Pathological” Cases MinMax- Counts.
Identification of structurally diverse Growth Hormone Secretagogue (GHS) agonists by virtual screening and structure-activity relationship analysis of.
Methodologies and SSADM Models, Tools and Techniques.
Computational Approach for Combinatorial Library Design Journal club-1 Sushil Kumar Singh IBAB, Bangalore.
INFORMATION RETRIEVAL MEASUREMENT OF RELEVANCE EFFECTIVENESS 1Adrienn Skrop.
Page 1 Computer-aided Drug Design —Profacgen. Page 2 The most fundamental goal in the drug design process is to determine whether a given compound will.
Toxicity vs CHEMICAL space
Plan for Today’s Lecture(s)
Selcia Fragment Library
Martin Rajman, Martin Vesely
Statistics 2 for Chemical Engineering lecture 5
New Ms and BS Chemical Informatics Programs
Information Retrieval and Web Design
Presentation transcript:

Fusing database rankings in similarity-based virtual screening Peter Willett, University of Sheffield

Overview Similarity-based virtual screening Combination of similarity rankings Similarity fusion Group fusion Comparison of fusion rules

Drug discovery The pharmaceutical industry has been one of the great success stories of scientific research, discovering a range of novel drugs for important therapeutic areas The computer has revolutionised how the industry uses chemical (and increasingly biological) information Many of these developments are within the discipline we now know as chemoinformatics “Chem(o)informatics is a generic term that encompasses the design, creation, organization, management, retrieval, analysis, dissemination, visualization and use of chemical information” (G. Paris at a 1999 ACS meeting, quoted at Focus on structural information (2D or 3D) cf bioinformatics

Virtual screening Chemoinformatics covers a wide range of techniques Here, focus on virtual screening of existing public and in-house databases Tools to rank compounds in order of decreasing probability of activity The top-ranked molecules are then prioritised for biological screening A range of virtual screening methods available, with similarity searching being one of the best established and most widely used

Similarity searching Use of a similarity measure to quantify the resemblance between an active reference (or target) structure and each database structure Given a reference structure find molecules in a database that are most similar to it (“give me ten more like this”) Compare the reference structure with each database structure and measure the similarity Sort the database in order of decreasing similarity Display the top-ranked structures (“nearest neighbours”) to the searcher

2D similarity searching

The similar property principle states that structurally similar molecules tend to have similar properties Given a known active reference structure, a similarity search of a database can be used to identify further molecules for testing NB many exceptions to the similar property principle Rationale for similarity searching

Similarity measures A similarity measure has two principal components A structure representation Characterise reference and database structures to enable rapid comparison A similarity coefficient to compare two representations Quantitative measure of the resemblance of these characterisations The most common measure is based on the use of 2D fingerprints and the Tanimoto coefficient (as in previous example)

Fingerprints A simple, but approximate, representation that encodes the presence of fragment substructures in a bit-string or fingerprint Cf keywords indexing textual documents Each bit in the bit-string (binary vector) records the presence (“1”) or absence (“0”) of a particular fragment in the molecule. Typical length is a few hundred or few thousand bits Two fingerprints are regarded as similar if they have many common bits set

Tanimoto coefficient Tanimoto coefficient for binary bit strings C bits set in common between Reference and Database structures R bits set in Reference structure D bits set in Database structure More complex form for use with non-binary data, e.g., physicochemical property vectors Many other similarity coefficients exist

Data fusion: I Many comparisons of effectiveness using different screening methods (e.g., different coefficients, different fingerprints, 2D or 3D methods) Sheridan and Kearsley, Drug Discov. Today, 7, 2002, 903 “We have come to regard looking for ‘the best’ way of searching chemical databases as a futile exercise. In both retrospective and prospective studies, different methods select different subsets of actives for the same biological activity and the same method might work better on some activities than others” Different types of coefficient and different types of representation reflect different molecular characteristics, so may enhance search performance by using more than one similarity measure

Data fusion: II Use of ideas from textual information retrieval (IR) given analogies between the two domains Documents, keywords with highly skewed frequency distributions, and relevance to a query Molecules, fragments with highly skewed frequency distributions, and activity against a specific biological target IR-like fusion first studied in the late Nineties Generate multiple rankings from the same reference structure using different similarity measures (similarity fusion) Found to give improved performance over use of a single similarity measure (more consistent, or even better than best individual) Later work in chemoinformatics Generate multiple rankings from different reference structures using the same similarity measure (group fusion)

Similarity fusion Conventional similarity searching yields a single database ranking Work in IR on the “Authority Effect” Experiments in TREC show that documents retrieved by multiple search engines more likely to be relevant to a query than if retrieved by a single search engine Does the Effect also apply in chemoinformatics? Extensive virtual screening experiments to investigate whether structures retrieved by multiple virtual screening methods more likely to be active than if retrieved by a single method

Experimental details: I Test collection methodology analogous to that used in IR Use of MDDR (ca. 102K structures) and WOMBAT (ca. 130K structures) databases Sets of molecules with known biological activities (several hundred known actives in each class) Simulated virtual screening using an active as the reference structure How many of the top-ranked molecules from a search are also active?

Experimental details: II Sets of 25 searches for a reference structure: 5 different similarity coefficients (Tanimoto, cosine, Euclidean distance, Forbes, Russell-Rao) 5 different fingerprints (MDL, BCI, Daylight, Unity and ECFP_4) Apply cut-off to take, e.g., top-1% of a ranking Numbers of molecules, and numbers of active molecules, retrieved by 1, 2….24, 25 searches Average over different reference structures for each activity class, and over different activity classes

Retrieval of molecules: WOMBAT top-1% searches

Retrieval of molecules: WOMBAT top-1% searches (average over classes) Zipf-like distribution

Retrieval of active molecules: WOMBAT top-1% searches

Retrieval of active molecules: WOMBAT top-1% searches (average over classes)

Similarity fusion: conclusions Using multiple searches hence results in: Rapid decrease in the numbers of molecules retrieved Rapid increase in the percentage of those retrieved molecules that are active Multiple searches could hence increase the effectiveness of similarity-based virtual screening Provides empirical basis for similarity fusion (but very simple fusion rule). What about group fusion?

Reference 1 Use of group fusion: I Reference 2 Reference 3

After truncation to required rank Reference 2 Reference 1 Reference 3

Group fusion Use of MDDR database (ca. 102K structures) Measured numbers of actives retrieved in top-5% of ranking? Group fusion searches where pick ten actives at random Comparison with the average of all the individual actives for each activity class Comparison with the best single active for each activity class Use of Unity and ECFP4 fingerprints) Group fusion markedly out-performs the use of individual reference structures Best results obtained using combination of scores and the MAX rule (see later) Hert et al., J. Chem. Inf. Comput. Sci., 44, 2004, 1177

Group fusion: average over 11 activity classes Single Similarity - Average Single Similarity - Maximum Data Fusion (Scores - Max) Recall at 5% (%ReReccall Recall (%)) Unity ECFP_4

Fusion rules Given multiple input rankings, a fusion rule outputs a single, combined ranking The rankings can be either the computed similarity values or the resulting rank positions Work in IR and chemoinformatics has used simple arithmetical operations to combine rankings (though many other, more complex types of rule available): CombMAX for similarity data CombSUM for rank data Detailed comparison of a range of rules

Fusion rules for the x-th database structure CombMax = max{S 1 (x), S 2 (x)..S i (x)..S n (x)} Also CombMIN CombSum = Σ S i (x) Also CombMED and other averages, using all or just some of the rankings CombRKP = Σ (1/R i (x)) Used only with rank data

Very simple rules! Other studies use supervised rules (logistic regression, belief theory etc) But normally very limited training data (i.e., structures and bioactivity information) at the stage you want to use data fusion If such data are available, other chemoinformatics approaches preferable

Experimental details Searches carried out using Similarity fusion and group fusion Various percentages of the ranked database 15 different fusion rules Results show conclusively that best results (for both similarity fusion and group fusion) obtained when: Use just the top 1-5% of each ranked list in the fusion Use the CombRKP fusion rule on the ranked lists

Use of CombRKP: I Virtual screening seeks to rank molecules in decreasing order of probability of activity: MDDR searches (J. Med. Chem., 48, 2005, 7049) show a hyperbola-like plot

Use of CombRKP: II Fusion scores for CombRKP best approximate probability of activity, and hence CombRKP likely to perform well, Results averaged over 200 MDDR searches

Conclusions Similarity-based virtual screening using fingerprints well-established Can enhance screening effectiveness by use of data fusion: Combining the rankings from different similarity measures Combining the rankings from different reference structures Range of simple fusion rules available for this purpose

Acknowledgments Organisations Accelrys, Daylight Chemical Information Systems, Digital Chemistry, EPSRC, Government of Malaysia, Sunset Molecular, Royal Society, Tripos, Wolfson Foundation People Claire Ginn, Jerome Hert, John Holliday, Evangelos Kanoulas, Nurul Malim, Christoph Mueller, Naomie Salim