Indiana University School of David Wild – I533 2006. Page 1 David Wild Chemical Informatics.

Slides:



Advertisements
Similar presentations
JKlustor clustering chemical libraries presented by … maintained by Miklós Vargyas Last update: 25 March 2010.
Advertisements

SOMA2 – Drug Design Environment. Drug design environment – SOMA2 The SOMA2 project Tekes (National Technology Agency of Finland) DRUG2000 program.
Indiana University School of David Wild – CICC Quarterly Meeting, Jan Page 1 Projects 1-4 update David Wild CICC Quarterly Meeting January 27.
Indiana University School of David Wild – Research Overview April Page 1 Research Update, April 2006 David Wild Assistant Professor of Chemical Informatics.
Indiana University School of David Wild – Joint IU, Michigan, Lilly Meeting, October Page 1 Smart Mining Interfaces, Workflows, and Data Mining the.
Dr. Matthew Wright Product Director.
Kensington Oracle Edition: Open Discovery Workflow Meets Oracle 10g Professor Yike Guo.
Analysis of High-Throughput Screening Data C371 Fall 2004.
Visual Scripting of XML
DSPIN: Detecting Automatically Spun Content on the Web Qing Zhang, David Y. Wang, Geoffrey M. Voelker University of California, San Diego 1.
All Hands Meeting, 2006 Title: Grid Workflow Scheduling in WOSE (Workflow Optimisation Services for e- Science Applications) Authors: Yash Patel, Andrew.
Presentation 7 part 2: SOAP & WSDL. Ingeniørhøjskolen i Århus Slide 2 Outline Building blocks in Web Services SOA SOAP WSDL (UDDI)
David Adams ATLAS DIAL Distributed Interactive Analysis of Large datasets David Adams BNL March 25, 2003 CHEP 2003 Data Analysis Environment and Visualization.
1 Draft of a Matchmaking Service Chuang liu. 2 Matchmaking Service Matchmaking Service is a service to help service providers to advertising their service.
Input Validation For Free Text Fields ADD Project Members: Hagar Offer & Ran Mor Academic Advisor: Dr Gera Weiss Technical Advisors: Raffi Lipkin & Nadav.
CS 290C: Formal Models for Web Software Lecture 1: Introduction Instructor: Tevfik Bultan.
1 An Empirical Study on Large-Scale Content-Based Image Retrieval Group Meeting Presented by Wyman
1 BrainWave Biosolutions Limited Accelerating Life Science Research through Technology.
Course Instructor: Aisha Azeem
Web-based Portal for Discovery, Retrieval and Visualization of Earth Science Datasets in Grid Environment Zhenping (Jane) Liu.
Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.
What is Software Architecture?
Chapter 3 Memory Management: Virtual Memory
Classroom User Training June 29, 2005 Presented by:
Implementation Yaodong Bi. Introduction to Implementation Purposes of Implementation – Plan the system integrations required in each iteration – Distribute.
Information Need Question Understanding Selecting Sources Information Retrieval and Extraction Answer Determina tion Answer Presentation This work is supported.
Test Of Distributed Data Quality Monitoring Of CMS Tracker Dataset H->ZZ->2e2mu with PileUp - 10,000 events ( ~ 50,000 hits for events) The monitoring.
Department of Biomedical Informatics Service Oriented Bioscience Cluster at OSC Umit V. Catalyurek Associate Professor Dept. of Biomedical Informatics.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Master Thesis Defense Jan Fiedler 04/17/98
20 October 2006Workflow Optimization in Distributed Environments Dynamic Workflow Management Using Performance Data David W. Walker, Yan Huang, Omer F.
CHAPTER TEN AUTHORING.
Indiana University School of David Wild – ECCR Meeting, October Page 1 Chemical Informatics & Cyberinfrastructure Collaboratory Cheminformatics Aspects:
28-29 th March 2006CCP4 Automation STAB MeetingCCP4i and Automation 1 CCP4i and Automation : Opportunities and Limitations Peter Briggs, CCP4.
Ihr Logo Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization Turban, Aronson, and Liang.
A Use Case Primer 1. The Benefits of Use Cases  Compared to traditional methods, use cases are easy to write and to read.  Use cases force the developers.
Freelib: A Self-sustainable Digital Library for Education Community Ashraf Amrou, Kurt Maly, Mohammad Zubair Computer Science Dept., Old Dominion University.
CMPS 435 F08 These slides are designed to accompany Web Engineering: A Practitioner’s Approach (McGraw-Hill 2008) by Roger Pressman and David Lowe, copyright.
Building the e-Minerals Minigrid Rik Tyer, Lisa Blanshard, Kerstin Kleese (Data Management Group) Rob Allan, Andrew Richards (Grid Technology Group)
Metadata Mòrag Burgon-Lyon University of Glasgow.
Cooperative experiments in VL-e: from scientific workflows to knowledge sharing Z.Zhao (1) V. Guevara( 1) A. Wibisono(1) A. Belloum(1) M. Bubak(1,2) B.
Software Prototyping Rapid software development to validate requirements.
Selecting Diverse Sets of Compounds C371 Fall 2004.
David Adams ATLAS DIAL: Distributed Interactive Analysis of Large datasets David Adams BNL August 5, 2002 BNL OMEGA talk.
Coevolutionary Automated Software Correction Josh Wilkerson PhD Candidate in Computer Science Missouri S&T.
Topic 4 - Database Design Unit 1 – Database Analysis and Design Advanced Higher Information Systems St Kentigern’s Academy.
ApproxHadoop Bringing Approximations to MapReduce Frameworks
Providing web services to mobile users: The architecture design of an m-service portal Minder Chen - Dongsong Zhang - Lina Zhou Presented by: Juan M. Cubillos.
Design of a Compound Screening Collection Gavin Harper Cheminformatics, Stevenage.
Chapter 11  2000 by Prentice Hall System Analysis and Design: Methodologies and Tools Uma Gupta Introduction to Information Systems.
CS223: Software Engineering Lecture 13: Software Architecture.
Example projects using metadata and thesauri: the Biodiversity World Project Richard White Cardiff University, UK
Copyright 2007, Information Builders. Slide 1 iWay Web Services and WebFOCUS Consumption Michael Florkowski Information Builders.
Presentation on Database management Submitted To: Prof: Rutvi Sarang Submitted By: Dharmishtha A. Baria Roll:No:1(sem-3)
Jiro Sumitomo, James M. Hogan, Felicity Newell, Paul Roe Microsoft QUT eResearch Centre
Compilation of XSLT into Dataflow Graphs for Web Service Composition Peter Kelly Paul Coddington Andrew Wendelborn.
18 May 2006CCGrid2006 Dynamic Workflow Management Using Performance Data Lican Huang, David W. Walker, Yan Huang, and Omer F. Rana Cardiff School of Computer.
A Self-organizing Semantic Map for Information Retrieval Xia Lin, Dagobert Soergel, Gary Marchionini presented by Yi-Ting.
Indiana University School of David Wild – ECCR Meeting, October Page 1 Chemical Informatics & Cyberinfrastructure Collaboratory HTS Data Analysis.
Indiana University School of Indiana University ECCR Summary Infrastructure: Cheminformatics web service infrastructure made available as a community resource.
SOFTWARE DESIGN AND ARCHITECTURE
Unified Modeling Language
Software Design and Architecture
Recap: introduction to e-science
Genomic Data Clustering on FPGAs for Compression
APPLICATIONS OF BIOINFORMATICS IN DRUG DISCOVERY
CICC Combines Grid Computing with Chemical Informatics
Overview of Workflows: Why Use Them?
Chaitali Gupta, Madhusudhan Govindaraju
Gordon Erlebacher Florida State University
Presentation transcript:

Indiana University School of David Wild – I Page 1 David Wild Chemical Informatics tools, services and workflows

Indiana University School of David Wild – I Page 2 Outline Chemical Informatics software packages available at IU Open source software The need for integration & innovation Pipelines, workflows and web services

Indiana University School of David Wild – I Page 3 Software at IUB Informatics Spotfire DecisionSite ChemTK ArgusLab BCI software – cluster analysis, fingerprints, Markush OpenEye software – 3D conformer, docking Chemaxon gNova CHORD Chemoinformatics programming toolkits –Daylight, BCI, OpenEye

Indiana University School of David Wild – I Page 4 Open Source / Free Software Blue Obelisk - koeln.de/dokuwiki/doku.phphttp://wiki.cubic.uni- koeln.de/dokuwiki/doku.php InChI - JMOL – FROWNS - OpenBabel - CML - CDK - MMTK -

Indiana University School of David Wild – I Page 5 The need for integration Research computing is currently very fragmented Existing approaches do not scale up to the amount of data now common Many chemical informatics tools are obscure, difficult to use and access Scientists’ questions are not that complex, but finding the answers is currently very time consuming and/or complex (for a human) –“has anybody patented this chemical structure I just made?” –“can I get hold of a compound that might bind to the active site of this protein I just resolved?” –“which compounds in this series are least likely to exhibit toxic effects?” Answers are often “stale” after a short period of time – questions need to be re-answered as new information is generated Almost all available systems are passive, and follow the (web) browsing model There tends to be one interface for every data source (or encompassing just a few)

Indiana University School of David Wild – I Page 6 Oracle Database (HTS) Compounds were tested against related assays and showed activity, including selectivity within target families Oracle Database (Genomics) ? None of these compounds have been tested in a microarray assay Computation The information in the structures and known activity data is good enough to create a QSAR model with a confidence of 75% External Database (Patent)  Some structures with a similarity > 0.75 to these appear to be covered by a patent held by a competitor Computation All the compounds pass the Lipinksi Rule of Five and toxicity filters Excel Spreadsheet (Toxicity) One of the compounds was previously tested for toxicology and was found to have no liver toxicity Word Document (Chemistry)  Several of the compounds had been followed up in a previous project, and solubility problems prevented further development Journal Article A recent journal article reported the effectiveness of some compounds in a related series against a target in the same family Word Document (Marketing)  A report by a team in Marketing casts doubt on whether the market for this target is big enough to make development cost-effective SCIENTIST “These compounds look promising from their HTS results. Should I commit some chemistry resources to following them up?” ?

Indiana University School of David Wild – I Page 7 Pipelining and workflow tools These tools permit applications to be “piped” together or connected in “workflows” where the output of one program can be given as input to another program (or script) Graphical front ends are replacing scripting – e.g. PERL, Python, etc Available graphical tools –Scitegic Pipeline Pilot - –Inforsense KDE - –Taverna – –IO-Informatics Sentient – Find their real power in a web services environment

Indiana University School of David Wild – I Page 8

Indiana University School of David Wild – I Page 9

Indiana University School of David Wild – I Page 10

Indiana University School of David Wild – I Page 11 Web Services Semantic Web – “Next Big Thing” –Encode semantics in web pages (XML) –Describes services as well as information (SOAP, WSDL, UDDI) –Computation detached from interface –Note seeping through to general web usage eScience (UK) –£200m over period – Cyber Infrastructure / Grid (US) –Semantic Web Health Care & Life Sciences Research Group -

Indiana University School of David Wild – I Page 12 CICC-related projects Formal CICC projects 1.Innovative cross-screen analysis of NIH DTP Human Tumor Cell Line Data – innovative scientific analysis of NIH HTS data 2.Development of cheminformatics web services and use cases in Taverna – web service & workflow infrastructure 3.Development of a novel interface for the analysis of PubChem HTS data – tools for interacting with lots of complex data 4.A structure storage and searching system for Distributed Drug Discovery – innovative kinds of chemical databases Other, related projects –Fast clustering of very large datasets using Linux clusters –Smart client for mining drug discovery data (Microsoft supported)

Indiana University School of David Wild – I Page 13 PROJECT 4 Experimental Databases PROJECT 2 Web services & workflows PROJECT 1 Innovative cross-screen analysis of HTS data PROJECT 3 Visualization, navigation & analysis tools for HTS data SMART CLIENT Smart interfaces (incl. NLP, RSS, agents, etc) SMART CLIENT General drug discovery web services & workflows SMART CLIENT Smart interfaces (incl. NLP, RSS, agents, etc) FAST PARALLEL CLUSTERING Using DivKmeans & AVIDD

Indiana University School of David Wild – I Page 14 Desired outcomes by Summer 2006 A chemical informatics web service infrastructure running at IU Several Taverna workflows that use these and other web services, and which demonstrate that the infrastructure can be used to perform complex, relevant operations on PubChem data Demonstrated scientific results with the NIH DTP data An established Distributed Drug Discovery database linked with PubChem, that shows that our techniques together with PubChem can be employed in ways which benefit humanity in general A sandbox PubChem copy with improved functionality and architecture One or more novel visualization tools for PubChem data Demonstrate the feasibility of fast, accurate clustering of very large datasets (including the whole of PubChem) using the AVIDD Linux Cluster and a parallelized clustering algorithm (DivKmeans) Show that.NET and Java-based web services can work well together in a common infrastructure Demonstrate the feasibility of a natural language or other straightforward interface for scientists to express their information needs

Indiana University School of David Wild – I Page 15 NIH Database Service PostgreSQL CHORD Fingerprint Generator BCI Makebits Cluster Analysis BCI Divkmeans Table Management VoTables Plot Visualizer VoPlot Docking Selector Script 2D-3D OpenEye OMEGA Docking OpenEye FRED 3D Visualizer JMOL Cluster the compounds in the NIH DTP database by chemical structure, then choose representative compounds from the clusters and dock them into PDB protein files of interest SMILES + ID Fingerprints PDB Database Service SMILES + ID + Data Cluster Membership SMILES + ID + + Cluster # + Data SMILES + ID MOL File PDB Structure + Box Docked Complex

Indiana University School of David Wild – I Page 16 “However large an array of facts, however rapidly they accumulate, it is possible to keep them in order and to extract from time to time digests containing the most generally significant information, while indicating how to find those items of specialized interest. To do so, however, requires the will and the means” “[we need to] get the best information in the minimum quantity in the shortest time, from the people who are producing the information to the people who want it, whether they know they want it or not” J.D. Bernal, quoted in Murray-Rust et. al., Org. Biomol. Chem., 2004, 2,

Indiana University School of David Wild – I Page 17 “Smart Client” for drug discovery An open-source prototype that implements a new model of data mining that would, on request, “push” relevant information to pharmaceutical scientists in response to previously-defined straightforward expressions of needs, rather than relying on them stumbling upon the right information using traditional “browsing” models. … using workflows and web services

Indiana University School of David Wild – I Page 18

Indiana University School of David Wild – I Page 19

Indiana University School of David Wild – I Page 20 Online database (e.g. PubChem) Local database 3D Docking Tool 2D-3D converter 3D visualizer UDDI New Structure Service Search online databases for recent structures Search local databases for recent structures Merge Results AGENT / SMART CLIENT Parse request Select appropriate use cases and/or web service(s) Schedule as necessary Request from Human Interface WSDL SOAP atomic services aggregate services USE-CASE SCRIPT Invoke New Structure Service Convert structures to 3D Dock results & protein file Extract any hits Return links for visualization

Indiana University School of David Wild – I Page 21 Prototype development plan Develop a handful of use-cases based around industry/academia scientists Build 5-6 data / computation sources (e.g. enumeration, property calculation, structure database) that can fulfill the use cases Build WSDL and SOAP web services around the data sources that can be accessed from Taverna Develop workflows in Taverna (see taverna.sourceforge.net) Publish web services in UDDI Encode use-cases into scripts Build Intelligent Agent / Smart Client node that can match user needs with scripts & web services using workflows Develop browser interface through Contextual Inquiry/Usability Studies Consider mapping to a Natural Language Interface

Indiana University School of David Wild – I Page 22 Use Case #1 Are there any good ligands for my target? A chemist is working on a project involving a particular protein target, and wants to know: –Any newly published compounds which might fit the protein receptor site –Any published 3D structures of the protein or of protein-ligand complexes –Any interactions of compounds with other proteins –Any information published on the protein target

Indiana University School of David Wild – I Page 23 Use Case #1 Are there any good ligands for my target? A chemist is working on a project involving a particular protein target, and wants to know: –Any newly published compounds which might fit the protein receptor site gNova / PostgreSQL, PubChem search, FRED Docking –Any published 3D structures of the protein or of protein-ligand complexes PDB search –Any interactions of compounds with other proteins gNova / PostgreSQL, PubChem search –Any information published on the protein target Journal text search

Indiana University School of David Wild – I Page 24 Use Case #2 Who else is working on these structures? A chemist is working on a chemical series for a particular project and wants to know: –If anyone publishes anything using the same or related compounds –Any new compounds added to the corporate collection which are similar or related –If any patents are submitted that might overlap the compounds he is working on –Any pharmacological or toxicological results for those or related compounds –The results for any other projects for which those compounds were screened

Indiana University School of David Wild – I Page 25 Use Case #2 Who else is working on these structures? A chemist is working on a chemical series for a particular project and wants to know: –If anyone publishes anything using the same or related compounds ~ PubChem search –Any new compounds added to the corporate collection which are similar or related gNova CHORD / PostgreSQL –If any patents are submitted that might overlap the compounds he is working on ~ BCI Markush handling software –Any pharmacological or toxicological results for those or related compounds gNova CHORD / PostgreSQL, MiToolkit –The results for any other projects for which those compounds were screened gNova CHORD / PostgreSQL, PubChem search

Indiana University School of David Wild – I Page 26 Priorities for web service development Search of PubChem –Wrap around HTTP or SOAP request Search of local gNova / PostgreSQL database –Wrap around application Molecular docking with OpenEye FRED –Wrap around application Property calculation with Molinspiration MiTools –Wrap around application PDB Search –Already implemented as EMBL web service BCI Markush search –Wrap around application Fast clustering of large datasets –Wrap around grid-based application Visualizations of datasets –Client and service development – VisualiSAR, Spotfire

Indiana University School of David Wild – I Page 27 Use Case - CICC Which of these hits should I follow up? An MLI HTS experiment has produced 10,000 possible hits out of a screening set of 2m compounds. A chemist at another laboratory wants to know if there are any interesting active series she might want to pursue, based on: –Structure-activity relationships –Chemical and pharmacokinetic properties –Compound history –Patentability –Toxicity –Synthetic feasibility

Indiana University School of David Wild – I Page 28 Use Case – ECCR Which of these hits should I follow up? An HTS experiment has produced 10,000 possible hits out of a screening set of 2m compounds. A chemist on the project wants to know what the most promising series of compounds for follow-up are, based on: –Series selection cluster analysis –Structure-activity relationships modal fingerprints/stigmata –Chemical and pharmacokinetic properties mitools, chemaxon –Compound history gNova / PostgreSQL –Patentability BCI Markush handling software –Toxicity –Synthetic feasibility –+ requires visualization tools!

Indiana University School of David Wild – I Page 29 Technology Perl SOAP::Lite –Will be used for initial web service development –Doesn’t really implement WSDL & UDDI Apache Axis & Tomcat –Deploy WSDL for web services BPEL4WS – Business Process Execution Language –For aggregation of web services – bpel/ bpel/ Microsoft.NET & C#

Indiana University School of David Wild – I Page 30 Current activities Core activities –Development of use-cases –Development of initial web services (Perl SOAP::Lite) –Use of Taverna to prototype use-case scripts Basic research on future components –Organizing large amounts of chemical information for human consumption Development of very fast parallel clustering techniques – to be exposed as web services –Selection of interface-level tools for basic interaction Chemical structure drawing, display Investigation of , NLP, RSS, and browser interfaces –Interface-level tools for visualization, navigation and analysis Cluster and dataset visualization, natural language interfaces)

Indiana University School of David Wild – I Page 31 Cluster Analysis and Chemical Informatics Used for organizing datasets into chemical series, to build predictive models, or to select representative compounds Organizational usage has not been as well studies as the other two, but see –Wild, D.J., Blankley, C.J. Comparison of 2D Fingerprint Types and Hierarchy Level Selection Methods for Structural Grouping using Wards Clustering, Journal of Chemical Information and Computer Sciences., 2000, 40, Essentially helping large datasets become manageable Methods used: –Jarvis-Patrick and variants O(N 2 ), single partition –Ward’s method Hierarchical, regarded as best, but at least O(N 2 ) –K-means < O(N 2 ), requires set no of clusters, a little “messy” –Sphere-exclusion (Butina) Fast, simple, similar to JP –Kohonen network Clusters arranged in 2D grid, ideal for visualization

Indiana University School of David Wild – I Page 32 Limitations of Ward’s method for large datasets (>1m) Best algorithms have O(N 2 ) time requirement (RNN) Requires random access to fingerprints –hence substantial memory requirements (O(N)) Problem of selection of best partition –can select desired number of clusters Easily hit 4GB memory addressing limit on 32 bit machines –Approximately 2m compounds

Indiana University School of David Wild – I Page 33 Scaling up clustering methods Parallelisation –Clustering algorithms can be adapted for multiple processors –Some algorithms more appropriate than others for particular architectures –Ward’s has been parallelized for shared memory machines, but overhead considerable New methods and algorithms –Divisive (“bisecting”) K-means method –Hierarchical Divisive –Approx. O(NlogN)

Indiana University School of David Wild – I Page 34 Divisive K-means Clustering New hierarchical divisive method –Hierarchy built from top down, instead of bottom up –Divide complete dataset into two clusters –Continue dividing until all items are singletons –Each binary division done using K-means method –Originally proposed for document clustering “Bisecting K-means” –Steinbach, Karypis and Kumar (Univ. Minnesota) users.cs.umn.edu/~karypis/publications/Papers/PDF/doccluster.pdf –Found to be more effective than agglomerative methods –Forms more uniformly-sized clusters at given level

Indiana University School of David Wild – I Page 35 BCI Divkmeans Several options for detailed operation –Selection of next cluster for division –size, variance, diameter –affects selection of partitions from hierarchy, not shape of hierarchy Options within each K-means division step –distance measure –choice of seeds –batch-mode or continuous update of centroids –termination criterion Have developed parallel version for Linux clusters / grids in conjunction with BCI For more information, see Barnard and Engels talks at:

Indiana University School of David Wild – I Page 36 Comparative execution times NCI subsets, 2.2 GHz Intel Celeron processor 7h 27m 3h 06m 2h 25m 44m

Indiana University School of David Wild – I Page 37 Clustering a 1 million compound dataset on a 2.2 GHz Celeron Desktop Machine MethodTime *Memory Usage K-Means (10,000 clusters) 3½ days95 MB Divisive K-means7 days65 MB Divisive K-means (Parallel, 4 machines incl. 1.7 GHz Pentium M) 16½ hours~ 50 MB * Time for a single run may vary due to different selection of seeds. Runtimes can be shortened e.g. by using a max. number of iterations or a % relocation cutoff. Results from AVIDD clusters & Teragrid coming soon….

Indiana University School of David Wild – I Page 38 Divisive Kmeans: Conclusions Much faster than Ward’s, speed comparable to K-means, suitable for very large datasets (millions) –Time requirements approximately O(N log N) –Current implementation can cluster 1m compounds in under a week on a low-power desktop PC –Cluster 1m compounds in a few hours with a 4-node parallel Linux cluster Better balance of cluster sizes than Wards or Kmeans Visual inspection of clusters suggests better assembly of compound series than other methods Better clustering of actives together than previously-studied methods Memory requirements minimal Experiments using AVIDD cluster and Teragrid forthcoming (50+ nodes)

Indiana University School of David Wild – I Page 39 Visualization & interface level tools No matter how clever the smarts underneath, the overriding factor in usefulness will be the quality of scientists’ interaction with the system Contextual Design, Interaction Design (Cooper) and Usability Studies have proven effective in designing the right interfaces for the right people in chemical informatics [collaboration with HCI?] Possibility of multiple interfaces for different people groups (Cooper’s “primary personas”) Don’t assume the browser interface – / NLP ? Start with the basics –2D chemical structure drawing (input) –Visualization of large numbers of chemical structures in 2D –3D chemical structure visualization Planning on evaluation of NLP, , RSS, etc. as well as browser-based interfaces

Indiana University School of David Wild – I Page 40 Usability of 2D structure drawing tools Key difference between “sequential” and “random” drawers Huge difference in intuitiveness Key factor how badly you can mess things up Marvin Sketch ≈ JME > ChemDraw >> ISIS Draw

Indiana University School of David Wild – I Page 41 Visualization methods for datasets & clusters Partitions –Spreadsheets –Enhanced Spreadsheets –2D or 3D plots Hierarchies –Dendograms –Tree Maps –Hyperbolic Maps

Indiana University School of David Wild – I Page 42

Indiana University School of David Wild – I Page 43

Indiana University School of David Wild – I Page 44 VisualiSAR – with a nod to Edward Tufte. See

Indiana University School of David Wild – I Page 45 Tree Maps – very Tufte-esque

Indiana University School of David Wild – I Page 46 3D Visualization - JMOL Open Source, very flexible, works in a web service environment: jmol.sourceforge.net

Indiana University School of David Wild – I Page 47 Conclusions so far Effective exploitation of large volumes and diverse sources of chemical information is a critical problem to solve, with a potential huge impact on the drug discovery process Most information needs of chemists and drug discovery scientists are conceptually straightforward, but complex (for them) to implement All of the technology is now in place to implement may of these information need “use-cases”: the four level model using service-oriented architectures together with smart clients look like a neat way of doing this The aggregation and interface levels offer the most challenges In conjunction with grid computing, rapid and effective organization and visualization of large chemical datasets is feasible in a web service environment Some pieces are missing: –Chemical structure search of journals (wait for InChI) –Automated patent searching –Effective dataset organization –Effective interfaces, especially visualization of large numbers of 2D structures (we’re working on it!)