1 Joint meeting of the Molecular Libraries Screening Centers Network (MLSCN) and the Exploratory Centers for Cheminformatics Research (ECCR): Talk II July.

1 Joint meeting of the Molecular Libraries Screening Centers Network (MLSCN) and the Exploratory Centers for Cheminformatics Research (ECCR): Talk II July 18 2006 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories Indiana University Bloomington IN 47401 gcf@indiana.edu http://www.infomall.org http://www.chembiogrid.org

2 Chemical Informatics and Cyberinfrastructure Collaboratory Collaboration between School of Informatics (Cheminformatics, Bioinformatics, Computer Science), departments of Biology and Chemistry at Indiana University Bloomington and Indianapolis (IUPUI) Thrusts are Education, use of Cyberinfrastructure for Cheminformatics and Computational Chemistry and Tool research NSF has an Office of Cyberinfrastructure running (roughly) TeraGrid (100 TF distributed supercomputers) and eScience eScience describes “modern Science as a team sport” with distributed Computers, Databases, Instruments, Sensors and People (>100 such projects worldwide) eScience builds applications as Grids using large scale managed Web services

3 Training People for your Centers! Cheminformatics Education at IU Linked to bioinformatics in an Indiana University’s School of Informatics http://www.informatics.indiana.edu School of Informatics degree programs BS, MS, PhD Programs offered at both the Indianapolis (IUPUI) and Bloomington (IUB) campuses Bioinformatics MS and track on PhD Chemical Informatics MS and track on PhD Informatics Undergraduates can choose a chemistry cognate PhD in Informatics started in August 2005 and offers tracks in bioinformatics; chemical informatics; health informatics; human-computer interaction design; social and organizational informatics; more to come!

4 Formal Cheminformatics Courses I571 Chemical Information Technology (3 cr.) Distance Ed section had 10 students in Fall 2005, from California to Connecticut I572 Computational Chemistry and Molecular Modeling (3 cr.) I573 Programming Techniques for Chemical and Life Science Informatics (3 cr.) I553 Independent Study in Chemical Informatics (3 cr.) Above courses required for the new Graduate Certificate Program in Chemical Informatics I533 Seminar in Chemical Informatics Spring 2006 Topic: Molecular Informatics, the Data Grid, and an Introduction to eScience http://www.indiana.edu/~cheminfo/I533/533home.html I647 Seminar in Chemical Informatics Fall 2006 Topic: Bridging Bioinformatics and Chemical Informatics http://www.indiana.edu/~cheminfo/I647/647home.html

5 Related Courses L519 Bioinformatics: Theory and Application (3 cr.) (at IUPUI: CSCI 548) L529 Bioinformatics in Molecular Biology and Genetics: Practical Applications (4 cr.) (not offered at IUPUI) I619 Structural Bioinformatics (3 cr.) I617 Informatics in Life Sciences and Chemistry (3 cr.) (for non-majors) B649 Topics in Systems: Service Architectures and Science (3 cr.) I590 Topics in Informatics: Scientific Applications of XML (IUPUI)

6 Other Educational Activities Graduate Certificate Program in Chemical Informatics (4 courses by Distance Education) Required courses: I571, I572, I573, I553 Enrollees pay in-state graduate fees regardless of location Special section of I571 will be taught as CIC CourseShare offering with Michigan, Fall 2006 University of Michigan School of Pharmaceutical Engineering ChE531 Introduction to Chemoinformatics Experiments with teleconferencing as a distance education tool (Raindance, Macromedia Breeze) Mesa Analytics Cheminformatics Virtual Classroom http://www.chemvc.com:8020/ Workshop, July 30, 2006 at the Biennial Conference on Chemical Education

7 Some Grid Concepts I Services are “just” (distributed) programs sending and receiving messages with well defined syntax Interfaces (input-output) must be open; innards can be open source (allowing you to modify) or proprietary Services can be any language from Fortran, Shell scripts, C, C#, C++, Java, Python, Perl – your choice!! Web Services supported by all vendors (IBM, Microsoft …) Service overhead will be just a few milliseconds (more now) which is < typical network transit time Any program that is distributed can be a Web service Any program taking execution time ≥ 20ms can be an efficient Web service

8 Some Grid Concepts II Systems are built from contributions from many different groups – you do not need one “vendor” for all components as Web services allow interoperability between components One reason DoD likes Grids (called Net-Centric computing) Grids are distributed in services and data allowing anybody to store their data and to produce “their” view Some think that University Library of future will curate/store data of their faculty “2 level programming model”: Classic programming of services and services are composed using workflow consistent with industry standards (BPEL) Grid of Grids: (System of Systems) Realistically Grid-like systems will be built using multiple technologies and “standards” – use wrapping to integrate Pipeline Pilot, CBIS, Chembank, ChemBench, ChemModLab, PowerMV etc. with OGSA (Open Grid Service Architecture from OGF) systems into a single Grid

9 Grid Capabilities for Science (Cheminformatics) Open technologies for any large scale distributed system that is adopted by industry, many sciences and many countries (including UK, EU, USA, Asia) Security, Reliability, Management and state standards Many bioinformatics grids including BIRN, caBIG, MyGrid Also computational chemistry and related (materials) grids Service and messaging specifications User interfaces via portals and portlets virtualizing to desktops, email, PDA’s etc. ~20 TeraGrid Science Gateways including RENCI Bio portal OGCE Portal technology effort led by Indiana Uniform approach to access distributed (super)computers supporting single (large) jobs and spawning lots of related jobs Data and meta-data architecture supporting real-time and archives as well as federation Links to Semantic web and annotation Grid (Web service) workflow with standards and several successful instantiations (such as Taverna and MyLead)

10 12181 acatttctac caacagtgga tgaggttgtt ggtctatgtt ctcaccaaat ttggtgttgt 12241 cagtctttta aattttaacc tttagagaag agtcatacag tcaatagcct tttttagctt 12301 gaccatccta atagatacac agtggtgtct cactgtgatt ttaatttgca ttttcctgct 12361 gactaattat gttgagcttg ttaccattta gacaacttca ttagagaagt gtctaatatt 12421 taggtgactt gcctgttttt ttttaattgg Taverna Taverna is typical Grid workflow developed in UK for bioinformatics in MyGrid project Not maybe better than well known tools like Pipeline Pilot but links to “all Grid services” Taverna being robustified and extended by UK eScience program

11 Grid Workflow Datamining in Earth Science Work with Scripps Institute Grid services controlled by workflow process real time data from ~70 GPS Sensors in Southern California Streaming Data Support Transformations Data Checking Hidden Markov Datamining (JPL) Display (GIS) NASA GPS Earthquake

12 Grid Workflow Data Assimilation in Earth Science Grid services triggered by abnormal events and controlled by workflow process real time data from radar and high resolution simulations for tornado forecasts Use a Portlet-based user portal to access and control services and workflow

CICC Prototype Web Services Molecular weights Molecular formulae Tanimoto similarity 2D Structure diagrams Molecular descriptors 3D structures InChi generation/search CMLRSS Basic cheminformatics Application based services Compare (NIH) Toxicity predictions (ToxTree) Literature extraction (OSCAR3) Clustering (BCI Toolkit) Docking, filtering,... (OpenEye) Varuna simulation Define WSDL interfaces to enable global production of compatible Web services; refine CML Ready to try “Prototype Production” Develop more training material Refine/go into production with key services including both tools, workflows and TeraGrid style simulations in capacity and capability modes In-house algorithm work for new services in clustering, diversity analysis, QSAR methodologies Next steps? Key Ideas Add value to PubChem with additional distributed services and databases Wrapping existing code in web services is not difficult Provide “core” (CDK) services and exemplars of typical tools Provide access to key databases via a web service interface Provide access to major Compute Grids

Web Service Locations Indiana University Clustering VOTables OSCAR3 Toxicity classification Database services Penn State University CDK based services Fingerprints Similarity calculations 2D structure diagrams Molecular descriptors Cambridge University InChi generation / search CMLRSS OpenBabel InfoChem SPRESI database SDSC Typical TeraGrid Site NIH PubChem ….. Compare …..

Workflows Using Chemical Literature OSCAR3 program All of PubMed “just” takes about a day to run through OSCAR3 on 2048 node Big Red SMILES NAME Pubmed ID CCC propane 1425356 CC ethane 3546453............................... Bulk download of Pubmed abstracts Extract chemical structures OSCAR3 Service Find similar molecules Searchable (structure/similarity) Grid database Local DTP database PubChem PDBBind Find similar document s Clustering of documents linked to clustering of chemicals

Large Scale Calculations on “All of PubChem/Med” TeraGrid: 100 Teraflop now to 1000 Teraflop next year IU 2048 node Big Red supercomputer: 20 Teraflop today The CDK can currently calculate approx. 107 Descriptors  Whole of PubChem (6M compounds) – 276 hours, 1 CPU  On IU's Big Red, 2048 CPU's, 20 TF: < 7 minutes  Even increasing the descriptor count by 5 times gives us < 35 minutes of compute time on Big Red OSCAR3 takes a few seconds per abstract to text-mine all compounds in it  All of PubMed would take < a day on Big Red  Cleanup and Iteration would take some time Can pre-calculate properties of smaller compounds using CDK (logP, BCUT, CPSA, …) and programs likes GAMESS  100,000 compounds take < a week each on a single CPU and would be a practical computation over next year

17 TeraGrid Supercomputers “Flocks” Prototype CICC Project: Controlling the TGF  pathway Collaboration between Baik & Zhang at IU PDB 1IAS Inactive TGF  VARUNA Experiments in the Zhang Lab Active TGF With inhibitor PubChem in-house Molecules in Varuna QM Database Conceptual Understanding of TGF  Inhibition Simulations AutoGeFF Questions: - What molecular feature controls inhibitor binding? - How do mutations impact binding? Web Service to generate custom force fields Can afford few ms overhead!

18 Simulating the Structure and Reactivity of Cu-A  Complex One of the speculation about the pathogenesis of Alzheimer’s Disease involves complexes of Cu-ions and  -Amyloid plaques. We will test the hypothesis that a Cu-A  complex can catalytically activate dioxygen to give hydrogen peroxide in a molecular modeling study:  Unfortunately, the structure of the Cu-A  complex is currently not known. (a) We will carry out a combined QM/QM-MM/MD study to propose a structure (b) We will evaluate the plausibility of the catalysis by constructing a reaction profile (c) We will use PubChem in combination with our in-house Quantum Chemical Database to identify small molecules and molecule classes that may inhibit the catalysis. This study serves as a Prototype application where  an unknown protein fragment structure is computed a priori and saved in an in- house database.  the diversity of small molecule targets are derived from clustering PubChem and other federated databases, including an in-house structural database

19 Other Chemical Projects Utilizing the Cyberinfrastructure Mechanistic studies on  how the anticancer drug cisplatin interacts with DNA to kill cancer cells.  how xanthine oxidase catalyzes the oxidation of xanthine.  how the bacterial enzyme methane monooxygenase catalyzes methane oxidation.  electrocyclization of small organic molecules that are relevant for natural product synthesis.  stereoselective carbocyclizations that are catalyzed by Rhodium complexes.  how molecular probes for in vivo detection of Zinc, Mercury and Lead can be designed in a rational fashion. Utilizing a molecular modeling database  By registering and saving the structures, charge distributions and molecular orbitals of computer simulations, we can conduct a new kind of similarity searches and recognize trends.  The molecular modeling database will allow for curating the structural information of other databases, such as PubChem, by providing more detailed simulated information.

20 MLSCN Post-HTS Biology Decision Support Percent Inhibition or IC 50 data is retrieved from HTS Question: Was this screen successful? Question: What should the active/inactive cutoffs be? Question: What can we learn about the target protein or cell line from this screen? Compounds submitted to PubChem Workflows encoding distribution analysis of screening results Grids can link data analysis ( e.g image processing developed in existing Grids), traditional Chem- informatics tools, as well as annotation tools (Semantic Web, del.icio.us) and enhance lead ID and SAR analysis A Grid of Grids linking collections of services at PubChem ECCR centers MLSCN centers Workflows encoding plate & control well statistics, distribution analysis, etc Workflows encoding statistical comparison of results to similar screens, docking of compounds into proteins to correlate binding, with activity, literature search of active compounds, etc CHEMINFORMATICSPROCESSGRIDS

21 MLSCN Data - How services and workflows are used MLSCN submits HTS data to Pubchem and/or sends directly to workflow for real-time feedback Data is stored in Pubchem Workflows perform different kinds of analysis on the MLSCN data, including SAR, clustering, literature searching, protein searching, toxicity testing, etc… End-user applications and interfaces utilize the information streams from the workflows for human interaction with the data and analysis PubChem interfaces to workflows via SOAP

22 Example HTS workflow: finding cell-protein relationships A protein implicated in tumor growth with known ligand is selected (in this case HSP90 taken from the PDB 1Y4 complex) SImilar structures to the ligand can be browsed using client portlets. Once docking is complete, the user visualizes the high- scoring docked structures in a portlet using the JMOL applet. Similar structures are filtered for drugability, are converted to 3D, and are automatically passed to the OpenEye FRED docking program for docking into the target protein. The screening data from a cellular HTS assay is similarity searched for compounds with similar 2D structures to the ligand. Docking results and activity patterns fed into R services for building of activity models and correlations Least Squares Regression Random Forests Neural Nets

23 Next steps in workflows Expansion of HTS Workflows  Inclusion of ToxTree for toxicity flagging  Prediction of protein binding through PDB ligand similarity search  Inclusion of literature text mining (OSCAR)  Using PubChem data instead of tumor cell dataset More workflows  Incorporating VARUNA, PubChem, PDBBind and other services  Workflows from Cambridge collaboration Making workflows available in other systems  Taverna SCUFL BPEL conversion  Use of workflows in other execution environments (starting with myLEAD supporting triggering)

24 Methods Development at the CICC Tagging methods for web-based annotation exploiting del.icio.us and Connotea Development of QSAR model interpretability and applicability methods RNN-Profiles for exploration of chemical spaces VisualiSAR - SAR through visual analysis  See http://www.daylight.com/meetings/mug99/Wild/Mug99.htmlhttp://www.daylight.com/meetings/mug99/Wild/Mug99.html Visual Similarity Matrices for High Volume Datasets  See http://www.osl.iu.edu/~chemuell/new/bioinformatics.phphttp://www.osl.iu.edu/~chemuell/new/bioinformatics.php Fast, accurate clustering using parallel Divisive K-means Mapping of Natural Language queries to use cases and workflows Advanced data mining models for drug discovery information

25 MPI Parallel Divkmeans clustering of PubChem AVIDD Linux cluster, 5,273,852 structures (Pubchem compound collection, Nov 2005)

Exploring Chemical Spaces The problem  Thousands of compounds  10's to 100's of descriptors Requirements  In my chemistry space what are the outliers? Which compounds are in the dense regions of space?  I don't want to / can't do descriptor selection  I don't want to squash things into a lower dimensional space  I want a simple way to view all this Our approach (Guha, R. et al;, J. Chem. Inf. Model., 2006) : Use the R-NN profile technique

R-NN Profiles & Exploring Chemical Space 4337 molecules = 240, 5 descriptors 2 known outliers Molecules at the top are in sparse regions Molecules at the bottom are in dense regions Drill down into specific regions (GGgobi, VOPlot...), annotate with activity,... Simple & intuitive, can be very fast

R-NN Profiles & HPC R-NN profiles require a pairwise distance matrix Can be sped up with approximate NN methods R-NN profiles can be trivially parallelized  1000 x 100 data matrix => 1000 x 1000 distance matrix -> 2.1 sec (P4 1GhZ laptop)  Evaluating R-NN profiles for 1000 compounds -> 43 sec  The current parameters allow a 100x speedup if we use 100 CPU's

Measuring Model Applicability We have many ways to build multiple models We perform validation But can we use a stored model for a new molecule(s)?  Trivially, yes  But does it really make sense to do this? Depends on similarities to the training set Also depends on a global chemistry space We can provide a component that attempts to answer this question for arbitrary model types Guha, R. et al., J. Chem. Inf. Model., 2005, 45, 65-73

Measuring Model Applicability Our initial approach  Considers regression models  Considers similarity to the TSET We and P&G are working on more robust methods that try and take into account a global chemistry space Alternate methods can be easily included in workflows Stored OLS/CNN/SVM/... Model Auxillary classification model Choose cutoff Training set residuals New (unseen) molecule Predict property Obtain applicability Decide whether it makes sense to go with this prediction

31 More detailed Slides not used

32 Load Workflow Run Workflow Current Process Result Output Result Output URL

33 Preliminary Results Shown is a fully equilibrated structure (highest population in a 10 ns MD @ room temp) of the Cu-A  structure. The Cu-(peptide) bonds require special attention, as standard force-fields do not allow simulations of this type. We use a new tool AutoGeFF (to be implented as a Webservice) that recognizes bonds for which no force field parameters exist. AutoGeFF can generate appropriate force fields by automatically carrying out a QM calculation on a small model system and fitting new forces to computed vibrational frequencies. We use a guided Monte Carlo approach to iteratively derive these forces. Currently, AutoGeFF recognizes most of the transition metals. Future work will include organic moieties (such as drug candidates).

34 Example HTS workflow: organization & flagging A biological screen is selected. The activity results for all the compounds is extracted from the database (currently using DTP Tumor Cell Line database) The compounds are clustered on chemical structure similarity, to group similar compounds together The compounds along with property and cluster information are converted to VOTABLES format and displayed in VOPLOT OpenEye FILTER is used to calculate biological and chemical properties of the compounds that are related to their potential effectiveness as drugs

35 Example of workflow output - LogP vs GI 50 Plotting XLogP against GI 50 can help identify highly active compounds with good logP profiles (1 - 4 range)

36 Example of workflow output - Cluster # vs GI 50 Plotting Cluster against GI 50 can help identify groups of highly active, structurally similar compounds, and also clusters which might yield good QSAR information

37 Example workflow output - docked complexes NSC_ID 685478 Docking score -29.74 NSC_ID 685477 Docking score -35.51 NSC_ID 719175 Docking score -30.78 NSC_ID 725806 Docking score -32.15 Example output of most similar compounds to PDB 1Y4 complex ligands docked into the target protein using OpenEye FRED

1 Joint meeting of the Molecular Libraries Screening Centers Network (MLSCN) and the Exploratory Centers for Cheminformatics Research (ECCR): Talk II July.

Similar presentations

Presentation on theme: "1 Joint meeting of the Molecular Libraries Screening Centers Network (MLSCN) and the Exploratory Centers for Cheminformatics Research (ECCR): Talk II July."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Joint meeting of the Molecular Libraries Screening Centers Network (MLSCN) and the Exploratory Centers for Cheminformatics Research (ECCR): Talk II July.

Similar presentations

Presentation on theme: "1 Joint meeting of the Molecular Libraries Screening Centers Network (MLSCN) and the Exploratory Centers for Cheminformatics Research (ECCR): Talk II July."— Presentation transcript:

Similar presentations

About project

Feedback