Download presentation
Presentation is loading. Please wait.
Published byGabriel Smith Modified over 9 years ago
1
1 Joint meeting of the Molecular Libraries Screening Centers Network (MLSCN) and the Exploratory Centers for Cheminformatics Research (ECCR): Talk I July 17 2006 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories Indiana University Bloomington IN 47401 gcf@indiana.edu http://www.infomall.org http://www.chembiogrid.org With apologies for my credentials. I have written a few papers on Biology, Chemistry and Crystallography while at Cambridge, Caltech and Syracuse Mostly on applications of parallel computing
2
2 Start-up and Organization Local Teams, successful Prototypes and International Collaboration set up in 3 major focus areas “Tool and Data” Cyberinfrastructure “Archival Database and Simulation” Cyberinfrastructure Education Wiki chosen to support project as a shared editable web space Web site http://www.chembiogrid.orghttp://www.chembiogrid.org Building Collaboratory involving PubChem – Global Information System accessible anywhere and at any time – enhance PubChem with distributed tools (clustering, simulation, annotation etc.) and data Initial results discussed at conferences/workshops/papers Gordon Conferences, ACS, SDSC tutorial First new Cheminformatics courses offered Advisory board set up and met Videoconferencing-based meetings with Peter Murray-Rust and group at Cambridge roughly every 2-3 weeks Good interactions with NIH DTP, Lilly and Michigan ECCR
3
3 http://www.chembiogrid.org
4
4 CICC Senior Personnel Geoffrey C. Fox Mu-Hyun (Mookie) Baik Dennis B. Gannon Marlon Pierce Beth A. Plale Gary D. Wiggins David J. Wild Yuqing (Melanie) Wu Peter T. Cherbas Mehmet M. Dalkilic Charles H. Davis A. Keith Dunker Kelsey M. Forsythe Kevin E. Gilbert John C. Huffman Malika Mahoui Daniel J. Mindiola Santiago D. Schnell William Scott Craig A. Stewart David R. Williams From Biology, Chemistry, Computer Science, Informatics at IU Bloomington and IUPUI (Indianapolis)
5
5 CICC Advisory Board Alan D. Palkowitz (Eli Lilly) Andrew Martin (Kalypsys) David Spellmeyer (IBM) Dimitris K. Agrafiotis (Johnson & Johnson) Horst Hemmerle (Eli Lilly) James M. Caruthers (Purdue University) Jeremy G. Frey (University of Southampton) Joel Saltz (Ohio State University/University of Maryland/Johns Hopkins University) John M. Barnard (Digital Chemistry) John Reynders (Eli Lilly) Peter Murray-Rust (University of Cambridge) Peter Willett (University of Sheffield) Thompson Doman (Eli Lilly) Val Gillet (University of Sheffield) Industry and Academia Met October 2005 will meet this fall
6
6 Publications Baik says he is especially productive due to Cyberinfrastructure
7
7 Our Meetings are on the Web
8
8 Varuna environment for molecular modeling (Baik, IU) QM Database Researcher Simulation Service FORTRAN Code, Scripts Chemical Concepts Experiments QM/MM Database PubChem, PDB, NCI, etc. ChemBioGrid Reaction DB DB Service Queries, Clustering, Curation, etc. Papers etc. Condor TeraGrid Supercomputers “Flocks”
9
9 Cyberinfrastructure and Grids These support eScience or distributed Computers, Databases, Instruments, Sensors and People Grids use large scale managed Web services – the current major technology building on modern Industry enterprise and Internet systems W3C, OASIS, OGF or Open Grid Forum (Fox VP for eScience) develops standards insuring distributed resources interoperate Cheminformatics benefits from 2 styles of Grids TeraGrid typifies Grid support of large scale computation of parallel simulations Bioinformatics (BIRN, caBIG, MyGrid …), Earth Science and Astronomy Grids illustrate integration of real-time and archival data(bases) and computation Well designed Grids run faster than older approaches
10
10 Cheminformatics Grids Need Broad System standards such as WSDL, SOAP, WSRM, JSDL, BPEL Domain specific data structures CML Cheminformatics GML Earth Science CellML, SBML Biology VOQL Astronomy Use of specific Grid/Web service technologies such as Web services directly for tools Web service proxies for large simulation codes – ANYTHING can be made a Web service efficiently if execution/network access time ≥ 20ms Portals/Portlets for user interfaces Workflow for composition Access to data and compute resources
11
TeraGrid: Integrating NSF Cyberinfrastructure TeraGrid is a facility that integrates computational, information, and analysis resources at the San Diego Supercomputer Center, the Texas Advanced Computing Center, the University of Chicago / Argonne National Laboratory, the National Center for Supercomputing Applications, Purdue University, Indiana University, Oak Ridge National Laboratory, the Pittsburgh Supercomputing Center, and the National Center for Atmospheric Research. SDSC TACC UC/ANL NCSA ORNL PU IU PSCNCAR Caltech USC-ISI Utah Iowa Cornell Buffalo UNC-RENCI Wisc
12
12 Top500 Supercomputers in the world Indiana University has Highest Performance U.S. Academic Computer System 20 Teraflops peak
13
13 Products and Demonstrations www. chembiogrid. org Note mixture of In-house Out of House Commercial Academic
14
CICC Prototype Web Services Molecular weights Molecular formulae Tanimoto similarity 2D Structure diagrams Molecular descriptors 3D structures InChi generation/search CMLRSS Basic cheminformatics Application based services Compare (NIH) Toxicity predictions (ToxTree) Literature extraction (OSCAR3) Clustering (BCI Toolkit) Docking, filtering,... (OpenEye) Varuna simulation Define WSDL interfaces to enable global production of compatible Web services; refine CML Ready to try “Prototype Production” Develop more training material Refine/go into production with key services including both tools, workflows and TeraGrid style simulations in capacity and capability modes In-house algorithm work for new services in clustering, diversity analysis, QSAR methodologies Next steps? Key Ideas Add value to PubChem with additional distributed services and databases Wrapping existing code in web services is not difficult Provide “core” (CDK) services and exemplars of typical tools Provide access to key databases via a web service interface Provide access to major Compute Grids
15
Web Service Locations Indiana University Clustering VOTables OSCAR3 Toxicity classification Database services Penn State University CDK based services Fingerprints Similarity calculations 2D structure diagrams Molecular descriptors Cambridge University InChi generation / search CMLRSS OpenBabel InfoChem SPRESI database SDSC Typical TeraGrid Site NIH PubChem ….. Compare …..
16
Usage of Open Source Projects A number of open source projects are used in our infrastructure CDK provides the underlying cheminformatics toolkit R provides the back-end modeling capabilities OSCAR is used for literature mining ToxTree is used to provide toxicity classification Open data and standards as promoted by the Blue Obelisk project
17
Contributions to Open Source Projects We also contribute functionality to these projects Molecular descriptor development to the CDK Modifications of various CDK functionality to make them suitable for web service usage Infrastructure for accessing R from the CDK Packages to use the CDK from within R Quality control, testing and documentation Steinbeck, C. et al.; Curr. Pharm. Des., 2006, 12(17), 2110-2120 Guha, R.; CDK News, 2005, 2(1), 7-13
18
Workflows Using Chemical Literature OSCAR3 program All of PubMed “just” takes about a day to run through OSCAR3 on 2048 node Big Red SMILES NAME Pubmed ID CCC propane 1425356 CC ethane 3546453............................... Bulk download of Pubmed abstracts Extract chemical structures OSCAR3 Service Find similar molecules Searchable (structure/similarity) Grid database Local DTP database PubChem PDBBind Find similar document s Clustering of documents linked to clustering of chemicals
19
19 Existing User Interface Document-enhanced Cyberinfrastructure etc. Google Scholar Manuscript Central Science.gov Windows Live Academic Search Citeseer CMT Conference Management Existing Document-based Research Tools Web service Wrappers New Document-enhanced Research Tools including Web2.0, Mashups, Annotation Integration/ Enhancement User Interface Community Tools Generic Document Tools MyResearch Database Bibliographic Database Export: RSS, Bibtex Endnote etc. CiteULike Connotea Del.icio.us Bibsonomy Biolicious PubChem PubMed Traditional Cyberinfrastructure
20
20 Products and Demonstrations II
21
Indiana University School of David Wild – Research Overview July 2006. Page 21 Example HTS workflow: organization & flagging A biological screen is selected. The activity results for all the compounds is extracted from the database (currently using DTP Tumor Cell Line database) The compounds are clustered on chemical structure similarity, to group similar compounds together The compounds along with property and cluster information are converted to VOTABLES format and displayed in VOPLOT OpenEye FILTER is used to calculate biological and chemical properties of the compounds that are related to their potential effectiveness as drugs Taverna Workflow
22
22 Load Workflow Run Workflow Current Process Result Output Result Output URL
23
23 Lilly very interested in our new educational programs
24
24 Total Grad Enrollment: Chem-, Lab, Bio-, Health Informatics, Fall 2005 Red = Expected, Chem, Fall 2006 MSChemLabBioHealth IUB3/30380 IUPUI6/3153436 TOTAL9/6157236 PhDChemLabBioHealth IUB1/3030 IUPUI0/1043 TOTAL1/4073
25
25 Formal Cheminformatics Courses I571 Chemical Information Technology (3 cr.) –Distance Ed section had 10 students in Fall 2005, from California to Connecticut I572 Computational Chemistry and Molecular Modeling (3 cr.) I573 Programming Techniques for Chemical and Life Science Informatics (3 cr.) I553 Independent Study in Chemical Informatics (3 cr.) Above courses required for the new Graduate Certificate Program in Chemical Informatics Also I533 (Cheminformatics seminar)
26
26 More detailed Slides not used
27
27 TeraGrid Hardware and Software TeraGrid is coordinated at the University of Chicago and includes 8 partner facilities NCSA, SDSC, PSC, ORNL, IU, PU, TACC, UC/ANL TeraGrid hardware totals > 102 teraflops of computing power. Comprehensive information available from http://www.teragrid.org/userinfo/hardware/overview.php. http://www.teragrid.org/userinfo/hardware/overview.php Systems are primarily Linux clusters. Grid software and services (Globus, MyProxy, etc) provide a uniform means for accessing TeraGrid resources. Scheduling, running and monitoring jobs Monitoring resources Moving and managing remote files. Common service APIs simplify the process for building remote tools.
28
28 Prototype CICC Project: Controlling the TGF pathway Collaboration between Baik & Zhang at IU PDB 1IAS Inactive TGF VARUNA Experiments in the Zhang Lab Active TGF With inhibitor PubChem in-house Molecules in Varuna Conceptual Understanding of TGF Inhibition Simulations AutoGeFF Questions: - What molecular feature controls inhibitor binding? - How do mutations impact binding? Web Service to generate custom force fields
29
29 MLSCN Data - How services and workflows are used MLSCN submits HTS data to Pubchem and/or sends directly to workflow for real-time feedback Data is stored in Pubchem Workflows perform different kinds of analysis on the MLSCN data - the variety of workflows is limitless End-user applications and interfaces utilize the information streams from the workflows for human interaction with the data and analysis PubChem interfaces to workflows via SOAP
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.