1 Grids/CI for Scholarly Research and application to Chemical Informatics HPC 2006 in Cetraro – Italy July 4 2006 Geoffrey Fox Computer Science, Informatics,

Slides:



Advertisements
Similar presentations
Indiana University School of David Wild – CICC Quarterly Meeting, Jan Page 1 Projects 1-4 update David Wild CICC Quarterly Meeting January 27.
Advertisements

Pulan Yu School of Informatics Indiana University Bloomington Web service based Varuna.Net.
DIGIDOC A web based tool to Manage Documents. System Overview DigiDoc is a web-based customizable, integrated solution for Business Process Management.
CACORE TOOLS FEATURES. caCORE SDK Features caCORE Workbench Plugin EA/ArgoUML Plug-in development Integrated support of semantic integration in the plugin.
ASCR Data Science Centers Infrastructure Demonstration S. Canon, N. Desai, M. Ernst, K. Kleese-Van Dam, G. Shipman, B. Tierney.
28 October 2005Jeremy Frey, University of Southampton1 “The CombeChem Experience” CICC Workshop 28 October 2005 Bloomington Indiana.
Chemical Informatics and Cyber- infrastructure Building Blocks Chemical Informatics Resources:  Deluge of experimental data > 100,000 compounds screened.
Provenance in Open Distributed Information Systems Syed Imran Jami PhD Candidate FAST-NU.
Evaluating (Scientific) Knowledge for people, documents, organizations/activities/communities ICiS Workshop: Integrating, Representing and Reasoning over.
1 Introduction to XML. XML eXtensible implies that users define tag content Markup implies it is a coded document Language implies it is a metalanguage.
1 Web 2.0 and Grids Introduction for Web 2.0 Tutorial OGF19 Chapel Hill North Carolina January Geoffrey Fox Computer Science, Informatics, Physics.
GenSpace: Exploring Social Networking Metaphors for Knowledge Sharing and Scientific Collaborative Work Chris Murphy, Swapneel Sheth, Gail Kaiser, Lauren.
Social Networking for Research Communities Using Tagging and Shared Bookmarks: a Web 2.0 Application Marlon Pierce, Geoffrey Fox, Joshua Rosen, Siddharth.
Federated Hierarchical Filter Grids STTR-funded project with Indiana, Caltech and Deep Web Technologies A Grid infrastructure for Data Analysis Integrates.
1 BrainWave Biosolutions Limited Accelerating Life Science Research through Technology.
Introduction Web Development II 5 th February. Introduction to Web Development Search engines Discussion boards, bulletin boards, other online collaboration.
Business Intelligence components Introduction. Microsoft® SQL Server™ 2005 is a complete business intelligence (BI) platform that provides the features,
Principles for Collaboration Systems Geoffrey Fox Community Grids Laboratory Indiana University Bloomington IN 47404
A Scalable Framework for the Collaborative Annotation of Live Data Streams Thesis Proposal Tao Huang
Project Proposal: Academic Job Market and Application Tracker Website Project designed by: Cengiz Gunay Client: Cengiz Gunay Audience: PhD candidates and.
1 Challenges Facing Modeling and Simulation in HPC Environments Panel remarks ECMS Multiconference HPCS 2008 Nicosia Cyprus June Geoffrey Fox Community.
Ihr Logo Data Explorer - A data profiling tool. Your Logo Agenda  Introduction  Existing System  Limitations of Existing System  Proposed Solution.
Possible Architectural Principles for OGSA-UK and other Grids UK e-Science Core Programme Town Meeting London Monday 31st January 2005 “Defining the next.
Metadata Creation with the Earth System Modeling Framework Ryan O’Kuinghttons – NESII/CIRES/NOAA Kathy Saint – NESII/CSG July 22, 2014.
Databases and Library Catalogs Global Index Medicus/Global Health Library PubMed Source Bibliographic Database: International Health and Disability.
Publishing and Visualizing Large-Scale Semantically-enabled Earth Science Resources on the Web Benno Lee 1 Sumit Purohit 2
DISTRIBUTED COMPUTING
OpenQuake Infomall ACES Meeting Maui May Geoffrey Fox
Microsoft Academic Search Search | Explore | Discover Alex D. Wade Director - Scholarly Communication.
Integrated Collaborative Information Systems Ahmet E. Topcu Advisor: Prof Dr. Geoffrey Fox 1.
1 Grids for Real-time and Streaming Applications GCC2005 Beijing China December Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology.
29-30 October, 2006, Estonia 1 IST4Balt Information analysis using social bookmarking and other tools IST4Balt Information analysis using social bookmarking.
GEM Portal and SERVOGrid for Earthquake Science PTLIU Laboratory for Community Grids Geoffrey Fox, Marlon Pierce Computer Science, Informatics, Physics.
The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop Discovery Environment Overview.
Freelib: A Self-sustainable Digital Library for Education Community Ashraf Amrou, Kurt Maly, Mohammad Zubair Computer Science Dept., Old Dominion University.
Large Scale Nuclear Physics Calculations in a Workflow Environment and Data Provenance Capturing Fang Liu and Masha Sosonkina Scalable Computing Lab, USDOE.
1 Semantic Research Grid Open Grid Forum Web 2.0 Workshop OGF21, Seattle Washington October Geoffrey Fox, Aurel Cami, Ahmet Fatih Mustacoglu, Ahmet.
ISERVOGrid Architecture Working Group Brisbane Australia June Geoffrey Fox Community Grids Lab Indiana University
SRG: A Digital Document-Enhanced Service Oriented Research Grid Ahmet E. Topcu Ahmet Fatih Mustacoglu Geoffrey C. Fox Aurel Cami Indiana University Computer.
November Geoffrey Fox Community Grids Lab Indiana University Net-Centric Sensor Grids.
BlackBerry Applications using Microsoft Visual Studio and Database Handling.
1 Web 2.0 and Grids for Scholarly Research Peking University July Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories.
1 © Xchanging 2010 no part of this document may be circulated, quoted or reproduced without prior written approval of Xchanging. MOSS Training – UI customization.
Web Information Retrieval Prof. Alessandro Agostini 1 Context in Web Search Steve Lawrence Speaker: Antonella Delmestri IEEE Data Engineering Bulletin.
7. Grid Computing Systems and Resource Management
Some comments on Portals and Grid Computing Environments PTLIU Laboratory for Community Grids Geoffrey Fox, Marlon Pierce Computer Science, Informatics,
Applications and Requirements for Scientific Workflow May NSF Geoffrey Fox Indiana University.
Internet Documentation and Integration of Metadata (IDIOM) Presented by Ahmet E. Topcu Advisor: Prof. Geoffrey C. Fox 1/14/2009.
Development of e-Science Application Portal on GAP WeiLong Ueng Academia Sinica Grid Computing
Improving User Access to Metadata for Public and Restricted Use US Federal Statistical Files William C. Block Jeremy Williams Lars Vilhuber Carl Lagoze.
Big Data to Knowledge Panel SKG 2014 Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China August Geoffrey Fox
Partnerships in Innovation: Serving a Networked Nation Grid Technologies: Foundations for Preservation Environments Portals for managing user interactions.
Applications and Requirements for Scientific Workflow May NSF Geoffrey Fox Indiana University.
Event-Based Infrastructure for Reconciling Distributed Annotation Records Ahmet Fatih Mustacoglu Advisor: Prof. Geoffrey C. Fox.
Event-Based Model for Reconciling Digital Entities Ahmet Fatih Mustacoglu Ahmet E. Topcu Aurel Cami Geoffrey C. Fox Indiana University Computer Science.
1 Service Oriented Collaboration and Community Grids CTS2006 May International Symposium on Collaborative Technologies and Systems
Directions in eScience Interoperability and Science Clouds June Interoperability in Action – Standards Implementation.
Indiana University School of Indiana University ECCR Summary Infrastructure: Cheminformatics web service infrastructure made available as a community resource.
Google Scholar and ShareLaTeX
Open Source distributed document DB for an enterprise
Recap: introduction to e-science
iSERVOGrid Architecture Working Group Brisbane Australia June
Some remarks on Portals and Web Services
Ahmet Fatih Mustacoglu
CICC Combines Grid Computing with Chemical Informatics
Semantic Scholars’ Grid I
Remarks on Peer to Peer Grids
Integrated Collaborative Information Systems
Developing Institutional Data Repositories
Chemical Informatics and Cyberinfrastructure Collaboratory
Presentation transcript:

1 Grids/CI for Scholarly Research and application to Chemical Informatics HPC 2006 in Cetraro – Italy July Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories Indiana University Bloomington IN

2 Motivation Build Cyberinfrastructure (Grids) that Support science from beginning (planning, instruments) through middle (analysis) and end (refereed publications, follow-on work) Integrates with the popular Web 2.0 (community) tools whose successes point to interesting ways of working together Integrate with Digital Library technology Does not redo previous work but rather augments it Assumes a heterogeneous fragmented world with multiple platforms Allows one to specify and manage all the services and data that a project needs with a mix of synchronous, asynchronous, close (classic workflow) and loose (including zero) coupling

3 Application Drivers Chemical Informatics as this has very precise naming rules for compounds that allow accurate searches in documents Suggesting how to tag scientific documents either when writing it or after the fact “Global Information Grid” (Military Net-Centric systems) as these inevitably need Grid of Grids to support “systems of systems” Journal web site of the future as illustrated by Nature building social bookmarking tool Connotea Conference support tools as can benefit from features needed by journals

4 The Science Drivers From Workshop on Challenges of Scientific Workflows Workflow is underlying support for current science model Distributed interdisciplinary data deluged scientific methodology as an end (instrument, conjecture) to end (paper, Nobel prize) process is a transformative approach Reproducibility core to scientific method and requires rich provenance, interoperable persistent repositories with linkage of open data and publication as well as distributed simulations, data analysis and new algorithms. Distributed Science Methodology publishes all steps in a new electronic logbook capturing scientific process (data analysis) as a rich cloud of resources including s, PPT, Wikis as well as databases, compiler options, build time/runtime configuration…

Community (? VO) Tools and list-serves are oldest and best used Kazaa, Instant Messengers, Skype, Napster, BitTorrent for P2P Collaboration – text, audio-video conferencing, files del.icio.us, Connotea, Citeulike, Bibsonomy, Biolicious manage shared bookmarks (later) MySpace, Bebo, Hotornot, Facebook, or similar sites allow you to create (upload) community resources and share them; Friendster, LinkedIn create networks Writely, Wikis and Blogs are powerful specialized shared document systems ConferenceXP and WebEx share general applications Google Scholar (Citeseer) tells you who has cited your papers while publisher sites tell you about co-authors Windows Live Academic Search has similar goals (later) Note sharing resources creates (implicit) communities Social network tools study graphs to both define communities and extract their properties

How to use Web2.0 Community tools in CI Nearly all of them have “profiles”, “users”, “groups”, “friends” etc. Need to integrate these P2P File Sharing: Maybe this is useful for sharing files in research groups (virtual organizations) Will modify Maze – popular Chinese social P2P system with 2.5 million usershttp://maze.pku.edu.cn BitTorrent: more popular than FTP – why not use for higher performance fault tolerant cached file sharing? MySpace etc.: Could consider MyGridSpace or MyScienceSpace that supports a similar document sharing model with users uploading pictures, papers and even data/services of interest Could include uploaded material in workflows Can impose different policies Social Bookmarking and linking: discuss later

7 SSG Domain-1 Web service SSG Domain-N Web service Tool-1 Del.icio.us Tool-2 Connotea Tool-3 MySpace Tool–N e.g. CiteSeer Native UI-1 Native UI-4 Native UI-3 Native UI-N Integrated User Interface UI Gateway WS-1 Gateway WS-2 Gateway WS-3 Gateway WS-N SSG MD Store Integration Framework of Tools SSG = Semantic Scholars’ Grid

Strategy Doesn’t seem useful to build the 251 st community tool In fact a major barrier to use of existing tools is What happens when a better tool comes along and/or chosen tool disappears (unsupported/removed from Web) So assume use existing tools but wrap them all as web services so can transfer information to new tools and integrate information between tools Need some “glue” logic, a “unification” database and minimal user interface Bookmarking tools: del.icio.us, Connotea, CiteULike (includes plug-ins to major publisher sites) Document: Google Scholar, Windows Live, Citeseer tools, OSCAR3 for Chemistry (later), Science.gov Journals: Manuscript Central Conferences: CMT from Microsoft or ?

9 Connotea

10 Connotea queried by SERVOGrid

11 Delicious Semantic Web/Grid purchased by Yahoo for ~$30M (Nature) Associate metadata with Bookmarks specified by URL’s, DOI’s (Digital Object Identifiers) Users add comments and keywords (called tags) Users are linked together into groups (communities) Information such as title and authors extracted automatically from some sites (PubMed, ACM, IEEE, Wiley etc.) Bibtex like additional information in CiteULike This is perhaps de facto Semantic Web – remarkable for its simplicity

12 Document-enhanced Cyberinfrastructure aka Semantic Scholar Grid I Citeseer and Google Scholar scour the Internet and analyze documents for incidental metadata Title, author and institution of documents Citations with their own metadata allowing one to match to other documents Science.gov extracts metadata from lots of US Government databases These capabilities are sure to become more powerful and to be extended Give “Citation Index” in real time Tell you all authors of all papers that cite a paper that cites you etc. (Note it’s a small world so don’t go too far in link analysis) Tell you all citations of all papers in a workshop

13 Document-enhanced Cyberinfrastructure aka Semantic Scholar Grid II It is natural to develop core document Services such as those used in Citeseer/Google Scholar but applied to “your” documents of interest that may not have been processed yet As just submitted to a conference perhaps These tools can help form useful lists such as authors of all cited or submitted papers to a journal OSCAR2/3 (from Peter Murray-Rust’s group at Cambridge) augment the application independent “core” metadata (Title, authors, institutions, Citations) with a list of all chemical terms This tool is a Service that can be applied to “your” document or to a set of documents harvested in some fashion Other fields have natural application specific metadata and OSCAR like tools can be developed for them Such high value tools could appear on “publisher” sites of future (or else publishers will disappear)

14 Existing User Interface Document-enhanced Cyberinfrastructure etc. Google Scholar Manuscript Central Science.gov Windows Live Academic Search Citeseer CMT Conference Management Existing Document based Research Tools Web service Wrappers New Document-enhanced Research Tools Integration/ Enhancement User Interface Community Tools Generic Document Tools MyResearch Database Bibliographic Database Export: RSS, Bibtex Endnote etc. CiteULike Connotea Del.icio.us Bibsonomy Biolicious PubChem PubMed Traditional Cyberinfrastructure

15 Chemical Informatics as a Grid Application Chemical Informatics is the application of information technology to problems in chemistry. Example problems: managing data in large scale drug discovery and molecular modeling Building Blocks: Chemical Informatics Resources: Chemical databases maintained by various groups NIH PubChem, NIH DTP, Application codes (both commercial and open source) Data mining such as clustering Quantum chemistry and molecular modeling Screening centers (with HTS High Throughput Screening devices) measuring interaction of chemicals with biological samples Visualization tools Web resources: journal articles, etc. Chemical Informatics Grid needs to integrate these into a common, loosely coupled, distributed computing environment.

Oracle Database (HTS) Compounds were tested against related assays and showed activity, including selectivity within target families Oracle Database (Genomics) ? None of these compounds have been tested in a microarray assay Computation The information in the structures and known activity data is good enough to create a QSAR model with a confidence of 75% External Database (Patent)  Some structures with a similarity > 0.75 to these appear to be covered by a patent held by a competitor Computation All the compounds pass the Lipinksi Rule of Five and toxicity filters Excel Spreadsheet (Toxicity) One of the compounds was previously tested for toxicology and was found to have no liver toxicity Word Document (Chemistry)  Several of the compounds had been followed up in a previous project, and solubility problems prevented further development Journal Article A recent journal article reported the effectiveness of some compounds in a related series against a target in the same family Word Document (Marketing)  A report by a team in Marketing casts doubt on whether the market for this target is big enough to make development cost-effective SCIENTIST “These compounds look promising from their HTS results. Should I commit some chemistry resources to following them up?” ? Document, Simulation and Data rich CI for Chemical Informatics

17 HTS results and COMPARE Web service Positive results (red bar to right of vertical line) indicates greater than average toxicity of cell line to tested agent.

18 HTS data organization & flagging A tumor cell line is selected. The activity results for all the compounds in the DTP database in the given range are extracted from the PostgreSQL database The compounds are clustered on chemical structure similarity, to group similar compounds together The compounds along with property and cluster information are converted to VOTABLES format and displayed in VOPLOT OpenEye FILTER is used to calculate biological and chemical properties of the compounds that are related to their potential effectiveness as drugs Use Taverna for Workflow and VOTable (from astronomy) as basic data structure; VOTable of compounds and properties with Excel-like spreadsheet services VOPlot Taverna

19 Varuna environment for molecular modeling (Baik, IU) QM Database Supercomputer Researcher Simulation Service FORTRAN Code, Scripts Chemical Concepts Experiments QM/MM Database PubChem, PDB, NCI, etc. ChemBioGrid Reaction DB DB Service Queries, Clustering, Curation, etc. Papers etc. Condor

20 OSCAR3 Service from Cambridge UK Oscar3 is a tool for shallow, chemistry-specific natural language parsing of chemical documents (i.e. journal articles). It identifies (or attempts to identify):  Chemical names: singular nouns, plurals, verbs etc., also formulae and acronyms.  Chemical data: Spectra, melting/boiling point, yield etc. in experimental sections.  Other entities: Things like N(5)-C(3) and so on. Uses SMILES, InChI and CML There is a larger effort, SciBorg, in this area 

21 OSCAR2 Chemistry Document analysis It detects “magic” chemical strings in text and then Stores them as metadata associated with document Queries ChemInformatics repositories to tell you lots of information about identified compounds Tells you which other documents have this compound

Clustering Documents from chemical properties

23 Provenance and Delicious CI We can use del.icio.us style interface to annotate Application Data with (extra) provenance and user comments of any type (describing quality of data or a keyword relating different data etc.) All data should be labeled by a URI to enable this One has in addition Citeseer/OSCAR metadata Current major tagging systems support flat list of tags without name=value (RDF triple) or schema organization RDF Triples << Full Semantic Web Delicious << RDF Tradeoff between features and pervasive deployment Some extra features are easy to add as a custom service Features not supported by del.icio.us can be uploaded as comments

24 Current Status Google Scholar, Windows Live Academic Search, del.icio.us, Connotea, CiteULike, OSCAR3 are Web Services Debugging on 500 presentations and papers from my CGL research group Experiment with GGF Presentations, Broad collection of Chemical Informatics resources (explore science document CI link) and Concurrency&Computation: Practice&Experience Web site (?business model for journals)

25 Collection (Grid) Builder Tool This can perhaps be built on top of workflow systems Unlike ordinary workflow, this is a tool to manage collections of Grids and the key metadata adorning Grids and Services It instantiates needed mediation between Grids (systems) to convert JMS to MQSeries GT4 to WS-I+ WS-Eventing to WS-Notification It supports conventional workflow as tightly coupled services It supports system wide “management” (configuration) We are using WS-Management – see CLADE paper Deploy services and mediation brokers on demand to deliver real-time performance DoD can’t pause the battle while WS-RM and TCP catch up if data saturated

26 Grids of Grids of Simple Services Grids are managed collections of one or more services A simple service is the smallest Grid Services and Grids are linked by messages Internally to service, functionalities are linked by methods Link serices via methods  messages  streams We are familiar with method-linked hierarchy Lines of Code  Methods  Objects  Programs  Packages Overlay and Compose Grids of Grids MethodsServicesComponent Grids CPUsClusters Compute Resource Grids MPPs Databases Federated Databases SensorSensor Nets Data Resource Grids

27 Component Grids? So we build collections of Web Services which we package as component Grids Visualization Grid Sensor Grid Utility Computing Grid Collaboration Grid Earthquake Simulation Grid Control Room Grid Crisis Management Grid Drug Discovery Grid Bioinformatics Sequence Analysis Grid Intelligence Data-mining Grid We build bigger Grids by composing component Grids

28 Mediation and Transformation in a Grid of Grids and Simple Services Port Internal Interfaces Grid or Service Port Internal Interfaces Grid or Service Port Internal Interfaces Grid or Service Mediation and Transformation Services Distributed Brokers between distributed ports External facing Interfaces Mediation and Transformation Services Listen, Queue Transform, Send Mediation and Transformation Services 1-10 ms Overhead Use “OGSA” to Federate?

2 Chips 2 Core/chip 2 Chips 1 Core/chip 1 Chip 8 Core/chip 1 Chip 6 Core/chip Xeon Opteron 4 Cores is 3000 messages per second; about one message per millisecond per core for Opteron; one message per 2 ms for Sun Niagara core

30 Message Size Naradabrokering (JMS) to IBM MQIBM MQ to Naradabrokering (JMS) In-order Messages/second No Ordering Messages/second In-Order Messages/second No Ordering Messages/second 100 Bytes Kbytes Kbytes Pentium 4 (3.4GHz) with 1GB of RAM while IBM- MQ Series, Naradabrokering and the Message Bridge are all running on it. NaradaBrokering running in JMS emulation mode

31 Database SS SSSSSSSSS FS FSFS Portal FSFS OSOS OSOS OSOS OSOS OSOS OSOS OSOS OSOS OSOS OSOS OSOS OSOS MD MetaData Filter Service Sensor Service Other Service Another Grid Raw Data  Data  Information  Knowledge  Wisdom Decisions S S Another Service S Another Grid S SS FS SOAP Messages Portal OSOS OSOS FS OSOS OSOS MD FS