Grids for Chemical Informatics Randall Bramley, Geoffrey Fox, Dennis Gannon, Beth Plale Computer Science, Informatics, Physics Pervasive Technology Laboratories Indiana University Bloomington IN 47401
What is a Grid? Name borrowed from the power grid. The concept: A ubiquitous information & computation resource A definition a network of compute and data resources that has been supplemented with a layer of services that provide uniform and secure access to a set of applications of interest to a distributed community of users. Grids may be wide-area or enterprise
Scientific Challenges The current and future generations of scientific problems are: Data Oriented Increasingly stream based. Often need petabyte archives In need of on-demand computing resources Conducted by geographically distributed teams of specialists Who don’t want to become experts in grid computing. On-Demand Storm predictions Streaming Observations Forecast Model Data Mining Storms Forming
Information/Knowledge Grids Distributed (10’s to 1000’s) of data sources (instruments, file systems, curated databases …) Data Deluge: 1 (now) to 100’s petabytes/year (2012) Moore’s law for Sensors Possible filters assigned dynamically (on-demand) Run image processing algorithm on telescope image Run Gene sequencing algorithm on compiled data Needs decision support front end with “what-if” simulations Metadata (provenance) critical to annotate data Integrate across experiments as in multi-wavelength astronomy Data Deluge comes from pixels/year available
Internet Scale Distributed Services Grids use Internet technology to manage sets of network connected resources Classic Web: independent one-to-one access to individual resources Grids integrate together and manage multiple Internet- connected resources: People, Sensors, computers, data systems Grids are built on top of commodity web service technology with broad industry support Organization can be explicit as in TeraGrid which federates many supercomputers; CrisisGrid which federates first responders, commanders, sensors, GIS, (Tsunami) simulations, science/public data Organization can be implicit such as curated databases and simulation resources that “harmonize a community”
The Architecture of Gateway Grids The Users Desktop. Gateway Services Grid Portal Server Grid Portal Server Physical Resource Layer Core Grid Services Proxy Certificate Server / vault Proxy Certificate Server / vault Application Events Resource Broker User Metadata Catalog User Metadata Catalog Replica Mgmt Application Workflow Application Workflow App. Resource catalogs App. Resource catalogs Application Deployment Application Deployment Execution Management Execution Management Information Services Information Services Self Management Self Management Data Services Data Services Resource Management Resource Management Security Services Security Services OGSA-like Layer
Let’s look at a few real examples (about a dozen … many more exist!)
BIRN – Biomedical Information
Mesoscale Meteorology NSF LEAD project - making the tools that are needed to make accurate predictions of tornados and hurricanes. - Data exploration and Grid workflow
Workflow in the LEAD Grid Katrina output
Renci Bio Gateway Providing access to biotechnology tools running on a back-end Grid. - leverage state-wide investment in bioinformatics - undergraduate & graduate education, faculty research - another portal soon: national evolutionary synthesis center
Flood Modeling Large-scale flooding along Brays Bayou in central Houston triggered by heavy rainfall during Tropical Storm Allison (June 9, 2001) caused more than $2 billion of damage. University of Texas TACC Center for Research in Water Resources ORNL Purdue Gordon Wells, UT; David Maidment, UT; Budhu Bhaduri, ORNL, Gilbert Rochon, Purdue
X-Ray Crystallography
SERVOGrid
SERVOGrid Requirements Seamless Access to Data repositories and large scale computers Integration of multiple data sources including sensors, databases, file systems with analysis system Including filtered OGSA-DAI (Grid database access) Rich meta-data generation and access with SERVOGrid specific Schema extending openGIS (Geography as a Web service) standards and using Semantic Grid Portals with component model for user interfaces and web control of all capabilities Collaboration to support world-wide work Basic Grid tools: workflow and notification NOT metacomputing
Database Analysis and Visualization Portal Repositories Federated Databases Data Filter Services Field Trip Data Streaming Data Sensors ? Discovery Services SERVOGrid Research Simulations ResearchEducation Customization Services From Research to Education Education Grid Computer Farm Grid of Grids: Research Grid and Education Grid GIS Grid Sensor Grid Database Grid Compute Grid
Google maps can be integrated with Web Feature Service Archives to filter and browse seismic records. Integrating Archived Web Feature Services and Google Maps
MyGrid - Bioinformatics
ABC The Williams Workflows A: Identification of overlapping sequence B: Characterisation of nucleotide sequence C: Characterisation of protein sequence
Physical Network Discovery Metadata BioInformatics Grid Chemical Informatics Grid … Domain Specific Grids/Services … Data Access/Storage SecurityWorkflowMessagingManagement Information/Knowledge Instrument/Sensor Compute/Supercomputer MIS Core Low Level Grid Services Application Services Policy M(B,C)IS is Molecular (Bio, Chem) Information System supporting specific metadata (CML, CellML, SBML) and physical representations HTS Tools Quantum Calculations CIS Sequencing Tools Biocomplexity Simulations BIS Portals Collaboration Services
Comments on Grid Components Support GT4 and WS-I+(+); Support Java and.NET Portals – all services will have a portlet interface Compute Grid -- This is some sort of Condor Grid (as used by Cambridge) Supercomputer Grid -- (extended) TeraGrid Workflow, Metadata, Information Management – learn from Taverna, link with BPEL style workflow, link with other Semantic Grid/metadata services Instruments – learn from CIMA/Reciprocal Net, compare with Sensors in LEAD/SERVOGrid MIS/CIS – See if idea sensible – in any case need CML, LSID, Molecular visualization Application Services – Need a wizard. Support “filters” (Wild) and loosely coupled simulations (Baik) Data – Link to PubChem and Bioinformatics – link to Baik database Discovery – Extended UDDI Security – review any special requirements and status of PubChem, caBIG, myGrid etc, Collaboration, Management, Messaging, Policy -- nothing special needed