CyberInfrastructure to Support Scientific Exploration and Collaboration Dennis Gannon (based on work with many collaborators, most notably Beth Plale ) School of Informatics Indiana University
Overview CyberInfrastructure for Virtual Organizations in Computational Science –Science Portals and The Gateway Concept Automating the Trail From Data to Scientific Discovery –An Example in Depth: Mesoscale Storm Prediction The challenge for the Individual Researcher –Connecting tools to on-line services The Promise of the Future –Multicore, personal petabyte and gigabit bandwidth
The Realities of Science in the U.S. “Big Science” dominates the funding hierarchy. –Why? Its important and easy to sell to congress. The NSF is investing in vast network of supercomputers to support big science –The results are empowering a broad range of scientific communities. Where is the single investigator? –The Web has enabled democratization of information access –Is there a similar path for access advanced computational resources?
Democratizing Access to Science What is needed for the individual or small team to do large scale science? –Access to data and the tools to analyze it and transform it. –A means to publish the not just the results of a study but a way to share the path to discovery. Where are the resources? –What we have now: TeraGrid –What is emerging?
The TeraGrid The US National Supercomputer Grid –CyberInfrastructure composed of a set of resources (compute and data) that provide common services for Wide area data management, Single sign-on user authentication Distributed Job scheduling and management. (in the works.) Collectively –1Petaflop –20 Petabytes Soon to triple. Will add a petaflop each year. –But at a slower rate than google, ebay, amazon add resources.
TeraGrid Wide: Science Gateways Science Portals –A Portal = a web-based home+personal workspace + personal tools. –Web Portal Technology + Grid Middleware Enables a community of researcher to: –Access to shared resources (both data and computation) –A forum to collaboration on shared problem solving TeraGrid Science Gateways –Allow the TeraGrid to be the back-end resource.
NEESGrid Realtime access to earthquake Shake table experiments at remote sites.
BIRN – Biomedical Information
Geological Information Grid Portal
Renci Bio Portal Providing access to biotechnology tools running on a back-end Grid. - leverage state-wide investment in bioinformatics - undergraduate & graduate education, faculty research - another portal soon: national evolutionary synthesis center
Nanohub - nanotechnology
X-Ray Crystallography
ServoGrid Portal
The LEAD Project
Predicting Storms Hurricanes and tornadoes cause massive loss of life and damage to property Underlying physical systems involve highly non-linear dynamics so computationally intense Data comes from multiple sources –“real time” derived from streams of data from sensors –Archived in databases of past storms Infrastructure challenges: –Data mine instrument radar data for storms –Allocate supercomputer resources automatically to run forecast simulations –Monitor results and retarget instruments. –Log provenance and metadata about experiments for auditing.
Analysis/Assimilation Quality Control Retrieval of Unobserved Quantities Creation of Gridded Fields Prediction/Detection PCs to Teraflop Systems Product Generation, Display, Dissemination End Users NWS Private Companies Students Traditional Methodology STATIC OBSERVATIONS Radar Data Mobile Mesonets Surface Observations Upper-Air Balloons Commercial Aircraft Geostationary and Polar Orbiting Satellite Wind Profilers GPS Satellites The Process is Entirely Serial and Static (Pre-Scheduled): No Response to the Weather! The Process is Entirely Serial and Static (Pre-Scheduled): No Response to the Weather!
Analysis/Assimilation Quality Control Retrieval of Unobserved Quantities Creation of Gridded Fields Prediction/Detection PCs to Teraflop Systems Product Generation, Display, Dissemination End Users NWS Private Companies Students The LEAD Vision: Enabling a new paradigm of scientific exploration. DYNAMIC OBSERVATIONS Models and Algorithms Driving Sensors The CS challenge: * Build cyberinfrastructure services that provide adaptability, scalability, availability, useability. * Build cyberinfrastructure services that provide adaptability, scalability, availability, useability. * Create a new paradigm of meteorology research. * Create a new paradigm of meteorology research.
Building Experiments that Respond to the Future Can we pose a scientific search and discovery query that the cyber infrastructure executes as our agent? In the LEAD case it is Data Driven, Persistent and Agile –Weather data streams define nature of computation –Mine the data streams, detect “interesting” features, event triggers workflow scenario that has been waiting for months.
The LEAD Gateway Portal To support three classes of users –Meteorology research scientists & grad students. –Undergrads in meteorology classes –People who want easy access to weather data. Go to:
Gateway Components A Framework for Discovery –Four basic components Data Discovery –Catalogs and index services The experiment –Computational workflow managing on-demand resources Data analysis and visualization Data product preservation, –automatic metadata generation and experimental data providence.
Data Search Select a region and a time range and desired attributes
Building Experiments As the user interacts with the portal they are creating “experiments” An experiment is –A collection of data (or desired data) –A set of analysis, transformational or predictive tasks Defined by a workflow or a high level query –A provenance document that encodes a repeatable history of the experiment.
Portal: Experimental Data & Metadata Space CyberInfrastructure extends user’s desktop to incorporate vast data analysis space. As users go about doing scientific experiments, the CI manages back- end storage and compute resources. –Portal provides ways to explore this data and search and discover it. Metadata about experiments is largely automatically generated, and highly searchable. –Describes data object (the file) in application-rich terms, and provides URI to data service that can resolve an abstract unique identifier to real, on-line data “file”.
arpssfc arpstrn Ext2arps-ibc 88d2arps mci2arps ADAS assimilation arps2wrf nids2arps WRF Ext2arps-lbc wrf2arps arpsplot IDV viz Terrain data files Surface data files ETA, RUC, GFS data Radar data (level II) Radar data (level III) Satellite data Surface, upper air mesonet & wind profiler data Typical weather forecast runs as workflow ~400 Data Products Consumed & Produced – transformed – during Workflow Lifecycle Pre-ProcessingAssimilationForecast Visualization
The Experiment Builder A Portal “wizard” that leads the user through the set-up of a workflow Asks the user: –“Which workflow do you want to run?” Once this is know, it can prompt the user for the required input data sources Then it “launches” the workflow.
Parameter Selection
Selecting the forecast region
Experience so far First release to support “WxChallenge: the new collegiate weather forecast challenge” –The goal: “forecast the maximum and minimum temperatures, precipitation, and maximum sustained wind speeds for select U.S. cities. –to provide students with an opportunity to compete against their peers and faculty meteorologists at 64 institutions for honors as the top weather forecaster in the nation.” –79 “users” ran 1,232 forecast workflows generating 2.6TBybes of data. Over 160 processors were reserved on Tungsten from 10am to 8pm EDT(EST), five days each week National Spring Forecast –First use of user initiated 2Km forecasts as part of that program. Generated serious interest from National Severe Storm Center. Integration with CASA project scheduled for final year of LEAD ITR.
Is TeraGrid the Only Enabler? The web has evolved a set information and service “super nodes” –Directories & indexes (google, MS, Yahoo) –Transactional mosh pits (eBay, Facebook, Wikipedia) –Raw data and compute services (Amazon …) We can build the tools for scientific discovery on this “private sector” grid? –Yes. –One CS student + one Bio-informatician + Amazon Storage Service + Amazon Compute Cloud =..
A Virtual Lab for Evolutionary Genomics Data and databases live on S3 Computational Tools run (on- demand) as services on EC2. User composes workflows. Result data and metadata visible to user through desktop client.
Validating Scientific Discovery The Gateway is becoming part of the process of science by being an active repository of data provenance The system records each computational experiment that a user initiates –A complete audit trail of the experiment or computation –Published results can include link to provenance information for repeatability and transparency. The Scientific Method is all about repeatability of experiments –Are we there yet?
Almost The provenance contains the workflow and if we publish it, it can be re-run –Are the same resources still available? Not a necessary condition for validation –Has the data changed? Another user can modify it. –Replace an analysis step with another –Test it on different data.
The Future Experimental Testbed In five years multicore, personal petabytes and ubiquitous gigabit bandwidth –Much richer experimental capability on the desktop. More of the computational work can be downloaded Do we no longer need the massive remote data/compute center? –Demand scales with capability. But there is more.
Last Thought Vastly improved capability for interactive experimentation –Data exploration and visualization. Interacting with hundreds of incoming data streams. –Tracking our path and exploring 100 possible experimental scenarios concurrently. –Deep search agents Discovering new data and new tools Grab data - automatically fetch and analyze the provenance and set up the workflow to be re-run.
Questions
The Realization in Software Data Storage Application services Compute Engine User Portal Portal server Portal server Data Catalog service Data Catalog service MyLEAD User Metadata catalog MyLEAD User Metadata catalog MyLEAD Agent service MyLEAD Agent service Data Management Service Data Management Service Workflow Engine Workflow Engine Workflow graph Providence Collection service Providence Collection service Event Notification Bus Fault Tolerance & scheduler Fault Tolerance & scheduler