May 29, 2007 Metadata, Provenance, and Search in e-Science Beth Plale Director, Center for Data and Search Informatics School of Informatics Indiana University.

Slides:



Advertisements
Similar presentations
GRADD: Scientific Workflows. Scientific Workflow E. Science laboris Workflows are the new rock and roll of eScience Machinery for coordinating the execution.
Advertisements

LEAD Portal: a TeraGrid Gateway and Application Service Architecture Marcus Christie and Suresh Marru Indiana University LEAD Project (
Designing Services for Grid-based Knowledge Discovery A. Congiusta, A. Pugliese, Domenico Talia, P. Trunfio DEIS University of Calabria ITALY
Abstraction Layers Why do we need them? –Protection against change Where in the hourglass do we put them? –Computer Scientist perspective Expose low-level.
ASCR Data Science Centers Infrastructure Demonstration S. Canon, N. Desai, M. Ernst, K. Kleese-Van Dam, G. Shipman, B. Tierney.
As computer network experiments increase in complexity and size, it becomes increasingly difficult to fully understand the circumstances under which a.
A Java Architecture for the Internet of Things Noel Poore, Architect Pete St. Pierre, Product Manager Java Platform Group, Internet of Things September.
1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.
1 Introduction to XML. XML eXtensible implies that users define tag content Markup implies it is a coded document Language implies it is a metalanguage.
Karma Provenance Framework v2 Provenance Challenge Workshop/GGF18 Yogesh L. Simmhan Beth Plale, Dennis Gannon, Srinath Perera Indiana University.
MS DB Proposal Scott Canaan B. Thomas Golisano College of Computing & Information Sciences.
NextGRID & OGSA Data Architectures: Example Scenarios Stephen Davey, NeSC, UK ISSGC06 Summer School, Ischia, Italy 12 th July 2006.
May 29, 2007 Dynamically Adaptive Weather Analysis and Forecasting in LEAD: Issues in Data Management, Metadata, and Search Beth Plale Director, Center.
Karma Provenance Framework v2 Provenance Challenge Workshop/GGF18 Yogesh L. Simmhan Beth Plale, Dennis Gannon, Srinath Perera Indiana University.
Mike Smorul Saurabh Channan Digital Preservation and Archiving at the Institute for Advanced Computer Studies University of Maryland, College Park.
Web-based Portal for Discovery, Retrieval and Visualization of Earth Science Datasets in Grid Environment Zhenping (Jane) Liu.
The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab.
Focus Study: Mining on the Grid with ADaM Sara Graves Sandra Redman Information Technology and Systems Center and Information Technology Research Center.
Karma Provenance Collection Framework for Data-driven Workflows Yogesh Simmhan Microsoft Research Beth Plale, Dennis Gannon, Ai Zhang, Girish Subramanian,
Grid Computing for Real World Applications Suresh Marru Indiana University 5th October 2005 OSCER OU.
Creating Business Workflow Using SharePoint Designer 2007 Presented by Tarek Ghazali IT Technical Specialist Microsoft SQL Server MVP Microsoft SQL Server.
January, 23, 2006 Ilkay Altintas
Apache Airavata GSOC Knowledge and Expertise Computational Resources Scientific Instruments Algorithms and Models Archived Data and Metadata Advanced.
ESB Guidance 2.0 Kevin Gock
18:15:32Service Oriented Cyberinfrastructure Lab, Grid Deployments Saul Rioja Link to presentation on wiki.
Jan Storage Resource Broker Managing Distributed Data in a Grid A discussion of a paper published by a group of researchers at the San Diego Supercomputer.
CyberInfrastructure to Support Scientific Exploration and Collaboration Dennis Gannon (based on work with many collaborators, most notably Beth Plale )
OGCE Workflow Suite GopiKandaswamy Suresh Marru SrinathPerera ChathuraHerath Marlon Pierce TeraGrid 2008.
Instant Karma Collecting Provenance for AMSR-E Beth Plale Director, Data to Insight Center Indiana University Helen Conover Information Technology and.
Ohio State University Department of Computer Science and Engineering 1 Cyberinfrastructure for Coastal Forecasting and Change Analysis Gagan Agrawal Hakan.
IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Collaborating with iPlant.
Introduction to Apache OODT Yang Li Mar 9, What is OODT Object Oriented Data Technology Science data management Archiving Systems that span scientific.
N-Wave Stakeholder Users Conference Wednesday, May 11, Marine St, Rm 123 Boulder, CO Linda Miller and Mike Schmidt Unidata Program Center (UPC)-Boulder,
Towards Low Overhead Provenance Tracking in Near Real-Time Stream Filtering Nithya N. Vijayakumar, Beth Plale DDE Lab, Indiana University {nvijayak,
IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Collaborating with iPlant.
The Future of the iPlant Cyberinfrastructure: Coming Attractions.
1 Schema Registries Steven Hughes, Lou Reich, Dan Crichton NASA 21 October 2015.
Phase II Additions to LSG Search capability to Gene Browser –Though GUI in Gene Browser BLAST plugin that invokes remote EBI BLAST service Working set.
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES Data Replication Service Sandeep Chandra GEON Systems Group San Diego Supercomputer Center.
Integrated Grid workflow for mesoscale weather modeling and visualization Zhizhin, M., A. Polyakov, D. Medvedev, A. Poyda, S. Berezin Space Research Institute.
GEM Portal and SERVOGrid for Earthquake Science PTLIU Laboratory for Community Grids Geoffrey Fox, Marlon Pierce Computer Science, Informatics, Physics.
1 Geospatial and Business Intelligence Jean-Sébastien Turcotte Executive VP San Francisco - April 2007 Streamlining web mapping applications.
Middleware for Grid Computing and the relationship to Middleware at large ECE 1770 : Middleware Systems By: Sepehr (Sep) Seyedi Date: Thurs. January 23,
Large Scale Nuclear Physics Calculations in a Workflow Environment and Data Provenance Capturing Fang Liu and Masha Sosonkina Scalable Computing Lab, USDOE.
1 Computing Challenges for the Square Kilometre Array Mathai Joseph & Harrick Vin Tata Research Development & Design Centre Pune, India CHEP Mumbai 16.
NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.
Presented by Scientific Annotation Middleware Software infrastructure to support rich scientific records and the processes that produce them Jens Schwidder.
Streamflow - Programming Model for Data Streaming in Scientific Workflows Chathura Herath.
INNOV-10 Progress® Event Engine™ Technical Overview Prashant Thumma Principal Software Engineer.
Sponsored by the National Science Foundation A New Approach for Using Web Services, Grids and Virtual Organizations in Mesoscale Meteorology.
GEON2 and OpenEarth Framework (OEF) Bradley Wallet School of Geology and Geophysics, University of Oklahoma
GRID Overview Internet2 Member Meeting Spring 2003 Sandra Redman Information Technology and Systems Center and Information Technology Research Center National.
Presented by Jens Schwidder Tara D. Gibson James D. Myers Computing & Computational Sciences Directorate Oak Ridge National Laboratory Scientific Annotation.
Indiana University School of Informatics The LEAD Gateway Dennis Gannon, Beth Plale, Suresh Marru, Marcus Christie School of Informatics Indiana University.
Development of e-Science Application Portal on GAP WeiLong Ueng Academia Sinica Grid Computing
MVC WITH CODEIGNITER Presented By Bhanu Priya.
Cyberinfrastructure: Many Things to Many People Russ Hobby Program Manager Internet2.
AHM04: Sep 2004 Nottingham CCLRC e-Science Centre eMinerals: Environment from the Molecular Level Managing simulation data Lisa Blanshard e- Science Data.
XMC Cat: An Adaptive Catalog for Scientific Metadata Scott Jensen and Beth Plale School of Informatics and Computing Indiana University-Bloomington Current.
OGCE Workflow and LEAD Overview Suresh Marru, Marlon Pierce September 2009.
Ocean Observatories Initiative OOI Cyberinfrastructure Life Cycle Objectives Review January 8-9, 2013 Scientific Workflows for OOI Ilkay Altintas Charles.
LEAD Project Discussion Presented by: Emma Buneci for CPS 296.2: Self-Managing Systems Source for many slides: Kelvin Droegemeier, Year 2 site visit presentation.
September 2003, 7 th EDG Conference, Heidelberg – Roberta Faggian, CERN/IT CERN – European Organization for Nuclear Research The GRACE Project GRid enabled.
 Cloud Computing technology basics Platform Evolution Advantages  Microsoft Windows Azure technology basics Windows Azure – A Lap around the platform.
A service Oriented Architecture & Web Service Technology.
Enhancements to Galaxy for delivering on NIH Commons
Accessing the VI-SEEM infrastructure
A Quick tour of LEAD for the VGrADS
Open Grid Computing Environments
Presentation transcript:

May 29, 2007 Metadata, Provenance, and Search in e-Science Beth Plale Director, Center for Data and Search Informatics School of Informatics Indiana University

Sept 17, 2007 Credits: PhD students Yogesh Simmhan, Nithya Vijayakumar, and Scott Jensen. Dennis Gannon, IU, key collaborator on discovery cyberinfrastructure

Sept 17, 2007 Nature of Computational Science Discovery Extract data from heterogeneous databases, Execute task sequences (“workflows”) on your behalf, Mine data from sensors and instruments and responding, Try out new algorithms, Explore data through visualization, and Go back and repeat steps again: with new data, answering new questions, or with new algorithms. How is this discovery process supported today? Through cyberinfrastructure. CyberInfrastructure that supports On demand knowledge discovery Automated experiment management (data and workflow) Data protection, and automated data product provenance tracking.

Sept 17, 2007 CyberInfrastructure: framework for discovery Plug and play data sources and analysis tools. Complex what-if scenarios. Through User portal Personal metadata catalog of data exploration results Data product index/catalog Data provenance service Workflow engine and composition tools Tied together with Internet-scale event bus. Results publishable to digital library.

Sept 17, 2007 Cyberinfrastructure for computing: DSI DataCenter Supports analysis, use, visualization and search research. Supports multiple datasets.

Sept 17, 2007 Distributed services provide functionali capability

Sept 17, 2007 Vision for Data Handling Capturing metadata about data sets as generated is key Syntatic: file size, date of creation and Semantic or domain specific: spatial region, logical time Context of file is key search parameter Provenance, or history of data product, needed to assess quality Volume of data used in computational science too large: manage on behalf of user Indexes help efficiency

Sept 17, 2007 The Realization in Software Data Storage Application services Compute Engine User’s Browser Portal server Portal server Data Catalog service Data Catalog service MyLEAD User Metadata catalog MyLEAD User Metadata catalog MyLEAD Agent service MyLEAD Agent service Data Management Service Data Management Service Workflow Engine Workflow Engine Workflow graph Provenance Collection service Provenance Collection service Event Notification Bus App factory App factory

Sept 17, 2007 Infrastructure is portal based - that is, all services are available through a web server Infrastructure is portal based - that is, all services are available through a web server

Sept 17, 2007 Gateway Services Core Grid Services e-Science Gateway Architecture Grid Portal Server Grid Portal Server Execution Management Execution Management Information Services Information Services Self Management Self Management Data Services Data Services Resource Management Resource Management Security Services Security Services Resource Virtualization (OGSA) Compute ResourcesData ResourcesInstruments & Sensors Proxy Certificate Server (Vault) Proxy Certificate Server (Vault) Events & Messaging Resource Broker Community & User Metadata Catalog Community & User Metadata Catalog Workflow engine Resource Registry Resource Registry Application Deployment Application Deployment User’s Grid Desktop [1] [1] Service Oriented Architectures for Science Gateways on Grid Systems, Gannon, D., et al.; ICSOC, 2005Service Oriented Architectures for Science Gateways on Grid Systems

Sept 17, 2007 LEAD-CI Cyberinfrastructure Workflows run on the LEADgrid and on Teragrid. Portal and persistent back-end web services run on LEADgrid. Data storage resources for storing user- generated data products are provided by Indiana University.

Sept 17, 2007 arpssfc arpstrn Ext2arps-ibc 88d2arps mci2arps ADAS assimilation arps2wrf nids2arps WRF Ext2arps-lbc wrf2arps arpsplot IDV viz Terrain data files Surface data files ETA, RUC, GFS data Radar data (level II) Radar data (level III) Satellite data Surface, upper air mesonet & wind profiler data Typical weather forecast runs as workflow ~400 Data Products Consumed & Produced – transformed – during Workflow Lifecycle Pre-ProcessingAssimilationForecast Visualization

Sept 17, 2007 To set up workflow experiment, we select a workflow (not shown) then set model parameters here To set up workflow experiment, we select a workflow (not shown) then set model parameters here

Sept 17, 2007 Supported community data collections Supported community data collections

Sept 17, 2007 Data Integration CASA radar Collection, Months (ftp) Latest 3 days Unidata IDD Distribution (XML web server) Level II and III radar, latest 3 days (XML web server) ETA, NCEP, NAM, METAR, etc. (XML web server) Oklaho ma Indiana Colorado Index XMLDB native XML database and Lucene for index Local view: crosswalk point of presence supports crawling, publishes difference list as LEAD Metadata Schema (LMS) documents Crawler crawls catalogs; Builds index of results; Web service API; Boolean search query with spatial/temporal support Globally integrated view: Data Catalog Service Web service API Boolean search query List of results as LEAD Metadata Schema documents crosswalks

Sept 17, 2007 LEAD Personal Workspace CyberInfrastructure extends user’s desktop to incorporate vast data analysis space. As users go about doing scientific experiments, the CI manages back-end storage and compute resources. Portal provides ways to explore this data and search and discover it. Metadata about experiments is largely automatically generated, and highly searchable. Describes data object (the file) in application-rich terms, and provides URI to data service that can resolve an abstract unique identifier to real, on-line data “file”.

Sept 17, 2007 Searching for experiments using model configuration parameters: 2 attributes selected

Sept 17, 2007 Searching for experiments based on model parameters: 4 returned experiments; one displayed

Sept 17, 2007 How forecast model configuration parameters stored in personal catalog Forecast model configuration file handed off to plugin that shreds XML document into queriable attributes associated with experiment

Sept 17, 2007 What & Why of Provenance Derivation history of a data product What (when, where) application created the data Its parameters & configuration Other input data used by application Workflow is composed from building blocks like these. So provenance for data used in workflow gives workflow trace Application A Data.Out.1 Data.In.1 Config.A Data.In.2 Data Provenance::Data.Out.1 Process: Application_A Timestamp: T12:45:23 Host: tyr20.cs.indiana.edu … Input: Data.In.1, Data.In.2 Config: Config.A

Sept 17, 2007 The What & Why of Provenance Trace Workflow Execution What services were used during workflow execution? Validate if all steps of execution successful? Audit Trail What resources were used during workflow execution? Data Quality & Reuse What applications were used to derived data products? Which workflows use a certain data product? Attribution Who performed the experiment? Who owns the workflow & data products? Discovery Locate data generated by a workflow Locate workflows containing App-X that succeeded

Sept 17, 2007 Karma Provenance Service Provenance Listener Provenance Listener Activity DB Activity DB Collection Framework Workflow Instance 10 Data Products Consumed & Produced by each Service Workflow Instance 10 Data Products Consumed & Produced by each Service Service 2 Service 2 … … Service 1 Service 1 Service 10 Service 10 Service 9 Service 9 10P/10C 10C 10P10C10P/10C 10P Workflow Engine Workflow Engine Message Bus WS-Eventing Service API WS-Messenger Notification Broker WS-Messenger Notification Broker Publish Provenance Activities as Notifications Application–Started & –Finished, Data–Produced & –Consumed Activities Workflow–Started & –Finished Activities Provenance Query API Provenance Query API Provenance Browser Client Provenance Browser Client Query for Workflow, Process, & Data Provenance Subscribe & Listen to Activity Notifications A Framework for Collecting Provenance in Data-Centric Scientific Workflows, Simmhan, Y., et al., ICWS Conference, 2006A Framework for Collecting Provenance in Data-Centric Scientific Workflows

Sept 17, 2007 Generating Karma Provenance Activities Instrument applications to publish provenance Simple Java Library available to Create provenance activities Publish activities as messages Jython “wrapper” scripts use library to publish provenance & invoke application Generic Factory toolkit easily converts applications to web service Built-in provenance instrumentation

Sept 17, 2007 Sample Sequence of Activities appStarted( App1 ) info( ‘ App1 starting ’ ) fileReceiveStarted( File1 ) -- do gridftp get to stage input file File1 -- fileReceiveFinished( File1 ) fileConsumed( File1 ) computationStarted( Code1 ) -- call Fortran code Code1 to process input files -- computationFinished( Code1 ) fileProduced( File2 ) fileSendStarted( File2 ) -- do gridftp put to save output file File2 -- fileSendFinished( File2 ) publishURL( File2 ) appFinishedSuccess( App1, File2 ) | appFinishedFailed( App1, ERR ) flush()

Sept 17, 2007 Performance perturbation

Sept 17, 2007 Standalone tool for provenance collection and experience reuse: future direction

Sept 17, 2007 Forecast start time can also be set to occur on severe weather conditions (not shown here) Forecast start time can also be set to occur on severe weather conditions (not shown here)

Sept 17, 2007 Weather triggered workflows Goal is cyberinfrastructure that allows scientists and students to run weather models dynamically and adaptively in response to weather events. Accomplished by coupling events processing and triggered forecast workflows Vijayakumar et al (2006) presented framework for this purpose Events-processing system does temporal and spatial filtering. Storm detection algorithm (SDA) detects storm events in remaining streams SDA returns detected storm events Events processing system generates trigger to workflow engine

Sept 17, 2007 Continuous stream mining In stream mining of weather, events of interest are anomalies Event processing queries can be deployed to sites in the LEAD grid (rectangles) Data streams delivered to each site through Unidata Internet Data Dissemination system CEP enables real-time response to the weather query computation node data generation source

Sept 17, 2007 Example CEP query Scientists can set up a 6-hour weather forecast over a region of say a 700 sq. mile bounding box, and submit a workflow that will run sometime in the future CEP query detects severe storm conditions developing in the region The forecast workflow is started at a future point in time as determined by the CEP query

Sept 17, 2007 Stream Provenance Tracking Data stream provenance - derivation history of data product where data product is derived time-bounded stream Stream provenance can establish correlations between significant events (e.g., storm occurrences) Anticipate resource needs by examining provenance data and discover trends in weather forecast model output Determine when next wave of users will arrive, and where their resources might need to be allocated

Sept 17, 2007 Stream processing as part of cyberinfrastructure SQL-based queries responding to input streams event- by-event within stream and concurrent across streams Each query generates time-bounded output stream Data Storage Application services Compute Engine User’s Browser Portal server Data Catalog service MyLEAD User Metadata catalog MyLEAD Agent service Data Management Service Workflow Engine Workflow graph Event Notification Bus App factory Calder Stream Mining Service NEXRAD Streams Mining queries Doppler Radars

Sept 17, 2007 Provenance Service in Calder Rowset service aggregating derived streams User Query Obtain continuous query Compile SQL to TCL query Distribute query Queries Setup buffer to aggregate results Deploy queries Create Ring Buffer Planner Service Computational mesh executing query execution engines DB Provenance Service Updates on Stream rates, approximations etc Query start / stop / distribution plan chance Results if any Process updates and store in DB Process flow / invocation Calder internal messaging WS- Messenger notifications

Sept 17, 2007 Provenance Update Handling Scalability Update processing time - time taken from instant user sends a notification to instant provenance service completes corresponding update Experiment Bombard provenance service at different update rates by simulating many clients sending provenance updates simultaneously Measure incoming rate at provenance service and overall time taken for handling each update. Overhead includes time to create message, send and receive through WS-Messenger, process message and store it in DB

Sept 17, 2007 Problem: Severe weather can bring many storms over a local region of interest It is infeasible and unnecessary to run weather model in response to each of them Solution: Group storm events into spatial clusters Trigger model runs in response to clusters of storms

Sept 17, 2007 Spatial Clustering: DBSCAN algorithm* DBSCAN is a density-based clustering algorithm and it can do spatial clustering location parameters are treated as features. DBSCAN algorithm has two parameters ε: radius within which a point is considered to be a neighbor of another point minPt: minimum number of neighboring points that a point has to have to be considered as a core point. The two parameters determine the clustering result * Mining work done by Xiang Li, University of Alabama Huntsville

Sept 17, 2007 Data WSR88D radar data on 3/27/2007 Total of 134 radar sites covering CONUS The time period examined is between 1:00 pm to 6:00pm EST. The 5 hrs time period is divided into 20 time interval with each interval of 15 min. Storm events within the same time interval is clustered Storm events detected at 1:00 pm – 1:15 pm * Mining work done by Xiang Li, University of Alabama Huntsville

Sept 17, 2007 Algorithm comparison: DBSCAN and K- means Number of clusters: 3 Time period: 1:00 pm – 1:15 pm K-means result DBSCAN result Conclusion: DBSCAN algorithm performs better than k-means algorithm

Sept 17, 2007 Future Work Publication of provenance to digital library Generalized support for metadata systems Enhanced support for mining triggers Personal weather predictor LEAD framework packaged into single 8-16 core multicore machine Expands educational opportunities: suitable for small schools Engage communities beyond meteorologists

Sept 17, 2007 Thank you for the interest. Thanks to my many domain science and CS collaborators, to my students, and to the funding agents. Please feel free to contact me at