1 Artemis: Integrating Scientific Data on the Grid Rattapoom Tuchinda Snehal Thakkar Yolanda Gil Ewa Deelman.

Slides:



Advertisements
Similar presentations
Abstraction Layers Why do we need them? –Protection against change Where in the hourglass do we put them? –Computer Scientist perspective Expose low-level.
Advertisements

Database Planning, Design, and Administration
A Stepwise Modeling Approach for Individual Media Semantics Annett Mitschick, Klaus Meißner TU Dresden, Department of Computer Science, Multimedia Technology.
High Performance Computing Course Notes Grid Computing.
Data Grids Darshan R. Kapadia Gregor von Laszewski
SpaceGRID and EGSO Satu Keski-Jaskari Maria Vappula Parallal Computing – Seminar
Distributed Database Management Systems. Reading Textbook: Ch. 4 Textbook: Ch. 4 FarkasCSCE Spring
Community Manager A Dynamic Collaboration Solution on Heterogeneous Environment Hyeonsook Kim  2006 CUS. All rights reserved.
Cloud based linked data platform for Structural Engineering Experiment Xiaohui Zhang
Web-based Portal for Discovery, Retrieval and Visualization of Earth Science Datasets in Grid Environment Zhenping (Jane) Liu.
University of ViennaP. Brezany 1 Knowledge Discovery in Grid Datasets – Goals, Design Concepts and the Architecture Peter Brezany University of Vienna.
Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.
LÊ QU Ố C HUY ID: QLU OUTLINE  What is data mining ?  Major issues in data mining 2.
Database System Development Lifecycle © Pearson Education Limited 1995, 2005.
1 Yolanda Gil Information Sciences InstituteJanuary 10, 2010 Requirements for caBIG Infrastructure to Support Semantic Workflows Yolanda.
Managing Service Metadata as Context The 2005 Istanbul International Computational Science & Engineering Conference (ICCSE2005) Mehmet S. Aktas
Data Management Kelly Clynes Caitlin Minteer. Agenda Globus Toolkit Basic Data Management Systems Overview of Data Management Data Movement Grid FTP Reliable.
CST203-2 Database Management Systems Lecture 2. One Tier Architecture Eg: In this scenario, a workgroup database is stored in a shared location on a single.
A Metadata Catalog Service for Data Intensive Applications Presented by Chin-Yi Tsai.
Jan Storage Resource Broker Managing Distributed Data in a Grid A discussion of a paper published by a group of researchers at the San Diego Supercomputer.
 DATABASE DATABASE  DATABASE ENVIRONMENT DATABASE ENVIRONMENT  WHY STUDY DATABASE WHY STUDY DATABASE  DBMS & ITS FUNCTIONS DBMS & ITS FUNCTIONS 
1 USC INFORMATION SCIENCES INSTITUTE Yolanda Gil Artificial Intelligence and Large-Scope Science: Workflow Planning and Beyond Yolanda Gil USC/Information.
An Integration Framework for Sensor Networks and Data Stream Management Systems.
Introduction to MDA (Model Driven Architecture) CYT.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Lecture On Introduction (DBMS) By- Jesmin Akhter Assistant Professor, IIT, Jahangirnagar University.
The Data Grid: Towards an Architecture for the Distributed Management and Analysis of Large Scientific Dataset Caitlin Minteer & Kelly Clynes.
Introduction to Apache OODT Yang Li Mar 9, What is OODT Object Oriented Data Technology Science data management Archiving Systems that span scientific.
Through the development of advanced middleware, Grid computing has evolved to a mature technology in which scientists and researchers can leverage to gain.
QCDGrid Progress James Perry, Andrew Jackson, Stephen Booth, Lorna Smith EPCC, The University Of Edinburgh.
RELATIONAL FAULT TOLERANT INTERFACE TO HETEROGENEOUS DISTRIBUTED DATABASES Prof. Osama Abulnaja Afraa Khalifah
EU Project proposal. Andrei S. Lopatenko 1 EU Project Proposal CERIF-SW Andrei S. Lopatenko Vienna University of Technology
Virtual Data Grid Architecture Ewa Deelman, Ian Foster, Carl Kesselman, Miron Livny.
1 USC INFORMATION SCIENCES INSTITUTE CAT: Composition Analysis Tool Interactive Composition of Computational Pathways Yolanda Gil Jihie Kim Varun Ratnakar.
Large Scale Nuclear Physics Calculations in a Workflow Environment and Data Provenance Capturing Fang Liu and Masha Sosonkina Scalable Computing Lab, USDOE.
Ames Research CenterDivision 1 Information Power Grid (IPG) Overview Anthony Lisotta Computer Sciences Corporation NASA Ames May 2,
Grid Computing & Semantic Web. Grid Computing Proposed with the idea of electric power grid; Aims at integrating large-scale (global scale) computing.
Pegasus: Running Large-Scale Scientific Workflows on the TeraGrid Ewa Deelman USC Information Sciences Institute
Policy Based Data Management Data-Intensive Computing Distributed Collections Grid-Enabled Storage iRODS Reagan W. Moore 1.
Replica Management Services in the European DataGrid Project Work Package 2 European DataGrid.
ICCS WSES BOF Discussion. Possible Topics Scientific workflows and Grid infrastructure Utilization of computing resources in scientific workflows; Virtual.
What is SAM-Grid? Job Handling Data Handling Monitoring and Information.
Enabling the Future Service-Oriented Internet (EFSOI 2008) Supporting end-to-end resource virtualization for Web 2.0 applications using Service Oriented.
Information Integration BIRN supports integration across complex data sources – Can process wide variety of structured & semi-structured sources (DBMS,
Enabling e-Research in Combustion Research Community T.V Pham 1, P.M. Dew 1, L.M.S. Lau 1 and M.J. Pilling 2 1 School of Computing 2 School of Chemistry.
Issues in Ontology-based Information integration By Zhan Cui, Dean Jones and Paul O’Brien.
Architecture View Models A model is a complete, simplified description of a system from a particular perspective or viewpoint. There is no single view.
1 WS-GIS: Towards a SOA-Based SDI Federation Fábio Luiz Leite Júnior Information System Laboratory University of Campina Grande
Distributed Data for Science Workflows Data Architecture Progress Report December 2008.
Development of e-Science Application Portal on GAP WeiLong Ueng Academia Sinica Grid Computing
An Overview of Scientific Workflows: Domains & Applications Laboratoire Lorrain de Recherche en Informatique et ses Applications Presented by Khaled Gaaloul.
NeOn Components for Ontology Sharing and Reuse Mathieu d’Aquin (and the NeOn Consortium) KMi, the Open Univeristy, UK
Ewa Deelman, Virtual Metadata Catalogs: Augmenting Existing Metadata Catalogs with Semantic Representations Yolanda Gil, Varun Ratnakar,
Lecture On Introduction (DBMS) By- Jesmin Akhter Assistant Professor, IIT, Jahangirnagar University.
1 A Scalable Distributed Data Management System for ATLAS David Cameron CERN CHEP 2006 Mumbai, India.
STAR Scheduler Gabriele Carcassi STAR Collaboration.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Developing GRID Applications GRACE Project
ACGT Architecture and Grid Infrastructure Juliusz Pukacki ‏ EGEE Conference Budapest, 4 October 2007.
Preservation Data Services Persistent Archive Research Group Reagan W. Moore October 1, 2003.
A Mixed-Initiative System for Building Mixed-Initiative Systems Craig A. Knoblock, Pedro Szekely, and Rattapoom Tuchinda Information Science Institute.
Data Grids, Digital Libraries and Persistent Archives: An Integrated Approach to Publishing, Sharing and Archiving Data. Written By: R. Moore, A. Rajasekar,
Cloud based linked data platform for Structural Engineering Experiment
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
 DATAABSTRACTION  INSTANCES& SCHEMAS  DATA MODELS.
Laura Bright David Maier Portland State University
Service Oriented Architecture (SOA)
The Anatomy and The Physiology of the Grid
The Anatomy and The Physiology of the Grid
Toward an Ontology-Driven Architectural Framework for B2B E. Kajan, L
Presentation transcript:

1 Artemis: Integrating Scientific Data on the Grid Rattapoom Tuchinda Snehal Thakkar Yolanda Gil Ewa Deelman

2 Outline Motivation Motivation Data integration needs in scientific applications Data integration needs in scientific applications Distributed computing in grids Distributed computing in grids Problem statement Problem statement Artemis architecture Artemis architecture Evaluation Evaluation Related Work Related Work Conclusions and future work Conclusions and future work

3 Scientific Data Integration Large-scale, cross-disciplinary scientific data collection, storage, and analysis exacerbates heterogeneity and dynamics Large-scale, cross-disciplinary scientific data collection, storage, and analysis exacerbates heterogeneity and dynamics National Virtual Observatory (NVO) National Virtual Observatory (NVO) Earth System Grid (ESG) Earth System Grid (ESG)

4 Grid Computing [Foster & Kesselman 04] Grids provide middleware services for distributed computing: Grids provide middleware services for distributed computing: Seamless integration and management of resources – OGSA Seamless integration and management of resources – OGSA Job submission and execution management – Condor Job submission and execution management – Condor Resource availability & performance – Monitoring and Directory Svc (MDS) Resource availability & performance – Monitoring and Directory Svc (MDS) Data replication for robustness and efficiency – Replica Loc Svc (RLS) Data replication for robustness and efficiency – Replica Loc Svc (RLS) Descriptions of data sources – Metadata Catalog Services (MCS) Descriptions of data sources – Metadata Catalog Services (MCS) R Discovery Many sources of data, services, computation R Registries organize services of interest to a community Access Data integration activities may require access to, & exploration/analysis of, data at many locations Exploration & analysis may involve complex, multi-step workflows RM Resource management is needed to ensure progress & arbitrate competing demands Security service Security service Policy service Policy service Security & policy must underlie access & management decisions From [Kesselman 04]:

5 Scientific Data Storage and Access Data sources are very heterogeneous Data sources are very heterogeneous Data that results from various instruments, disciplines, and types of analyses Data that results from various instruments, disciplines, and types of analyses Wide variety of data storage systems (files, DBs, servers, etc) Wide variety of data storage systems (files, DBs, servers, etc) Data sources are highly distributed Data sources are highly distributed Data stored in different locations on the grid Data stored in different locations on the grid Data is replicated in multiple locations Data is replicated in multiple locations Data sources are highly dynamic Data sources are highly dynamic Data grows continuously, new data models are routine Data grows continuously, new data models are routine New data sources regularly appear New data sources regularly appear Data sources may become unavailable sporadically Data sources may become unavailable sporadically Data available at unprecedented scale Data available at unprecedented scale Very soon petabytes Very soon petabytes These challenges are in the way of scientific progress in many disciplines

6 Data Storage and Access in Grids Data described with metadata attributes Data described with metadata attributes Attribute names may not be consistent across different sources Attribute names may not be consistent across different sources Metadata descriptions often stored separately from the data itself Metadata descriptions often stored separately from the data itself Metadata Catalog Service (MCS) [Moore et al 01, Singh et al 03] Metadata Catalog Service (MCS) [Moore et al 01, Singh et al 03] Stores descriptive metadata and allows users to query based on desired attributes Stores descriptive metadata and allows users to query based on desired attributes Addresses heterogeneity of data source implementations and access Addresses heterogeneity of data source implementations and access

7 Sample Query search constraints: search constraints: keywords = "atmospheric data" or "climate data“ keywords = "atmospheric data" or "climate data“ or "climate model“ or "climate model“ model type = "CCSM" or "PCM“ model type = "CCSM" or "PCM“ period = 2001 period = 2001 search results: Files, collections, or views: /CCSM2/b20.007/atm /PCM/B06.62/atm /PCM/B06.20/atm /PCM/B06.21/atm search results: Files, collections, or views: /CCSM2/b20.007/atm /PCM/B06.62/atm /PCM/B06.20/atm /PCM/B06.21/atm

8 Problem Statement Users should have seamless single point access Users should have seamless single point access Should not have to formulate a different query for each source Should not have to formulate a different query for each source Should not manage the unavailability of data sources Should not manage the unavailability of data sources Users need assistance formulating the queries Users need assistance formulating the queries Data models may have different attribute names and representations (even from the same source) Data models may have different attribute names and representations (even from the same source) New data models/metadata attributes created all the time New data models/metadata attributes created all the time MCS1 MCS2 MCS3 DB1 DB2 DB3 q1 q2 q3 stime etime starttime endtime descr sub currently unavailable

9 Artemis A mixed-initiative data integration system that aims to: A mixed-initiative data integration system that aims to: Abstracts users from diversity in attribute representations Abstracts users from diversity in attribute representations Assists users to formulate queries step-by-step Assists users to formulate queries step-by-step Manages the access and availability of dynamic collections of data sources Manages the access and availability of dynamic collections of data sources Integrates and extends various AI techniques: Integrates and extends various AI techniques: Data Integration Data Integration Ontology Ontology Dialogue wizards Dialogue wizards

10 Approach stime etime … starttime endtime … description subject stimestarttimeetimeendtime Time Start timeEnd time ONTOLOGY Query Mediator Query Formulation Wizard Start time > ^ End time ^ End time < Data Source Metadata Catalog2 Data Source Data Source Metadata Catalog3 Metadata Catalog1

11 Artemis Architecture Entity selection Filters MCS Wizard Dynamic Model Generator Prometheus Query Mediator Metadata Catalog Service Metadata Catalog Service Metadata Catalog Service Data Source Ontology Model Mappings Models

12 MCS Wizard Based on the Agent Wizard [Tuchinda 2003] Based on the Agent Wizard [Tuchinda 2003] Domain experts create mappings between Ontologies and meta-data attributes Domain experts create mappings between Ontologies and meta-data attributes users can then pick the ontology and the mappings relevant to their domain. users can then pick the ontology and the mappings relevant to their domain. Guides the user through available operations and filters consistent with the models of the data. Guides the user through available operations and filters consistent with the models of the data.

13 Prometheus Query Mediator Data integration system from earlier research [Thakkar et. al. 2004] [Knoblock et al 2003] Data integration system from earlier research [Thakkar et. al. 2004] [Knoblock et al 2003] Provides unified query interface to a wide variety of data sources Provides unified query interface to a wide variety of data sources Relational model Relational model Requires pre-defined domain model relating sources to domain relations Requires pre-defined domain model relating sources to domain relations Extended in Artemis to support: Extended in Artemis to support: Source relations: Various MCSs Source relations: Various MCSs Domain relations Domain relations File, View, Collection File, View, Collection Dynamic domain model based on availability of data sources Dynamic domain model based on availability of data sources

14 Dynamic Model Generation Generate mediator model dynamically by querying MCSs Generate mediator model dynamically by querying MCSs Convert object oriented model of MCSs to relational model of the mediator Convert object oriented model of MCSs to relational model of the mediator Handles dynamic nature of data by generating new domain models at query time Handles dynamic nature of data by generating new domain models at query time Intuitive idea Intuitive idea Query MCSs one at a time for all possible attributes of different objects Query MCSs one at a time for all possible attributes of different objects Create domain relation for each object type with all possible attributes Create domain relation for each object type with all possible attributes Create rules defining each MCS as data source Create rules defining each MCS as data source Relate various data sources to domain relations Relate various data sources to domain relations

15 Dynamic Model Generator (Cont’d) Example Example MCS 1: MCS 1: File1(starttime, endtime, frequency), File2(starttime, endtime, frequency, amplitude) File1(starttime, endtime, frequency), File2(starttime, endtime, frequency, amplitude) MCS 2: MCS 2: File3(starttime, endtime, lat, lon, temp), File4(starttime, endtime, lat, lon, windspeed) File3(starttime, endtime, lat, lon, temp), File4(starttime, endtime, lat, lon, windspeed) Domain relation Domain relation File(starttime, endtime, frequency, amplitude, lat, lon, temp, windspeed, name) File(starttime, endtime, frequency, amplitude, lat, lon, temp, windspeed, name) Source relations Source relations MCS1File(starttime, endtime, frequency, amplitude, name) MCS1File(starttime, endtime, frequency, amplitude, name) MCS2File(starttime, endtime, lat, lon, temp, windspeed, name) MCS2File(starttime, endtime, lat, lon, temp, windspeed, name) Domain Rules Domain Rules File(starttime, endtime, frequency, amplitude, lat, lon, temp, windspeed, name) :- MCS1File(starttime, endtime, frequency, amplitude, name)^ File(starttime, endtime, frequency, amplitude, lat, lon, temp, windspeed, name) :- MCS1File(starttime, endtime, frequency, amplitude, name)^ (lat = ‘’) ^ (lon = ‘’) ^ (temp = ‘’) ^ (windspeed = ‘’) (lat = ‘’) ^ (lon = ‘’) ^ (temp = ‘’) ^ (windspeed = ‘’) File(starttime, endtime, frequency, amplitude, lat, lon, temp, windspeed, name) :- MCS2File(starttime, endtime, lat, lon, temp, windspeed)^ File(starttime, endtime, frequency, amplitude, lat, lon, temp, windspeed, name) :- MCS2File(starttime, endtime, lat, lon, temp, windspeed)^ (frequency = ‘’) ^ (amplitude = ‘’) (frequency = ‘’) ^ (amplitude = ‘’)

16 Query Processing When Prometheus receives a query it determines which MCSs are relevant When Prometheus receives a query it determines which MCSs are relevant Relevant MCSs are determined by comparing the constraints of the query with the constraints of the MCSs Relevant MCSs are determined by comparing the constraints of the query with the constraints of the MCSs MCSs that do not satisfy constraints of the query are not used in the query MCSs that do not satisfy constraints of the query are not used in the query For example, if the query asked for finding files that contained data for some lat, lon then MCS1 would not be queried For example, if the query asked for finding files that contained data for some lat, lon then MCS1 would not be queried

17 Query Processing: Example Let’s say, the user uses the MCSWizard to form the following query. Let’s say, the user uses the MCSWizard to form the following query. Q(name) :- File(starttime, endtime, frequency, amplitude, lat, lon, temp, windspeed, name)^ (lat > 33)^(lat 33)^(lat < 34)^ (lon -119)^ (starttime > 50000)^(endtime 50000)^(endtime < 60000) The Prometheus mediator would generate a datalog program with the query and domain rules The Prometheus mediator would generate a datalog program with the query and domain rules File(starttime, endtime, frequency, amplitude, lat, lon, temp, windspeed, name) :- MCS1File(starttime, endtime, frequency, amplitude, name)^ (lat = ‘’) ^ (lon = ‘’) ^ (temp = ‘’) ^ (windspeed = ‘’) (lat = ‘’) ^ (lon = ‘’) ^ (temp = ‘’) ^ (windspeed = ‘’) File(starttime, endtime, frequency, amplitude, lat, lon, temp, windspeed, name) :- MCS2File(starttime, endtime, lat, lon, temp, windspeed)^ (frequency = ‘’) ^ (amplitude = ‘’) (frequency = ‘’) ^ (amplitude = ‘’)

18 Query Processing: Example Let’s say, the user uses the MCSWizard to form the following query. Let’s say, the user uses the MCSWizard to form the following query. Q(name) :- File(starttime, endtime, frequency, amplitude, lat, lon, temp, windspeed, name)^ (lat > 33)^(lat 33)^(lat < 34)^ (lon -119)^ (starttime > 50000)^(endtime 50000)^(endtime < 60000) The Prometheus mediator would generate a datalog program with the query and domain rules The Prometheus mediator would generate a datalog program with the query and domain rules File(starttime, endtime, frequency, amplitude, lat, lon, temp, windspeed, name) :- MCS1File(starttime, endtime, frequency, amplitude, name)^ (lat = ‘’) ^ (lon = ‘’) ^ (temp = ‘’) ^ (windspeed = ‘’) (lat = ‘’) ^ (lon = ‘’) ^ (temp = ‘’) ^ (windspeed = ‘’) File(starttime, endtime, frequency, amplitude, lat, lon, temp, windspeed, name) :- MCS2File(starttime, endtime, lat, lon, temp, windspeed)^ (frequency = ‘’) ^ (amplitude = ‘’) (frequency = ‘’) ^ (amplitude = ‘’) The mediator determines that the order constraints in the rule one on lat and lon attribute are not compatible with the order constraints on lat and lon in the query, so only MCS2 is queried The mediator determines that the order constraints in the rule one on lat and lon attribute are not compatible with the order constraints on lat and lon in the query, so only MCS2 is queried

19 Artemis: Top level Selection

20 Artemis: Filtering

21 Evaluation Enabled users to query 12 different MCSs Enabled users to query 12 different MCSs Covering information from three different applications Covering information from three different applications LIGO, ESG, and Geo-spatial data warehouse LIGO, ESG, and Geo-spatial data warehouse Covering 17,000 different files Covering 17,000 different files Metadata consisted of about 300 different attributes Metadata consisted of about 300 different attributes Simulated addition of metadata to MCSs and failure of several MCSs while system was running Simulated addition of metadata to MCSs and failure of several MCSs while system was running

22 Related Work MCS [Singh et al 03] MCS [Singh et al 03] Organize metadata about objects on the data grid Organize metadata about objects on the data grid Object oriented schema to support user defined metadata attributes Object oriented schema to support user defined metadata attributes Difficult for users to keep track of diverse attribute names Difficult for users to keep track of diverse attribute names No semantic information is attached to the attributes No semantic information is attached to the attributes Agent Wizard [Tuchinda et. al. 2003] Agent Wizard [Tuchinda et. al. 2003] Interactive application that guides user by dividing complex tasks as series of simpler question answering tasks Interactive application that guides user by dividing complex tasks as series of simpler question answering tasks Challenge is to model complex task as set of simpler subtasks Challenge is to model complex task as set of simpler subtasks Prometheus Mediator [Thakkar et. al. 2004] Prometheus Mediator [Thakkar et. al. 2004] Data integration system that can efficiently integrate data from a wide variety of data sources Data integration system that can efficiently integrate data from a wide variety of data sources Key restriction is that relational schema for data sources and domain must be known in advance Key restriction is that relational schema for data sources and domain must be known in advance

23 Related Work (Cont’d) Mygrid [Wroe 2003] Mygrid [Wroe 2003] Model data sources as semantic web services Model data sources as semantic web services Integration of data sources is represented as a workflow Integration of data sources is represented as a workflow Requires that data sources have fixed schema and associated semantics Requires that data sources have fixed schema and associated semantics Model-based mediator system for scientific data management [Ludascher 2003] Model-based mediator system for scientific data management [Ludascher 2003] Data sources provide semantic information regarding their data Data sources provide semantic information regarding their data The provided information is used to generate domain model for a mediator system The provided information is used to generate domain model for a mediator system Assumption is that semantic information is provided by different data sources of interest Assumption is that semantic information is provided by different data sources of interest

24 Conclusions Contributions: Contributions: Mixed-initiative approach to help scientists query objects on the data grid Mixed-initiative approach to help scientists query objects on the data grid Isolate users from heterogeneity of data sources Isolate users from heterogeneity of data sources Manage distributed dynamic data Manage distributed dynamic data Future Work: Future Work: Algorithm to determine when to dynamically generate domain model Algorithm to determine when to dynamically generate domain model Better support for specifying model mappings Better support for specifying model mappings Artemis available as a grid service Artemis available as a grid service More extensive testing and usability studies More extensive testing and usability studies

25 ?