Data Grid Research Group Dept. of Computer Science and Engineering The Ohio State University Columbus, Ohio 43210, USA David Chiu and Gagan Agrawal Enabling Ad Hoc Queries over Low-Level Scientific Data Sets
Enabling Ad Hoc Queries over Low Level Scientific Data SetsSSDBM ’09: New Orleans, LA. Jun 2-4, Increased tremendously over the years Scientific Data Sets The collection of scientific data has increased over the years with new instruments, simulations, etc. Data sets are stored in repositories around the globe Just within U.S. entities in the geospatial domain ‣ NOAA: oceanic, climate, water quality,... ‣ NASA: ozone, air quality, tropical,... ‣ NRCS: land quality, watershed,...
Enabling Ad Hoc Queries over Low Level Scientific Data SetsSSDBM ’09: New Orleans, LA. Jun 2-4, Increased tremendously over the years Scientific Data Sets The collection of scientific data has increased over the years with new instruments, simulations, etc. Data sets are stored in repositories around the globe Just within U.S. entities in the geospatial domain ‣ NOAA: oceanic, climate, water quality,... ‣ NASA: ozone, air quality, tropical,... ‣ NRCS: land quality, watershed,...
Enabling Ad Hoc Queries over Low Level Scientific Data SetsSSDBM ’09: New Orleans, LA. Jun 2-4, Increased tremendously over the years Scientific Data Sets The collection of scientific data has increased over the years with new instruments, simulations, etc. Data sets are stored in repositories around the globe Just within U.S. entities in the geospatial domain ‣ NOAA: oceanic, climate, water quality,... ‣ NASA: ozone, air quality, tropical,... ‣ NRCS: land quality, watershed,...
Enabling Ad Hoc Queries over Low Level Scientific Data SetsSSDBM ’09: New Orleans, LA. Jun 2-4, Data Repositories Web or Data Grid Infrastructure Mass Storage Systems (MSS)
Enabling Ad Hoc Queries over Low Level Scientific Data SetsSSDBM ’09: New Orleans, LA. Jun 2-4, Scientific Data Sets Data sets are typically low level, i.e., ‣ Unstructured or semi-structured However, data is well-documented ‣ Accompanying XML-based metadata describing data sets is typically required in today’s repositories
Enabling Ad Hoc Queries over Low Level Scientific Data SetsSSDBM ’09: New Orleans, LA. Jun 2-4, Data Repositories Mass Storage Systems (MSS) Grid/Web Services & portals Web or Data Grid Infrastructure
Enabling Ad Hoc Queries over Low Level Scientific Data SetsSSDBM ’09: New Orleans, LA. Jun 2-4, Data Repositories in the Global Scale USEU AU...
Enabling Ad Hoc Queries over Low Level Scientific Data SetsSSDBM ’09: New Orleans, LA. Jun 2-4, What Do the Users Want? US EU AU... High level query... - Keywords - Natural language Don’t just give me the data, but... - Transform it - Manipulate it - Compose it with other processes and data sets And do this with the least amount of work required from me!
Enabling Ad Hoc Queries over Low Level Scientific Data SetsSSDBM ’09: New Orleans, LA. Jun 2-4, System Goals To enable queries over low level data sets, which involves: ‣ identification of relevant data sets ‣ automatic planning for the composition of dependent services (processes) for derivation... while being non-intrusive to existing schemes, i.e., ‣ avoids a standardized format for storing data sets ‣ accommodates heterogeneous metadata ‣ this system should - fit - into existing MSS and scientific computing infrastructures (Data Grid & the Web)
Enabling Ad Hoc Queries over Low Level Scientific Data SetsSSDBM ’09: New Orleans, LA. Jun 2-4, That’s good and all, but... Challenges Not without challenges... ‣ supporting high level user queries ‣ dealing with metadata from multiple entities ‣ efficiently identifying relevant data sets ‣ planning and executing accurate service compositions on the spot
Enabling Ad Hoc Queries over Low Level Scientific Data SetsSSDBM ’09: New Orleans, LA. Jun 2-4, That’s good and all, but... Challenges Not without challenges... ‣ supporting high level user queries ‣ dealing with metadata from multiple entities ‣ efficiently identifying relevant data sets ‣ planning and executing accurate service compositions on the spot DOMAIN KNOWLEDGE & SEMANTICS And without question, the need for
Enabling Ad Hoc Queries over Low Level Scientific Data SetsSSDBM ’09: New Orleans, LA. Jun 2-4, Proposed System Overview
Enabling Ad Hoc Queries over Low Level Scientific Data SetsSSDBM ’09: New Orleans, LA. Jun 2-4, The Semantics Layer A Need for Domain Level Knowledge Assume the following service retrieves a satellite image pertaining to (x,y) with resolution respective to r Questions to ask the system: ‣ How to deduce that this service can be used? ‣ How to determine what information is needed for input? ‣ Did the user provide enough information to invoke this service? get_sat_image(double x, double y, double r) inputsTo longitudelatitudegrid_size outputsTo satellite image
Enabling Ad Hoc Queries over Low Level Scientific Data SetsSSDBM ’09: New Orleans, LA. Jun 2-4, In the Semantics Layer Applying Domain Information Domain concepts can be derived from executing a service Domain concepts can also be derived from retrieving an existing data set Service parameters represent different domain concepts
Enabling Ad Hoc Queries over Low Level Scientific Data SetsSSDBM ’09: New Orleans, LA. Jun 2-4, Data Registration Service Indexing Data Sets Handling heterogeneous metadata For instance, just within the geospatial domain, CountryMetadata Standards USCSDGM AU, NZANZLIC EU??? CDN???...
Enabling Ad Hoc Queries over Low Level Scientific Data SetsSSDBM ’09: New Orleans, LA. Jun 2-4, Data Registration Service Indexing Data Sets Handling heterogeneous metadata
Enabling Ad Hoc Queries over Low Level Scientific Data SetsSSDBM ’09: New Orleans, LA. Jun 2-4, Data Registration Service Indexing Data Sets Metadata to DB transformations... (transform to spatial index)
Enabling Ad Hoc Queries over Low Level Scientific Data SetsSSDBM ’09: New Orleans, LA. Jun 2-4, Data Registration Service Indexing Data Sets Metadata to DB transformations... insert
Enabling Ad Hoc Queries over Low Level Scientific Data SetsSSDBM ’09: New Orleans, LA. Jun 2-4, Data Registration Service Indexing Data Sets
Enabling Ad Hoc Queries over Low Level Scientific Data SetsSSDBM ’09: New Orleans, LA. Jun 2-4, Data Registration Service Indexing Data Sets
Enabling Ad Hoc Queries over Low Level Scientific Data SetsSSDBM ’09: New Orleans, LA. Jun 2-4, Data Registration Service Indexing Data Sets
Enabling Ad Hoc Queries over Low Level Scientific Data SetsSSDBM ’09: New Orleans, LA. Jun 2-4, Indexing Services Services (inputs, outputs) are also registered in much the same way
Enabling Ad Hoc Queries over Low Level Scientific Data SetsSSDBM ’09: New Orleans, LA. Jun 2-4, System Overview
Enabling Ad Hoc Queries over Low Level Scientific Data SetsSSDBM ’09: New Orleans, LA. Jun 2-4, Supporting High Level Queries In supporting high level queries, recall our ontology for modeling domain semantics Entire system is domain-concept-driven So, we should decompose queries into concepts first
Enabling Ad Hoc Queries over Low Level Scientific Data SetsSSDBM ’09: New Orleans, LA. Jun 2-4, Supporting High Level Queries
Enabling Ad Hoc Queries over Low Level Scientific Data SetsSSDBM ’09: New Orleans, LA. Jun 2-4, Original Query: ‣ “return water level from station=32125 on 10/31/2008” The elements of our query have been parsed against the ontology Supporting High Level Queries
Enabling Ad Hoc Queries over Low Level Scientific Data SetsSSDBM ’09: New Orleans, LA. Jun 2-4, Proposed System Overview
Enabling Ad Hoc Queries over Low Level Scientific Data SetsSSDBM ’09: New Orleans, LA. Jun 2-4, The Planning Layer Service Composition: An Example
Enabling Ad Hoc Queries over Low Level Scientific Data SetsSSDBM ’09: New Orleans, LA. Jun 2-4, In the Semantics Layer Applying Domain Information Domain concepts can be derived from executing a service Domain concepts can also be derived from retrieving an existing data set Service parameters represent different domain concepts
Enabling Ad Hoc Queries over Low Level Scientific Data SetsSSDBM ’09: New Orleans, LA. Jun 2-4, The Planning Layer Service Composition: An Example A subset of the ontology (unrolled)
Enabling Ad Hoc Queries over Low Level Scientific Data SetsSSDBM ’09: New Orleans, LA. Jun 2-4, The Planning Layer Service Composition begin compSrvc(concept, Q[...]) W := () //perform DFS starting from concept let v := concept be the currently visited node if v is a data type then W := (W, index.getData(v, Q)) else //v is a service let (p 1,..,p n ) be v’s params //recursive call on each p i W := (W, (v, compSrvc(p 1, Q ),..., compSrvc(p n, Q ))) end if return W end
Enabling Ad Hoc Queries over Low Level Scientific Data SetsSSDBM ’09: New Orleans, LA. Jun 2-4, The Planning Layer Service Composition: An Example A subset of the ontology (unrolled)
Enabling Ad Hoc Queries over Low Level Scientific Data SetsSSDBM ’09: New Orleans, LA. Jun 2-4, The Planning Layer Service Composition: An Example Ontology (unrolled) A Derived Execution PlanThis is what data registration provides
Enabling Ad Hoc Queries over Low Level Scientific Data SetsSSDBM ’09: New Orleans, LA. Jun 2-4, Planning Times
Enabling Ad Hoc Queries over Low Level Scientific Data SetsSSDBM ’09: New Orleans, LA. Jun 2-4, Conclusion Our system... ‣ proposes to unify heterogeneous metadata ‣ extracts certain metadata attributes and indexes low level data sets and services for fast access from distributed repositories ‣ automatically composes these services and data sets to answer user queries Questions - Comments? ‣ David Chiu ‣ Gagan Agrawal