CYBERINFRASTRUCTURE FOR THE GEOSCIENCES1 Towards a Generic Framework for Semantic Data Registration and Integration in Geosciences Kai Lin, Chaitan Baru San Diego Supercomputer Center University of California, San Diego
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES2 Data Integration Goal Query heterogeneous data sources as a single resource Query heterogeneous data sources as a single resource – Query: not write a program (“ad hoc, non-procedural query languages”) – Heterogeneous: local resource controls definition of the data – Single resource: remove the burden of individually accessing each data source
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES3 Data Integration Challenges: Heterogeneities Syntactical Heterogeneity Syntactical Heterogeneity heterogeneous data format heterogeneous data format e.g vs. 02/04/04 Structural Heterogeneity Structural Heterogeneity heterogeneous data models and schemas e.g is saved as three columns or one columns Semantics Heterogeneity Semantics Heterogeneity fuzzy metadata, terminology, “hidden” semantics, implicit assumptions GEON Solution: data should be semantically registered to GEON first heterogeneities are resolved by registration
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES4 Levels of Registration Metadata-level registration Metadata-level registration – Register metadata associated with a resource – submit required metadata. Predefined semantics. “Item” level registration “Item” level registration – Register the “schema” of a resources, e.g. relational database, shapefiles, … – Record semantics of schema elements, e.g. table name, column name “Item-Detail” level registration “Item-Detail” level registration – Register individual values in a dataset – Record semantics of each item in a record/column
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES5 Registering Structured Data Relational databases Relational databases Shapefiles database tables Shapefiles database tables Excel spreadsheets database tables Excel spreadsheets database tables Delimited ASCII files database tables Delimited ASCII files database tables Headers of scientific data files, e.g. netCDF Headers of scientific data files, e.g. netCDF
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES6 Item Level Database Registration and Access Table View Original Database Table Def View Def Published Database select tables and views to register GEON Mediator GEON JDBC Driver Application
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES7 How to Connect to GEON Databases Download GEON JDBC Driver Use the following code to create a connection // load driver Class.forName ("org.geongrid.jdbc.driver.Driver"); // set the mediator URL String url = "jdbc:geon://geon01.sdsc.edu:2532/GEON-63cb404c d9-a69f”; // open the connection Connection conn = DriverManager.getConnection(url, "geonuser", "geongrid"); GEON JDBC protocol The host name and port number of GEON Mediator GEON ID Note: the original account information is not accessbile by end users
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES8 GEON Mediator Enables Write Protection Mediator Database UPDATE B Only accepts SELECT statements Rejects any requests other than SELECT A B C B
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES9 Read Protection for Unregistered Tables and Views Mediator Database SELECT * FROM A An unregistered table or view is invisible to an end user The data in the table can’t be viewed by SELECT statement The schema can’t be fetched A B C B
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES10 GEON Database Integration GEON Mediator supports integration at three levels Level 1: Federation-Based Integration End users need to be knowledgeable about each database Level 2: View-Based Integration End users see “integrated views”. An intermediary designs these views. Level 3: Ontology-Based Integration End users can query using familiar concepts Requires middleware and formal representation of domain knowledge
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES11 Level 1: Federation-Based Integration C AB G D F E C AB D GF E GEON Mediator backend SELECT * FROM A, E WHERE …… Use SQL to query the federated database Structural and semantic heterogeneity should be solved by users themselves
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES12 Level 2: View-Based Integration C AB G D F E C AB D GF E GEON Mediator backend SELECT * FROM V, W WHERE …… Allow defining views on top of the federated databases Allow hiding the original backend schemas Integration results can be shared and reused VW
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES13 Level 3: Ontology-Based Integration Requires ontology annotations for backend databases Use simple ontology query language to query the integrated database End users do not need to know the backend schemas and local semantics C AB G D F E C AB D GF E GEON Mediator backend Ontology Based Query
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES14 GEON Ontology Based Data Integration Ontology Enabled Semantic Integration Ontology Enabled Semantic Integration Challenges for Computer Scientists and Domain Scientists Challenges for Computer Scientists and Domain Scientists – Computer Scientists: build an integration system based on the ontological registration of datasets – Domain Scientists: create domain ontologies – Data Providers: register datasets to ontologies Ontology1 Ontology2 ontology3 dataset1dataset2dataset3 dataset4
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES15 Ontological Data Registration for Data integration Registering a dataset to an ontology for data integration is a procedure to generate a partial model of the ontology from the dataset itself Registering a dataset to an ontology for data integration is a procedure to generate a partial model of the ontology from the dataset itself From registration dataset individualsontology p Not all the constraints in the ontology are satisfied by the generated individuals
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES16 Associate one or more columns under an optional SQL condition to a selected class in the ontology Associate one or more columns under an optional SQL condition to a selected class in the ontology Provide a mapping method if no explicit names of individuals should be generated Provide a mapping method if no explicit names of individuals should be generated Registering Relational Tables to Ontology Classes ……Latitude……Longitude…… ………………………… Location (23.5, 47.9) is the name of an individual of the class Location Same name indicates the same location RockSample RockSample GeologicAge GeologicAge …… …… Jurassic/Triassic Jurassic/Triassic Precambrian Precambrian ………… ………… GeologicalAge PrecambrianCenozoicPaleozoic
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES17 Registering Relational Tables to Ontology Object Properties Associate two entities which are already registered to the domain class and the range class of a selected object property in the ontology Associate two entities which are already registered to the domain class and the range class of a selected object property in the ontology ……RockSampleID……PERIOD…… ………………………… Rock GeologicAge hasAge
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES18 Register item/item-detail to Ontology ODAL (Ontological Database Annotation Language) User query SOQL (Simple Ontology Query Language) ODAL and SOQL
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES19 ODAL (Ontological Database Annotation Language) <odal:NamedIndividuals odal:id="RockSample" odal:database="VTDatabase"> Samples RockTexture RockGeoChemistry ModalData MineralChemistry Images ssID GUI generate to ODAL processor The values in the column ssID of the table Samples, RockTexture, RockGeoChemistry, ModalData,MineralChemistry and Images represent instances of RockSample Create a partial model of ontologies from databases Independent of end interface Independent of specific database implementations The ODAL mapping is itself a “first-class” object
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES20 ODAL: Import Ontologies The Ontologies used for annotating a database can be imported as follows: <odal:ODAL xmlns:rdf = “ xmlns:owl=" xmlns:odal = “ > ……
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES21 ODAL: Database Connection Declaration The target databases for making annotation is declared as follows: <odal:ODAL xmlns:rdf = “ xmlns:owl=" xmlns:odal = “ > …… Oracle oracle.sdsc.edu 3456 Publications ……
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES22 ODAL: Simple Named Individuals <odal:NamedIndividuals odal:id="BookInTableBookPrice" odal:database="PublicationDatabase" > odal:database="PublicationDatabase" > Collections Collections book-price book-price ISBN ISBN </odal:NamedIndividuals> Suppose the Book ontology contains a class Book and the schema Collection contains a table Book-Price with a column ISBN. odal:id gives a name to the declaration, and represents the set of the individuals generated by the statement. The statement says that each value in the column ISBN represents a book individual.
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES23 ODAL: Named Individuals from Multiple Columns California California Rock-Sample Rock-Sample Latitude Latitude Longitude Longitude </odal:NamedIndividuals> Suppose an ontology contains a class Location and a database table Rock-Sample with two columns Latitude and Longitude. The statement says that a pair of latitude and longitude gives a location
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES24 ODAL: Named Individuals with Conditions employee EmployeeId ]] employee EmployeeId ]] A condition in an odal:Condition element should be a boolean expression which is valid to be used in any WHERE clauses of SQL queries
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES25 ODAL: Data Type Property Declaration Person ssn person …8… … …age…SSN… Person double hasAge
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES26 To join data across independent resources we need we need to know the correspondence between entities. To join data across independent resources we need we need to know the correspondence between entities. For example, does “10001” represent the same rock in the two resources. By default, we assume they are not. For example, does “10001” represent the same rock in the two resources. By default, we assume they are not. A set of datatype properties can be declared as a key for a class in the ontology. We do join cross multiple resources based on keys. A set of datatype properties can be declared as a key for a class in the ontology. We do join cross multiple resources based on keys. e.g. { hasLatitude, hasLongitude} can be declared as a key of Location e.g. { hasLatitude, hasLongitude} can be declared as a key of Location Two locations from different resources are same if they have the same Two locations from different resources are same if they have the same latitude and longitude latitude and longitude Conditions for Joining Individuals from Different Resources Rock RockSampleID RockSampleID …... …... RockID RockID …… ……
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES27 SOQL (Simple Ontology Query Language) Query single or integrated resources via ontologies (i.e., high level logical views) independent of schema-level representation RockSampleLocation ValueWithUnit float location hasSiO2 value latlong unit string SELECT X.location.*; FROM RockSample X WHERE X.location.lat > 60 AND X.location.long > 100 AND X.hasSiO2.value < 30 AND X.hasSiO2.unit =‘weightPercetage’ GUI generate to SOQL processor
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES28 The Architecture of GEON Semantic Mediator Portal or Application Mediator JDBC Driver GUI SOQL Semantic Query Rewriter SOQL Parser Ontology Reasoner SOQL Processor Spatial SQL against federal schemas SQL Parser OWLODAL Query Execution Query Optimization Query Planning Internal Database OracleDB2MySQL SQL Server PostgreSQL PostGIS ODAL Processor
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES29 SELECT X.code, X.location.* FROM SeismicStation X, Railroad Y WHERE distance(X.location, Y.geometry) < 1 SELECT X2.stationcode, X2.lat, X2.lon FROM railroads_of_the_united_states X1, stationdatatable X2 WHERE distance(X1.the_geom, MakePoint(X2.lat, X2.lon)) < 1 GEON SOQL GUI SOQL Processor Railroad shapefile Seismic Stations Schema Mediator distance(X1.the_geom, MakePoint(X2.lat, X2.lon)) < 1 SELECT X1.the_geom FROM railroads X1 Question: Finding all seismic stations within 1 mile from railroads SELECT X2.stationcode, X2.lat, X2.lon FROM stationdatatable X2 WHERE bounding box condition