Presentation is loading. Please wait.

Presentation is loading. Please wait.

Dvoy Database Ideas. Heterogeneous to homogeneous Homogenization by applying uniform schema: Multidimensional data model User queries are directed toward.

Similar presentations


Presentation on theme: "Dvoy Database Ideas. Heterogeneous to homogeneous Homogenization by applying uniform schema: Multidimensional data model User queries are directed toward."— Presentation transcript:

1 Dvoy Database Ideas

2 Heterogeneous to homogeneous Homogenization by applying uniform schema: Multidimensional data model User queries are directed toward the nDim data cubes, generally for extraction of data slices along specific dimensions. The slices represent data views, e.g. map view, time view –Fore relational data, homogenization is by view-specific adding queries –For unstructured data, metadata is added to describe the structure, so the agent can retrieve the data for views

3 Multi-dimensional Data Access In array notation, the granule ‘value’ is accessed as –MyGranule = My1DArray(i) –MyGranule = My2DArray(i,j) –MyGranule = MynDArray(i,j,…..n) In order to select a data granule, a controller is assigned to each data dimension 1D Dataset e.g. Time selector i j k j i Data Granule Selection i 2D Dataset e.g. Param & Time selector 3D Dataset e.g. Param, Time & Location

4 Multi-Dimensional Data Model Data can be distributed over 1,2, …n dimensions 1 Dimensional e.g. Time dimension i j k j i Data Granule i 1 Dimensional e.g. Location & Time 1 Dimensional e.g. Location, Time & Parameter View 1 Data Space View 2 Views are orthogonal slices through multidimensional data cubes Spatial and temporal slices through the data are most common

5 Database Schema Design Fact Table: A fact table (yellow) contains the main data of interest, i.e. the pollutant concentration by location, day, pollutant and measurement method.Fact Table Star Schema consists of a central fact table surrounded by de-normalized dimensional tables (blue) describing the sites, parameters, methods.. Snowflake Schema is an extension of the star schema where each point of the star ‘explodes’ into further fully normalized tables, expanding the description of each dimension. Snowflake schema can capture all the key data content and relationships if full detail. It is well suited for capturing and encoding complex monitoring data into a robust relational database.

6 IMPROVE Relational Database Schema (Tentative) For the IMPROVE data, an extended star schema captures the needed information The dimensional tables (Site, Parameter, Method) are partly normalized (further broken into relational tables)

7 Snowflake Example: Central Calif. AQ Study, CCAQS CCAQS schema incorporates a rich set of parameters needed for QA/QC (e.g. sample tracking) as well as for data analysis. The fully relational CCAQS schema permits the enforcing of integrity constraints and it has been demonstrated to be useful for data entry/verification. However, no two snowflakes are identical. Similarly, the rich snowflake schemata for one sampling/analysis environment cannot be easily transplanted elsewhere. More importantly, many of the recorded parameters ‘on the fringes’ are not particularly useful for integrative, cross-supersite, regional analyses. Hence the shared ( exposed) subset of the entire data set may consist of a small subset of the ‘snowflake’

8 Catalog of Multidimensional Datasets Designed to Publish, Find(Select), Bind(Access) and Render distributed datasets Publishing is though an open web interface for user/broker registration of datasets Finding a dataset is aided by a metadata on Provider and Dataset Binding (data access) information is contained in the Dimensional Tables Rendering parameters are also contained in the Dimensional Tables

9 Minimal Star Schema for Spatio-Temporal Data The minimal Site table includes SiteID, Name and Lat/Lon. The minimal Parameter table consists of ParamterID, Description and Unit The time dimensional table is usually skipped since time is self-describing The minimal Fact (Data) table consists of the Observation value and the three dimensional codes for DateTime, Site and Parameter The above minimal multidimensional schema was used in the CAPITA data exploration software, Voyager. In order to respond to data queries by time, location and parameter, the database has to have time, location and parameter as dimensions

10 From Heterogeneous to Homogeneous Schema Individual Supersite SQL databases can be queried along spatial, temporal and parameter dimensions. However, the query to retrieve the same information depends on the of the particular database. A way to homogenize the distributed data is access all the data through a Data Adapter using only a subset of the tables/fields from any particular database (red) The proposed extracted uniform (abstract) schema is the Minimal Star Schema, (possibly expanded). The final form of the uniformly extracted data schema will be arrived at by consensus. Subset used Uniform Schema Fact Table Data Wrapper Extraction of homogeneous data from heterogeneous sources

11 Remote data access configuration for DATAFED Connection between DATAFED and remote server can be either through Web Services or CGI call. Dvoy SQL query template Wrapper Software Fill template Call SQL Return Data SQL Server Firewall Dataset XML or CSV CGI or WS call [[param_list]] [[filter]]

12 Spatial Query Template for DVoy Items in red brackets [[item]] are replaced by the code at run time. SiteTable and DataFactTable names to be replaced by the actual table names The JOIN can be an inner join for some tables SELECT SiteTable.Longitude * (180.0/PI()) AS Lon, SiteTable.Latatitude * (180.0/PI()) AS Lat, SiteTable. Site_ID, [[param_list]] FROM DataFactTable RIGHT OUTER(??) JOIN SiteTable ON DataFactTable.Site_ID = SiteTable.Site_ID WHERE [[filter]] ORDER BY SiteTable.Lon ASC Default [[param_list]] – list of fields to be extracted in the table DataFactTable. [[param_abbr]] AS VALUE Default [[filter]] (for spatial query, datetime_min = datatime_max): (DataFactTable.Date_Time BETWEEN ' [[datetime_min]] ' AND ' [[datetime_max]] ')

13 Purpose of the Supersite Relational Database System Design, populate and maintain a database which: –Includes monitoring data from Supersites and auxiliary projects –Facilitates cross-Supersite [regional or comparative] data analyses –Supports the analyses by a variety of research groups

14 Stated Features of Relational Data System Data Input: –Data input electronically through FTP, Web browser, (CD, if necessary) –Modest amount of metadata on sites, instruments, data sources/version, contacts etc. –Data structures, formats and submission procedures simple for the submitters Data Storage and Maintenance: –Data stored in relational database(s), possibly distributed over multiple servers –Maintenance of data holdings catalog and and request logs –Data updates quarterly Data Access: –Access method: User-friendly web-access by multiple authorized users –Data finding: Metadata catalog of datasets –Data query: by parameter, method, location, date/time, or other metadata –Data output format: ASCII, spreadsheet, other (dbf, XML)

15 Snowflake Example: Central Calif. AQ Study, CCAQS The CCAQS provides a good example of a snowflake relational schema. It incorporates a rich set of parameters needed for QA/QC as well as for data analysis The ‘branches’ and ‘leafs’ contain tables for QA/QC and for detailed dimensional metadata The fully relational snowflake schema permits the enforcing of integrity constraints However, its hard to apply an elaborate snowflake schema for other sampling/analysis conditions

16 Data Preparation Procedures: Data gathering, QA/QC and standard formatting is to be done by individual projects The data exchange standards, data ingest and archives are by ORNL and NASA Data ingest is to automated, aided by tools and procedures supplied by this project –NARSTO DES-SQL translator –Web submission tools and procedures –Metadata Catalog and I/O facilities Data submissions and access will be password protected as set by the community. Submitted data will be retained in a temporary buffer space and following verification transferred to the shared SQL database. The data access, submissions etc. will be automatically recorded an summarized in human-readable reports.

17 Database Schema Design for the Federated Data Warehouse Nov 27, 2001, RBH

18 Data Warehouse Features As much as possible data should reside in their respective home environment. ‘Uprooted’ data in decoupled databases tend to decay ie can not be easily updated, maintained, enriched. Abstract (universal) query/retrieval facilitates integration and comparison along the key dimensions (space, time, parameter, method) The open architecture data warehouse (based on Web Services) promotes the building of further value chains: Data Viewers, Data Integration Programs, Automatic Report Generators etc..

19 Supersite Relational Database System (SRDS) Rudolf Husar, PI Center for Air Pollution Impact and Trend Analysis (CAPITA) Washington University, St. Louis, MO Proposal Presentation to the Supersite Program Nov 30, 2001 a sub- project of St. Louis Midwest Supersite Project, Jay Turner, PI Subset used Uniform Schema Fact Table Data Adapter SQL Data Servers User

20 Design, Populate and Maintain a Supersite Relational Database System Facilitate cross-Supersite, regional, comparative data analyses Support analyses by a variety of research groups Include monitoring data from Supersites and auxiliary projects Purpose of the Project:

21 EPA Specs of the Supersite Relational Data System (from RFP) Data Input: –Data input electronically –Modest amount of metadata on sites, instruments, data sources/version, contacts etc. –Simple data structures, formats and convenient submission procedures Data Storage and Maintenance: –Data storage in relational database(s), possibly distributed over multiple servers –A catalog of data holdings and request logs –Supersite data updates quarterly Data Access: –User-friendly web-access by multiple authorized users –Data query by parameter, method, location, date/time, or other metadata –Multiple data output formats (ASCII, spreadsheet, other (dbf, XML)

22 General Approach to SRDS Design Based on consensus, adopt a uniform relational data structure, suitable for regional and cross-Supersite data integration and analysis. We propose a star schema with spatial, temporal, parameter and method dimensions. The ‘original’ data are to be maintained at the respective providers or custodians (Supersites, CIRA, CAPITA...). We propose the creation of flexible ‘adapters’ and web-submission forms for the transfer of data subsets into the uniformly formatted ‘Federated Data Warehouse’. Data users would access the data warehouse manually or through software. We propose data access using modern ‘web services’ protocol, suitable for adding data viewers, processors (filtering, aggregation and fusion) and other value-adding processes.

23 From Heterogeneous to Homogeneous Schema Individual SQL databases have varied designs, usually following a more elaborate ‘snowflake’ pattern (see Database Schema Design for the Federated Data Warehouse ). Database Schema Design for the Federated Data Warehouse Though they have more complicated schemata, these Supersite SQL servers can be queried along spatial, temporal, parameter, method dimensions. However, the query to retrieve the same information depends on the particular database schema. A way to homogenize the distributed data is by accessing all the data through a Data Wrapper which accesses only a subset of the tables/fields from any particular database (shown red in schemata below). The proposed extracted uniform (abstract) schema is the Minimal Star Schema, (possibly expanded). The final form of the uniformly extracted data schema will be arrived at by Supersite consensus. Subset used Uniform Schema Fact Table Data Adapter Extraction of homogeneous data from heterogeneous sources

24 On-line Analytical Processing: OLAP A multidimensional data model making it easy to select, navigate, integrate and explore the data. An analytical query language providing power to filter, aggregate and merge data as well as explore complex data relationships. Ability to create calculated variables from expressions based on other variables in the database. Pre-calculation of frequently queried aggregated values, i.e. monthly averages, enables fast response time to ad hoc queries.

25 Fast Analysis of Shared Multidimensional Information (FASMI) (Nigel, P. “The OLAP Report”) being Fast – The system is designed to deliver relevant data to users quickly and efficiently; suitable for ‘real-time’ analysis facilitating Analysis – The capability to have users extract not only “raw” data but data that they “calculate” on the fly. being Shared – The data and its access are distributed. being Multidimensional – The key feature. The system provides a multidimensional view of the data. exchanging Information – The ability to disseminate large quantities of various forms of data and information. An OLAP system is characterized as:

26 Multi-Dimensional Data Cubes Multi-dimensional data models use inherent relationships in data to populate multidimensional matrices called data cubes. A cube's data can be queried using any combination of dimensions Hierarchical data structures are created by aggregating the data along successively larger ranges of a given dimension, e.g time dimension can contain the aggregates year, season, month and day.

27 Data Measure, DataPoints and Data Cube Measure A measure (in OLAP terminology) represent numerical values for a specific entity to be analyzed (e.g. temperature, wind speed, pollutant).OLAP A collection of measures form a special dimension ‘ Measures’ (??Can Measures be Dimensions??)special dimension Dimension Y Measure Data Granules Dimension Z Dimension X Conceptual Data Cube Data Granules A measure has set of discrete data granules –atomic data entities that cannot be further broken down. All data points in a measure represent the same measured parameter e.g. temperature. Hence, they share the same units and dimensionality. The data points of a measure are enclosed in a conceptual multidimensional data cube; each data point occupies a volume (slice or point) in the data cube. Data points in a measure share the same dimensions; Conversely, each data point has to have the dimensional coordinates in the data cube of the measure that it belongs to.

28 DVoy Data Space Xmin Zmin Xmax Ymin Ymax Data Views Z X Y Data Space for a Measure Ymax XminXmax Ymin Data Zmin Zmax aergqegqet View Data Space Data Granule Measure Data Space

29 View Data Space Data Granule Measure Data Space

30 DVoy: Components and Image Data Flow and Dvoy Components: Distributed Data Image Registration Data Catalog Data Query Image Delivery Image Viewer Distributed Image Data Image Registration Data Catalog XY MAP: Z,T fixed Image Data Browser Data Query for a Measure: Default Cube  X,  Y,  Z,  T Image Delivery Web Service Measure: Elevation Measure: TOMS Measure: SeaWiFS HTTP, FTP, Web Service

31 Distributed Voyager: Safe Connection for Remote Data (SQL) Servers R. Husar and K. Hoijarvi CAPITA October 24, 2003

32 Data Set Data Instance DVoy Data Space

33 Dvoy: Components and Image Data Flow and Dvoy Components: Distributed Data Image Registration Data Catalog Data Query Image Delivery Image Viewer Distributed Image Data Image Registration Data Catalog XY MAP: Z,T fixed Image Data Browser Data QueryImage Delivery Web Service Measure: Elevation Measure: TOMS Measure: SeaWiFS Data Selection: Measure, X, Y, Z, T

34 DVoy Data Space, DataSet, Data The parameter (or ‘variable’, or the OLAP ‘measure’) is NOT a dimension Rather, each parameter covers a multidimensional space. Z Measures: Data Set:X, Y, Z, T Subset of the Data Spacee.g. TOMS Images, 96-01 Data Instance:X, Y …Subset of Data Setsingle TOMS image X Y Xmin Zmin Xmax Ymin Ymax Data Space Data Set XminXmax Ymin Ymax Data Data Instance: Atomic data entity that cannot be broken down to smaller entities. It is a record in the database, e.g: - data point..Temperature (x i, y i, z i, t i ) - image Temperature (x range, y range, t i ) Data Set: Collection of Data Instances

35 Render Service Chaining in Spatio-Temporal Data Browser Spatial Slice Find/Bind Data Data Cube Time Slice Time Portrayal Spatial PortrayalSpatial Overlay Time Overlay OGC-Compliant GIS Services Time-Series Services PortrayOverlay Homogenizer Catalog Wrapper Mediator Client Browser Cursor/Controller Maintain Data Vector GIS Data XDim Data SQL Table OLAP Satellite Images Data Sources

36 Data Entry to the Supersite Relational Data System: EPA Supersite Data Coordinated Supersite Relational Tables EOSDIS Data Archive NARSTO ORNL DES, Data Ingest Supersite SQL Server 3. DES-SQL Transformer 1. DataAdapter Supersite & other SQL Data Data Query Table Output 2. Direct Web Data Input 1.Batch transfer of large Supersite and other datasets to the SRDS SQL server 2.Web-submission of relational tables by the data producers/custodians 3.Automatic translation and transfer of NARSTO-archived DES data to SQL

37 Summary of Proposed Database Schema Design The starting point for the design of Supersite Relational Database schema will be the Minimal Star Schema for fixed-location monitoring data. Extensions will be made if it clearly benefits regional analysis and cross-Supersite comparisons The possible extensions, based on user needs, may include the addition of: ‘Methods’ dimension table to identify the sampling/analysis method of each observation Additional attributes (columns) Site and Parameter tables The Supersite data are not yet ready for submission to the NARSTO archive. Thus, there is still time to develop an agreed-upon schema for the Supersite data in SRDS. The schema modifications and and the consensus-building will be conducted through the SRDS website

38 The RDMS Schema: ‘Minimal’ Star Schema‘Minimal’ Star Schema The minimal Sites table includes SiteID, Name and Lat/Lon. The minimal Parameter table consists of ParamterID, Description and Unit The time dimensional table is usually skipped since the time code is self-describing The minimal Fact (Data) table consists of the Obs_Value and the three dimensional codes for Obs_DateTime, Site_ID and Parameter_ID Additional dimension tables may include Method and Data Quality. For integrative, cross-Supersite analysis, the database has to have, at the minimum, a ‘fact table’ and associated time, location, parameter and method tables as dimensions The CAPITA data exploration software, Voyager uses this minimal schema. Voyager has been in use for the past 12 years successfully encoding and browsing 1000+ datasets worldwide.Voyager The state of California still formats and distributes their AQ data on CDs using Voyager.


Download ppt "Dvoy Database Ideas. Heterogeneous to homogeneous Homogenization by applying uniform schema: Multidimensional data model User queries are directed toward."

Similar presentations


Ads by Google