Download presentation
Presentation is loading. Please wait.
Published bySheena Boone Modified over 8 years ago
1
SEMDIG supported by:funded by: Providing Data Access and Data Related Monitoring Information for Data Integration on the Grid Alexander Wöhrer and Peter Brezany Institute of Scientific Computing University of Vienna {woehrer|brezany}@par.univie.ac.at
2
SemDIG supported by: funded by: Contents context of SemDIG starting scenario information needed for query optimization and adaptive query processing (AQP) continuous data statistics with D³G overall strategie for metadata about data sources future work and conclusions
3
SemDIG supported by: funded by: Context of this work SemDIG: Semantic Data Integration on the Grid –2 years project –focus on: Query Optimization –e.g. early exclusion of data sources –which source to take? Adaptive Query Processing –e.g. changes on available data source indexes Pilot applications: –ecological (via AustrianGrid) –GridMiner project
4
SemDIG supported by: funded by: Starting scenario I ecological application need to query measurement data from water, air and soil –various replicas definied MAIN_1 MAIN_2 REP_2 REP_1 MAIN_1MAIN_2 REP_2 REP_1 MAIN_1 MAIN_2 REP_1 REP_2 WATER AIRSOIL
5
SemDIG supported by: funded by: Starting scenario II Questions for DAI: –which sources can provide data to answer a query with various conditions? –take main source or replica? –Data distribution and volume (important for query optimisation)? „Normal“ answers: –all main sources –take main source if available –normal distribution of the values
6
SemDIG supported by: funded by: Starting scenario III An example query plan could look like this: MAIN_1 MAIN_2MAIN_1MAIN_2 MAIN_1 MAIN_2 U J UU J Host 1 Host 2 Host 3
7
SemDIG supported by: funded by: Needed information for further DAI optimisations Data access related: –Available indexes provided by OGSA-DAI on request –Connection time indicator for current database workload Data related: –available histograms –exact data statistics (for columns often used in conditions!) General idea: provide more information for better initial query plans and support AQP
8
SemDIG supported by: funded by: Envisioned Solution RDBMS HOST Data Access related Connection Time Indexes Data related Histograms Data statistics Data Source Monitoring Web Service independent from the actual data access technology Supporting/using SOA features –e.g. subscribe to index changes
9
SemDIG supported by: funded by: Histograms http://www.dba-oracle.com/art_builder_histo.htm important for cost based optimiser available from system tables of a DBMS
10
SemDIG supported by: funded by: Exact Data Statistics expensive to query each time when needed Idea: –gather once –include the effect of the delta (increment) for various database operations (insert, delete, update) Advantage: –Low running costs –use to refute data sources from a query plan early
11
SemDIG supported by: funded by: D³G RDBMS-side architecture Maintainance: –row trigger after delete/insert/update to update the following values of a table: mean, standard deviation (numerical) missing and total frequency –statement trigger to keep min/max for columns up- to-date RDBMS side Stored procedure Triggers create Data statstics update init Tables monitor All Triggers are dynamically (according to the table structure) generated after initializing the data statistics
12
SemDIG supported by: funded by: D³G RDBMS-side features customisable: –for certain tables (exposed once) certain columns (the once often used in queries) gathered exact statistics for each column: –min, max, stddev, mean for numerical columns –total frequency, missing frequency
13
SemDIG supported by: funded by: D³G RDBMS-side performance Setup: –table with 11 columns (9 numerical) –Oracle 10g on a AMD 1 GHz, 768 MB RAM init just once per table RT independent of the table size no updates to min/max => ST returns immediately Performance of RDBMS side functionality in msec
14
SemDIG supported by: funded by: Target DAI scenario I The following information is available: –Water REP_1 has an index on a column used MAIN_2 exposes 1 < WATER_ID < 5000 –Soil MAIN_2 has a very bad connection time –Air MAIN_1 exposes 1 < AIR_ID < 100.000 Let the query be: select * from water, soil, air where.... WATER_ID > 10000 and AIR_ID > 150000
15
SemDIG supported by: funded by: Target DAI scenario: Starting query plan MAIN_1 MAIN_2MAIN_1MAIN_2 MAIN_1 MAIN_2 U J UU J Host 1 Host 2 Host 3
16
SemDIG supported by: funded by: Target DAI scenario II refute data sources early Histograms and information about row numbers could be used to change operator distribution MAIN_1REP_2MAIN_2 J U J Host 1 Host 2 Host 3 REP_1
17
SemDIG supported by: funded by: Conclusions Efficient DAI needs more metadata about a data source –Data related histograms data statistics –Data access related indexes connection time Additionally: info about main source + info about replicas = more knowledge about one source (combine it) D³G promising first results Query optimisation as well as AQP could profit –QO: better initial query plans –AQP: react to index changes, more information used during adaption More information on this and future work http://www.par.univie.ac.at/project/semdig
18
SemDIG supported by: funded by: References Jim Gray, “Distributed Computing Economics”,TR, 2003 Alexander Wöhrer, Lenka Novakova, Peter Brezany and A Min Tjoa, „D3G: Novel Approaches to Data Statistics, Understanding and Preprocessing on the Grid“, Accepted for IEEE AINA, Vienna, 2006 SemDIG, http://www.par.univie.ac.at/project/semdig PMML, http://www.dmg.org
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.