SEMDIG supported by:funded by: Providing Data Access and Data Related Monitoring Information for Data Integration on the Grid Alexander Wöhrer and Peter.

Slides:



Advertisements
Similar presentations
Meta Data Larry, Stirling md on data access – data types, domain meta-data discovery Scott, Ohio State – caBIG md driven architecture semantic md Alexander.
Advertisements

Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part C Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
Brian Babcock Surajit Chaudhuri Gautam Das at the 2003 ACM SIGMOD International Conference By Shashank Kamble Gnanoba.
Name: Jim Jones Making the Web of Data Available via Web Feature Services Jim Jones, Werner Kuhn, Carsten Keßler and Simon Scheider
CIT 613: Relational Database Development using SQL Introduction to SQL.
Institute for Software Science – University of ViennaP.Brezany 1 Databases and the Grid Peter Brezany Institute für Scientific Computing University of.
Self-Tuning and Self-Configuring Systems Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems March 16, 2005.
Databases. Database Information is not useful if not organized In database, data are organized in a way that people find meaningful and useful. Database.
Time Series Analyst An Internet Based Application for Viewing and Analyzing Environmental Time Series Jeffery S. Horsburgh Utah State University David.
GGF Toronto Spitfire A Relational DB Service for the Grid Peter Z. Kunszt European DataGrid Data Management CERN Database Group.
Chapter 6: Database Evolution Title: AutoAdmin “What-if” Index Analysis Utility Authors: Surajit Chaudhuri, Vivek Narasayya ACM SIGMOD 1998.
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 11 Database Performance Tuning and Query Optimization.
Institute for Scientific Computing – University of ViennaP.Brezany 1 Databases and the Grid Peter Brezany Institute für Scientific Computing University.
Concepts of Database Management Sixth Edition
Chapter 14 The Second Component: The Database.
Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.
Advance Computer Programming Java Database Connectivity (JDBC) – In order to connect a Java application to a database, you need to use a JDBC driver. –
Database System Development Lifecycle © Pearson Education Limited 1995, 2005.
Chapter 5 Lecture 2. Principles of Information Systems2 Objectives Understand Data definition language (DDL) and data dictionary Learn about popular DBMSs.
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 10 Database Performance Tuning and Query Optimization.
Keeping remote datasets up to date A standards based method for exchanging (geo-)data mutations.
Database Replication Policies for Dynamic Content Applications Gokul Soundararajan, Cristiana Amza, Ashvin Goel University of Toronto EuroSys 2006: Leuven,
Presenter: Dipesh Gautam.  Introduction  Why Data Grid?  High Level View  Design Considerations  Data Grid Services  Topology  Grids and Cloud.
Chapter 4 The Relational Model 3: Advanced Topics Concepts of Database Management Seventh Edition.
Switch off your Mobiles Phones or Change Profile to Silent Mode.
 DATABASE DATABASE  DATABASE ENVIRONMENT DATABASE ENVIRONMENT  WHY STUDY DATABASE WHY STUDY DATABASE  DBMS & ITS FUNCTIONS DBMS & ITS FUNCTIONS 
Intelligent Grid Solutions GridMiner A Framework for Knowledge Discovery on the Grid – from a Vision to Design and Implementation Peter.
Component 4/Unit 6f Topic VI: Create simple querying statements for the database The SELECT statement Clauses Functions Joins Subqueries Data manipulation.
Physical Database Design & Performance. Optimizing for Query Performance For DBs with high retrieval traffic as compared to maintenance traffic, optimizing.
ASG - Towards the Adaptive Semantic Services Enterprise Harald Meyer WWW Service Composition with Semantic Web Services
DBSQL 14-1 Copyright © Genetic Computer School 2009 Chapter 14 Microsoft SQL Server.
20 October 2006Workflow Optimization in Distributed Environments Dynamic Workflow Management Using Performance Data David W. Walker, Yan Huang, Omer F.
10 1 Chapter 10 Distributed Database Management Systems Database Systems: Design, Implementation, and Management, Sixth Edition, Rob and Coronel.
Professor Michael J. Losacco CIS 1110 – Using Computers Database Management Chapter 9.
Lecture2: Database Environment Prepared by L. Nouf Almujally & Aisha AlArfaj 1 Ref. Chapter2 College of Computer and Information Sciences - Information.
1 Data Warehouses BUAD/American University Data Warehouses.
What is a schema ? Schema is a collection of Database Objects. Schema Objects are logical structures created by users to contain, or reference, their data.
Database Design and Management CPTG /23/2015Chapter 12 of 38 Functions of a Database Store data Store data School: student records, class schedules,
C6 Databases. 2 Traditional file environment Data Redundancy and Inconsistency: –Data redundancy: The presence of duplicate data in multiple data files.
Lecture2: Database Environment Prepared by L. Nouf Almujally 1 Ref. Chapter2 Lecture2.
CHAPTER 3 DATABASES AND DATA WAREHOUSES. 2 OPENING CASE STUDY Chrysler Spins a Competitive Advantage with Supply Chain Management Software Chapter 2 –
EQC16: An Optimized Packet Classification Algorithm For Large Rule-Sets Author: Uday Trivedi, Mohan Lal Jangir Publisher: 2014 International Conference.
DATABASE SYSTEMS. DATABASE u A filing system for holding data u Contains a set of similar files –Each file contains similar records Each record contains.
1 CS 430 Database Theory Winter 2005 Lecture 2: General Concepts.
Database Systems Design, Implementation, and Management Coronel | Morris 11e ©2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or.
ISO6 Relational Databases Simon Booth Room Library S6 Tel: 7247.
LOGO A comparison of two web-based document management systems ShaoxinYu Columbia University March 31, 2009.
DATABASE ADMINISTRATION Pertemuan ke-7. Application Performance source : Database Administration the complete guide to practices and procedures chapter.
Course FAQ’s I do not have any knowledge on SQL concepts or Database Testing. Will this course helps me to get through all the concepts? What kind of.
AL-MAAREFA COLLEGE FOR SCIENCE AND TECHNOLOGY INFO 232: DATABASE SYSTEMS CHAPTER 7 (Part II) INTRODUCTION TO STRUCTURED QUERY LANGUAGE (SQL) Instructor.
Academic Year 2014 Spring Academic Year 2014 Spring.
Module 5: Implementing Merge Replication. Overview Understanding Merge Replication Architecture Implementing Conflict Resolution Planning and Deploying.
Presented By Anirban Maiti Chandrashekar Vijayarenu
Copyright 2007, Information Builders. Slide 1 Machine Sizing and Scalability Mark Nesson, Vashti Ragoonath June 2008.
SqlExam1Review.ppt EXAM - 1. SQL stands for -- Structured Query Language Putting a manual database on a computer ensures? Data is more current Data is.
CIT 613: Relational Database Development using SQL Introduction to SQL DeSiaMorePowered by DeSiaMore 1.
Database technology Introduction ER Modeling Database objects (intro) SQL.
Component 4: Introduction to Information and Computer Science Unit 6: Databases and SQL Lecture 6 This material was developed by Oregon Health & Science.
The ATLAS TAGs Database - Experiences and further developments Elisabeth Vinek, CERN & University of Vienna on behalf of the TAGs developers group.
20 Copyright © 2008, Oracle. All rights reserved. Cache Management.
University of Texas at Arlington Presented By Srikanth Vadada Fall CSE rd Sep 2010 Dynamic Sample Selection for Approximate Query Processing.
1 Information Retrieval and Use De-normalisation and Distributed database systems Geoff Leese September 2008, revised October 2009.
Database Systems, 8 th Edition SQL Performance Tuning Evaluated from client perspective –Most current relational DBMSs perform automatic query optimization.
Joe Foster 1 Two questions about datasets: –How do you find datasets with the processes, cuts, conditions you need for your analysis? –How do.
OGSA-DAI.
Data Resource Management Data Concepts Database Management Types of Databases Chapter 5 McGraw-Hill/Irwin Copyright © 2007 by The McGraw-Hill Companies,
Fast Data Analysis with Integrated Statistical Metadata in Scientific Datasets By Yong Chen (with Jialin Liu) Data-Intensive Scalable Computing Laboratory.
Database cracking Stratos Idreos, Martin Kersten and Stefan Manegold
Building a Database on S3
Presentation transcript:

SEMDIG supported by:funded by: Providing Data Access and Data Related Monitoring Information for Data Integration on the Grid Alexander Wöhrer and Peter Brezany Institute of Scientific Computing University of Vienna

SemDIG supported by: funded by: Contents context of SemDIG starting scenario information needed for query optimization and adaptive query processing (AQP) continuous data statistics with D³G overall strategie for metadata about data sources future work and conclusions

SemDIG supported by: funded by: Context of this work SemDIG: Semantic Data Integration on the Grid –2 years project –focus on: Query Optimization –e.g. early exclusion of data sources –which source to take? Adaptive Query Processing –e.g. changes on available data source indexes Pilot applications: –ecological (via AustrianGrid) –GridMiner project

SemDIG supported by: funded by: Starting scenario I ecological application need to query measurement data from water, air and soil –various replicas definied MAIN_1 MAIN_2 REP_2 REP_1 MAIN_1MAIN_2 REP_2 REP_1 MAIN_1 MAIN_2 REP_1 REP_2 WATER AIRSOIL

SemDIG supported by: funded by: Starting scenario II Questions for DAI: –which sources can provide data to answer a query with various conditions? –take main source or replica? –Data distribution and volume (important for query optimisation)? „Normal“ answers: –all main sources –take main source if available –normal distribution of the values

SemDIG supported by: funded by: Starting scenario III An example query plan could look like this: MAIN_1 MAIN_2MAIN_1MAIN_2 MAIN_1 MAIN_2 U J UU J Host 1 Host 2 Host 3

SemDIG supported by: funded by: Needed information for further DAI optimisations Data access related: –Available indexes provided by OGSA-DAI on request –Connection time indicator for current database workload Data related: –available histograms –exact data statistics (for columns often used in conditions!) General idea: provide more information for better initial query plans and support AQP

SemDIG supported by: funded by: Envisioned Solution RDBMS HOST Data Access related Connection Time Indexes Data related Histograms Data statistics Data Source Monitoring Web Service independent from the actual data access technology Supporting/using SOA features –e.g. subscribe to index changes

SemDIG supported by: funded by: Histograms important for cost based optimiser available from system tables of a DBMS

SemDIG supported by: funded by: Exact Data Statistics expensive to query each time when needed Idea: –gather once –include the effect of the delta (increment) for various database operations (insert, delete, update) Advantage: –Low running costs –use to refute data sources from a query plan early

SemDIG supported by: funded by: D³G RDBMS-side architecture Maintainance: –row trigger after delete/insert/update to update the following values of a table: mean, standard deviation (numerical) missing and total frequency –statement trigger to keep min/max for columns up- to-date RDBMS side Stored procedure Triggers create Data statstics update init Tables monitor All Triggers are dynamically (according to the table structure) generated after initializing the data statistics

SemDIG supported by: funded by: D³G RDBMS-side features customisable: –for certain tables (exposed once) certain columns (the once often used in queries) gathered exact statistics for each column: –min, max, stddev, mean for numerical columns –total frequency, missing frequency

SemDIG supported by: funded by: D³G RDBMS-side performance Setup: –table with 11 columns (9 numerical) –Oracle 10g on a AMD 1 GHz, 768 MB RAM init just once per table RT independent of the table size no updates to min/max => ST returns immediately Performance of RDBMS side functionality in msec

SemDIG supported by: funded by: Target DAI scenario I The following information is available: –Water REP_1 has an index on a column used MAIN_2 exposes 1 < WATER_ID < 5000 –Soil MAIN_2 has a very bad connection time –Air MAIN_1 exposes 1 < AIR_ID < Let the query be: select * from water, soil, air where.... WATER_ID > and AIR_ID >

SemDIG supported by: funded by: Target DAI scenario: Starting query plan MAIN_1 MAIN_2MAIN_1MAIN_2 MAIN_1 MAIN_2 U J UU J Host 1 Host 2 Host 3

SemDIG supported by: funded by: Target DAI scenario II refute data sources early Histograms and information about row numbers could be used to change operator distribution MAIN_1REP_2MAIN_2 J U J Host 1 Host 2 Host 3 REP_1

SemDIG supported by: funded by: Conclusions Efficient DAI needs more metadata about a data source –Data related histograms data statistics –Data access related indexes connection time Additionally: info about main source + info about replicas = more knowledge about one source (combine it) D³G promising first results Query optimisation as well as AQP could profit –QO: better initial query plans –AQP: react to index changes, more information used during adaption More information on this and future work

SemDIG supported by: funded by: References Jim Gray, “Distributed Computing Economics”,TR, 2003 Alexander Wöhrer, Lenka Novakova, Peter Brezany and A Min Tjoa, „D3G: Novel Approaches to Data Statistics, Understanding and Preprocessing on the Grid“, Accepted for IEEE AINA, Vienna, 2006 SemDIG, PMML,