Lesley Charles November 23, 2009.

Slides:



Advertisements
Similar presentations
Google News Personalization: Scalable Online Collaborative Filtering
Advertisements

CH-4 Ontologies, Querying and Data Integration. Introduction to RDF(S) RDF stands for Resource Description Framework. RDF is a standard for describing.
Lukas Blunschi Claudio Jossen Donald Kossmann Magdalini Mori Kurt Stockinger.
Statistical Techniques I
P3 / 2004 Register Allocation. Kostis Sagonas 2 Spring 2004 Outline What is register allocation Webs Interference Graphs Graph coloring Spilling Live-Range.
CrowdER - Crowdsourcing Entity Resolution
Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.
Selectivity Estimation Example Mohammad Farhan Husain.
1 gStore: Answering SPARQL Queries Via Subgraph Matching Presented by Guan Wang Kent State University October 24, 2011.
CS4432: Database Systems II
Register Allocation CS 671 March 27, CS 671 – Spring Register Allocation - Motivation Consider adding two numbers together: Advantages: Fewer.
 Copyright 2004 Digital Enterprise Research Institute. All rights reserved. SPARQL Query Language for RDF presented by Cristina Feier.
Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.
Image Analysis Phases Image pre-processing –Noise suppression, linear and non-linear filters, deconvolution, etc. Image segmentation –Detection of objects.
© 2006 Pearson Addison-Wesley. All rights reserved14 A-1 Chapter 14 Graphs.
Approximation Algorithms
Evaluating Hypotheses
Exploiting Correlated Attributes in Acquisitional Query Processing Amol Deshpande University of Maryland Joint work with Carlos Sam
MAE 552 – Heuristic Optimization Lecture 26 April 1, 2002 Topic:Branch and Bound.
SubSea: An Efficient Heuristic Algorithm for Subgraph Isomorphism Vladimir Lipets Ben-Gurion University of the Negev Joint work with Prof. Ehud Gudes.
Optimizing RDF Chain Queries using Genetic Algorithms DBDBD 2010 Alexander Hogenboom, Viorel Milea, Flavius Frasincar, and Uzay Kaymak Erasmus University.
Cost-Based Plan Selection Choosing an Order for Joins Chapter 16.5 and16.6 by:- Vikas Vittal Rao ID: 124/227 Chiu Luk ID: 210.
16.5 Introduction to Cost- based plan selection Amith KC Student Id: 109.
Experimental Evaluation
Cmpt-225 Simulation. Application: Simulation Simulation  A technique for modeling the behavior of both natural and human-made systems  Goal Generate.
Two Discrete Optimization Problems Problem: The Transportation Problem.
Chapter 7 Sampling Distributions
Managing Large RDF Graphs (Infinite Graph) Vaibhav Khadilkar Department of Computer Science, The University of Texas at Dallas FEARLESS engineering.
272: Software Engineering Fall 2012 Instructor: Tevfik Bultan Lecture 17: Code Mining.
Temporal Event Map Construction For Event Search Qing Li Department of Computer Science City University of Hong Kong.
SPARQL Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology.
Cost based transformations Initial logical query plan Two candidates for the best logical query plan.
Database Management 9. course. Execution of queries.
Copyright © 2010 Pearson Education, Inc. Chapter 6 The Standard Deviation as a Ruler and the Normal Model.
Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce Mohammad Farhan Husain, Pankil Doshi, Latifur Khan, Bhavani Thuraisingham University.
Copyright © Cengage Learning. All rights reserved. 2 Descriptive Analysis and Presentation of Single-Variable Data.
Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.
RESOURCES, TRADE-OFFS, AND LIMITATIONS Group 5 8/27/2014.
1 Chapter 7 Optimizing the Optimizer. 2 The Oracle Optimizer is… About query optimization Is a sophisticated set of algorithms Choosing the fastest approach.
Swarup Acharya Phillip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented By Vinay Hoskere.
CS 533 Information Retrieval Systems.  Introduction  Connectivity Analysis  Kleinberg’s Algorithm  Problems Encountered  Improved Connectivity Analysis.
Query – One of the objects in Microsoft Access – It can help users extract data, which meets the criteria defined by them, from a database file. – It must.
Efficient RDF Storage and Retrieval in Jena2 Written by: Kevin Wilkinson, Craig Sayers, Harumi Kuno, Dave Reynolds Presented by: Umer Fareed 파리드.
CROSS-VALIDATION AND MODEL SELECTION Many Slides are from: Dr. Thomas Jensen -Expedia.com and Prof. Olga Veksler - CS Learning and Computer Vision.
6. A PPLICATION MAPPING 6.3 HW/SW partitioning 6.4 Mapping to heterogeneous multi-processors 1 6. Application mapping (part 2)
Algorithmic Detection of Semantic Similarity WWW 2005.
Two Discrete Optimization Problems Problem: The Transportation Problem.
1 Chapter 4: Creating Simple Queries 4.1 Introduction to the Query Task 4.2 Selecting Columns and Filtering Rows 4.3 Creating New Columns with an Expression.
Searching Specification Documents R. Agrawal, R. Srikant. WWW-2002.
Finding frequent and interesting triples in text Janez Brank, Dunja Mladenić, Marko Grobelnik Jožef Stefan Institute, Ljubljana, Slovenia.
05/01/2016 SPARQL SPARQL Protocol and RDF Query Language S. Garlatti.
Chapter 2: Frequency Distributions. Frequency Distributions After collecting data, the first task for a researcher is to organize and simplify the data.
An Effective SPARQL Support over Relational Database Jing Lu, Feng Cao, Li Ma, Yong Yu, Yue Pan SWDB-ODBIS 2007 SNU IDB Lab. Hyewon Lim July 30 th, 2009.
CC L A W EB DE D ATOS P RIMAVERA 2015 Lecture 7: SPARQL (1.0) Aidan Hogan
Bootstrapped Optimistic Algorithm for Tree Construction
File Processing : Query Processing 2008, Spring Pusan National University Ki-Joune Li.
Lecture 8: Measurement Errors 1. Objectives List some sources of measurement errors. Classify measurement errors into systematic and random errors. Study.
Written By: Presented By: Swarup Acharya,Amr Elkhatib Phillip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy Join Synopses for Approximate Query Answering.
SQL Server Statistics DEMO SQL Server Statistics SREENI JULAKANTI,MCTS.MCITP,MCP. SQL SERVER Database Administration.
SQL Server Statistics and its relationship with Query Optimizer
CC La Web de Datos Primavera 2017 Lecture 7: SPARQL [i]
Keyword Search over RDF Graphs
A paper on Join Synopses for Approximate Query Answering
Selectivity Estimation Example
Probabilistic Data Management
Logics for Data and Knowledge Representation
File Processing : Query Processing
Introduction to Summary Statistics
On Spatial Joins in MapReduce
CC La Web de Datos Primavera 2016 Lecture 7: SPARQL (1.0)
Presentation transcript:

Lesley Charles November 23, 2009

 Query Optimization is a process that tends to device a query execution plan that takes the minimum response time.  The response time is minimized by reducing the number of blocks that must be read or to be written to the memory to complete the query.  Query Optimization is vital especially in cases where numerous transactions are made every second.

 SPARQL uses multiple triples to match certain conditions and extract data based on these conditions.  There are two factors that play an important role in the response time of a SPARQL query, ◦ The order in which the triples are accessed, ◦ The necessity of each triple  SPARQL Optimization also depends on the platform on which it is implemented.

 There are two types of Query Optimization, ◦ Logical Optimization ◦ Physical Optimization  Logical Optimization generates a sequence or an order in which the triples are processed so as to minimize the response time.  Physical Optimization is a high level optimization where we determine how each operation is done.

 The aim of a logical optimization is to find an execution plan which is expected to return the result set fastest without actually executing the query or subset.  The technique is to use selectivity based Basic Graph Patterns for optimization. We will try to find which triple has a minimum selectivity, by referring to graph patterns and based on this decide the execution plan.

 Selectivity of a triple pattern is the fraction of triples matching the pattern. This helps us in deciding the execution plan. Consider the following query, ?x NS:type NS:animal ?x NS:species“zebra” Changing the order in which they are executed can save us a lot of time.

 The triple patterns are considered nodes of a directed graph, where the directed edge denotes a triple pattern pair.  The node with the minimum selectivity is first visited and is added to the execution plan.  Further each node is checked for the following two conditions and added to the final execution plan. 1.Minimum selectivity, 2.Visited or not.

 There are various heuristics the optimizer can implement and use for the selectivity estimation of graph patterns.  Basically these heuristics can be classified into two types, 1.Heuristics without pre-computed statistics, 2.Heuristics with pre-computed statistics.  It also depends on whether the subject or predicate or the object is more selective.

 These types of heuristics do not require any kind of statistical data.  Variable Counting : The selectivity of a triple pattern is computed according to the type and number of unbound components and is characterized by the ranking sel(S) < sel(O) < sel(P).  Variable Counting Predicates : The selectivity of bound joins is set to 1.0 by default.  Graph Statistics Handler : It enables graph patterns to lookup for an exact size information of any triple pattern component. However it doesn’t support joins of any kind. The selectivity is determined by the size information.

 These types of heuristics are more accurate but they require pre – computed statistics about RDF data.  Probabilistic Framework : It is a standalone framework of the selectivity estimation of RDF graph patterns.  Probabilistic Framework Join : It differs from PF in the sense that it includes the selectivity of the more selective triple pattern in estimating selectivity of joined triple patterns.  PFN is another variation of PF which does not limit the lower bound of selectivity estimation.

 It is a traditional practice to keep track of metadata, i.e. data about the data in order to calculate cardinalities, generate indices of data etc..  These metadata can be used to create summary statistics in such a way to facilitate the estimation of the size or result set of any query.  It can also help in calculating selectivity for a particular component of a triple pattern.  Histograms can be used to represent the distribution of data.

 We need to consider the subject, the predicate, the object and along with them, we need to consider whether it is a bound or unbound component.  The bound object size can be approximated by means of equal width histograms. For each distinct predicate we compute a histogram to represent corresponding object – value distribution.  For joined triple patterns we consider the join and decide whether both triples contain components of the same class.

 The selectivity is the ratio of estimated number of triples matching a pattern to the total number of triples in the dataset.  sel(t) = sel(s) * sel(p) * sel(o)  sel(s) = 1/R, R - No. of Resources.  sel(p) = T p /T, T – Total No. of triples, T p – Triples matching predicate p.  sel(o) = h c (p,o c )/Tp, where (p,o c ) represents the class of the histogram for predicate p in which object o falls.

 Scalability is the basic limitation of query optimization based on selectivity, as it is not feasible to find selectivity in a dataset containing millions of triples.  The major limitation of logical optimization is the use of special modifiers like, OPTIONAL, UNION, FILTER etc.. These modifiers affect the selectivity and undermines the whole algorithm presented for logical optimization.  Another important issue to take into account is the type of ontology used and the pattern in which data is stored. These cannot be generalized as it varies with each dataset.

 Physical Optimization is a customized solution for a specific ontology or framework. Here we decide how each and every query can be implemented based on the ontology and the data.  Usually the queries are rewritten, sometimes eliminating certain triples to obtain the same result. ?x NS:type NS:animal ?x NS:species“zebra”  For each query, we analyse the triples and get an idea of what data is required and then figure out the most beneficial way of extracting the same data from the data store.

Reference M. Stocker, A. Seaborne, A. Bernstein, C. Kiefer, D. Reynolds, “SPARQL Basic Graph Pattern Optimization Using Selectivity Estimation”, In WWW’08: Proceeding of the 17 th International Conference on World Wide Web, pages , New York, NY, USA, 2008, ACM.