CS742 – Distributed & Parallel DBMSPage 3. 1M. Tamer Özsu Outline Introduction & architectural issues Data distribution  Distributed query processing.

Slides:



Advertisements
Similar presentations
Choosing an Order for Joins
Advertisements

Outline  Introduction  Background  Distributed DBMS Architecture  Distributed Database Design  Semantic Data Control ➠ View Management ➠ Data Security.
Distributed DBMSPage 6. 1© 1998 M. Tamer Özsu & Patrick Valduriez Outline Introduction Background Distributed DBMS Architecture Distributed Database Design.
Distributed DBMSPage © 1998 M. Tamer Özsu & Patrick Valduriez Outline Introduction Background Distributed DBMS Architecture Distributed Database.
Distributed Query Processing –An Overview
Distributed DBMSPage © 1998 M. Tamer Özsu & Patrick Valduriez Outline Introduction Background Distributed DBMS Architecture Distributed Database.
Distributed DBMS© M. T. Özsu & P. Valduriez Ch.6/1 Outline Introduction Background Distributed Database Design Database Integration Semantic Data Control.
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.7/1 Outline Introduction Background Distributed Database Design Database Integration Semantic Data Control.
Distributed databases
Distributed Database Systems Dr. Mohamed Osman Hegazi.
CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 52 Database Systems I Relational Algebra.
CS263 Lecture 19 Query Optimisation.  Motivation for Query Optimisation  Phases of Query Processing  Query Trees  RA Transformation Rules  Heuristic.
Session – 10 QUERY OPTIMIZATION Matakuliah: M0184 / Pengolahan Data Distribusi Tahun: 2005 Versi:
1 Distributed Databases CS347 Lecture 14 May 30, 2001.
Distributed DBMSPage 4. 1© 1998 M. Tamer Özsu & Patrick Valduriez Outline Introduction Background  Distributed DBMS Architecture  Datalogical Architecture.
Distributed DBMSPage 5. 1 © 1998 M. Tamer Özsu & Patrick Valduriez Outline Introduction Background Distributed DBMS Architecture  Distributed Database.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide
Institut für Scientific Computing – Universität WienP.Brezany Optimization of Distributed Queries Univ.-Prof. Dr. Peter Brezany Institut für Scientific.
Distributed DBMSPage 5. 1 © 1998 M. Tamer Özsu & Patrick Valduriez Outline Introduction Background Distributed DBMS Architecture  Distributed Database.
Query Optimization. General Overview Relational model - SQL  Formal & commercial query languages Functional Dependencies Normalization Physical Design.
L Distributed Query Optimization Algorithms -- 1 Distributed Query Optimization Algorithms v System R and R* v Hill Climbing and SDD-1.
©Silberschatz, Korth and Sudarshan14.1Database System Concepts 3 rd Edition Chapter 14: Query Optimization Overview Catalog Information for Cost Estimation.
Distributed DBMSPage © 1998 M. Tamer Özsu & Patrick Valduriez Outline Introduction Background Distributed DBMS Architecture Distributed Database.
Distributed Databases
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.7/1 Outline Introduction Background Distributed Database Design Database Integration Semantic Data Control.
Distributed Databases and DBMSs: Concepts and Design
Query Processing Presented by Aung S. Win.
III. Current Trends: 1 - Distributed DBMSsSlide 1/32 III. Current Trends Part 1: Distributed DBMSs: Concepts and Design Lecture 12 (2 hours) Lecturer:
low level data manipulation
DISTRIBUTED DATABASE DESIGN
Access Path Selection in a Relational Database Management System Selinger et al.
Session-9 Data Management for Decision Support
1 6. Distributed Query Optimization Chapter 9 Optimization of Distributed Queries.
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.7/1 Οι διαφάνειες καλύπτουν μέρος των Κεφαλαίων 7&8: Distributed Database QueryProcessing and Optimization.
Query Optimization. Query Optimization Query Optimization The execution cost is expressed as weighted combination of I/O, CPU and communication cost.
Overview of Query Processing
PMIT-6102 Advanced Database Systems By- Jesmin Akhter Assistant Professor, IIT, Jahangirnagar University.
Query Optimization Chap. 19. Evaluation of SQL Conceptual order of evaluation – Cartesian product of all tables in from clause – Rows not satisfying where.
M Taimoor Khan Course Objectives 1) Basic Concepts 2) Tools 3) Database architecture and design 4) Flow of data (DFDs)
PMIT-6102 Advanced Database Systems By- Jesmin Akhter Assistant Professor, IIT, Jahangirnagar University.
Query Processor  A query processor is a module in the DBMS that performs the tasks to process, to optimize, and to generate execution strategy for a high-level.
PMIT-6102 Advanced Database Systems By- Jesmin Akhter Assistant Professor, IIT, Jahangirnagar University.
Distributed DBMS© M. T. Özsu & P. Valduriez Ch.8/1 Outline Introduction Background Distributed Database Design Database Integration Semantic Data Control.
DDBMS Distributed Database Management Systems Fragmentation
Distributed DBMSs- Concept and Design Jing Luo CS 157B Dr. Lee Fall, 2003.
Query Processing Bayu Adhi Tama, MTI. 1 ownerNoclient © Pearson Education Limited 1995, 2005.
PMIT-6101 Advanced Database Systems By- Jesmin Akhter Assistant Professor, IIT, Jahangirnagar University.
Distributed DBMSPage © 1998 M. Tamer Özsu & Patrick Valduriez Outline Introduction Background Distributed DBMS Architecture Distributed Database.
Chapter 18 Query Processing. 2 Chapter - Objectives u Objectives of query processing and optimization. u Static versus dynamic query optimization. u How.
PMIT-6101 Advanced Database Systems By- Jesmin Akhter Assistant Professor, IIT, Jahangirnagar University.
1 ICS 214B: Transaction Processing and Distributed Data Management Lecture 9: Fragmentation and Distributed Query Processing Professor Chen Li.
Query Processing – Query Trees. Evaluation of SQL Conceptual order of evaluation – Cartesian product of all tables in from clause – Rows not satisfying.
1 Chapter 22 Distributed DBMS Concepts and Design CS 157B Edward Chen.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Introduction to Query Processing (1) Query optimization: The process of choosing a suitable execution.
Distributed Database Design Bayu Adhi Tama, MTI Fasilkom-Unsri Adapted from Connolly, et al., Database Systems 4 th Edition, Pearson Education Limited,
1 Semijoin Reduction in Query Processors Stocker, Kossman, Braumandl, Kemper Integrating Semi-Join-Reducers into State-of-the-Art Query Processors ICDE.
L4: Query Optimization (1) - 1 L4: Query Processing and Optimization v 4.1 Query Processing  Query Decomposition  Data Localization v 4.1 Query Optimization.
Distributed DBMS© 2001 M. Tamer Özsu & Patrick Valduriez Page 1.1 Outline n Introduction Background Distributed DBMS Architecture Distributed Database.
CS742 – Distributed & Parallel DBMSPage 2. 1M. Tamer Özsu Outline Introduction & architectural issues  Data distribution  Fragmentation  Data Allocation.
Outline Background Introduction Distributed DBMS Architecture
DISTRIBUTED DATABASE ARCHITECTURE
Outline Introduction Background Distributed DBMS Architecture
Outline Introduction Background Distributed DBMS Architecture
Outline Introduction Background Distributed DBMS Architecture
Distributed Database Management Systems
Advance Database Systems
Query Optimization.
Distributed Database Management Systems
Distributed Database Management Systems
Outline Introduction Background Distributed DBMS Architecture
Presentation transcript:

CS742 – Distributed & Parallel DBMSPage 3. 1M. Tamer Özsu Outline Introduction & architectural issues Data distribution  Distributed query processing  Query Processing Methodology  Localization  Distributed query optimization  Distributed transactions & concurrency control  Distributed reliability  Data replication  Parallel database systems  Database integration & querying  Advanced topics

CS742 – Distributed & Parallel DBMSPage 3. 2M. Tamer Özsu Query Processing high level user query query processor low level data manipulation commands

CS742 – Distributed & Parallel DBMSPage 3. 3M. Tamer Özsu Query Processing Components Query language that is used SQL: “intergalactic dataspeak” Query execution methodology The steps that one goes through in executing high-level (declarative) user queries. Query optimization How do we determine the “best” execution plan?

CS742 – Distributed & Parallel DBMSPage 3. 4M. Tamer Özsu SELECTENAME FROMEMP,ASG WHEREEMP.ENO = ASG.ENO ANDRESP = “Manager” Strategy 1 Π ENAME (  RESP=“Manager”  EMP.ENO=ASG.ENO (EMP  ASG)) Strategy 2 Π ENAME (EMP ⋈ ENO (  RESP=“Manager” (ASG))) Strategy 2 avoids Cartesian product, so is “better” Selecting Alternatives

CS742 – Distributed & Parallel DBMSPage 3. 5M. Tamer Özsu What is the Problem? Site 1Site 2Site 3Site 4Site 5 EMP 1 =  ENO≤“E3” (EMP)EMP 2 =  ENO>“E3” (EMP) ASG 2 =  ENO>“E3” (ASG) ASG 1 =  ENO≤“E3” (ASG) Result Site 5 Site 1Site 2Site 3Site 4 ASG 1 EMP 1 EMP 2 ASG 2 Site 4Site 3 Site 1Site 2 Site 5 result= (EMP 1  EMP 2 ) ENO σ RESP=“Manager” (ASG 1  ASG 2 )

CS742 – Distributed & Parallel DBMSPage 3. 6M. Tamer Özsu Assume: size (EMP) = 400, size (ASG) = 1000 tuple access cost = 1 unit; tuple transfer cost = 10 units Strategy 1  produce ASG': (10+10) ∗ tuple access cost 20  transfer ASG' to the sites of EMP: (10+10) ∗ tuple transfer cost 200  produce EMP': (10+10) ∗ tuple access cost ∗ 2 40  transfer EMP' to result site: (10+10) ∗ tuple transfer cost 200 Total cost 460 Strategy 2  transfer EMP to site 5: 400 ∗ tuple transfer cost 4,000  transfer ASG to site 5 :1000 ∗ tuple transfer cost 10,000  produce ASG':1000 ∗ tuple access cost 1,000  join EMP and ASG':400 ∗ 20 ∗ tuple access cost 8,000 Total cost23,000 Cost of Alternatives

CS742 – Distributed & Parallel DBMSPage 3. 7M. Tamer Özsu Minimize a cost function I/O cost + CPU cost + communication cost These might have different weights in different distributed environments Wide area networks communication cost will dominate  low bandwidth  low speed  high protocol overhead most algorithms ignore all other cost components Local area networks communication cost not that dominant total cost function should be considered Can also maximize throughput Query Optimization Objectives

CS742 – Distributed & Parallel DBMSPage 3. 8M. Tamer Özsu Query Optimization Issues – Types Of Optimizers Exhaustive search Cost-based Optimal Combinatorial complexity in the number of relations Heuristics Not optimal Regroup common sub-expressions Perform selection, projection first Replace a join by a series of semijoins Reorder operations to reduce intermediate relation size Optimize individual operations

CS742 – Distributed & Parallel DBMSPage 3. 9M. Tamer Özsu Query Optimization Issues – Optimization Granularity Single query at a time Cannot use common intermediate results Multiple queries at a time Efficient if many similar queries Decision space is much larger

CS742 – Distributed & Parallel DBMSPage 3. 10M. Tamer Özsu Query Optimization Issues – Optimization Timing Static Compilation  optimize prior to the execution Difficult to estimate the size of the intermediate results  error propagation Can amortize over many executions R* Dynamic Run time optimization Exact information on the intermediate relation sizes Have to reoptimize for multiple executions Distributed INGRES Hybrid Compile using a static algorithm If the error in estimate sizes > threshold, reoptimize at run time Mermaid

CS742 – Distributed & Parallel DBMSPage 3. 11M. Tamer Özsu Query Optimization Issues – Statistics Relation Cardinality Size of a tuple Fraction of tuples participating in a join with another relation Attribute Cardinality of domain Actual number of distinct values Common assumptions Independence between different attribute values Uniform distribution of attribute values within their domain

CS742 – Distributed & Parallel DBMSPage 3. 12M. Tamer Özsu Query Optimization Issues – Decision Sites Centralized Single site determines the “best” schedule Simple Need knowledge about the entire distributed database Distributed Cooperation among sites to determine the schedule Need only local information Cost of cooperation Hybrid One site determines the global schedule Each site optimizes the local subqueries

CS742 – Distributed & Parallel DBMSPage 3. 13M. Tamer Özsu Query Optimization Issues – Network Topology Wide area networks (WAN) – point-to-point Characteristics  Low bandwidth  Low speed  High protocol overhead Communication cost will dominate; ignore all other cost factors Global schedule to minimize communication cost Local schedules according to centralized query optimization Local area networks (LAN) Communication cost not that dominant Total cost function should be considered Broadcasting can be exploited (joins) Special algorithms exist for star networks

CS742 – Distributed & Parallel DBMSPage 3. 14M. Tamer Özsu Distributed Query Processing Methodology Calculus Query on Distributed Relations CONTROL SITE LOCAL SITES Query Decomposition Query Decomposition Data Localization Data Localization Algebraic Query on Distributed Relations Global Optimization Global Optimization Fragment Query Local Optimization Local Optimization Optimized Fragment Query with Communication Operations Optimized Local Queries GLOBAL SCHEMA GLOBAL SCHEMA FRAGMENT SCHEMA FRAGMENT SCHEMA STATS ON FRAGMENTS STATS ON FRAGMENTS LOCAL SCHEMAS LOCAL SCHEMAS

CS742 – Distributed & Parallel DBMSPage 3. 15M. Tamer Özsu Step 1 – Query Decomposition Input : calculus query on global relations Normalization Manipulate query quantifiers and qualification Analysis Detect and reject “incorrect” queries Possible for only a subset of relational calculus Simplification Eliminate redundant predicates Restructuring Calculus query ⇒ algebraic query More than one translation is possible Use transformation rules

CS742 – Distributed & Parallel DBMSPage 3. 16M. Tamer Özsu Convert relational calculus to relational algebra Make use of query trees Example Find the names of employees other than J. Doe who worked on the CAD/CAM project for either 1 or 2 years. SELECTENAME FROMEMP, ASG, PROJ WHEREEMP.ENO = ASG.ENO ANDASG.PNO = PROJ.PNO ANDENAME ≠ “J. Doe” ANDPNAME = “CAD/CAM” ANDDUR = 12 or DUR = 24) Restructuring  ENAME  DUR=12  DUR=24  PNAME=“CAD/CAM”  ENAME≠“J. DOE” PROJASGEMP Project Select Join ⋈ PNO ⋈ ENO

CS742 – Distributed & Parallel DBMSPage 3. 17M. Tamer Özsu Restructuring – Transformation Rules Commutativity of binary operations R ×  S  S  × R R ⋈ S  S ⋈ R R  S  S  R Associativity of binary operations ( R  × S )  × T  R  × ( S  × T ) ( R ⋈ S ) ⋈ T  R ⋈ ( S ⋈ T ) Idempotence of unary operations  A ’ (  A ’ ( R ))  A ’ ( R )  p 1 ( A 1 ) (  p 2 ( A 2 ) ( R ))  p 1 ( A 1 )  p 2 ( A 2 ) ( R ) where R [ A ] and A'  A, A"  A and A'  A" Commuting selection with projection

CS742 – Distributed & Parallel DBMSPage 3. 18M. Tamer Özsu Previous Example – Equivalent Query  ENAME  PNAME=“CAD/CAM”  (DUR=12  DUR=24)  ENAME≠“J. Doe” × PROJ ASG EMP ⋈ PNO,ENO  ENAME  DUR=12  DUR=24  PNAME=“CAD/CAM”  ENAME≠“J. DOE” PROJASGEMP ⋈ PNO ⋈ ENO

CS742 – Distributed & Parallel DBMSPage 3. 19M. Tamer Özsu EMP  ENAME  ENAME ≠ "J. Doe" ASGPROJ  PNO,ENAME  PNAME = "CAD/CAM"  PNO  DUR=24  DUR=24  PNO,ENO  PNO,ENAME Restructuring ⋈ PNO ⋈ ENO

CS742 – Distributed & Parallel DBMSPage 3. 20M. Tamer Özsu Step 2 – Data Localization Input: Algebraic query on distributed relations Determine which fragments are involved Localization program substitute for each global query its materialization program optimize

CS742 – Distributed & Parallel DBMSPage 3. 21M. Tamer Özsu Example Assume EMP is fragmented into EMP 1, EMP 2, EMP 3 as follows:  EMP 1 =  ENO≤“E3” (EMP)  EMP 2 =  “E3”<ENO≤“E6” (EMP)  EMP 3 =  ENO≥“E6 ” (EMP) ASG fragmented into ASG 1 and ASG 2 as follows:  ASG 1 =  ENO≤“E3” (ASG)  ASG 2 =  ENO>“E3” (ASG) Replace EMP by (EMP 1  EMP 2  EMP 3 ) and ASG by (ASG 1  ASG 2 ) in any query  ENAME  DUR=12  DUR=24  PNAME=“CAD/CAM”  ENAME≠“J. DOE” PROJ   EMP 1 EMP 2 EMP 3 ASG 1 ASG 2 ⋈ PNO ⋈ ENO

CS742 – Distributed & Parallel DBMSPage 3. 22M. Tamer Özsu Provides Parallellism EMP 3 ASG 1 EMP 2 ASG 2 EMP 1 ASG 1  EMP 3 ASG 2 ⋈ ENO

CS742 – Distributed & Parallel DBMSPage 3. 23M. Tamer Özsu Eliminates Unnecessary Work EMP 2 ASG 2 EMP 1 ASG 1 EMP 3 ASG 2  ⋈ ENO

CS742 – Distributed & Parallel DBMSPage 3. 24M. Tamer Özsu Reduction for PHF Reduction with selection Relation R and F R ={ R 1, R 2, …, R w } where R j =  p j ( R )  p i ( R j )=   if  x in R : ¬( p i ( x )  p j ( x )) Example SELECT* FROMEMP WHEREENO="E5"  ENO=“E5” EMP 1 EMP 2 EMP 3 EMP 2  ENO=“E5” 

CS742 – Distributed & Parallel DBMSPage 3. 25M. Tamer Özsu Reduction for PHF Reduction with join Possible if fragmentation is done on join attribute Distribute join over union ( R 1  R 2 ) ⋈ S  ( R 1 ⋈ S )  ( R 2 ⋈ S ) Given R i =  p i ( R ) and R j =  p j ( R ) R i ⋈ R j =  if  x in R i,  y in R j : ¬( p i ( x )  p j ( y ))

CS742 – Distributed & Parallel DBMSPage 3. 26M. Tamer Özsu Reduction for PHF Assume EMP is fragmented as before and ASG 1 :  ENO ≤ "E3" (ASG) ASG 2 :  ENO > "E3" (ASG) Consider the query SELECT* FROMEMP,ASG WHEREEMP.ENO=ASG.ENO Distribute join over unions Apply the reduction rule  EMP 1 EMP 2 EMP 3 ASG 1 ASG 2 ⋈ ENO  EMP 1 ASG 1 EMP 2 ASG 2 EMP 3 ASG 2 ⋈ ENO

CS742 – Distributed & Parallel DBMSPage 3. 27M. Tamer Özsu Reduction for VF Find useless (not empty) intermediate relations Relation R defined over attributes A = { A 1,..., A n } vertically fragmented as R i =  A ' ( R ) where A '  A :  D,K ( R i ) is useless if the set of projection attributes D is not in A ' Example: EMP 1 =  ENO,ENAME (EMP); EMP 2 =  ENO,TITLE (EMP) SELECTENAME FROMEMP EMP 1 EMP 2  ENAME ⋈ ENO  ENAME

CS742 – Distributed & Parallel DBMSPage 3. 28M. Tamer Özsu Reduction for DHF Rule : Distribute joins over unions Apply the join reduction for horizontal fragmentation Example ASG 1 : ASG ⋉ ENO EMP 1 ASG 2 : ASG ⋉ ENO EMP 2 EMP 1 :  TITLE=“Programmer” (EMP) EMP 2 :  TITLE=“Programmer” (EMP) Query SELECT * FROMEMP, ASG WHEREASG.ENO = EMP.ENO ANDEMP.TITLE = "Mech. Eng."

CS742 – Distributed & Parallel DBMSPage 3. 29M. Tamer Özsu Generic query Selections first Reduction for DHF   ASG 1  TITLE=“Mech. Eng.” ASG 2 EMP 1 EMP 2  ASG 1 ASG 2 EMP 2  TITLE=“Mech. Eng.” ⋈ ENO

CS742 – Distributed & Parallel DBMSPage 3. 30M. Tamer Özsu Joins over unions Reduction for DHF Elimination of the empty intermediate relations (left sub-tree)  ASG 1 EMP 2  TITLE=“Mech. Eng.” ASG 2  TITLE=“Mech. Eng.” ASG 2 EMP 2  TITLE=“Mech. Eng.” ⋈ ENO