CENG 553 Database Management Systems1 Distributed Databases.

Slides:



Advertisements
Similar presentations
Database Systems: Design, Implementation, and Management
Advertisements

V. Megalooikonomou Distributed Databases (based on notes by Silberchatz,Korth, and Sudarshan and notes by C. Faloutsos at CMU) Temple University – CIS.
Enterprise Systems Distributed databases and systems - DT
Distributed Databases John Ortiz. Lecture 24Distributed Databases2  Distributed Database (DDB) is a collection of interrelated databases interconnected.
Transaction.
MIS 385/MBA 664 Systems Implementation with DBMS/ Database Management Dave Salisbury ( )
Chapter 13 (Web): Distributed Databases
Advanced Database Systems September 2013 Dr. Fatemeh Ahmadi-Abkenari 1.
Chapter 25 Distributed Databases and Client-Server Architectures Copyright © 2004 Pearson Education, Inc.
1 Distributed Databases Chapter Two Types of Applications that Access Distributed Databases The application accesses data at the level of SQL statements.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide
Overview Distributed vs. decentralized Why distributed databases
1 © Prentice Hall, 2002 Chapter 13: Distributed Databases Modern Database Management 6 th Edition Jeffrey A. Hoffer, Mary B. Prescott, Fred R. McFadden.
1 Distributed Databases Chapter What is a Distributed Database? Database whose relations reside on different sites Database some of whose relations.
©Silberschatz, Korth and Sudarshan19.1Database System Concepts Lecture-10 Distributed Database System A distributed database system consists of loosely.
Chapter 12 Distributed Database Management Systems
Chapter 13 (Web): Distributed Databases
©Silberschatz, Korth and Sudarshan18.1Database System Concepts Centralized Systems Run on a single computer system and do not interact with other computer.
Definition of terms Definition of terms Explain business conditions driving distributed databases Explain business conditions driving distributed databases.
Distributed Databases
Distributed databases
Distributed Databases
Distributed Databases and DBMSs: Concepts and Design
Distributed Databases Dr. Lee By Alex Genadinik. Distributed Databases? What is that!?? Distributed Database - a collection of multiple logically interrelated.
1 Distributed and Parallel Databases. 2 Distributed Databases Distributed Systems goal: –to offer local DB autonomy at geographically distributed locations.
Database Design – Lecture 16
DISTRIBUTED DATABASES IN ADBMS Shilpa Seth
04/18/2005Yan Huang - CSCI5330 Database Implementation – Distributed Database Systems Distributed Database Systems.
DISTRIBUTED DATABASE SYSTEM.  A distributed database system consists of loosely coupled sites that share no physical component  Database systems that.
Session-9 Data Management for Decision Support
Massively Distributed Database Systems - Distributed DBS Spring 2014 Ki-Joune Li Pusan National University.
Lecture 5: Sun: 1/5/ Distributed Algorithms - Distributed Databases Lecturer/ Kawther Abas CS- 492 : Distributed system &
Session-8 Data Management for Decision Support
Lecture 16- Distributed Databases Advanced Databases Masood Niazi Torshiz Islamic Azad University- Mashhad Branch
10 1 Chapter 10 Distributed Database Management Systems Database Systems: Design, Implementation, and Management, Sixth Edition, Rob and Coronel.
Database Systems: Design, Implementation, and Management Tenth Edition Chapter 12 Distributed Database Management Systems.
Database Systems: Design, Implementation, and Management Ninth Edition Chapter 12 Distributed Database Management Systems.
Overview – Chapter 11 SQL 710 Overview of Replication
Distributed Databases Reference Books: An introduction to Database Systems - By C.J. Database Systems and Concepts – Silberchatz, Korth and Sudarshan Lecture.
Distributed Databases DBMS Textbook, Chapter 22, Part II.
Kjell Orsborn UU - DIS - UDBL DATABASE SYSTEMS - 10p Course No. 2AD235 Spring 2002 A second course on development of database systems Kjell.
Distributed Databases
ASMA AHMAD 28 TH APRIL, 2011 Database Systems Distributed Databases I.
1 Distributed Databases BUAD/American University Distributed Databases.
Databases Illuminated
Distributed Database. Introduction A major motivation behind the development of database systems is the desire to integrate the operational data of an.
1 Distributed Databases Chapter 21, Part B. 2 Introduction v Data is stored at several sites, each managed by a DBMS that can run independently. v Distributed.
Topic Distributed DBMS Database Management Systems Fall 2012 Presented by: Osama Ben Omran.
MBA 664 Database Management Systems Dave Salisbury ( )
Chapter 19 Distributed Databases. 2 Distributed Database System n A distributed DBS consists of loosely coupled sites that share no physical component.
Chapter 12 Distributed Data Bases. Learning Objectives What a distributed database management system (DDBMS) is and what its components are How database.
Introduction to Distributed Databases Yiwei Wu. Introduction A distributed database is a database in which portions of the database are stored on multiple.
 Distributed Database Concepts  Parallel Vs Distributed Technology  Advantages  Additional Functions  Distribution Database Design  Data Fragmentation.
Distributed DBMS, Query Processing and Optimization
Distributed Database Design Bayu Adhi Tama, MTI Fasilkom-Unsri Adapted from Connolly, et al., Database Systems 4 th Edition, Pearson Education Limited,
1 Information Retrieval and Use De-normalisation and Distributed database systems Geoff Leese September 2008, revised October 2009.
Topics in Distributed Databases Database System Implementation CSE 507 Some slides adapted from Navathe et. Al and Silberchatz et. Al.
© 2009 Pearson Education, Inc. Publishing as Prentice Hall 1 Lecture 11 Distributed Databases Modern Database Management 9 th Edition Jeffrey A. Hoffer,
CMS Advanced Database and Client-Server Applications Distributed Databases slides by Martin Beer and Paul Crowther Connolly and Begg Chapter 22.
Distributed Databases
1 Chapter 22 Distributed DBMSs - Concepts and Design Simplified Transparencies © Pearson Education Limited 1995, 2005.
Database System Concepts, 5th Ed. ©Silberschatz, Korth and Sudarshan See for conditions on re-usewww.db-book.com Chapter 22: Distributed.
Distributed Databases “Fundamentals”
Distributed Databases and Client-Server Architectures
Database System Implementation CSE 507
Chapter 19: Distributed Databases
Distributed Databases
Distributed Databases
Distributed Databases
Presentation transcript:

CENG 553 Database Management Systems1 Distributed Databases

CENG 553 Database Management Systems2 What is a Distributed Database? Database whose relations reside on different sites Database some of whose relations are replicated at different sites Database whose relations are split between different sites

CENG 553 Database Management Systems3 Two Types of Applications that Access Distributed Databases The application accesses data at the level of SQL statements –Example: company has nationwide network of warehouses, each with its own database; a transaction can access all databases using their schemas The application accesses data at a database using only stored procedures provided by that database. –Example: purchase transaction involving a merchant and a credit card company, each providing stored subroutines for its subtransactions

CENG 553 Database Management Systems4 Database and Query Design Issues How should a distributed database be designed? At what site should each item be stored? Which items should be replicated and at which sites? How should queries that access multiple databases be processed? How do issues of query optimization affect query design?

CENG 553 Database Management Systems5 Application Designer’s View of a Distributed Database Designer might see the individual schemas of each local database -- called a multidatabase -- in which case distribution is visible –Can be homogeneous (all databases from one vendor) or heterogeneous (databases from different vendors) Designer might see a single global schema that integrates all local schemas (is a view) in which case distribution is hidden

CENG 553 Database Management Systems6 Views of Distributed Data (a) Multidatabase with local schemas (b) Integrated distributed database with global schema

CENG 553 Database Management Systems7 Multidatabases Application must explicitly connect to each site Application accesses data at a site using SQL statements based on that site’s schema Application may have to do reformatting in order to integrate data from different sites Application must manage replication –Know where replicas are stored and decide which replica to access

CENG 553 Database Management Systems8 Global Schemas Middleware provides integration of local schemas into a global schema –Application need not connect to each site –Application accesses data using global schema Need not know where data is stored – location transparency –Global joins are supported –Middleware performs necessary data reformatting –Middleware manages replication – replication transparency

Location Transparency –User does not have to know the location of the data. –Data requests automatically forwarded to appropriate sites Local Autonomy –Local site can operate with its database when network connections fail –Each site controls its own data, security, logging, recovery Objectives : Distributed Architecture

Distributed Data Storage Fragmentation: –Relation is partitioned into several fragments stored in distinct sites Replication: –System maintains multiple copies of data, stored in different sites, for faster retrieval and fault tolerance. Replication and fragmentation can be combined: –Relation is partitioned into several fragments: System maintains several identical replicas of each such fragment.

Data Fragmentation Division of relation r into fragments r 1, r 2, …, r n which contain sufficient information to reconstruct relation r. Horizontal fragmentation: each tuple of r is assigned to one or more fragments. Vertical fragmentation: the schema for relation r is split into several smaller schemas.

12 CENG 553 Database Management Systems12 Storing Data v Fragmentation –Horizontal: Usually disjoint. –Vertical: Lossless-join; tids. v Replication –Gives increased availability. –Faster query evaluation. –Synchronous vs. Asynchronous. u Vary in how current copies are. TID t1 t2 t3 t4 R1 R2 R3 SITE A SITE B

CENG 553 Database Management Systems13 Horizontal Fragmentation Each fragment, T i, of table T contains a subset of the rows and each row is in exactly one fragment: T i =  C i (T) T =  T i –Horizontal fragmentation is lossless T4T4 T3T3 T2T2 T1T1 T

CENG 553 Database Management Systems14 Horizontal Fragmentation Example: An Internet grocer has a relation describing inventory at each warehouse Inventory Inventory(StockNum, Amount, Price, Location) It fragments the relation by location and stores each fragment locally: rows with Location = ‘Chicago’ are stored in the Chicago warehouse in a fragment Inventory_ch Inventory_ch(StockNum, Amount, Price, Location) Alternatively, it can use the schema Inventory_ch Inventory_ch(StockNum, Amount, Price)

CENG 553 Database Management Systems15 Vertical Fragmentation Each fragment, T i, of T contains a subset of the columns, each column is in at least one fragment, and each fragment includes the key: T i =  attr_list i (T) T = T 1 T 2 ….. T n –Vertical fragmentation is lossless Example: The Internet grocer has a relation Employee Employee(SSnum, Name, Salary, Title, Location) –It fragments the relation to put some information at headquarters and some elsewhere: Emp1 Emp1(SSnum, Name, Salary) – at headquarters Emp2 Emp2(SSnum, Name, Title, Location) – elsewhere

Data Replication A relation or fragment of a relation is replicated if it is stored redundantly in two or more sites. Full replication of a relation is the case where the relation is stored at all sites. Fully redundant databases are those in which every site contains a copy of the entire database.

Data Replication (Cont.) Advantages of Replication: –Availability: failure of site containing relation r does not result in unavailability of r if replicas exist. –Parallelism: queries on r may be processed by several nodes in parallel. –Reduced data transfer: relation r is available locally at each site containing a replica of r.

Data Replication (Cont.) Disadvantages of Replication –Increased cost of updates: each replica of relation r must be updated. –Increased complexity of concurrency control: concurrent updates to distinct replicas may lead to inconsistent data unless special concurrency control mechanisms are implemented. One solution: choose one copy as primary copy and apply concurrency control operations on primary copy.

CENG 553 Database Management Systems19 Replication Example Internet grocer might have relation Customer Customer(CustNum, Address, Location) –Queries are executed At headquarters to produce monthly mailings At a warehouse to obtain information about deliveries –Updates are executed At headquarters when new customer registers and when information about a customer changes

CENG 553 Database Management Systems20 Example (con’t) Intuitively it seems appropriate to either or both: –Store complete relation at headquarters –Horizontally fragment a replica of the relation and store a fragment at the corresponding warehouse site Each row is replicated: one copy at headquarters, one copy at a warehouse The relation can be both distributed and replicated

CENG 553 Database Management Systems21 Example (con’t): Performance Analysis We consider three alternatives: –Store the entire relation at the headquarters site and nothing at the warehouses (no replication) –Store the fragments at the warehouses and nothing at the headquarters (no replication) –Store entire relation at headquarters and a fragment at each warehouse (replication)

CENG 553 Database Management Systems22 Example (con’t): Performance Analysis - Assumptions To evaluate the alternatives, we estimate the amount of information that must be sent between sites. Assumptions: Customer –The Customer relation has 100,000 rows –The headquarters mailing application sends each customer 1 mailing a month –500 deliveries are made each day; a single row is read for each delivery –100 new customers/day –Changes to customer information occur infrequently

CENG 553 Database Management Systems23 Example: The Evaluation Entire relation at headquarters, nothing at warehouses –500 tuples per day from headquarters to warehouses for deliveries Fragments at warehouses, nothing at headquarters –100,000 tuples per month from warehouses to headquarters for mailings (3,300 tuples per day, amortized) –100 tuples per day from headquarters to warehouses for new customer registration Entire relation at headquarters, fragments at warehouses –100 tuples per day from headquarters to warehouses for new customer registration

CENG 553 Database Management Systems24 Example: Conclusion Replication (case 3) seems best, if we count the number of transmissions. Let us look at another measure: –If we replicate, the time to register a new customer might suffer because of the remote update But this update can be done by a separate transaction after the registration transaction commits (asynchronous update)

CENG 553 Database Management Systems25 Query Planning Systems that support a global schema contain a global query optimizer, which analyzes each global query and translates it into an appropriate sequence of steps to be executed at each site In a multidatabase system, the query designer must manually decompose each global query into a sequence of SQL statements to be executed at each site –Thus a query designer must be her own query optimizer

26 CENG 553 Database Management Systems26 Distributed Query Processing v Cost-based approach: consider all plans, pick cheapest; similar to centralized optimization. –Difference 1: Communication costs must be considered. –Difference 2: Local site autonomy must be respected. –Difference 3: New distributed join methods. v Query site constructs global plan, with suggested local plans describing processing at each site. –If a site can improve suggested local plan, free to do so.

CENG 553 Database Management Systems27 Planning Global Joins Suppose an application at site A wants to join tables at sites B and C. Two straightforward approaches –Transmit both tables to site A and do the join there The application explicitly tests the join condition This approach must be used in multidatabase systems –Transmit the smaller of the tables, e.g. the table at site B, to site C; execute the join there; transmit the result to site A This approach might be used in a homogenous distributed database system

CENG 553 Database Management Systems28 Global Join Example Site B Student Student(Id, Major) Site C Transcript Transcript(StudId, CrsCode) Application at Site A wants to compute join with join condition StudentTranscript Student.Id = Transcript.StudId

CENG 553 Database Management Systems29 Assumptions Lengths of attributes –Id and StudId: 9 bytes –Major: 3 bytes –CrsCode: 6 bytes Student:Student: 15,000 tuples, each of length 12 bytes Transcript:Transcript: 20,000 tuples, each of length 15 bytes –5000 students are registered for at least 1 course (10,000 students are not registered – summer session) –Each student is registered for 4 courses on the average

CENG 553 Database Management Systems30 Comparison of Alternatives Send both tables to site A, do join there: –have to send 15,000* ,000*15 = 480,000 bytes StudentSend the smaller table, Student, from site B to site C, compute the join there. Then send result to Site A: –have to send 15,000* ,000*18 = 540,000 bytes Alternative 1 is better

CENG 553 Database Management Systems31 Another Alternative: Semijoin Step1: PTranscript At site C: Compute P =  StudId (Transcript) P Send P to site B –P contains Ids of students registered for at least 1 course –Student –Student tuples having Ids in P contribute to join Step 2: QStudentP At site B: Compute Q = Student Id = StudId P Q Send Q, to site A Student –Q contains tuples of Student corresponding to students registered for at least 1 course (i.e., 5,000 students out of 15,000) semijoinStudent –Q is a semijoin – the set of all Student tuples that will participate in the join Step 3: Transcript Send Transcript to site A Transcript Q At site A: Compute Transcript Id = StudId Q

CENG 553 Database Management Systems32 Comparing Semijoin with Previous Alternatives In step 1: 45,000 = 5,000*9 bytes sent In step 2: 60,000 = 5,000*12 bytes sent In step 3: 300,000 = 20,000*15 bytes sent In total: 405,000 = 45, , ,000 bytes sent Semijoin is the best of the three alternatives

CENG 553 Database Management Systems33 Definition of Semijoin The semijoin of two relations, T 1 and T 2, is defined as: T 1 join_cond T 2 =  attributes(T 1 ) ( T 1 join_cond T 2 ) –In other words, the semijoin consists of the tuples in T 1 that participate in the join with T 2

CENG 553 Database Management Systems34 Using the Semijoin To compute T 1 join_cond T 2 using a semijoin, first compute T 1 join_cond T 2 then join it with T 2 :  attributes(T 1 ) ( T 1 join_cond T 2 ) join_cond T 2

CENG 553 Database Management Systems35 Queries that Involve Joins and Selections Suppose the Internet grocer relation Employee is vertically fragmented as Emp1 Emp1(SSnum, Name, Salary) at Site B Emp2 Emp2(SSnum, Title, Location) at Site C A query at site A wants the names of all employees with Title = ‘manager’ and Salary > ‘20000’ Solution 1: First do join then selection: –Semijoin not helpful here: all tuples of each table must be brought together to form join (due to vertical fragmentation) Emp1Emp2  Name (  Title=‘manager’ AND Salary>’20000’ (Emp1 Emp2))

CENG 553 Database Management Systems36 Queries that Involve Joins and Selections Solution 2: Do selections before the join: Emp1 R1At site B, select all tuples from Emp1 satisfying Salary > ‘20000’; call the result R1 Emp2 R2At site C, select all tuples from Emp2 satisfying Title = ‘manager’; call the result R2 R1 R2)At some site to be determined by minimizing communication costs, compute  Name (R1 R2); Send result to site A –In a multidatabase, join must be performed at Site A, but communication costs are reduced because only “selected” data needs to be sent Emp1Emp2  Name ((  Salary>’20000’ (Emp1)) (  Title=‘manager’ (Emp2)))

CENG 553 Database Management Systems37 Exercise Consider the following global and fragmentation schemas in a distributed database: Global schema: DOCTOR(DNO, DNAME, DEPT) PATIENT(PNO, PNAME, DEPT, TREAT, DNO) CARE(PNO, DRUG, QUAN) Fragmentation schema: D1 =  DEPT = ‘SURGERY’ (DOCTOR) D2 =  DEPT = ‘PEDIATRICS’ (DOCTOR) D3 =  DEPT  ‘SURGERY’  DEPT  ‘PEDIATRICS’ (DOCTOR) P1 =  DEPT = ‘SURGERY’  TREAT = ‘INTENSIVE’ (PATIENT) P2 =  DEPT = ‘SURGERY’  TREAT  ‘INTENSIVE’ (PATIENT) P3 =  DEPT  ‘SURGERY’ (PATIENT) C1 = CARE P1 C2 = CARE P2 C3 = CARE P3

CENG 553 Database Management Systems38 Exercise (con’t) Translate the following global queries into fragment queries and simplify them. 1.List names of patients who use aspirin in their care. 2.List names of doctors who have prescribed aspirin to patients undergoing intensive treatment.

CENG 553 Database Management Systems39 Choices to be Made by a Distributed Database Application Designer Place tables at different sites Fragment tables in different ways and place fragments at different sites Replicate tables or data within tables and place replicas at different sites In multidatabase systems, do manual “query optimization”: choose an optimal sequence of SQL statements to be executed at each site