Distributed Databases

Slides:



Advertisements
Similar presentations
Database Systems: Design, Implementation, and Management
Advertisements

Enterprise Systems Distributed databases and systems - DT
Distributed Databases John Ortiz. Lecture 24Distributed Databases2  Distributed Database (DDB) is a collection of interrelated databases interconnected.
Distributed databases
Transaction.
Chapter 13 (Web): Distributed Databases
Advanced Database Systems September 2013 Dr. Fatemeh Ahmadi-Abkenari 1.
1 Minggu 12, Pertemuan 23 Introduction to Distributed DBMS (Chapter , 22.6, 3rd ed.) Matakuliah: T0206-Sistem Basisdata Tahun: 2005 Versi: 1.0/0.0.
Chapter 25 Distributed Databases and Client-Server Architectures Copyright © 2004 Pearson Education, Inc.
ABCSG - Distributed Database 1 Data Management Distributed Database Data Replication.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide
Distributed Database Management Systems
©Silberschatz, Korth and Sudarshan19.1Database System Concepts Lecture-10 Distributed Database System A distributed database system consists of loosely.
Chapter 12 Distributed Database Management Systems
©Silberschatz, Korth and Sudarshan18.1Database System Concepts Centralized Systems Run on a single computer system and do not interact with other computer.
Definition of terms Definition of terms Explain business conditions driving distributed databases Explain business conditions driving distributed databases.
Distributed Databases
Outline Introduction Background Distributed Database Design
Distributed databases
Alexandria Dodd Janelle Toungett
Distributed Databases
DISTRIBUTED DBMS ARCHITECTURE
Distributed Databases and DBMSs: Concepts and Design
Database Environment 1.  Purpose of three-level database architecture.  Contents of external, conceptual, and internal levels.  Purpose of external/conceptual.
1 Distributed and Parallel Databases. 2 Distributed Databases Distributed Systems goal: –to offer local DB autonomy at geographically distributed locations.
12 1 Chapter 12 Distributed Database Management Systems Database Systems: Design, Implementation, and Management, Seventh Edition, Rob and Coronel.
Database Design – Lecture 16
III. Current Trends: 1 - Distributed DBMSsSlide 1/32 III. Current Trends Part 1: Distributed DBMSs: Concepts and Design Lecture 12 (2 hours) Lecturer:
DISTRIBUTED DATABASES IN ADBMS Shilpa Seth
DISTRIBUTED DATABASE DESIGN
Session-9 Data Management for Decision Support
Massively Distributed Database Systems - Distributed DBS Spring 2014 Ki-Joune Li Pusan National University.
Session-8 Data Management for Decision Support
Database Systems: Design, Implementation, and Management Tenth Edition Chapter 12 Distributed Database Management Systems.
Database Systems: Design, Implementation, and Management Ninth Edition Chapter 12 Distributed Database Management Systems.
Week 5 Lecture Distributed Database Management Systems Samuel ConnSamuel Conn, Asst Professor Suggestions for using the Lecture Slides.
Distributed systems and Distributed databases design Enterprise systems DT
Distributed Database Systems Overview
PMIT-6102 Advanced Database Systems By- Jesmin Akhter Assistant Professor, IIT, Jahangirnagar University.
DDBMS Distributed Database Management Systems Fragmentation
Distributed DBMSs- Concept and Design Jing Luo CS 157B Dr. Lee Fall, 2003.
Kjell Orsborn UU - DIS - UDBL DATABASE SYSTEMS - 10p Course No. 2AD235 Spring 2002 A second course on development of database systems Kjell.
Elmasri and Navathe, Fundamentals of Database Systems, Fourth Edition Copyright © 2004 Pearson Education, Inc. Slide 2-1 Data Models Data Model: A set.
Distributed Databases
ASMA AHMAD 28 TH APRIL, 2011 Database Systems Distributed Databases I.
1 Distributed Databases BUAD/American University Distributed Databases.
Distributed Database. Introduction A major motivation behind the development of database systems is the desire to integrate the operational data of an.
INTRODUCTION TO DBS Database: a collection of data describing the activities of one or more related organizations DBMS: software designed to assist in.
CS338Parallel and Distributed Databases11-1 Parallel and Distributed Databases Lecture Topics Multi-CPU and distributed systems Monolithic system Client–server.
Distributed database system
Topic Distributed DBMS Database Management Systems Fall 2012 Presented by: Osama Ben Omran.
Introduction to Distributed Databases Yiwei Wu. Introduction A distributed database is a database in which portions of the database are stored on multiple.
Distributed Database Management Systems. Reading Textbook: Ch. 1, Ch. 3 Textbook: Ch. 1, Ch. 3 For next class: Ch. 4 For next class: Ch. 4 FarkasCSCE.
 Distributed Database Concepts  Parallel Vs Distributed Technology  Advantages  Additional Functions  Distribution Database Design  Data Fragmentation.
Distributed DBMS, Query Processing and Optimization
1 Lecture 8 Distributed Data Bases: Replication and Fragmentation.
1 Chapter 22 Distributed DBMS Concepts and Design CS 157B Edward Chen.
Distributed Database Design Bayu Adhi Tama, MTI Fasilkom-Unsri Adapted from Connolly, et al., Database Systems 4 th Edition, Pearson Education Limited,
1 Information Retrieval and Use De-normalisation and Distributed database systems Geoff Leese September 2008, revised October 2009.
CMS Advanced Database and Client-Server Applications Distributed Databases slides by Martin Beer and Paul Crowther Connolly and Begg Chapter 22.
1 Chapter 22 Distributed DBMSs - Concepts and Design Simplified Transparencies © Pearson Education Limited 1995, 2005.
Distributed Databases Technological Education Institution of Larisa in collaboration with Staffordshire University Larisa Dr. Georgia Garani.
Distributed Databases and Client-Server Architectures
CHAPTER 25 - Distributed Databases and Client–Server Architectures
Distributed Database Concepts
Parallel and Distributed Databases
Distributed Databases
Chapter 19: Distributed Databases
Distributed Databases and DBMSs: Concepts and Design
Presentation transcript:

Distributed Databases Section 2 Distributed Databases

Section Content 2.1 Concepts 2.2 Advantages 2.3 Classification of Distributed Systems 2.4 Database Design 2.5 Distributed Query Processing CA306 Introduction

2.1 Concepts A Distributed Database (DDB) is a collection of nodes, connected via a communication network. Each site is autonomous, but a partnership exists among a set of independent but co-operating centralised systems. A Distributed Database Management System (DDBMS) is the software that permits the management of the DDBs and makes distribution transparent to users. There are three basic architectures: networked with a single centralised database; shared memory, and shared nothing. CA306 Introduction

Centralised in a Networked Architecture Client Interface DBMS Interface Network Client Interface Client Interface Client Interface CA306 Introduction

Centralised in a Networked Architecture Storage exists at a single site (with a shared disk architecture). Architecture resembles a typical client server architecture although DDB transparencies exist. This architecture is suited to a (conceptually) fully replicated environment. Each client site sees the same data as all other sites. This architecture also suits a (conceptually) fully fragmented site where each client sees a different view of the overall schema. CA306 Introduction

Shared Memory Architecture NT2000 O/S Workstation SQL Server NT2000 O/S Workstation DBMS Interface Network SQL Server NT2000 O/S Workstation DBMS Interface SQL Server DBMS Interface CA306 Introduction

Shared Memory Architecture Each node on the network operates in an autonomous fashion, with selected hardware and operating system setup. However, each system runs (for example) distributed Oracle where each system shares a common memory space in which transactions are processed. Each site may have copies of data which ‘belong’ to other sites: will require synchronisation of updates. CA306 Introduction

Shared Nothing Network UNIX cluster Oracle NT2000 O/S Node DBMS Interface Network Oracle VMS Mainframe DBMS Interface Oracle DBMS Interface CA306 Introduction

Shared Nothing Architecture Each processor has its own autonomous processing and storage capabilities. Each node is homogenous with respect to operating system, database management system protocols and storage. Communication is (typically) through a high-speed interconnection network. CA306 Introduction

Sections Covered 2.1 Concepts 2.2 Advantages 2.3 Classification of Distributed Systems 2.4 Database Design 2.5 Distributed Query Processing CA306 Introduction

2.2 Advantages Management of distributed data with different levels of transparency. Transparencies: Distribution: location transparency ensures that the user need not worry about the location or local name of data objects. Replication: The user is unaware of data copies. These copies provide better availability, performance and reliability. Fragmentation: horizontal and vertical fragmentation details are hidden from the user. Increased reliability and availability. Reliability is improved with a decrease in downtime. This is due to replication. Availability is the probability that the DDB runs for a predetermined time interval. CA306 Introduction

Advantages (ii) Improved Performance Easier Expansion A distributed DBMS fragments so that data is stored at the site where it is needed most. Fragmentation also implies that the database is smaller: instead of a single CPU processing one large database, multiple CPUs process many smaller databases. Inter-query and intra-query parallelism can be achieved as multiple queries can be run in parallel at separate sites. Easier Expansion Expansion is easier as it may involve adding a new site. Expansion can be planned to suit the current distribution scheme. CA306 Introduction

System Overheads (i) Controlling Data. It is necessary to monitor data distribution, fragmentation and replication by expanding the system catalog. Distributed Query Processing. It is necessary to access multiple sites during the execution of global queries. Optimisation. It is necessary to devise execution strategies based on factors such as the movement of data between sites and the speed of network connections between those sites. Replicated Data Management. It is necessary to propagate changes form one site to all copies. This requires an ability to decide which copy is master, and to maintain consistency among replicated sites. CA306 Introduction

System Overheads (ii) Distributed Database Recovery. A requirement to handle new types of failure (based on communication), and to recover from individual site crashes. Security. Global transactions require the negotiation of different security systems. Authorisation and access privileges must be maintained. Distributed Catalog Management. The hold holds metadata for the entire DDBMS. A decision must be made at design time as to the fragmentation or replication (or both) of the system catalog. Discuss Catalog Managmeent further …. What are the options? Centrlaised? Distrbited? Or some combination of both? CA306 Introduction

Sections Covered 2.1 Concepts 2.2 Advantages 2.3 Classification of Distributed Systems 2.4 Database Design 2.5 Distributed Query Processing CA306 Introduction

2.3 Classification of Distributed Systems Distributed databases have design alternatives along three dimensions: Autonomy, Distribution, Heterogeneity. Autonomy refers to the distribution of control, and indicates the degree to which individual DBMSs can operate independently. The distribution dimension deals with data. There are only two possibilities: data is distributed across multiple sites, or is stored at a single site. Heterogeneity can occur in various forms: hardware, networking protocols, variations in database managers. The important ones relate to data models, query languages, and transaction management protocols. CA306 Introduction

Distribution Autonomy Heterogeneity Distributed Homogeneous “federated” DBMSs Distributed Homogeneous DBMSs Distributed Heterogeneous DBMSs Logically integrated Homogeneous Multiple DBMSs Single site homogeneous Federated DBMSs Autonomy Heterogeneous integrated DBMSs Discuss with students the effects of moving along each axis. Note we will not cover federated databases in this course, so explain only briefly (for sake of completeness). Type exam Q might be “Along what dimensions can a distributed db …..” Heterogeneity CA306 Introduction

Sections Covered 2.1 Concepts 2.2 Advantages 2.3 Classification of Distributed Systems 2.4 Database Design 2.5 Distributed Query Processing CA306 Introduction

2.4 Database Design Early research into DDBSs suggests the organisation of distributed systems along three orthogonal dimensions: level of sharing; behaviour of access patterns; level of knowledge on access pattern behaviour. The first property looks at how data is shared between users; the second looks at issues such as static and dynamic access patterns; and the third looks at how much information is available regarding access patterns. 2. For example A uses Ri Rj and Rk, and B uses Rk and Rl. 3. A only uses Rk on Fridays. A’s usage of Ri is intensive at month end. CA306 Introduction

Top-down design Top-down design is suited to a “green-field” type of application, whereas bottom-up design is generally employed where systems already exist. Requirements Analysis  Objectives Conceptual Design  the Global Conceptual Schema View Design  Access Information and External Schema Definitions Distributed Design  Local Conceptual Schemas Physical Design  Physical Schema Observation & Monitoring  Feedback 1. What are the apps; 2. Build the global schema; 3. Define all views; 4. Distribute the global schema; 5. Determine the low-level distribution/indexes/clusters 6. Continue to tweak & watch for app evolution & migration. CA306 Introduction

CA306 Introduction

Issues Why fragment ? How should fragmentation be performed ? (horizontally v vertical) How much should be fragmented? An important issue as it effects the performance of query execution; aim to find a nice balance between large and small units. Can we test the correctness of decomposition ? (Observe rules) How is allocation performed ? (choose sites, replication required ?) What is the necessary information for fragmentation and allocation? (database information, application information, communication network information and computer system information). CA306 Introduction

Correctness Rules of Fragmentation The following three rules should be enforced during fragmentation, which, together ensure that the database does not undergo semantic change during fragmentation. Completeness. If a relation instance R is decomposed into fragments R1,R2,…,Rn, each data item that can be found in R can also be found in one or more of each Ri. This property is identical to the lossless decomposition property of normalisation. Reconstruction. If a relation R is decomposed into fragments R1,R2,…,Rn, it should be possible to define a relational operator  such that R = Ri  Ri  FR The operator  will be different for different fragmentations, but the operation must be identified. CA306 Introduction

Rules Disjointness. If a relation instance R is decomposed into fragments R1,R2,…,Rn, and data item di resides in Rj, it cannot reside in any other fragment Rk (kj). This criterion ensures that the horizontal fragments are disjoint. Note that the primary key is often repeated in all fragments for vertical partitioning, thus, disjointness is defined only on the non-primary key attributes of a relation. CA306 Introduction

Sections Covered 2.1 Concepts 2.2 Advantages 2.3 Classification of Distributed Systems 2.4 Database Design 2.5 Distributed Query Processing CA306 Introduction

2.5 Query Processing The main function of a relational query processor is to transform a high-level query into an equivalent lower-level query. The low-level query (contains the information required to) implements the execution strategy for the query. The transformation must achieve correctness and efficiency. The well-defined mapping between relational calculus and algebra makes the correctness issue easy. However, producing an execution strategy that is efficient is more complex. A relational calculus query may have many equivalent transformations in relational algebra. The issue is to select that execution strategy that minimises resource consumption. In a distributed system, relational algebra is not enough to express execution strategies. It must be supplemented with operations for exchanging data between sites. For example, the distributed query processor must select the best sites to process data. CA306 Introduction

Sample DB Site 1 (containing a table called EMPLOYEE) {Fname, Lname, RSI, DOB, Address, Sex, Salary, DeptNo} 10,000 tuples (each 100 bytes in length) RSI is 9 bytes; DeptNo is 4 bytes; Fname is 15 bytes; Lname is 15 bytes Site 2 (containing a table called DEPARTMENT) {Dname, Dnumber, MgrRSI, MGRStartdate} 100 tuples (35 bytes in length) Dnumber is 4 bytes; Dname is 10 bytes; MgrRSI is 9 bytes Properties Size of EMPLOYEE is 10,000 * 100 = 1,000,000 bytes Size of DEPARTMENT is 100 * 35 = 3,500 bytes EMPLOYEE.DeptNo = DEPARTMENT.Dnumber CA306 Introduction

Sample Query 1 For each employee, retrieve the employee name and the department in which that employee works. The result of the query will include 10,000 tuples (assuming that every employee has a valid department). We know that 40 bytes are required for each tuple in the result. The query is executed at Site 3 (result site). Three strategies exist for execution of the distributed query. If minimising the amount of data transfer is the optimisation criterion, which strategy is selected? CA306 Introduction

Strategy 1 Transfer both the EMPLOYEE and DEPARTMENT relations to the result site, and perform the join there. Site 1 Site 2 Employee Dept E = {Fname, Lname, RSI, DOB, Address, Sex, Salary, DeptNo} D = {Dname, Dnumber, MgrRSI, MGRStartdate} Site 3 Transfer amount = 1,000,000 + 3,500 = 1,003,500 bytes CA306 Introduction

Strategy 2 Transfer the EMPLOYEE relation to site 2, execute the join at site 2, and send the result to site 3. Site 1 Site 2 Employee Dept E = {Fname, Lname, RSI, DOB, Address, Sex, Salary, DeptNo} R = {Fname, Lname, Dname} Site 3 Transfer 1,000,000 bytes to Site 2; Query result size = 40 * 10,000 = 400,000 bytes; Transfer amount = 1,000,000 + 400,000 = 1,400,000 bytes. CA306 Introduction

Strategy 3 Transfer the DEPARTMENT relation to site 1, execute the join at site 2, and transfer the result to site 3. Site 1 Site 2 Employee Dept D = {Dname, Dnumber, MgrRSI, MRGStartdate} Site 3 R = {Fname, Lname, Dname} Transfer 3,500 bytes to Site 1; Query result size = 40 * 10,000 = 400,000 bytes; Transfer amount = 3,500 + 400,000 = 403,500 bytes. CA306 Introduction

Sample Query 2 For each department, retrieve the department name, and the name of the department manager. Assume the query is again submitted at site 3, and that the result contains 100 tuples (of 40 bytes). CA306 Introduction

Strategy 1 Transfer both EMPLOYEE and DEPARTMENT to site 3, and perform the join there. Site 1 Site 2 Employee Dept E = {Fname, Lname, RSI, DOB, Address, Sex, Salary, DeptNo} D = {Dname, Dnumber, MgrRSI, MRGStartdate} Site 3 Transfer amount = 1,000,000 + 3,500 = 1,003,500 bytes CA306 Introduction

Strategy 2 Transfer the EMPLOYEE relation to site 2, execute the join at site 2, and send the result to site 3. Site 1 Site 2 Employee Dept E = {Fname, Lname, RSI, DOB, Address, Sex, Salary, DeptNo} R = {Fname, Lname, Dname} Site 3 Transfer 1,000,000 bytes to Site 2; Query result size = 40 * 100 = 4,000 bytes; Transfer amount = 1,000,000 + 4,000 = 1,004,000 bytes. CA306 Introduction

Strategy 3 Transfer the DEPARTMENT relation to site 1, execute the join at site 2, and transfer the result to site 3. Site 1 Site 2 Employee Dept D = {Dname, Dnumber, MgrRSI, MRGStartdate} Site 3 R = {Fname, Lname, Dname} Transfer 3,500 bytes to Site 1; Query result size = 40 * 100 = 4,000 bytes; Transfer amount = 3,500 + 4,000 = 7,500 bytes. CA306 Introduction

Exercises Determine what the result would be if the projection of each table was executed before they left the site (eg.  Dnumber(Department) and  <DeptNo, Fname, Lname>(Employee) for query 1). Determine the best strategy if the query is executed at site 2. CA306 Introduction

Processing Layers Used for processing Output from Step 1 Output from CA306 Introduction

Query Decomposition The first layer decomposes the distributed calculus query into an algebraic query. Query decomposition can be viewed as four successive steps: rewrite the calculus query in a normalised form (suitable for subsequent manipulations); analyse the normalised query to detect incorrect queries (reject them early); simplify the correct query (eg. eliminate redundant predicates); transform the calculus query into an algebraic query. CA306 Introduction

Data Localisation The input into this layer is the algebraic transformation of the query. The main role of this layer is to localise the query’s data using data distribution information: determine which fragments are involved in the query and transform the distributed query into fragment queries. There are two steps: The distributed query is mapped into a fragment query by substituting each distributed relation by its materialisation program. The fragment query is simplified and restructured to another correct query. CA306 Introduction

Global Query Optimisation The input to this layer is a (algebraic) query fragment . The goal of the query optimiser is to locate an execution strategy for the query that is close to optimal. This consists of finding the best ordering of operations in the fragment query. An important aspect of query optimisation is join ordering, since permutations of joins within the query may lead to improvements of orders of magnitude. CA306 Introduction

Local Query Optimisation The final layer is performed by all sites having fragments involved in the query. Each sub-query executing at local sites is optimised using the local schema of the site. CA306 Introduction