Fall 2008Parallel Databases1. Fall 2008Parallel Databases2 Ideal Parallel Systems Two key properties:  Linear Speedup: Twice as much hardware can perform.

Slides:



Advertisements
Similar presentations
CSE431 Chapter 7A.1Irwin, PSU, 2008 CSE 431 Computer Architecture Fall 2008 Chapter 7A: Intro to Multiprocessor Systems Mary Jane Irwin (
Advertisements

Parallel Databases By Dr.S.Sridhar, Ph.D.(JNUD), RACI(Paris, NICE), RMR(USA), RZFM(Germany) DIRECTOR ARUNAI ENGINEERING COLLEGE TIRUVANNAMALAI.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part C Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
Quick Review of Apr 10 material B+-Tree File Organization –similar to B+-tree index –leaf nodes store records, not pointers to records stored in an original.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
04/25/2005Yan Huang - CSCI5330 Database Implementation – Parallel Database Parallel Databases.
Multidimensional Data
Parallel Database Systems
Parallel Database Systems The Future Of High Performance Database Systems David Dewitt and Jim Gray 1992 Presented By – Ajith Karimpana.
Parallel Database Systems
Data Warehousing 1 Lecture-25 Need for Speed: Parallelism Methodologies Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics.
Multidimensional Data. Many applications of databases are "geographic" = 2­dimensional data. Others involve large numbers of dimensions. Example: data.
Parallel DBMS Slides adapted from textbook; from Joe Hellerstein; and from Jim Gray, Microsoft Research. DBMS Textbook Chapter 22.
Chapter 1 Introduction 1.1A Brief Overview - Parallel Databases and Grid Databases 1.2Parallel Query Processing: Motivations 1.3Parallel Query Processing:
Parallel DBMS Chapter 22, Part A
Chapter 3 Parallel Search 3.1Search Queries 3.2Data Partitioning 3.3Search Algorithms 3.4Summary 3.5Bibliographical Notes 3.6Exercises.
Institut für Scientific Computing – Universität WienP.Brezany Parallel Databases Univ.-Prof. Dr. Peter Brezany Institut für Scientific Computing Universität.
1 An Empirical Study on Large-Scale Content-Based Image Retrieval Group Meeting Presented by Wyman
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
1 Parallel DBMS Slides by Joe Hellerstein, UCB, with some material from Jim Gray, Microsoft Research. See also:
TDD: Topics in Distributed Databases
Distributed DBMS© 1998 M. Tamer Özsu & Patrick Valduriez Page 13.1 Outline Introduction Background Distributed DBMS Architecture Distributed Database Design.
Chapter 5 Parallel Join 5.1Join Operations 5.2Serial Join Algorithms 5.3Parallel Join Algorithms 5.4Cost Models 5.5Parallel Join Optimization 5.6Summary.
Multidimensional Data Many applications of databases are ``geographic'' = 2­dimensional data. Others involve large numbers of dimensions. Example: data.
Database System Architectures  Client-server Database System  Parallel Database System  Distributed Database System Wei Jiang.
PARALLEL DBMS VS MAP REDUCE “MapReduce and parallel DBMSs: friends or foes?” Stonebraker, Daniel Abadi, David J Dewitt et al.
Parallel & Distributed databases Agenda –The problem domain of design parallel & distributed databases (chp 18-20) –The data allocation problem –The data.
Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.
PMIT-6102 Advanced Database Systems
Parallel DBMS Instructor : Marina Gavrilova
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 10 Database Performance Tuning and Query Optimization.
Database Design – Lecture 16
17.1Database System Concepts - 6 th Edition Chapter 17: Database System Architectures Centralized and Client-Server Systems Server System Architectures.
TM 7-1 Copyright © 1999 Addison Wesley Longman, Inc. Physical Database Design.
Data Warehousing at Acxiom Paul Montrose Data Warehousing at Acxiom Paul Montrose.
Data Warehousing 1 Lecture-24 Need for Speed: Parallelism Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics.
Parallel Database Systems Instructor: Dr. Yingshu Li Student: Chunyu Ai.
Parallel Databases 77. Introduction 4 Basic idea: use multiple disks, memory and/or processors to speed up querying. 4 Measures –Throughput – how many.
Radix Sort and Hash-Join for Vector Computers Ripal Nathuji 6.893: Advanced VLSI Computer Architecture 10/12/00.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
©Silberschatz, Korth and Sudarshan18.1Database System Concepts 3 rd Edition Module 18: Database System Architectures Centralized Systems Client--Server.
Infrastructure for Data Warehouses. Basics Of Data Access Data Store Machine Memory Buffer Memory Cache Data Store Buffer Bus Structure.
Mapping the Data Warehouse to a Multiprocessor Architecture
©Silberschatz, Korth and Sudarshan18.1Database System Concepts - 6 th Edition Chapter 18: Parallel Databases Introduction I/O Parallelism Interquery Parallelism.
Lecture 14- Parallel Databases Advanced Databases Masood Niazi Torshiz Islamic Azad University- Mashhad Branch
©Silberschatz, Korth and Sudarshan20.1Database System Concepts 3 rd Edition Chapter 20: Parallel Databases Introduction I/O Parallelism Interquery Parallelism.
Database System Concepts, 6 th Ed. ©Silberschatz, Korth and Sudarshan See for conditions on re-usewww.db-book.com Module D: Hashing.
CS 540 Database Management Systems
Unit - 4 Introduction to the Other Databases.  Introduction :-  Today single CPU based architecture is not capable enough for the modern database.
CS 440 Database Management Systems Parallel DB & Map/Reduce Some slides due to Kevin Chang 1.
Handling Data Skew in Parallel Joins in Shared-Nothing Systems Yu Xu, Pekka Kostamaa, XinZhou (Teradata) Liang Chen (University of California) SIGMOD’08.
Implementation of Database Systems, Jarek Gryz 1 Parallel DBMS Chapter 22, Part A.
©Silberschatz, Korth and Sudarshan16.1Database System Concepts 3 rd Edition Database System Architectures Centralized Systems Client--Server Systems Parallel.
CPT-S Advanced Databases 11 Yinghui Wu EME 49.
CS 440 Database Management Systems
Chapter 20: Database System Architectures
Parallel Databases.
Mapping the Data Warehouse to a Multiprocessor Architecture
April 30th – Scheduling / parallel
Cse 344 May 2nd – Map/reduce.
Chapter 17: Database System Architectures
Physical Database Design
Akshay Tomar Prateek Singh Lohchubh
Parallel DBMS Chapter 22, Part A
Parallel DBMS Chapter 22, Sections 22.1–22.6
Lecture 13: Query Execution
Database System Architectures
The Gamma Database Machine Project
Parallel DBMS DBMS Textbook Chapter 22
Presentation transcript:

Fall 2008Parallel Databases1

Fall 2008Parallel Databases2 Ideal Parallel Systems Two key properties:  Linear Speedup: Twice as much hardware can perform the task in half the elapse time (i.e., speedup = number of processors.)  Linear Scaleup: Twice as much hardware can perform twice as large a task in the same elapsed time(i.e., scaleup = 1.)

Fall 2008Parallel Databases3 Barriers to Parallelism  Startup: The time needed to start a parallel operation (thread creation/connection overhead) may dominate the actual computation time.  Interference: When accessing shared resources, each new process slows down the others (hot spot problem).  Skew: The response time of a set of parallel processes is the time of the slowest one.

Fall 2008Parallel Databases4 The Challenges  The ideal database machine has:  A single infinitely fast processor.  An infinitely large memory with infinite bandwidth. Unfortunately, technology is not delivering such machines.  The challenges are:  To build an infinitely fast processor out of infinitely many processors of finite speed.  To build an infinitely large memory with infinitely many storage units of finite speed.

Fall 2008Parallel Databases5 Why Parallel Databases?  High-performance, low-cost commodity components have recently become available.  Microprocessor-based systems are much cheaper than traditional mainframes.  Widespread adoption of the relational data model.  Relational data model is ideally suited to parallel execution.  Terabyte online databases are becoming common as the price of online storage decreases.  It is difficult to build mainframes powerful enough to meet the I/O demands of large relational databases.

Fall 2008Parallel Databases6 Hardware Architecture Communication network P0P0 Memory Disk Shared-Everything (SE)... P1P1 PnPn … Communication network P0P0 Disk Shared-Disk (SD)... P1P1 PnPn Memory … IBM 3090 series Digital VAX Sequent Symmetry nCUBE/2 Original Digital VAX cluster Sun Fire (72) IBM pSeries (32)

Fall 2008Parallel Databases7 Shared-Nothing Architecture Consensus: Shared-Nothing architecture is most scalable to support very large databases. Communication network P0P0 Memory Disk P1P1 Shared-Nothing (SN) PnPn... IBM SP/2 Teradata DBC/1012 Tandem Processing Node (PN)

Fall 2008Parallel Databases8 IBM RS/6000 SP  It allows for up to 8,192 individual processors to be combined and managed as a single system.  Processors are packaged in shared memory nodes of up to 16 processors each.  IBM's well-planned roadmap for the SP allows customers to start small and scale up to larger, more powerful systems.  This may entail adding nodes without having to replace existing hardware -- ensuring long-term investment protection as operating needs grow.

Fall 2008Parallel Databases9 Parallel Database Servers  Tandem NonStop SQL  Informix: Online 7.0 supported SE environment with the Informix Parallel Data Query (PDQ). Its 8.0 version supports SN computer.  Oracle: two products, Parallel Server and Parallel Query Option (PQO).  Sybase: Navigation Server.  AT&T Global Information Solutions (GIS).  IBM DB2 Parallel Edition: Supports the IBM SP2 SN multiprocessor.

Fall 2008Parallel Databases10 Process Structure for PDB Query Optimizer Executor Storage Manager Hardware SQLResults  SE, SD, SN architectures  Data placement  Query Scheduling  Query optimization

Fall 2008Parallel Databases11 Parallelism in Relational Data Model  Pipeline Parallelism: If one operator sends its output to another, the two operators can execute in parallel. INSERT JOIN SCAN Table ATable B C

Fall 2008Parallel Databases12  Partitioned Parallelism: By taking the large relational operators and partitioning their inputs and outputs, it is possible to turn one big job into many concurrent independent little ones. INSERT JOIN INSERT SCAN A0A0 A2A2 A1A1 B0B0 B1B1 C0C0 C1C1 C2C2

Fall 2008Parallel Databases13 Data Partitioning Strategies  There are two problems for SN architecture:  The degree of parallelism is determined by the physical layout of the data across the PNs.  Its performance is very sensitive to the skewness in data distributed.  Partitioned data is the key to partitioned execution:  Round-Robin  Hash Partitioning  Range Partitioning

Fall 2008Parallel Databases14 Round-Robin Partitioning  It maps the ith tuple to disk i mod n, where n is the number of disks.  Advantage: It’s simple.  Disadvantage: It does not support associative search. D0D0 D1D1 D2D2 D3D … Records

Fall 2008Parallel Databases15 Hash Partitioning  It maps each tuple to a disk location based on a hash function.  Advantage: Associative access to the tuples with a specific attribute value can be directed to a single disk.  Disadvantage: It tends to randomize data rather than cluster it. … D0D0 D1D1 D2D2 D3D3 Hash

Fall 2008Parallel Databases16 Range Partitioning  It maps contiguous attribute ranges of a relation to various disks.  Advantage: It is good for associative search and clustering data.  Disadvantage: It risks execution skew in which all the execution occurs in one partition. D0D0 D1D1 D2D2 D3D3 A~FG~LM~RS~Z

Fall 2008Parallel Databases17 Horizontal Data Partitioning

Fall 2008Parallel Databases18 Problems for Horizontal Partitioning  Query 1: Retrieve the names of students who have a GPA better than 2.0.  Only P2 and P3 can participate.  In a multi-user environment, the system can effectively use all the remaining PNs for other queries (generally not achievable).  Query 2: Retrieve the names of students who major in Computer Science.  The whole file must be searched.  It cannot be easily addresses.

Fall 2008Parallel Databases19 To Address the Problem  The relation is horizontally partitioned and distributed across the PNs. Locally, each partition is organized as a grid file.  The relation is partitioned using multiple attributes. Locally, each partition can be organized as a grid file (investigated in most of researches).

Fall 2008Parallel Databases20 Multidimensional Data Partitioning Salary (K) Age attribute query 1-attribute query

Fall 2008Parallel Databases21 Advantage of MDP  Degree of parallelism is maximized (using as many processing nodes as possible).  Search space is minimized (searching only relevant data blocks).

Fall 2008Parallel Databases22 Query Types  Query Shape: The shape of the data sub-space accessed by a range query.  Square Query: The query shape is a square.  Row Query: The query shape is a rectangle containing a number of rows.  Column Query: The query shape is a rectangle containing a number of column.

Fall 2008Parallel Databases23 Disk Modulo (DM) Allocation A 16×16 DM example

Fall 2008Parallel Databases24 Disk Modulo  Advantage: optimal for row and column queries.  Disadvantage: poor for square queries.

Fall 2008Parallel Databases25 Hilbert Curve Allocation Method(HCAM) Hilbert Curve 16×16 HCAM

Fall 2008Parallel Databases26 HCAM  HCAM is based on the idea of space filling curves.  A space filling curve visits all points in a k- dimensional space grid exactly once and never crosses itself.  Advantages: good for square range queries.  Disadvantages: poor for row and column queries.

Fall 2008Parallel Databases27 General Multidimensional Data Allocation 2-D GeMDA 16 ×16 GeMDA

Fall 2008Parallel Databases28 2-D GeMDA  Regular Rows: Circular left shift   positions.  Check Rows: Circular left shift   +1 positions.  Number of check rows: GCD(  , N) - 1 Advantages: optimal for row, column, and small square range queries (|Q| <   2 ). N is the number of PNs

Fall 2008Parallel Databases29 3-D GeMDA

Fall 2008Parallel Databases30 Mapping Function For GeMDA

Fall 2008Parallel Databases31 Optimality Comparison Allocation Scheme Optimal with respect to row queries column queries small square queries HCAMNo DMYes No GeMDAYes