CMSC724: Database Management Systems Instructor: Amol Deshpande

Slides:



Advertisements
Similar presentations
Choosing an Order for Joins
Advertisements

LIBRA: Lightweight Data Skew Mitigation in MapReduce
Parallel Databases By Dr.S.Sridhar, Ph.D.(JNUD), RACI(Paris, NICE), RMR(USA), RZFM(Germany) DIRECTOR ARUNAI ENGINEERING COLLEGE TIRUVANNAMALAI.
Parallel Databases Michael French, Spencer Steele, Jill Rochelle When Parallel Lines Meet by Ken Rudin (BYTE, May 98)
Fall 2008Parallel Query Optimization1. Fall 2008Parallel Query Optimization2 Bucket Sizes and I/O Costs Bucket B does not fit in the memory in its entirety,
Parallel Database Systems
Parallel Database Systems The Future Of High Performance Database Systems David Dewitt and Jim Gray 1992 Presented By – Ajith Karimpana.
Parallel Database Systems
Data Warehousing 1 Lecture-25 Need for Speed: Parallelism Methodologies Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics.
1 Minggu 12, Pertemuan 23 Introduction to Distributed DBMS (Chapter , 22.6, 3rd ed.) Matakuliah: T0206-Sistem Basisdata Tahun: 2005 Versi: 1.0/0.0.
Parallel DBMS Slides adapted from textbook; from Joe Hellerstein; and from Jim Gray, Microsoft Research. DBMS Textbook Chapter 22.
VLDB Revisiting Pipelined Parallelism in Multi-Join Query Processing Bin Liu and Elke A. Rundensteiner Worcester Polytechnic Institute
Chapter 1 Introduction 1.1A Brief Overview - Parallel Databases and Grid Databases 1.2Parallel Query Processing: Motivations 1.3Parallel Query Processing:
Parallel DBMS Chapter 22, Part A
1 Parallel DBMS Slides by Joe Hellerstein, UCB, with some material from Jim Gray, Microsoft Research. See also:
McGraw-Hill/Irwin Copyright © 2007 by The McGraw-Hill Companies, Inc. All rights reserved. Chapter 17 Client-Server Processing, Parallel Database Processing,
1 CSE SUNY New Paltz Chapter Nine Multiprocessors.
TDD: Topics in Distributed Databases
Fall 2008Parallel Databases1. Fall 2008Parallel Databases2 Ideal Parallel Systems Two key properties:  Linear Speedup: Twice as much hardware can perform.
©Silberschatz, Korth and Sudarshan18.1Database System Concepts Centralized Systems Run on a single computer system and do not interact with other computer.
CPSC-608 Database Systems Fall 2011 Instructor: Jianer Chen Office: HRBB 315C Phone: Notes #12.
Chapter 5 Parallel Join 5.1Join Operations 5.2Serial Join Algorithms 5.3Parallel Join Algorithms 5.4Cost Models 5.5Parallel Join Optimization 5.6Summary.
Database System Architectures  Client-server Database System  Parallel Database System  Distributed Database System Wei Jiang.
Parallel & Distributed databases Agenda –The problem domain of design parallel & distributed databases (chp 18-20) –The data allocation problem –The data.
PMIT-6102 Advanced Database Systems
Redundant Array of Independent Disks
1 Distributed and Parallel Databases. 2 Distributed Databases Distributed Systems goal: –to offer local DB autonomy at geographically distributed locations.
Parallel DBMS Instructor : Marina Gavrilova
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Data Warehousing 1 Lecture-24 Need for Speed: Parallelism Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics.
Querying Large Databases Rukmini Kaushik. Purpose Research for efficient algorithms and software architectures of query engines.
Parallel Database Systems Instructor: Dr. Yingshu Li Student: Chunyu Ai.
Parallel Databases 77. Introduction 4 Basic idea: use multiple disks, memory and/or processors to speed up querying. 4 Measures –Throughput – how many.
Distributed Databases DBMS Textbook, Chapter 22, Part II.
1 Distributed Databases BUAD/American University Distributed Databases.
Databases Illuminated
Lecture 15- Parallel Databases (continued) Advanced Databases Masood Niazi Torshiz Islamic Azad University- Mashhad Branch
CS338Parallel and Distributed Databases11-1 Parallel and Distributed Databases Lecture Topics Multi-CPU and distributed systems Monolithic system Client–server.
CS4432: Database Systems II Query Processing- Part 2.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Introduction.  Administration  Simple DBMS  CMPT 454 Topics John Edgar2.
Mapping the Data Warehouse to a Multiprocessor Architecture
©Silberschatz, Korth and Sudarshan18.1Database System Concepts - 6 th Edition Chapter 18: Parallel Databases Introduction I/O Parallelism Interquery Parallelism.
6340 DBMS Components. DBMS OS, application, middleware Components: storage, query optimizer, recovery manager, transaction processor, security.
Query Processing CS 405G Introduction to Database Systems.
Chapter 1 Database Access from Client Applications.
Lecture 3 - Query Processing (continued) Advanced Databases Masood Niazi Torshiz Islamic Azad university- Mashhad Branch
Spring EE 437 Lillevik 437s06-l22 University of Portland School of Engineering Advanced Computer Architecture Lecture 22 Distributed computer Interconnection.
CS 440 Database Management Systems Parallel DB & Map/Reduce Some slides due to Kevin Chang 1.
Handling Data Skew in Parallel Joins in Shared-Nothing Systems Yu Xu, Pekka Kostamaa, XinZhou (Teradata) Liang Chen (University of California) SIGMOD’08.
Implementation of Database Systems, Jarek Gryz 1 Parallel DBMS Chapter 22, Part A.
Querying the Internet with PIER CS294-4 Paul Burstein 11/10/2003.
Introduction Goal: connecting multiple computers to get higher performance – Multiprocessors – Scalability, availability, power efficiency Job-level (process-level)
CS4432: Database Systems II Query Processing- Part 1 1.
CPT-S Advanced Databases 11 Yinghui Wu EME 49.
INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.
CS 540 Database Management Systems
Interquery Parallelism
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Join Processing in Database Systems with Large Main Memories (part 2)
Mapping the Data Warehouse to a Multiprocessor Architecture
April 30th – Scheduling / parallel
Cse 344 May 2nd – Map/reduce.
Akshay Tomar Prateek Singh Lohchubh
Parallel DBMS Chapter 22, Part A
Parallel DBMS Chapter 22, Sections 22.1–22.6
Database System Architectures
The Gamma Database Machine Project
Parallel DBMS DBMS Textbook Chapter 22
Distributed Databases
Presentation transcript:

CMSC724: Database Management Systems Instructor: Amol Deshpande

Database Machines Proposals in late 70’s, early 80’s for specialized hardware “Database Machines: An idea whose time has passed ?” DeWitt Processor-per-track: Database specific storage Evaluate selections directly on the CPs etc… Processor-per-head Need parallel readout Combined with indexes, gives good performance Off-the-track Something like a shared-memory machine Used special DB-specific processors

Database Machines Didn’t work Processor-per-track: Not cost-effective Based on fixed-head disks, or solid-state storage Processor-per-head Parallel readouts are hard to do ?? Off-the-track Disk bandwidth is actually decreasing ??? (maybe true then) Generally, specialized hardware is hard to make work Doesn’t help too much with sorts/joins anyway General-purpose hardware improves faster Soon catches up IDISKs ? (late 90’s)

Parallelism

Shared memory Direct mapping from uni-processor Shared nothing Horizontal data partitioning, partial failure Query processing, optimization challenging Shared disk Distributed lock managers, cache-coherency etc..

Concepts… Speedup vs Scaleup Pipelined vs partitioned Database operations “embarrassingly parallel” Round-robin vs Hash-based vs Range-based Bubba used “heat” to partition New operators: Hash joins, replicate-all strategy etc… Barriers to linearity Startup overheads Interference If the interference just 1%, the maximum speedup < 37 Skew

Engineering issues Re-use existing code Split/merge operators Volcano: Exchange operator Allows arbitrary interleavings An operator can directly call another operator (within the process), across processes or across network Data-driven vs Demand-driven dataflows (pull vs push) Semaphores used for controlling producer vs consumer rates Allowed the exchange operator to live in the middle Also, rule-based extensible optimizer

Query execution Left-deep trees good for pipelining Selections: Indexes etc (individual at each site) Joins: Hash joins, replicate-all Symmetric hash join operator ? Aggregation: Do separately, and combine Can all aggregates be done like this ?

Other issues Concurrency/locking ? Deadlock detection Recovery Failures ? Chained declustering Similar to RAID

Query optimization Much larger plan space 2-phase optimization (XPRS) How to partition ? Cost metric: Communication cost ? Load balancing/skew Recursive partitioning for hash joins Memory management

Distributed Databases R* etc… Communication costs much higher Use semi-joins/bloom filters etc More autonomy per machine Typically different administrative domains Different schemas, even different machines Federated ? Mariposa – Used the economic paradigm For query processing, replication etc.