CPT-S Advanced Databases

Slides:



Advertisements
Similar presentations
LIBRA: Lightweight Data Skew Mitigation in MapReduce
Advertisements

1 TDD: Topics in Distributed Databases Distributed Query Processing MapReduce Vertex-centric models for querying graphs Distributed query evaluation by.
Resource Management §A resource can be a logical, such as a shared file, or physical, such as a CPU (a node of the distributed system). One of the functions.
A Model of Computation for MapReduce
Super computers Parallel Processing By: Lecturer \ Aisha Dawood.
PRAM (Parallel Random Access Machine)
Efficient Parallel Algorithms COMP308
Slide 1 Parallel Computation Models Lecture 3 Lecture 4.
Advanced Topics in Algorithms and Data Structures An overview of the lecture 2 Models of parallel computation Characteristics of SIMD models Design issue.
Models of Parallel Computation
1 Lecture 8 Architecture Independent (MPI) Algorithm Design Parallel Computing Fall 2007.
Sorting, Searching, and Simulation in the MapReduce Framework Michael T. Goodrich Dept. of Computer Science.
1 QSX: Querying Social Graphs Graph algorithms in MapReduce MapReduce: an introduction BFS for distance queries PageRank Keyword search Subgraph isomorphism.
RAM and Parallel RAM (PRAM). Why models? What is a machine model? – A abstraction describes the operation of a machine. – Allowing to associate a value.
MapReduce.
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
Bulk Synchronous Parallel Processing Model Jamie Perkins.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Pregel: A System for Large-Scale Graph Processing Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and.
-1.1- Chapter 2 Abstract Machine Models Lectured by: Nguyễn Đức Thái Prepared by: Thoại Nam.
1 Lectures on Parallel and Distributed Algorithms COMP 523: Advanced Algorithmic Techniques Lecturer: Dariusz Kowalski Lectures on Parallel and Distributed.
RAM, PRAM, and LogP models
LogP Model Motivation BSP Model Limited to BW of Network (g) and Load of PE Requires large load per super steps. Need Better Models for Portable Algorithms.
LogP and BSP models. LogP model Common MPP organization: complete machine connected by a network. LogP attempts to capture the characteristics of such.
Bulk Synchronous Processing (BSP) Model Course: CSC 8350 Instructor: Dr. Sushil Prasad Presented by: Chris Moultrie.
Parallel Processing & Distributed Systems Thoai Nam Chapter 2.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Data Structures and Algorithms in Parallel Computing Lecture 1.
Data Structures and Algorithms in Parallel Computing Lecture 4.
Introduction.  Administration  Simple DBMS  CMPT 454 Topics John Edgar2.
Data Structures and Algorithms in Parallel Computing
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.
Applied Operating System Concepts
Auburn University
Overview Parallel Processing Pipelining
Introduction to Distributed Platforms
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
Applying Control Theory to Stream Processing Systems
CSC 4250 Computer Architectures
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Mapping Techniques Dr. Xiao Qin Auburn University.
NOSQL.
PREGEL Data Management in the Cloud
Parallel and Distributed Simulation Techniques
Introduction to MapReduce and Hadoop
CHAPTER 3 Architectures for Distributed Systems
NOSQL databases and Big Data Storage Systems
Lecture 22 review PRAM: A model developed for parallel machines
Parallel computation models
PRAM Algorithms.
CMSC 611: Advanced Computer Architecture
MapReduce Simplied Data Processing on Large Clusters
Pipelining and Vector Processing
湖南大学-信息科学与工程学院-计算机与科学系
DHT Routing Geometries and Chord
February 26th – Map/Reduce
Cse 344 May 4th – Map/Reduce.
Parallel and Distributed Algorithms
Objective of This Course
Operating System Concepts
Unit –VIII PRAM Algorithms.
Database System Architectures
5/7/2019 Map Reduce Map reduce.
Parallel Programming in C with MPI and OpenMP
Operating System Concepts
COS 518: Distributed Systems Lecture 11 Mike Freedman
MapReduce: Simplified Data Processing on Large Clusters
Presentation transcript:

CPT-S 580-06 Advanced Databases Yinghui Wu EME 49 1

Parallel machine models 2 adapted from David Rodriguez et.al

A parallel Machine Model What is a machine model? Describes a “machine” Puts a value to the operations on the machine Why do we need a model? Makes it easy to reason algorithms Achieve complexity bounds Analyzes maximum parallelism

Parallel Computer Models Performance attributes Machine size: number of processors Clock rate: speed of processors (MHz) Workload: number of computation operations (Mflop) Speedup, efficiency, utilization Startup time Three abstract machine models: PRAM BSP logP programming model: MapReduce

PRAM model 5 adapted from Michael C.Scherger

RAM (Random Access Machine) Unbounded number of local memory cells Each memory cell can hold an integer of unbounded size Instruction set included –simple operations, data operations, comparator, branches All operations take unit time Time complexity = number of instructions executed Space complexity = number of memory cells used Add multiply compare, etc.

PRAM (Parallel Random Access Machine) Definition: Is an abstract machine for designing the algorithms applicable to parallel computers M’ is a system <M, X, Y, A> of infinitely many RAM’s M1, M2, …, each Mi is called a processor of M’. All the processors are assumed to be identical. Each has ability to recognize its own index I Input cells X(1), X(2),…, Output cells Y(1), Y(2),…, Shared memory cells A(1), A(2),…, PRAM is a synchronous, MIMD, shared memory parallel computer. Processors share a common clock but may execute different instructions in each cycle.

PRAM (Parallel RAM) Unbounded collection of RAM processors P0, P1, …, Processors don’t have tape Each processor has unbounded registers Unbounded collection of share memory cells All processors can access all memory cells in unit time All communication via shared memory

PRAM (step in a computation) Consist of 5 phases (carried in parallel by all the processors) each processor: Reads a value from one of the cells x(1),…, x(N) Reads one of the shared memory cells A(1), A(2),… Performs some internal computation May write into one of the output cells y(1), y(2),… May write into one of the shared memory cells A(1), A(2),… e.g. for all i, do A[i] = A[i-1] + 1; Read A[i-1] , compute add 1, write A[i] happened synchronously

PRAM (Parallel RAM) Some subset of the processors can remain idle P0 PN Shared Memory Cells Two or more processors may read simultaneously from the same cell A write conflict occurs when two or more processors try to write simultaneously into the same cell

Share Memory Access Conflicts PRAM are classified based on their Read/Write abilities (realistic and useful) Exclusive Read(ER) : all processors can simultaneously read from distinct memory locations Exclusive Write(EW) : all processors can simultaneously write to distinct memory locations Concurrent Read(CR) : all processors can simultaneously read from any memory location Concurrent Write(CW) : all processors can write to any memory location EREW, CREW, CRCW

Concurrent Write (CW) What value gets written finally? Priority CW: processors have priority based on which value is decided, the highest priority is allowed to complete WRITE Common CW: all processors are allowed to complete WRITE iff all the values to be written are equal. Arbitrary/Random CW: one randomly chosen processor is allowed to complete WRITE

Strengths of PRAM PRAM is attractive and important model for designers of parallel algorithms. Why? It is natural: the number of operations executed per one cycle on p processors is at most p It is strong: any processor can read/write any shared memory cell in unit time It is simple: it abstracts from any communication or synchronization overhead, which makes the complexity and correctness of PRAM algorithm easier It can be used as a benchmark: If a problem has no feasible/efficient solution on PRAM, it has no feasible/efficient solution for any parallel machine

An initial example How do you add N numbers residing in memory location M[0, 1, …, N] Serial Algorithm = O(N) PRAM Algorithm using N processors P0, P1, P2, …, PN ? Log (n) steps = time needed n / 2 processors needed Speed-up = n / log(n) Efficiency = 1 / log(n) Applicable for other operations +, *, <, >, etc.

Example 2 p processor PRAM with n numbers (p<=n) Does x exist within the n numbers? Algorithm Inform everyone what x is Every processor checks [n/p] numbers and sets a flag Check if any of the flags are set to 1 EREW CREW CRCW (common) Inform everyone what x is log(p) 1 Every processor checks [n/p] numbers and sets a flag n/p Check if any of the flag are set to 1

Some variants of PRAM Bounded number of shared memory cells. Small memory PRAM (input data set exceeds capacity of the share memory i/o values can be distributed evenly among the processors) Bounded number of processor Small PRAM. If # of threads of execution is higher, processors may interleave several threads. Bounded size of a machine word. Word size of PRAM Handling access conflicts. Constraints on simultaneous access to share memory cells

Lemma Assume p’<p. Any problem that can be solved for a p processor PRAM in t steps can be solved in a p’ processor PRAM in t’ = O(tp/p’) steps (assuming same size of shared memory) Proof: Partition p into p’ groups of size p/p’ Associate each of the p’ simulating processors with one of these groups Each of the simulating processors simulates one step of its group of processors by: executing all their READ and local computation substeps first executing their WRITE substeps then

Lemma Assume m’<m. Any problem that can be solved for a p processor and m-cell PRAM in t steps can be solved on a max(p,m’)-processors m’-cell PRAM in O(tm/m’) steps Proof: Partition m simulated shared memory cells into m’ continuous segments Si of size m/m’ each Each simulating processor P’i (1<=i<=p), will simulate processor Pi of the original PRAM Each simulating processor P’i 1<=i<=m’, stores the initial contents of Si into its local memory and will use M’[i] as an auxiliary memory cell for simulation of accesses to cell of Si Simulation of one original READ operation Each P’i i=1,…,max(p,m’) repeats for k=1,…,m/m’ write the value of the k-th cell of Si into M’[i] i=1…,m’, read the value which the simulated processor Pi i=1,…,,p, would read in this simulated substep, if it appeared in the shared memory The local computation substep of Pi i=1..,p is simulated in one step by P’i Simulation of one original WRITE operation is analogous to that of READ http://www.cs.yale.edu/homes/arvind/cs424/readings/pram.pdf

BSP model 19 adapted from Michael C.Scherger

What Is Bulk Synchronous Parallelism? BSP is a parallel programming model based on the Synchronizer Automata (Proposed by Leslie Valiant of Harvard University) The model consists of: A set of processor-memory pairs. A communications network that delivers messages in a point-to-point manner. A mechanism for the efficient barrier synchronization for all or a subset of the processes. There are no special combining, replicating, or broadcasting facilities. Communication Network (g) P M Node (w) Node Barrier (l)

BSP Programming Style Vertical Structure Horizontal Structure Sequential composition of “supersteps”. Local computation Process Communication Barrier Synchronization Horizontal Structure Concurrency among a fixed number of virtual processors. Processes do not have a particular order. Locality plays no role in the placement of processes on processors. p = number of processors. Virtual Processors Local Computation Global Communication Barrier Synchronization

BSP Programming Style Properties: Simple to write programs. Independent of target architecture. Performance of the model is predictable. Considers computation and communication at the level of the entire program and executing computer instead of considering individual processes and individual communications. Renounces locality as a performance optimization. Good and bad BSP may not be the best choice for which locality is critical i.e. low-level image processing.

How Does Communication Work? BSP considers communication en masse. Makes it possible to bound the time to deliver a whole set of data by considering all the communication actions of a superstep as a unit. If the maximum number of incoming or outgoing messages per processor is h, then such a communication pattern is called an h-relation. Parameter g measures the permeability of the network to continuous traffic addressed to uniformly random destinations. Defined such that it takes time hg to deliver an h-relation. BSP does not distinguish between sending 1 message of length m, or m messages of length 1. Cost is mgh

Barrier Synchronization “Often expensive and should be used as sparingly as possible.” Developers of BSP claim that barriers are not as expensive as they are believed to be in high performance computing folklore. The cost of a barrier synchronization has two parts. The cost caused by the variation in the completion time of the computation steps that participate. The cost of reaching a globally-consistent state in all processors. Cost is captured by parameter l (“ell”) (parallel slackness). lower bound on l is the diameter of the network.

Predictability of the BSP Model Characteristics: p = number of processors s = processor computation speed (flops/s) … used to calibrate g & l l = synchronization periodicity; minimal number of time steps between successive synchronization operations g = total number of local operations performed by all processors in one second / total number of words delivered by the communications network in one second Cost of a superstep (standard cost model): MAX( wi ) + MAX( hi g ) + l ( or just w + hg + l ) Cost of a superstep (overlapping cost model): MAX( w, hg ) + l

Predictability of the BSP Model Strategies used in writing efficient BSP programs: Balance the computation in each superstep between processes. “w” is a maximum of all computation times and the barrier synchronization must wait for the slowest process. Prioritization, bounded staleness, soft synchronization Balance the communication between processes. “h” is a maximum of the fan-in and/or fan-out of data. message grouping, prioritization Minimize the number of supersteps. Determines the number of times the parallel slackness appears in the final cost.

LogP model 27 adapted from Michael C.Scherger

LogP Model BSP Model: Limited to BW of Network (g) and number of processors Requires large load per super steps. Need Better Models for Portable Algorithms Assumption: Each node is a powerful processor with large memory Interconnection structure has limited bandwidth Interconnection structure has significant latency

Parameters L: Latency delay on the network: time from sender to receiver o: Overhead on the time either processor is occupied sending or receiving message – can’t do anything else for o cycles g: gap minimum interval between consecutive messages (due to bandwidth) P: Number of processors Note: L,o,g : independent from P or node distances; measured by cycles Message length: short message L,o,g are per word or per message of fixed length k word message: k short messages (k*o overhead) L independent from message length

Parameters (continue) Bandwidth: 1/g * unit message length Number of messages to send or receive for each P: L/g Send to Receive total time : L+2o if o >> g, ignore o Similar to BSP except no synchronization step No communication computation overlapping Speed-up factor at most two

Broadcast Optimal Broad cast tree P=8, L=6, g=4, o=2 10 14 o g L P1 P0 10 14 18 22 20 24 o g L p1 p0

BSP vs. LogP BSP differs from LogP in three ways: LogP uses a form of message passing based on pairwise synchronization. LogP adds an extra parameter representing the overhead involved in sending a message. Applies to every communication! LogP defines g in local terms. It regards the network as having a finite capacity and treats g as the minimal permissible gap between message sends from a single process. The parameter g in both cases is the reciprocal of the available per-processor network bandwidth: BSP takes a global view of g, LogP takes a local view of g.

BSP vs. LogP When analyzing the performance of LogP model, it is often necessary (or convenient) to use barriers. Message overhead is present but decreasing… Only overhead is from transferring the message from user space to a system buffer. LogP + barriers - overhead = BSP Both models can efficiently simulate the other.

BSP vs. PRAM BSP can be regarded as a generalization of the PRAM model. If the BSP architecture has a small value of g (g=1), then it can be regarded as PRAM. Use hashing to automatically achieve efficient memory management. The value of l determines the degree of parallel slackness required to achieve optimal efficiency. If l = g = 1 … corresponds to idealized PRAM where no slackness is required.

MapReduce 35

MapReduce A programming model with two primitive functions: Google Map: <k1, v1>  list (k2, v2) Reduce: <k2, list(v2)>  list (k3, v3) Input: a list <k1, v1> of key-value pairs Map: applied to each pair, computes key-value pairs <k2, v2> The intermediate key-value pairs are hash-partitioned based on k2. Each partition (k2, list(v2)) is sent to a reducer Reduce: takes a partition as input, and computes key-value pairs <k3, v3> The process may reiterate – multiple map/reduce steps How does it work?

Architecture (Hadoop) <k1, v1> <k1, v1> <k1, v1> <k1, v1> One block for each mapper (a map task) Stored in DFS Partitioned in blocks (64M) mapper mapper mapper <k2, v2> In local store of mappers Hash partition (k2) reducer reducer Multiple steps <k3, v3> Aggregate results No need to worry about how the data is stored and sent

Data partitioned parallelism <k1, v1> mapper reducer <k2, v2> <k3, v3> Parallel computation What parallelism? Parallel computation Data partitioned parallelism

Study Spark: https://spark.apache.org/ Popular in industry Apache Hadoop, used by Facebook, Yahoo, … Hive, Facebook, HiveQL (SQL) PIG, Yahoo, Pig Latin (SQL like) SCOPE, Microsoft, SQL Cassandra, Facebook, CQL (no join) HBase, Google, distributed BigTable MongoDB, document-oriented (NoSQL) Scalability Yahoo!: 10,000 cores, for Web search queries (2008) Facebook: 100 PB, about half a PB per day Amazon Elastic Compute Cloud (EC2) and Amazon Simple Storage Service (S3); New York Time used 100 EC2 instances to process 4TB of image data, $240 Study Spark: https://spark.apache.org/

Advantages of MapReduce Simple: one only needs to define two functions no need to worry about how the data is stored, distributed and how the operations are scheduled scalability: a large number of low end machines scale out (scale horizontally): adding a new computer to a distributed software application; lost-cost “commodity” scale up (scale vertically): upgrade, add (costly) resources to a single node independence: it can work with various storage layers flexibility: independent of data models or schema Fault tolerance: why?

Able to handle an average of 1.2 failures per analysis job Fault tolerance <k1, v1> mapper reducer <k2, v2> <k3, v3> triplicated Detecting failures and reassigning the tasks of failed nodes to healthy nodes Redundancy checking to achieve load balancing Able to handle an average of 1.2 failures per analysis job

MapReduce algorithms Input: query Q and graph G Output: answers Q(G) to Q in G map(key: node, value: (adjacency-list, others) ) {computation; emit (mkey, mvalue) } Match rkey, rvalue when multiple iterations of MapReduce are needed Match mkey, mvalue reduce(key: __ , value: list[value] ) { … emit (rkey, rvalue) } compatibility

Control flow Copy files from input directory staging dir 1; preprocessing while (termination condition is not satisfied) do { map from staging dir 1; reduce into staging dir 2; move files from staging dir 2 to staging dir 1 } Iterations of MapReduce Postprocessing; move files from staging dir 2 to output dir Termination: non-MapReduce driver program Functional programming No global data structures accessible and mutable by all

Conclusion We need some model to reason, compare, analyze and design algorithms PRAM is simple and easy to understand Rich set of theoretical results Over-simplistic and often not realistic The programs written on these machines are, in general, of type MIMD. BSP is a computational model based on supersteps. does not use locality of reference for the assignment of processes to processors. Predictability is defined in terms of three parameters. BSP is a generalization of PRAM. BSP = LogP + barriers – overhead LogP MapReduce

Papers for you to review W. Fan, F. Geerts, and F. Neven. Making Queries Tractable on Big Data with Preprocessing, VLDB 2013 Y. Tao, W. Lin. X. Xiao. Minimal MapReduce Algorithms (MMC) http://www.cse.cuhk.edu.hk/~taoyf/paper/sigmod13-mr.pdf L. Qin, J. Yu, L. Chang, H. Cheng, C. Zhang, Xuemin Lin: Scalable big graph processing in MapReduce. SIGMOD 2014. http://www1.se.cuhk.edu.hk/~hcheng/paper/SIGMOD2014qin.pdf W. Lu, Y. Shen, S. Chen, B. Ooi: Efficient Processing of k Nearest Neighbor Joins using MapReduce. PVLDB 2012. http://arxiv.org/pdf/1207.0141.pdf V. Rastogi, A. Machanavajjhala, L. Chitnis, A. Sarma: Finding connected components in map-reduce in logarithmic rounds. ICDE 2013http://arxiv.org/pdf/1203.5387.pdf More on the course website.