CPT-S Advanced Databases

CPT-S 580-06 Advanced Databases
Yinghui Wu EME 49 1

Parallel machine models
2 adapted from David Rodriguez et.al

A parallel Machine Model
What is a machine model? Describes a “machine” Puts a value to the operations on the machine Why do we need a model? Makes it easy to reason algorithms Achieve complexity bounds Analyzes maximum parallelism

Parallel Computer Models
Performance attributes Machine size: number of processors Clock rate: speed of processors (MHz) Workload: number of computation operations (Mflop) Speedup, efficiency, utilization Startup time Three abstract machine models: PRAM BSP logP programming model: MapReduce

PRAM model 5 adapted from Michael C.Scherger

RAM (Random Access Machine)
Unbounded number of local memory cells Each memory cell can hold an integer of unbounded size Instruction set included –simple operations, data operations, comparator, branches All operations take unit time Time complexity = number of instructions executed Space complexity = number of memory cells used Add multiply compare, etc.

PRAM (Parallel Random Access Machine)
Definition: Is an abstract machine for designing the algorithms applicable to parallel computers M’ is a system <M, X, Y, A> of infinitely many RAM’s M1, M2, …, each Mi is called a processor of M’. All the processors are assumed to be identical. Each has ability to recognize its own index I Input cells X(1), X(2),…, Output cells Y(1), Y(2),…, Shared memory cells A(1), A(2),…, PRAM is a synchronous, MIMD, shared memory parallel computer. Processors share a common clock but may execute different instructions in each cycle.

PRAM (Parallel RAM) Unbounded collection of RAM processors P0, P1, …,
Processors don’t have tape Each processor has unbounded registers Unbounded collection of share memory cells All processors can access all memory cells in unit time All communication via shared memory

PRAM (step in a computation)
Consist of 5 phases (carried in parallel by all the processors) each processor: Reads a value from one of the cells x(1),…, x(N) Reads one of the shared memory cells A(1), A(2),… Performs some internal computation May write into one of the output cells y(1), y(2),… May write into one of the shared memory cells A(1), A(2),… e.g. for all i, do A[i] = A[i-1] + 1; Read A[i-1] , compute add 1, write A[i] happened synchronously

PRAM (Parallel RAM) Some subset of the processors can remain idle P0
PN Shared Memory Cells Two or more processors may read simultaneously from the same cell A write conflict occurs when two or more processors try to write simultaneously into the same cell

Share Memory Access Conflicts
PRAM are classified based on their Read/Write abilities (realistic and useful) Exclusive Read(ER) : all processors can simultaneously read from distinct memory locations Exclusive Write(EW) : all processors can simultaneously write to distinct memory locations Concurrent Read(CR) : all processors can simultaneously read from any memory location Concurrent Write(CW) : all processors can write to any memory location EREW, CREW, CRCW

Concurrent Write (CW) What value gets written finally?
Priority CW: processors have priority based on which value is decided, the highest priority is allowed to complete WRITE Common CW: all processors are allowed to complete WRITE iff all the values to be written are equal. Arbitrary/Random CW: one randomly chosen processor is allowed to complete WRITE

Strengths of PRAM PRAM is attractive and important model for designers of parallel algorithms. Why? It is natural: the number of operations executed per one cycle on p processors is at most p It is strong: any processor can read/write any shared memory cell in unit time It is simple: it abstracts from any communication or synchronization overhead, which makes the complexity and correctness of PRAM algorithm easier It can be used as a benchmark: If a problem has no feasible/efficient solution on PRAM, it has no feasible/efficient solution for any parallel machine

An initial example How do you add N numbers residing in memory location M[0, 1, …, N] Serial Algorithm = O(N) PRAM Algorithm using N processors P0, P1, P2, …, PN ? Log (n) steps = time needed n / 2 processors needed Speed-up = n / log(n) Efficiency = 1 / log(n) Applicable for other operations +, *, <, >, etc.

Example 2 p processor PRAM with n numbers (p<=n)
Does x exist within the n numbers? Algorithm Inform everyone what x is Every processor checks [n/p] numbers and sets a flag Check if any of the flags are set to 1 EREW CREW CRCW (common) Inform everyone what x is log(p) 1 Every processor checks [n/p] numbers and sets a flag n/p Check if any of the flag are set to 1

Some variants of PRAM Bounded number of shared memory cells. Small memory PRAM (input data set exceeds capacity of the share memory i/o values can be distributed evenly among the processors) Bounded number of processor Small PRAM. If # of threads of execution is higher, processors may interleave several threads. Bounded size of a machine word. Word size of PRAM Handling access conflicts. Constraints on simultaneous access to share memory cells

Lemma Assume p’<p. Any problem that can be solved for a p processor PRAM in t steps can be solved in a p’ processor PRAM in t’ = O(tp/p’) steps (assuming same size of shared memory) Proof: Partition p into p’ groups of size p/p’ Associate each of the p’ simulating processors with one of these groups Each of the simulating processors simulates one step of its group of processors by: executing all their READ and local computation substeps first executing their WRITE substeps then

Lemma Assume m’<m. Any problem that can be solved for a p processor and m-cell PRAM in t steps can be solved on a max(p,m’)-processors m’-cell PRAM in O(tm/m’) steps Proof: Partition m simulated shared memory cells into m’ continuous segments Si of size m/m’ each Each simulating processor P’i (1<=i<=p), will simulate processor Pi of the original PRAM Each simulating processor P’i 1<=i<=m’, stores the initial contents of Si into its local memory and will use M’[i] as an auxiliary memory cell for simulation of accesses to cell of Si Simulation of one original READ operation Each P’i i=1,…,max(p,m’) repeats for k=1,…,m/m’ write the value of the k-th cell of Si into M’[i] i=1…,m’, read the value which the simulated processor Pi i=1,…,,p, would read in this simulated substep, if it appeared in the shared memory The local computation substep of Pi i=1..,p is simulated in one step by P’i Simulation of one original WRITE operation is analogous to that of READ

BSP model 19 adapted from Michael C.Scherger

What Is Bulk Synchronous Parallelism?
BSP is a parallel programming model based on the Synchronizer Automata (Proposed by Leslie Valiant of Harvard University) The model consists of: A set of processor-memory pairs. A communications network that delivers messages in a point-to-point manner. A mechanism for the efficient barrier synchronization for all or a subset of the processes. There are no special combining, replicating, or broadcasting facilities. Communication Network (g) P M Node (w) Node Barrier (l)

BSP Programming Style Vertical Structure Horizontal Structure
Sequential composition of “supersteps”. Local computation Process Communication Barrier Synchronization Horizontal Structure Concurrency among a fixed number of virtual processors. Processes do not have a particular order. Locality plays no role in the placement of processes on processors. p = number of processors. Virtual Processors Local Computation Global Communication Barrier Synchronization

BSP Programming Style Properties:
Simple to write programs. Independent of target architecture. Performance of the model is predictable. Considers computation and communication at the level of the entire program and executing computer instead of considering individual processes and individual communications. Renounces locality as a performance optimization. Good and bad BSP may not be the best choice for which locality is critical i.e. low-level image processing.

How Does Communication Work?
BSP considers communication en masse. Makes it possible to bound the time to deliver a whole set of data by considering all the communication actions of a superstep as a unit. If the maximum number of incoming or outgoing messages per processor is h, then such a communication pattern is called an h-relation. Parameter g measures the permeability of the network to continuous traffic addressed to uniformly random destinations. Defined such that it takes time hg to deliver an h-relation. BSP does not distinguish between sending 1 message of length m, or m messages of length 1. Cost is mgh

Barrier Synchronization
“Often expensive and should be used as sparingly as possible.” Developers of BSP claim that barriers are not as expensive as they are believed to be in high performance computing folklore. The cost of a barrier synchronization has two parts. The cost caused by the variation in the completion time of the computation steps that participate. The cost of reaching a globally-consistent state in all processors. Cost is captured by parameter l (“ell”) (parallel slackness). lower bound on l is the diameter of the network.

Predictability of the BSP Model
Characteristics: p = number of processors s = processor computation speed (flops/s) … used to calibrate g & l l = synchronization periodicity; minimal number of time steps between successive synchronization operations g = total number of local operations performed by all processors in one second / total number of words delivered by the communications network in one second Cost of a superstep (standard cost model): MAX( wi ) + MAX( hi g ) + l ( or just w + hg + l ) Cost of a superstep (overlapping cost model): MAX( w, hg ) + l

Predictability of the BSP Model
Strategies used in writing efficient BSP programs: Balance the computation in each superstep between processes. “w” is a maximum of all computation times and the barrier synchronization must wait for the slowest process. Prioritization, bounded staleness, soft synchronization Balance the communication between processes. “h” is a maximum of the fan-in and/or fan-out of data. message grouping, prioritization Minimize the number of supersteps. Determines the number of times the parallel slackness appears in the final cost.

LogP model 27 adapted from Michael C.Scherger

LogP Model BSP Model: Limited to BW of Network (g) and number of processors Requires large load per super steps. Need Better Models for Portable Algorithms Assumption: Each node is a powerful processor with large memory Interconnection structure has limited bandwidth Interconnection structure has significant latency

Parameters L: Latency delay on the network: time from sender to receiver o: Overhead on the time either processor is occupied sending or receiving message – can’t do anything else for o cycles g: gap minimum interval between consecutive messages (due to bandwidth) P: Number of processors Note: L,o,g : independent from P or node distances; measured by cycles Message length: short message L,o,g are per word or per message of fixed length k word message: k short messages (k*o overhead) L independent from message length

Parameters (continue)
Bandwidth: 1/g * unit message length Number of messages to send or receive for each P: L/g Send to Receive total time : L+2o if o >> g, ignore o Similar to BSP except no synchronization step No communication computation overlapping Speed-up factor at most two

Broadcast Optimal Broad cast tree P=8, L=6, g=4, o=2 10 14 o g L P1 P0
10 14 18 22 20 24 o g L p1 p0

BSP vs. LogP BSP differs from LogP in three ways:
LogP uses a form of message passing based on pairwise synchronization. LogP adds an extra parameter representing the overhead involved in sending a message. Applies to every communication! LogP defines g in local terms. It regards the network as having a finite capacity and treats g as the minimal permissible gap between message sends from a single process. The parameter g in both cases is the reciprocal of the available per-processor network bandwidth: BSP takes a global view of g, LogP takes a local view of g.

BSP vs. LogP When analyzing the performance of LogP model, it is often necessary (or convenient) to use barriers. Message overhead is present but decreasing… Only overhead is from transferring the message from user space to a system buffer. LogP + barriers - overhead = BSP Both models can efficiently simulate the other.

BSP vs. PRAM BSP can be regarded as a generalization of the PRAM model. If the BSP architecture has a small value of g (g=1), then it can be regarded as PRAM. Use hashing to automatically achieve efficient memory management. The value of l determines the degree of parallel slackness required to achieve optimal efficiency. If l = g = 1 … corresponds to idealized PRAM where no slackness is required.

MapReduce 35

MapReduce A programming model with two primitive functions: Google
Map: <k1, v1>  list (k2, v2) Reduce: <k2, list(v2)>  list (k3, v3) Input: a list <k1, v1> of key-value pairs Map: applied to each pair, computes key-value pairs <k2, v2> The intermediate key-value pairs are hash-partitioned based on k2. Each partition (k2, list(v2)) is sent to a reducer Reduce: takes a partition as input, and computes key-value pairs <k3, v3> The process may reiterate – multiple map/reduce steps How does it work?

Architecture (Hadoop)
<k1, v1> <k1, v1> <k1, v1> <k1, v1> One block for each mapper (a map task) Stored in DFS Partitioned in blocks (64M) mapper mapper mapper <k2, v2> In local store of mappers Hash partition (k2) reducer reducer Multiple steps <k3, v3> Aggregate results No need to worry about how the data is stored and sent

Data partitioned parallelism
<k1, v1> mapper reducer <k2, v2> <k3, v3> Parallel computation What parallelism? Parallel computation Data partitioned parallelism

Study Spark: https://spark.apache.org/
Popular in industry Apache Hadoop, used by Facebook, Yahoo, … Hive, Facebook, HiveQL (SQL) PIG, Yahoo, Pig Latin (SQL like) SCOPE, Microsoft, SQL Cassandra, Facebook, CQL (no join) HBase, Google, distributed BigTable MongoDB, document-oriented (NoSQL) Scalability Yahoo!: 10,000 cores, for Web search queries (2008) Facebook: 100 PB, about half a PB per day Amazon Elastic Compute Cloud (EC2) and Amazon Simple Storage Service (S3); New York Time used 100 EC2 instances to process 4TB of image data, $240 Study Spark:

Advantages of MapReduce
Simple: one only needs to define two functions no need to worry about how the data is stored, distributed and how the operations are scheduled scalability: a large number of low end machines scale out (scale horizontally): adding a new computer to a distributed software application; lost-cost “commodity” scale up (scale vertically): upgrade, add (costly) resources to a single node independence: it can work with various storage layers flexibility: independent of data models or schema Fault tolerance: why?

Able to handle an average of 1.2 failures per analysis job
Fault tolerance <k1, v1> mapper reducer <k2, v2> <k3, v3> triplicated Detecting failures and reassigning the tasks of failed nodes to healthy nodes Redundancy checking to achieve load balancing Able to handle an average of 1.2 failures per analysis job

MapReduce algorithms Input: query Q and graph G
Output: answers Q(G) to Q in G map(key: node, value: (adjacency-list, others) ) {computation; emit (mkey, mvalue) } Match rkey, rvalue when multiple iterations of MapReduce are needed Match mkey, mvalue reduce(key: __ , value: list[value] ) { … emit (rkey, rvalue) } compatibility

Control flow Copy files from input directory staging dir 1; preprocessing while (termination condition is not satisfied) do { map from staging dir 1; reduce into staging dir 2; move files from staging dir 2 to staging dir 1 } Iterations of MapReduce Postprocessing; move files from staging dir 2 to output dir Termination: non-MapReduce driver program Functional programming No global data structures accessible and mutable by all

Conclusion We need some model to reason, compare, analyze and design algorithms PRAM is simple and easy to understand Rich set of theoretical results Over-simplistic and often not realistic The programs written on these machines are, in general, of type MIMD. BSP is a computational model based on supersteps. does not use locality of reference for the assignment of processes to processors. Predictability is defined in terms of three parameters. BSP is a generalization of PRAM. BSP = LogP + barriers – overhead LogP MapReduce

Papers for you to review
W. Fan, F. Geerts, and F. Neven. Making Queries Tractable on Big Data with Preprocessing, VLDB 2013 Y. Tao, W. Lin. X. Xiao. Minimal MapReduce Algorithms (MMC) L. Qin, J. Yu, L. Chang, H. Cheng, C. Zhang, Xuemin Lin: Scalable big graph processing in MapReduce. SIGMOD W. Lu, Y. Shen, S. Chen, B. Ooi: Efficient Processing of k Nearest Neighbor Joins using MapReduce. PVLDB V. Rastogi, A. Machanavajjhala, L. Chitnis, A. Sarma: Finding connected components in map-reduce in logarithmic rounds. ICDE 2013http://arxiv.org/pdf/ pdf More on the course website.

CPT-S Advanced Databases

Similar presentations

Presentation on theme: "CPT-S Advanced Databases"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CPT-S Advanced Databases

Similar presentations

Presentation on theme: "CPT-S Advanced Databases"— Presentation transcript:

Similar presentations

About project

Feedback