Distributed Systems CS

Distributed Systems CS 15-440
Programming Models- Part IV Lecture 18, Nov 16, 2015 Mohammad Hammoud

Today… Last Session: Quiz II Programming Models – Part III: MapReduce
Today’s Session: Programming Models – Part IV: MapReduce (Cont’d) & Pregel Announcements: P3 is due today by midnight PS4 is due on Thursday Nov 19 by midnight Quiz II grades are out

The MapReduce Analytics Engine
Basics A Closer Look Combiner Functions Task & Job Scheduling Fault-Tolerance

Combiner Functions MapReduce applications are limited by the bandwidth available on the cluster It pays off to minimize the data shuffled between Map and Reduce tasks Hadoop allows users to specify combiner functions (just like reduce functions) to be run on Map outputs (Y, T) Map output Combiner output MT (1950, 0) (1950, 20) (1950, 10) MT (1950, 20) N MT LEGEND: R = Rack N = Node MT = Map Task RT = Reduce Task Y = Year T = Temperature R N MT RT MT N MT R N MT

Basics Basics A Closer Look A Closer Look Combiner Functions Combiner Functions Task & Job Scheduling Fault-Tolerance

Task Scheduling in MapReduce
MapReduce adopts a master-slave architecture The master node is referred to as JobTracker (JT) Each slave node is referred to as TaskTracker (TT) MapReduce adopts a pull-based scheduling strategy (rather than a push-based one) I.e., JT does not push Map and Reduce tasks to TTs but rather TTs pull them by making pertaining requests TT Task Slots Request JT Reply Tasks Queue Reply T0 T0 T1 T1 T2 Request TT Task Slots

Map and Reduce Task Scheduling
Every TT sends a heartbeat message periodically to JT encompassing a request for a Map or a Reduce task Map Task Scheduling: JT satisfies requests for Map tasks via attempting to schedule Map tasks in the vicinity of their input splits (i.e., it exploits data locality) Reduce Task Scheduling: However, JT simply assigns the next yet-to-run Reduce task to a requesting TT regardless of TT’s network location and its implied effect on the reducer’s shuffle time (i.e., it does not exploit data locality)

Job Scheduling in MapReduce
In MapReduce, an application is represented by one or many jobs A job consists of one or many Map and Reduce tasks Hadoop MapReduce comes with various choices of job schedulers: FIFO Scheduler: schedules jobs in order of submission Fair Scheduler: aims at giving every user a “fair” share of the cluster capacity over time Capacity Scheduler: Similar to Fair Scheduler but does not apply job preemption

Basics Basics A Closer Look A Closer Look Combiner Functions Combiner Functions Task & Job Scheduling Task & Job Scheduling Fault-Tolerance

Fault Tolerance in Hadoop: Node Failures
MapReduce can guide jobs toward a successful completion even when jobs are run on large clusters where probability of failures increases Hadoop MapReduce achieves fault-tolerance through restarting tasks If a TT fails to communicate with JT for a period of time (by default, 1 minute), JT will assume that TT in question has crashed If the job is still in the Map phase, JT asks another TT to re-execute all Map tasks that previously ran at the failed TT If the job is in the Reduce phase, JT asks another TT to re-execute all Reduce tasks that were in-progress on the failed TT

Fault Tolerance in Hadoop: Speculative Execution
A MapReduce job is dominated by the slowest task MapReduce attempts to locate slow tasks (or stragglers) and run replicated (or speculative) tasks that will optimistically commit before the corresponding stragglers In general, this strategy is known as task resiliency or task replication (as opposed to data replication), but in Hadoop it is referred to as speculative execution Only one copy of a straggler is allowed to be speculated Whichever copy (among the two copies) of a task commits first, it becomes the definitive copy, and the other copy is killed by JT

But, How to Locate Stragglers?
Hadoop monitors each task progress using a progress score between 0 and 1 If a task’s progress score is less than (average – 0.2), and the task has run for at least 1 minute, it is marked as a straggler T1 Not a straggler PS= 2/3 T2 A straggler PS= 1/12 Time

To this End…

What Makes MapReduce Unique?
MapReduce is characterized by: Its simplified programming model which allows the user to quickly write and test distributed systems Its efficient and automatic distribution of data and workload across cluster machines Its flat scalability curve After a MapReduce program is written and executed on a 10-machine cluster, very little (if any) work is required to make the same program run on a 1000-machine cluster Communication overhead is minimized as much as possible

Comparison With Traditional Models
Aspect Shared Memory Message Passing MapReduce Communication Implicit (via loads/stores) Explicit Messages Limited and Implicit Synchronization Explicit Implicit (via messages) Immutable (K, V) Pairs Hardware Support Typically Required None Development Effort Lower Higher Lowest Tuning Effort Aspect Shared Memory Message Passing MapReduce Communication Implicit (via loads/stores) Explicit Messages Limited and Implicit Synchronization Explicit Implicit (via messages) Immutable (K, V) Pairs Hardware Support Typically Required None Development Effort Lower Higher Lowest Tuning Effort Aspect Shared Memory Message Passing MapReduce Communication Implicit (via loads/stores) Explicit Messages Limited and Implicit Synchronization Explicit Implicit (via messages) Immutable (K, V) Pairs Hardware Support Typically Required None Development Effort Lower Higher Lowest Tuning Effort Aspect Shared Memory Message Passing MapReduce Communication Implicit (via loads/stores) Explicit Messages Limited and Implicit Synchronization Explicit Implicit (via messages) Immutable (K, V) Pairs Hardware Support Typically Required None Development Effort Lower Higher Lowest Tuning Effort Aspect Shared Memory Message Passing MapReduce Communication Implicit (via loads/stores) Explicit Messages Limited and Implicit Synchronization Explicit Implicit (via messages) Immutable (K, V) Pairs Hardware Support Typically Required None Development Effort Lower Higher Lowest Tuning Effort Aspect Shared Memory Message Passing MapReduce Communication Implicit (via loads/stores) Explicit Messages Limited and Implicit Synchronization Explicit Implicit (via messages) Immutable (K, V) Pairs Hardware Support Typically Required None Development Effort Lower Higher Lowest Tuning Effort

Discussion on Programming Models
Objectives Discussion on Programming Models MapReduce, Pregel and GraphLab MapReduce, Pregel and GraphLab Message Passing Interface (MPI) Types of Parallel Programs Traditional Models of parallel programming Parallel computer architectures Why parallelizing our programs? Cont’d Over 3 Sessions

The Pregel Analytics Engine
Motivation & Definition The Computation & Programming Models Input and Output Architecture & Execution Flow Fault-Tolerance

Motivation for Pregel How to implement algorithms to process Big Graphs? Create a custom distributed infrastructure for each new algorithm Rely on existing distributed analytics engines like MapReduce Use a single-computer graph algorithm library like BGL, LEDA, NetworkX etc. Use a parallel graph processing system like Parallel BGL or CGMGraph Difficult! Inefficient and Cumbersome! Big Graphs might be too large to fit on a single machine! Graph algorithms usually are processed more efficient using a message-passing programming model Parallel BGL and CGMGraph do not apply fault-tolerance mechanisms that are necessary at large-scale deployments Not suited for Large-Scale Distributed Systems!

What is Pregel? Pregel is a large-scale graph-parallel distributed analytics engine Some Characteristics: In-Memory (opposite to MapReduce) High scalability Automatic fault-tolerance Flexibility in expressing graph algorithms Message-Passing programming model Tree-style, master-slave architecture Synchronous Pregel is inspired by Valiant’s Bulk Synchronous Parallel (BSP) model

The BSP Model Iterations Barrier Barrier Barrier Super-Step 1
Data Barrier Data Data Data CPU 1 CPU 2 CPU 1 CPU 1 Data CPU 2 CPU 2 Data Data CPU 3 CPU 3 CPU 3 Data Data Data Super-Step 1 Super-Step 2 Super-Step 3

Entities and Super-Steps
The computation is described in terms of vertices, edges and a sequence of super-steps You give Pregel a directed graph consisting of vertices and edges Each vertex is associated with a modifiable user-defined value Each edge is associated with a source vertex, value and a destination vertex During a super-step: A user-defined function F is executed at each vertex V F can read messages sent to V in superset S – 1 and send messages to other vertices that will be received at superset S + 1 F can modify the state of V and its outgoing edges F can alter the topology of the graph A superstep acts as a global synchronization barrier

Topology Mutations The graph structure can be modified during any super-step Vertices and edges can be added or deleted Mutating graphs can create conflicting requests where multiple vertices at a super-step might try to alter the same edge/vertex Conflicts are avoided using partial ordering and handlers Partial orderings: Edges are removed before vertices Vertices are added before edges Mutations performed at super-step S are only effective at super-step S + 1 All mutations precede calls to actual computations Handlers: Among multiple conflicting requests, one request is selected arbitrarily

Algorithm Termination
Algorithm termination is based on every vertex voting to halt In super-step 0, every vertex is active All active vertices participate in the computation of any given super-step A vertex deactivates itself by voting to halt and enters an inactive state A vertex can return to active state if it receives an external message A Pregel program terminates when all vertices are simultaneously inactive and there are no messages in transit Vote to Halt Active Inactive Message Received Vertex State Machine

Finding the Max Value in a Graph
3 6 2 1 S: Blue Arrows are messages 3 6 2 1 6 6 2 6 Blue vertices have voted to halt S + 1: Needs Animation 6 2 6 6 6 S + 2: 6 6 S + 3:

The Programming Model Pregel adopts the message-passing programming model Messages can be passed from any vertex to any other vertex in the graph Any number of messages can be passed The message order is not guaranteed Messages will not be duplicated Combiners can be used to reduce the number of messages passed between super-steps Aggregators are available for reduction operations (e.g., sum, min, and max)

The Pregel API in C++ A Pregel program is written by sub-classing the Vertex class: template <typename VertexValue, typename EdgeValue, typename MessageValue> class Vertex { public: virtual void Compute(MessageIterator* msgs) = 0; const string& vertex_id() const; int64 superstep() const; const VertexValue& GetValue(); VertexValue* MutableValue(); OutEdgeIterator GetOutEdgeIterator(); void SendMessageTo(const string& dest_vertex, const MessageValue& message); void VoteToHalt(); }; To define the types for vertices, edges and messages Override the compute function to define the computation at each superstep To get the value of the current vertex To modify the value of the vertex To pass messages to other vertices

Pregel Code for Finding the Max Value
Class MaxFindVertex : public Vertex<double, void, double> { public: virtual void Compute(MessageIterator* msgs) { int currMax = GetValue(); SendMessageToAllNeighbors(currMax); for ( ; !msgs->Done(); msgs->Next()) { if (msgs->Value() > currMax) currMax = msgs->Value(); } if (currMax > GetValue()) *MutableValue() = currMax; else VoteToHalt(); };

Input, Graph Flow and Output
The input graph in Pregel is stored in a distributed storage layer (e.g., GFS or Bigtable) The input graph is divided into partitions consisting of vertices and outgoing edges Default partitioning function is hash(ID) mod N, where N is the # of partitions Partitions are stored at node memories for the duration of computations (hence, an in-memory model & not a disk-based one) Outputs in Pregel are typically graphs isomorphic (or mutated) to input graphs Yet, outputs can be also aggregated statistics mined from input graphs (depends on the graph algorithms) A Bigtable is a sparse, distributed, persistent multidimensional sorted map

The Architectural Model
Pregel assumes a tree-style network topology and a master-slave architecture Core Switch Rack Switch Rack Switch Worker1 Worker2 Worker3 Worker4 Worker5 Master Push work (i.e., partitions) to all workers Send Completion Signals When the master receives the completion signal from every worker in super-step S, it starts super-step S + 1

The Execution Flow Steps of Program Execution in Pregel:
Copies of the program code are distributed across all machines 1.1 One copy is designated as the master and every other copy is deemed as a worker/slave The master partitions the graph and assigns workers partition(s), along with portions of input “graph data” Every worker executes the user-defined function on each vertex Workers can communicate among each others

The Execution Flow Steps of Program Execution in Pregel:
The master coordinates the execution of super-steps The master calculates the number of inactive vertices after each super-step and signals workers to terminate if all vertices are inactive (and no messages are in transit) Each worker may be instructed to save its portion of the graph

Fault Tolerance in Pregel
Fault-tolerance is achieved through checkpointing At the start of every super-step the master may instruct the workers to save the states of their partitions in a stable storage Master uses “ping” messages to detect worker failures If a worker fails, the master re-assigns corresponding vertices and input graph data to another available worker, and restarts the super-step The available worker re-loads the partition state of the failed worker from the most recent available checkpoint The state of a partition includes vertex value, edge values, and incoming messages

How Does Pregel Compare to MapReduce?
The state of a partition includes vertex value, edge values, and incoming messages

Pregel versus MapReduce
Aspect Hadoop MapReduce Pregel Programming Model Shared-Memory (abstraction) Message-Passing Computation Model Synchronous Parallelism Model Data-Parallel Graph-Parallel Architectural Model Master-Slave Aspect Hadoop MapReduce Pregel Programming Model Shared-Memory (abstraction) Message-Passing Computation Model Synchronous Parallelism Model Data-Parallel Graph-Parallel Architectural Model Master-Slave Task/Vertex Scheduling Model Pull-Based Push-Based Aspect Hadoop MapReduce Pregel Programming Model Shared-Memory (abstraction) Message-Passing Computation Model Synchronous Aspect Hadoop MapReduce Pregel Programming Model Shared-Memory (abstraction) Message-Passing Aspect Hadoop MapReduce Pregel Programming Model Shared-Memory (abstraction) Message-Passing Computation Model Synchronous Parallelism Model Data-Parallel Graph-Parallel Architectural Model Master-Slave Task/Vertex Scheduling Model Pull-Based Push-Based Application Suitability Loosely-Connected/Embarrassingly Parallel Applications Strongly-Connected Applications Aspect Hadoop MapReduce Pregel Programming Model Shared-Memory (abstraction) Message-Passing Computation Model Synchronous Parallelism Model Data-Parallel Graph-Parallel

Next Class GraphLab

Back-up Slides

PageRank PageRank is a link analysis algorithm
The rank value indicates an importance of a particular web page A hyperlink to a page counts as a vote of support A page that is linked to by many pages with high PageRank receives a high rank itself A PageRank of 0.5 means there is a 50% chance that a person clicking on a random link will be directed to the document with the 0.5 PageRank PageRank is a probability distribution used to represent the likelihood that a person randomly clicking on links will arrive at any particular page.

PageRank (Cont’d) Iterate: Where: α is the random reset probability
L[j] is the number of links on page j 1 2 3 4 5 6

Distributed Systems CS

Similar presentations

Presentation on theme: "Distributed Systems CS"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Distributed Systems CS

Similar presentations

Presentation on theme: "Distributed Systems CS"— Presentation transcript:

Similar presentations

About project

Feedback