Distributed Systems CS 15-440 Hadoop Lecture 15, November 07, 2018 Mohammad Hammoud
Today Last Session: Today’s Session: Announcements: MPI Hadoop Distributed File System and MapReduce Announcements: P2 grades are out PS4 will be out today P3 is due on Nov 26 by midnight
We Live in a World of Data…
What Do We Do With Big Data? Store Share Access Process Encrypt …. and more! We want to do all these seamlessly...
Where to Store Big Data? The underlying storage system is a key component for enabling Big Data querying/mining/analytics Typically, the storage system would “partition” and “distribute” Big Data, using striping (or partitioning) and placement techniques This allows for concurrent accesses to data as well as improves fault-tolerance Logical File Striping Unit Stripe Size 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Server 1 Server 2 Server 3 Server 4 4 8 12 1 5 9 13 2 6 10 14 3 7 11 15
Example: The Google File System GFS paritions large files into fixed-size blocks and distributes them randomly across cluster machines Large File Blk 0 Blk 1 Blk 2 Blk 3 Blk 4 Blk 5 Blk 6 Server 0 (Writer) Server 1 Server 2 Server 3 0M Blk 0 Blk 0 Blk 1 Blk 0 64M Blk 1 Blk 2 Blk 2 Blk 1 128M Blk 2 Blk 3 Blk 4 Blk 4 Blk 3 Blk 3 Blk 5 Blk 5 192M Blk 6 256M Blk 4 Blk 6 Blk 5 320M Blk 6 384M
Example: The Google File System GFS adopts a master-slave architecture File name GFS client Master Contact address Chunk Id, range Chunk Server Chunk Server Chunk Server Chunk data Linux File System Linux File System Linux File System
How to Process Big Data? One alternative: Create a custom distributed system (or program) for each new algorithm Cumbersome! Another alternative: utilize modern distributed analytics frameworks, which: Relieve programmers from concerns with many of the difficult aspects of developing distributed programs Allow programmers to focus on ONLY the sequential parts of their programs Examples: Hadoop MapReduce Google’s Pregel CMU’s Distributed GraphLab
Distributed Analytics Frameworks Hadoop MapReduce Introduction Programming Model Execution Model Architectural & Scheduling Models
Hadoop Hadoop is one of the most successful realizations of large-scale “data-parallel” distributed analytics frameworks Hadoop MapReduce is an open source implementation of Google’s MapReduce Hadoop uses Hadoop Distributed File System (HDFS) as a storage layer HDFS is an open source implementation of GFS
Hadoop MapReduce: A Bird’s Eye View Hadoop MapReduce incorporates two phases, Map and Reduce phases, which encompass multiple Map and Reduce tasks Map Task HDFS BLK Split 0 Partition Partition Reduce Task Partition Partition Map Task Partition Partition Split 1 HDFS BLK Dataset Partition Reduce Task Partition Partition Partition To HDFS Map Task Split 2 HDFS BLK HDFS Partition Partition Reduce Task Partition Partition Map Task Partition Split 3 HDFS BLK Partition Partition Merge & Sort Stage Shuffle Stage Reduce Stage Map Phase Reduce Phase
Distributed Analytics Frameworks Hadoop MapReduce Introduction Programming Model Execution Model Architectural & Scheduling Models
The Programming Model Hadoop MapReduce employs a shared-based programming model, which entails that: Tasks can interact (if needed) via reading and writing to a shared space HDFS provides the shared space for all Map and Reduce tasks Programmers write only sequential code, without defining functions that send/receive messages between tasks A Shared Address Space (Provided by HDFS) MT1 MT2 MT3 MT4 MT5 MT6 “Implicit” communication (provided by the MapReduce Engine)- Programmers do not write or call any communication routines RT1 RT2 RT3 A Shared Address Space (Provided by HDFS)
Example: Word Count A Text File A Map Function A Map Function A Reduce Key2 Value2 Mohammad 1 is delivering a lecture to the 15-440 class A Reduce Function A Chunk of File Key2 Value2 Mohammad 1 is 2 delivering a lecture to the 15-440 class course name of Distributed Systems Mohammad is delivering a lecture to the 15-440 class Key1 Value1 Mohammad is 20 delivering a 38 lecture to the 60 15-440 class Parse & Count A Text File Mohammad is delivering a lecture to the 15-440 class The course name of 15-440 is Distributed Systems Iterate& Sum A Map Function Key2 Value2 The 1 course name of 15-440 is Distributed Systems A Chunk of File Key1 Value1 The course 17 name of 15-440 40 is Distributed 58 Systems The course name of 15-440 is Distributed Systems Parse & Count
Distributed Analytics Frameworks Hadoop MapReduce Introduction Programming Model Execution Model Architectural & Scheduling Models
The Execution Model Hadoop MapReduce adopts a synchronous execution model A distributed program (or system) is said to be synchronous if and only if its constituent tasks operate in a lock-step mode No two tasks can run concurrently under two different iterations In MapReduce: Each iteration is treated as a MapReduce job A job can encompass 1 or many Map tasks and 0 or many Reduce tasks Programs with multiple iterations (i.e., iterative programs) are executed using multiple chained MapReduce jobs When all Reduce tasks within job i are committed, a new job i + 1 is started (if any) Hence, two different tasks cannot run in parallel under two different jobs (or iterations)
Distributed Analytics Frameworks Hadoop MapReduce Introduction Programming Model Execution Model Architectural & Scheduling Models
The Architectural and Scheduling Models Hadoop MapReduce employs a master-slave architecture A pull-based task scheduling strategy is used, whereby: Map tasks are scheduled in proximity of HDFS blocks Reduce tasks are scheduled anywhere Core Switch A slave The master Rack Switch 1 Rack Switch 2 TaskTracker1 TaskTracker2 TaskTracker3 TaskTracker4 TaskTracker5 JobTracker MT2 MT3 MT1 MT2 MT3 Request a Map Task Schedule a Map Task at an Empty Map Slot on TaskTracker1
The Architectural and Scheduling Models Hadoop MapReduce employs a master-slave architecture With the above setup, how many Map tasks can run in parallel? Each TaskTracker has by default two Map slots, thus can run two Map tasks concurrently With 4 TaskTrackers and 2 Map slots on each TaskTracker, 8 Map tasks can be executed in parallel The maximum number of Map tasks that can run in parallel is denoted as Map wave Core Switch A slave The master Rack Switch 1 Rack Switch 2 TaskTracker1 TaskTracker2 TaskTracker3 TaskTracker4 TaskTracker5 JobTracker MT1 MT2 MT3 MT2 MT3 Request a Map Task Schedule a Map Task at an Empty Map Slot on TaskTracker1
The Architectural and Scheduling Models Hadoop MapReduce employs a master-slave architecture For a dataset with a size of 1024MB, how many Map waves are needed? The size of each HDFS block is by default 64MB and each split encompasses by default 1 HDFS block Hence, there will be a total of 1024/64 = 16 HDFS blocks or 16 splits The input to each Map task is a single split, thus there will be a total of 16 Map tasks Therefore, 16 tasks/8 slots = 2 Map waves will be needed Core Switch A slave The master Rack Switch 1 Rack Switch 2 TaskTracker1 TaskTracker2 TaskTracker3 TaskTracker4 TaskTracker5 JobTracker MT1 MT2 MT3 MT2 MT3 Request a Map Task Schedule a Map Task at an Empty Map Slot on TaskTracker1
Hadoop MapReduce: Summary Aspect Hadoop MapReduce Programming Model Shared-Based Execution Model Synchronous Architectural Model Master-Slave Scheduling Model Pull-Based Suitable Applications Loosely-Connected/Embarrassingly-Parallel Applications
Hadoop MapReduce: Summary Aspect Hadoop MapReduce Programming Model Shared-Based Execution Model Synchronous Architectural Model Master-Slave Scheduling Model Pull-Based Suitable Applications Loosely-Connected/Embarrassingly-Parallel Applications
Next Class Pregel and GraphLab