Distributed Systems CS

Distributed Systems CS 15-440
Hadoop Lecture 13, October 25, 2017 Mohammad Hammoud

Today Last Session: Today’s Session: Announcements: MPI (Concluded)
Hadoop Distributed File System and MapReduce Announcements: P2 grades are out PS4 is out. It is due on Nov 1st by midnight P3 is due on Nov 12th by midnight

We Live in a World of Data…

What Do We Do With Big Data?
Store Share Access Process Encrypt …. and more! We want to do all these seamlessly...

Where to Store Big Data? The underlying storage system is a key component for enabling Big Data querying/mining/analytics Typically, the storage system would “partition” and “distribute” Big Data, using striping (or partitioning) and placement techniques This allows for concurrent accesses to data as well as improves fault-tolerance Logical File Striping Unit Stripe Size 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Server 1 Server 2 Server 3 Server 4 4 8 12 1 5 9 13 2 6 10 14 3 7 11 15

Example: The Google File System
GFS paritions large files into fixed-size blocks and distributes them randomly across cluster machines Large File Blk 0 Blk 1 Blk 2 Blk 3 Blk 4 Blk 5 Blk 6 Server 0 (Writer) Server 1 Server 2 Server 3 0M Blk 0 Blk 0 Blk 1 Blk 0 64M Blk 1 Blk 2 Blk 2 Blk 1 128M Blk 2 Blk 3 Blk 4 Blk 4 Blk 3 Blk 3 Blk 5 Blk 5 192M Blk 6 256M Blk 4 Blk 6 Blk 5 320M Blk 6 384M

Example: The Google File System
GFS adopts a master-slave architecture File name GFS client Master Contact address Chunk Id, range Chunk Server Chunk Server Chunk Server Chunk data Linux File System Linux File System Linux File System

How to Process Big Data? One alternative: Create a custom distributed system (or program) for each new algorithm Cumbersome! Another alternative: utilize modern distributed analytics frameworks, which: Relieve programmers from concerns with many of the difficult aspects of developing distributed programs Allow programmers to focus on ONLY the sequential parts of their programs Examples: Hadoop MapReduce Google’s Pregel CMU’s Distributed GraphLab

Distributed Analytics Frameworks
Hadoop MapReduce Introduction Programming Model Execution Model Architectural & Scheduling Models

Hadoop Hadoop is one of the most successful realizations of large-scale “data-parallel” distributed analytics frameworks Hadoop MapReduce is an open source implementation of Google’s MapReduce Hadoop uses Hadoop Distributed File System (HDFS) as a storage layer HDFS is an open source implementation of GFS

Hadoop MapReduce: A Bird’s Eye View
Hadoop MapReduce incorporates two phases, Map and Reduce phases, which encompass multiple Map and Reduce tasks Map Task HDFS BLK Split 0 Partition Partition Reduce Task Partition Partition Map Task Partition Partition Split 1 HDFS BLK Dataset Partition Reduce Task Partition Partition Partition To HDFS Map Task Split 2 HDFS BLK HDFS Partition Partition Reduce Task Partition Partition Map Task Partition Split 3 HDFS BLK Partition Partition Merge & Sort Stage Shuffle Stage Reduce Stage Map Phase Reduce Phase

The Programming Model Hadoop MapReduce employs a shared-based programming model, which entails that: Tasks can interact (if needed) via reading and writing to a shared space HDFS provides the shared space for all Map and Reduce tasks Programmers write only sequential code, without defining functions that send/receive messages between tasks A Shared Address Space (Provided by HDFS) MT1 MT2 MT3 MT4 MT5 MT6 “Implicit” communication (provided by the MapReduce Engine)- Programmers do not write or call any communication routines RT1 RT2 RT3 A Shared Address Space (Provided by HDFS)

Example: Word Count A Text File A Map Function A Map Function A Reduce
Key2 Value2 Mohammad 1 is delivering a lecture to the 15-440 class A Reduce Function A Chunk of File Key2 Value2 Mohammad 1 is 2 delivering a lecture to the 15-440 class course name of Distributed Systems Mohammad is delivering a lecture to the class Key1 Value1 Mohammad is 20 delivering a 38 lecture to the 60 class Parse & Count A Text File Mohammad is delivering a lecture to the class The course name of is Distributed Systems Iterate& Sum A Map Function Key2 Value2 The 1 course name of 15-440 is Distributed Systems A Chunk of File Key1 Value1 The course 17 name of 40 is Distributed 58 Systems The course name of is Distributed Systems Parse & Count

The Execution Model Hadoop MapReduce adopts a synchronous execution model A distributed program (or system) is said to be synchronous if and only if its constituent tasks operate in a lock-step mode No two tasks can run concurrently under two different iterations In MapReduce: Each iteration is treated as a MapReduce job A job can encompass 1 or many Map tasks and 0 or many Reduce tasks Programs with multiple iterations (i.e., iterative programs) are executed using multiple chained MapReduce jobs When all Reduce tasks within job i are committed, a new job i + 1 is started (if any) Hence, two different tasks cannot run in parallel under two different jobs (or iterations)

The Architectural and Scheduling Models
Hadoop MapReduce employs a master-slave architecture A pull-based task scheduling strategy is used, whereby: Map tasks are scheduled in proximity of HDFS blocks Reduce tasks are scheduled anywhere Core Switch A slave The master Rack Switch 1 Rack Switch 2 TaskTracker1 TaskTracker2 TaskTracker3 TaskTracker4 TaskTracker5 JobTracker MT2 MT3 MT1 MT2 MT3 Request a Map Task Schedule a Map Task at an Empty Map Slot on TaskTracker1

Hadoop MapReduce employs a master-slave architecture With the above setup, how many Map tasks can run in parallel? Each TaskTracker has by default two Map slots, thus can run two Map tasks concurrently With 4 TaskTrackers and 2 Map slots on each TaskTracker, 8 Map tasks can be executed in parallel The maximum number of Map tasks that can run in parallel is denoted as Map wave Core Switch A slave The master Rack Switch 1 Rack Switch 2 TaskTracker1 TaskTracker2 TaskTracker3 TaskTracker4 TaskTracker5 JobTracker MT1 MT2 MT3 MT2 MT3 Request a Map Task Schedule a Map Task at an Empty Map Slot on TaskTracker1

Hadoop MapReduce employs a master-slave architecture For a dataset with a size of 1024MB, how many Map waves are needed? The size of each HDFS block is by default 64MB and each split encompasses by default 1 HDFS block Hence, there will be a total of 1024/64 = 16 HDFS blocks or 16 splits The input to each Map task is a single split, thus there will be a total of 16 Map tasks Therefore, 16 tasks/8 slots = 2 Map waves will be needed Core Switch A slave The master Rack Switch 1 Rack Switch 2 TaskTracker1 TaskTracker2 TaskTracker3 TaskTracker4 TaskTracker5 JobTracker MT1 MT2 MT3 MT2 MT3 Request a Map Task Schedule a Map Task at an Empty Map Slot on TaskTracker1

Hadoop MapReduce: Summary
Aspect Hadoop MapReduce Programming Model Shared-Based Execution Model Synchronous Architectural Model Master-Slave Scheduling Model Pull-Based Suitable Applications Loosely-Connected/Embarrassingly-Parallel Applications

Next Class Pregel and GraphLab

Distributed Systems CS

Similar presentations

Presentation on theme: "Distributed Systems CS"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Distributed Systems CS

Similar presentations

Presentation on theme: "Distributed Systems CS"— Presentation transcript:

Similar presentations

About project

Feedback