Download presentation
Presentation is loading. Please wait.
1
Distributed Systems CS 15-440
Hadoop Lecture 13, October 25, 2017 Mohammad Hammoud
2
Today Last Session: Today’s Session: Announcements: MPI (Concluded)
Hadoop Distributed File System and MapReduce Announcements: P2 grades are out PS4 is out. It is due on Nov 1st by midnight P3 is due on Nov 12th by midnight
3
We Live in a World of Data…
4
What Do We Do With Big Data?
Store Share Access Process Encrypt …. and more! We want to do all these seamlessly...
5
Where to Store Big Data? The underlying storage system is a key component for enabling Big Data querying/mining/analytics Typically, the storage system would “partition” and “distribute” Big Data, using striping (or partitioning) and placement techniques This allows for concurrent accesses to data as well as improves fault-tolerance Logical File Striping Unit Stripe Size 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Server 1 Server 2 Server 3 Server 4 4 8 12 1 5 9 13 2 6 10 14 3 7 11 15
6
Example: The Google File System
GFS paritions large files into fixed-size blocks and distributes them randomly across cluster machines Large File Blk 0 Blk 1 Blk 2 Blk 3 Blk 4 Blk 5 Blk 6 Server 0 (Writer) Server 1 Server 2 Server 3 0M Blk 0 Blk 0 Blk 1 Blk 0 64M Blk 1 Blk 2 Blk 2 Blk 1 128M Blk 2 Blk 3 Blk 4 Blk 4 Blk 3 Blk 3 Blk 5 Blk 5 192M Blk 6 256M Blk 4 Blk 6 Blk 5 320M Blk 6 384M
7
Example: The Google File System
GFS adopts a master-slave architecture File name GFS client Master Contact address Chunk Id, range Chunk Server Chunk Server Chunk Server Chunk data Linux File System Linux File System Linux File System
8
How to Process Big Data? One alternative: Create a custom distributed system (or program) for each new algorithm Cumbersome! Another alternative: utilize modern distributed analytics frameworks, which: Relieve programmers from concerns with many of the difficult aspects of developing distributed programs Allow programmers to focus on ONLY the sequential parts of their programs Examples: Hadoop MapReduce Google’s Pregel CMU’s Distributed GraphLab
9
Distributed Analytics Frameworks
Hadoop MapReduce Introduction Programming Model Execution Model Architectural & Scheduling Models
10
Hadoop Hadoop is one of the most successful realizations of large-scale “data-parallel” distributed analytics frameworks Hadoop MapReduce is an open source implementation of Google’s MapReduce Hadoop uses Hadoop Distributed File System (HDFS) as a storage layer HDFS is an open source implementation of GFS
11
Hadoop MapReduce: A Bird’s Eye View
Hadoop MapReduce incorporates two phases, Map and Reduce phases, which encompass multiple Map and Reduce tasks Map Task HDFS BLK Split 0 Partition Partition Reduce Task Partition Partition Map Task Partition Partition Split 1 HDFS BLK Dataset Partition Reduce Task Partition Partition Partition To HDFS Map Task Split 2 HDFS BLK HDFS Partition Partition Reduce Task Partition Partition Map Task Partition Split 3 HDFS BLK Partition Partition Merge & Sort Stage Shuffle Stage Reduce Stage Map Phase Reduce Phase
12
Distributed Analytics Frameworks
Hadoop MapReduce Introduction Programming Model Execution Model Architectural & Scheduling Models
13
The Programming Model Hadoop MapReduce employs a shared-based programming model, which entails that: Tasks can interact (if needed) via reading and writing to a shared space HDFS provides the shared space for all Map and Reduce tasks Programmers write only sequential code, without defining functions that send/receive messages between tasks A Shared Address Space (Provided by HDFS) MT1 MT2 MT3 MT4 MT5 MT6 “Implicit” communication (provided by the MapReduce Engine)- Programmers do not write or call any communication routines RT1 RT2 RT3 A Shared Address Space (Provided by HDFS)
14
Example: Word Count A Text File A Map Function A Map Function A Reduce
Key2 Value2 Mohammad 1 is delivering a lecture to the 15-440 class A Reduce Function A Chunk of File Key2 Value2 Mohammad 1 is 2 delivering a lecture to the 15-440 class course name of Distributed Systems Mohammad is delivering a lecture to the class Key1 Value1 Mohammad is 20 delivering a 38 lecture to the 60 class Parse & Count A Text File Mohammad is delivering a lecture to the class The course name of is Distributed Systems Iterate& Sum A Map Function Key2 Value2 The 1 course name of 15-440 is Distributed Systems A Chunk of File Key1 Value1 The course 17 name of 40 is Distributed 58 Systems The course name of is Distributed Systems Parse & Count
15
Distributed Analytics Frameworks
Hadoop MapReduce Introduction Programming Model Execution Model Architectural & Scheduling Models
16
The Execution Model Hadoop MapReduce adopts a synchronous execution model A distributed program (or system) is said to be synchronous if and only if its constituent tasks operate in a lock-step mode No two tasks can run concurrently under two different iterations In MapReduce: Each iteration is treated as a MapReduce job A job can encompass 1 or many Map tasks and 0 or many Reduce tasks Programs with multiple iterations (i.e., iterative programs) are executed using multiple chained MapReduce jobs When all Reduce tasks within job i are committed, a new job i + 1 is started (if any) Hence, two different tasks cannot run in parallel under two different jobs (or iterations)
17
Distributed Analytics Frameworks
Hadoop MapReduce Introduction Programming Model Execution Model Architectural & Scheduling Models
18
The Architectural and Scheduling Models
Hadoop MapReduce employs a master-slave architecture A pull-based task scheduling strategy is used, whereby: Map tasks are scheduled in proximity of HDFS blocks Reduce tasks are scheduled anywhere Core Switch A slave The master Rack Switch 1 Rack Switch 2 TaskTracker1 TaskTracker2 TaskTracker3 TaskTracker4 TaskTracker5 JobTracker MT2 MT3 MT1 MT2 MT3 Request a Map Task Schedule a Map Task at an Empty Map Slot on TaskTracker1
19
The Architectural and Scheduling Models
Hadoop MapReduce employs a master-slave architecture With the above setup, how many Map tasks can run in parallel? Each TaskTracker has by default two Map slots, thus can run two Map tasks concurrently With 4 TaskTrackers and 2 Map slots on each TaskTracker, 8 Map tasks can be executed in parallel The maximum number of Map tasks that can run in parallel is denoted as Map wave Core Switch A slave The master Rack Switch 1 Rack Switch 2 TaskTracker1 TaskTracker2 TaskTracker3 TaskTracker4 TaskTracker5 JobTracker MT1 MT2 MT3 MT2 MT3 Request a Map Task Schedule a Map Task at an Empty Map Slot on TaskTracker1
20
The Architectural and Scheduling Models
Hadoop MapReduce employs a master-slave architecture For a dataset with a size of 1024MB, how many Map waves are needed? The size of each HDFS block is by default 64MB and each split encompasses by default 1 HDFS block Hence, there will be a total of 1024/64 = 16 HDFS blocks or 16 splits The input to each Map task is a single split, thus there will be a total of 16 Map tasks Therefore, 16 tasks/8 slots = 2 Map waves will be needed Core Switch A slave The master Rack Switch 1 Rack Switch 2 TaskTracker1 TaskTracker2 TaskTracker3 TaskTracker4 TaskTracker5 JobTracker MT1 MT2 MT3 MT2 MT3 Request a Map Task Schedule a Map Task at an Empty Map Slot on TaskTracker1
21
Hadoop MapReduce: Summary
Aspect Hadoop MapReduce Programming Model Shared-Based Execution Model Synchronous Architectural Model Master-Slave Scheduling Model Pull-Based Suitable Applications Loosely-Connected/Embarrassingly-Parallel Applications
22
Hadoop MapReduce: Summary
Aspect Hadoop MapReduce Programming Model Shared-Based Execution Model Synchronous Architectural Model Master-Slave Scheduling Model Pull-Based Suitable Applications Loosely-Connected/Embarrassingly-Parallel Applications
23
Next Class Pregel and GraphLab
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.