Distributed Systems CS

Slides:



Advertisements
Similar presentations
Mapreduce and Hadoop Introduce Mapreduce and Hadoop
Advertisements

MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
Center-of-Gravity Reduce Task Scheduling to Lower MapReduce Network Traffic Mohammad Hammoud, M. Suhail Rehman, and Majd F. Sakr 1.
 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)
L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
Next Generation of Apache Hadoop MapReduce Arun C. Murthy - Hortonworks Founder and Architect Formerly Architect, MapReduce.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
EXPOSE GOOGLE APP ENGINE AS TASKTRACKER NODES AND DATA NODES.
Apache Hadoop MapReduce What is it ? Why use it ? How does it work Some examples Big users.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
MC 2 : Map Concurrency Characterization for MapReduce on the Cloud Mohammad Hammoud and Majd Sakr 1.
Database Applications (15-415) Part II- Hadoop Lecture 26, April 21, 2015 Mohammad Hammoud.
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
Presented by: Katie Woods and Jordan Howell. * Hadoop is a distributed computing platform written in Java. It incorporates features similar to those of.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
{ Tanya Chaturvedi MBA(ISM) Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.
Next Generation of Apache Hadoop MapReduce Owen
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Lecture 4. MapReduce Instructor: Weidong Shi (Larry), PhD
Map reduce Cs 595 Lecture 11.
Big Data is a Big Deal!.
Introduction to Google MapReduce
About Hadoop Hadoop was one of the first popular open source big data technologies. It is a scalable fault-tolerant system for processing large datasets.
Hadoop Aakash Kag What Why How 1.
Introduction to Distributed Platforms
Software Systems Development
INTRODUCTION TO BIGDATA & HADOOP
An Open Source Project Commonly Used for Processing Big Data Sets
What is Apache Hadoop? Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created.
Chapter 10 Data Analytics for IoT
Hadoop MapReduce Framework
Cloud Computing CS Distributed File Systems and Cloud Storage – Part I
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn.
15-826: Multimedia Databases and Data Mining
Introduction to MapReduce and Hadoop
Software Engineering Introduction to Apache Hadoop Map Reduce
Myoungjin Kim1, Yun Cui1, Hyeokju Lee1 and Hanku Lee1,2,*
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Database Applications (15-415) Hadoop Lecture 26, April 19, 2016
Distributed Systems CS
The Basics of Apache Hadoop
湖南大学-信息科学与工程学院-计算机与科学系
February 26th – Map/Reduce
Hadoop Basics.
Distributed Systems CS
Distributed Systems CS
Cse 344 May 4th – Map/Reduce.
CS110: Discussion about Spark
Lecture 16 (Intro to MapReduce and Hadoop)
CS 345A Data Mining MapReduce This presentation has been altered.
CS639: Data Management for Data Science
Acknowledgement: slides include content from Ennan Zhai
Distributed Systems (15-440)
COS 518: Distributed Systems Lecture 11 Mike Freedman
MapReduce: Simplified Data Processing on Large Clusters
CS639: Data Management for Data Science
Presentation transcript:

Distributed Systems CS 15-440 Hadoop Lecture 15, November 07, 2018 Mohammad Hammoud

Today Last Session: Today’s Session: Announcements: MPI Hadoop Distributed File System and MapReduce Announcements: P2 grades are out PS4 will be out today P3 is due on Nov 26 by midnight

We Live in a World of Data…

What Do We Do With Big Data? Store Share Access Process Encrypt …. and more! We want to do all these seamlessly...

Where to Store Big Data? The underlying storage system is a key component for enabling Big Data querying/mining/analytics Typically, the storage system would “partition” and “distribute” Big Data, using striping (or partitioning) and placement techniques This allows for concurrent accesses to data as well as improves fault-tolerance Logical File Striping Unit Stripe Size 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Server 1 Server 2 Server 3 Server 4 4 8 12 1 5 9 13 2 6 10 14 3 7 11 15

Example: The Google File System GFS paritions large files into fixed-size blocks and distributes them randomly across cluster machines Large File Blk 0 Blk 1 Blk 2 Blk 3 Blk 4 Blk 5 Blk 6 Server 0 (Writer) Server 1 Server 2 Server 3 0M Blk 0 Blk 0 Blk 1 Blk 0 64M Blk 1 Blk 2 Blk 2 Blk 1 128M Blk 2 Blk 3 Blk 4 Blk 4 Blk 3 Blk 3 Blk 5 Blk 5 192M Blk 6 256M Blk 4 Blk 6 Blk 5 320M Blk 6 384M

Example: The Google File System GFS adopts a master-slave architecture File name GFS client Master Contact address Chunk Id, range Chunk Server Chunk Server Chunk Server Chunk data Linux File System Linux File System Linux File System

How to Process Big Data? One alternative: Create a custom distributed system (or program) for each new algorithm Cumbersome! Another alternative: utilize modern distributed analytics frameworks, which: Relieve programmers from concerns with many of the difficult aspects of developing distributed programs Allow programmers to focus on ONLY the sequential parts of their programs Examples: Hadoop MapReduce Google’s Pregel CMU’s Distributed GraphLab

Distributed Analytics Frameworks Hadoop MapReduce Introduction Programming Model Execution Model Architectural & Scheduling Models

Hadoop Hadoop is one of the most successful realizations of large-scale “data-parallel” distributed analytics frameworks Hadoop MapReduce is an open source implementation of Google’s MapReduce Hadoop uses Hadoop Distributed File System (HDFS) as a storage layer HDFS is an open source implementation of GFS

Hadoop MapReduce: A Bird’s Eye View Hadoop MapReduce incorporates two phases, Map and Reduce phases, which encompass multiple Map and Reduce tasks Map Task HDFS BLK Split 0 Partition Partition Reduce Task Partition Partition Map Task Partition Partition Split 1 HDFS BLK Dataset Partition Reduce Task Partition Partition Partition To HDFS Map Task Split 2 HDFS BLK HDFS Partition Partition Reduce Task Partition Partition Map Task Partition Split 3 HDFS BLK Partition Partition Merge & Sort Stage Shuffle Stage Reduce Stage Map Phase Reduce Phase

Distributed Analytics Frameworks Hadoop MapReduce Introduction Programming Model Execution Model Architectural & Scheduling Models

The Programming Model Hadoop MapReduce employs a shared-based programming model, which entails that: Tasks can interact (if needed) via reading and writing to a shared space HDFS provides the shared space for all Map and Reduce tasks Programmers write only sequential code, without defining functions that send/receive messages between tasks A Shared Address Space (Provided by HDFS) MT1 MT2 MT3 MT4 MT5 MT6 “Implicit” communication (provided by the MapReduce Engine)- Programmers do not write or call any communication routines RT1 RT2 RT3 A Shared Address Space (Provided by HDFS)

Example: Word Count A Text File A Map Function A Map Function A Reduce Key2 Value2 Mohammad 1 is delivering a lecture to the 15-440 class A Reduce Function A Chunk of File Key2 Value2 Mohammad 1 is 2 delivering a lecture to the 15-440 class course name of Distributed Systems Mohammad is delivering a lecture to the 15-440 class Key1 Value1 Mohammad is 20 delivering a 38 lecture to the 60 15-440 class Parse & Count A Text File Mohammad is delivering a lecture to the 15-440 class The course name of 15-440 is Distributed Systems Iterate& Sum A Map Function Key2 Value2 The 1 course name of 15-440 is Distributed Systems A Chunk of File Key1 Value1 The course 17 name of 15-440 40 is Distributed 58 Systems The course name of 15-440 is Distributed Systems Parse & Count

Distributed Analytics Frameworks Hadoop MapReduce Introduction Programming Model Execution Model Architectural & Scheduling Models

The Execution Model Hadoop MapReduce adopts a synchronous execution model A distributed program (or system) is said to be synchronous if and only if its constituent tasks operate in a lock-step mode No two tasks can run concurrently under two different iterations In MapReduce: Each iteration is treated as a MapReduce job A job can encompass 1 or many Map tasks and 0 or many Reduce tasks Programs with multiple iterations (i.e., iterative programs) are executed using multiple chained MapReduce jobs When all Reduce tasks within job i are committed, a new job i + 1 is started (if any) Hence, two different tasks cannot run in parallel under two different jobs (or iterations)

Distributed Analytics Frameworks Hadoop MapReduce Introduction Programming Model Execution Model Architectural & Scheduling Models

The Architectural and Scheduling Models Hadoop MapReduce employs a master-slave architecture A pull-based task scheduling strategy is used, whereby: Map tasks are scheduled in proximity of HDFS blocks Reduce tasks are scheduled anywhere Core Switch A slave The master Rack Switch 1 Rack Switch 2 TaskTracker1 TaskTracker2 TaskTracker3 TaskTracker4 TaskTracker5 JobTracker MT2 MT3 MT1 MT2 MT3 Request a Map Task Schedule a Map Task at an Empty Map Slot on TaskTracker1

The Architectural and Scheduling Models Hadoop MapReduce employs a master-slave architecture With the above setup, how many Map tasks can run in parallel? Each TaskTracker has by default two Map slots, thus can run two Map tasks concurrently With 4 TaskTrackers and 2 Map slots on each TaskTracker, 8 Map tasks can be executed in parallel The maximum number of Map tasks that can run in parallel is denoted as Map wave Core Switch A slave The master Rack Switch 1 Rack Switch 2 TaskTracker1 TaskTracker2 TaskTracker3 TaskTracker4 TaskTracker5 JobTracker MT1 MT2 MT3 MT2 MT3 Request a Map Task Schedule a Map Task at an Empty Map Slot on TaskTracker1

The Architectural and Scheduling Models Hadoop MapReduce employs a master-slave architecture For a dataset with a size of 1024MB, how many Map waves are needed? The size of each HDFS block is by default 64MB and each split encompasses by default 1 HDFS block Hence, there will be a total of 1024/64 = 16 HDFS blocks or 16 splits The input to each Map task is a single split, thus there will be a total of 16 Map tasks Therefore, 16 tasks/8 slots = 2 Map waves will be needed Core Switch A slave The master Rack Switch 1 Rack Switch 2 TaskTracker1 TaskTracker2 TaskTracker3 TaskTracker4 TaskTracker5 JobTracker MT1 MT2 MT3 MT2 MT3 Request a Map Task Schedule a Map Task at an Empty Map Slot on TaskTracker1

Hadoop MapReduce: Summary Aspect Hadoop MapReduce Programming Model Shared-Based Execution Model Synchronous Architectural Model Master-Slave Scheduling Model Pull-Based Suitable Applications Loosely-Connected/Embarrassingly-Parallel Applications

Hadoop MapReduce: Summary Aspect Hadoop MapReduce Programming Model Shared-Based Execution Model Synchronous Architectural Model Master-Slave Scheduling Model Pull-Based Suitable Applications Loosely-Connected/Embarrassingly-Parallel Applications

Next Class Pregel and GraphLab