湖南大学-信息科学与工程学院-计算机与科学系

Slides:

Advertisements

Similar presentations

Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)

Advertisements

MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.

Mapreduce and Hadoop Introduce Mapreduce and Hadoop

MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.

Distributed Computations

CS 345A Data Mining MapReduce. Single-node architecture Memory Disk CPU Machine Learning, Statistics “Classical” Data Mining.

MapReduce Simplified Data Processing on Large Clusters Google, Inc. Presented by Prasad Raghavendra.

CS 425 / ECE 428 Distributed Systems Fall 2014 Indranil Gupta (Indy) Lecture 3: Mapreduce and Hadoop All slides © IG.

Distributed Computations MapReduce

L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.

Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.

MapReduce ： Simplified Data Processing on Large Clusters Hongwei Wang & Sihuizi Jin & Yajing Zhang

Google Distributed System and Hadoop Lakshmi Thyagarajan.

Hadoop & Cheetah. Key words Cluster  data center – Lots of machines thousands Node  a server in a data center – Commodity device fails very easily Slot.

Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.

SIDDHARTH MEHTA PURSUING MASTERS IN COMPUTER SCIENCE (FALL 2008) INTERESTS: SYSTEMS, WEB.

Lecture 3-1 Computer Science 425 Distributed Systems CS 425 / CSE 424 / ECE 428 Fall 2010 Indranil Gupta (Indy) August 31, 2010 Lecture 3  2010, I. Gupta.

Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.

MapReduce. Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture emerging: – Cluster of.

Google MapReduce Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc. Presented by Conroy Whitney 4 th year CS – Web Development.

Jeffrey D. Ullman Stanford University. 2 Chunking Replication Distribution on Racks.

Süleyman Fatih GİRİŞ CONTENT 1. Introduction 2. Programming Model 2.1 Example 2.2 More Examples 3. Implementation 3.1 ExecutionOverview 3.2.

Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)

Parallel Programming Models Basic question: what is the “right” way to write parallel programs –And deal with the complexity of finding parallelism, coarsening.

MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.

1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.

MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.

f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read

Map Reduce: Simplified Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat Google, Inc. OSDI ’04: 6 th Symposium on Operating Systems Design.

MAP REDUCE : SIMPLIFIED DATA PROCESSING ON LARGE CLUSTERS Presented by: Simarpreet Gill.

MapReduce How to painlessly process terabytes of data.

MapReduce M/R slides adapted from those of Jeff Dean’s.

MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.

Introduction to Search Engines Technology CS Technion, Winter 2013 Amit Gross Some slides are courtesy of: Edward Bortnikov & Ronny Lempel, Yahoo!

MapReduce: Simplified Data Processing on Large Clusters Lim JunSeok.

MapReduce ： Simplified Data Processing on Large Clusters P 謝光昱 P 陳志豪 Operating Systems Design and Implementation 2004 Jeffrey Dean, Sanjay.

MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.

MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.

Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.

Lecture 4. MapReduce Instructor: Weidong Shi (Larry), PhD

Lecture 4: Mapreduce and Hadoop

Hadoop Aakash Kag What Why How 1.

CS 525 Advanced Distributed Systems Spring 2017

Chapter 10 Data Analytics for IoT

Large-scale file systems and Map-Reduce

The Map-Reduce Framework

Hadoop MapReduce Framework

Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn.

Lecture 4: Mapreduce and Hadoop

Software Engineering Introduction to Apache Hadoop Map Reduce

MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner

Distributed Systems CS

CS 525 Advanced Distributed Systems Spring 2018

MapReduce Simplied Data Processing on Large Clusters

February 26th – Map/Reduce

Map reduce use case Giuseppe Andronico INFN Sez. CT & Consorzio COMETA

Cse 344 May 4th – Map/Reduce.

CS110: Discussion about Spark

Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing Zaharia, et al (2012)

CS 345A Data Mining MapReduce This presentation has been altered.

Lecture 4: Mapreduce and Hadoop

Distributed Systems CS

Charles Tappert Seidenberg School of CSIS, Pace University

Introduction to MapReduce

5/7/2019 Map Reduce Map reduce.

COS 518: Distributed Systems Lecture 11 Mike Freedman

MapReduce: Simplified Data Processing on Large Clusters

Lecture 29: Distributed Systems

Presentation transcript:

湖南大学-信息科学与工程学院-计算机与科学系云计算技术陈果副教授湖南大学-信息科学与工程学院-计算机与科学系邮箱：guochen@hnu.edu.cn 个人主页：1989chenguo.github.io https://1989chenguo.github.io/Courses/CloudComputing2018Spring.html

What we have learned What is cloud computing Cloud Networking Cloud Distributed System Introduction to Cloud Distributed System What is cloud distributed system What we’ll look at in this part

MapReduce Part #2: Cloud Distributed System Most materials from UIUC MOOC Thanks Indranil Gupta

Outline Paradigm Examples Scheduling Fault-tolerance

Outline Paradigm Examples Scheduling Fault-tolerance

What is MapReduce? “MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster.” Terms are borrowed from functional language (e.g., Lisp) Example: sum of squares Map: square (1,2,3,4) Output: (12,22,32,42) = (1,4,9,16) Reduce: sum (1,4,9,16) Output: (1+4+9+16) = 30 Let’s consider a sample application: WordCount You’re given a huge dataset and asked to list the count for each of the words in all the documents Processes each record independently Processes (set of) all records in batches

WordCount: Map Process individual records to generate intermediate key/value pairs

WordCount: Map Process individual records to generate intermediate key/value pairs in parallel Map task 1 Map task 2

WordCount: Map Process a large number of individual records to generate intermediate key/value pairs in parallel

WordCount: Reduce Processes and merges all intermediate values associated per key

WordCount: Reduce Each key assigned to one Reduce Parallelly processes and merges all intermediate values by partitioning keys Typically: hash partitioning, i.e., key is assigned to reduce # = hash(key)%number_of_reduce_servers

Hadoop code - Map

Hadoop code - Reduce

Hadoop code - Driver

Outline Paradigm Examples Scheduling Fault-tolerance

Some Applications of MapReduce #1 Distributed Grep Input: large set of files Output: lines that match pattern Map Emits a line if it matches the supplied pattern Reduce Copies the intermediate data to output

Some Applications of MapReduce #2 Reverse Web-Link Graph Input: Web graph: tuples (a, b) where (page a --> page b) Output For each page, list of pages that link to it Map For each input <source, target>, output <target, source> Reduce Emits <target, list(source)>

Some Applications of MapReduce #3 Count URL access frequence Input: Log of accessed URLs, e.g., from proxy server Output For each URL, % of total accesses for that URL Map: Output <URL, 1> Multiple reducers: Emits <URL, URL_count> Chain another MapReduce jog after above one Map: Process <URL, URL_count> and output <1, URL_count> One reducer: Sum up <URL_count> to calculate overall _count Now we got URL count, but still need % Get % by URL_count/overall_count

Outline Paradigm Examples Scheduling Fault-tolerance

Programming MapReduce Externally: For user Write a Map program, write a Reduce program Submit job; wait for result Need to know nothing about parallel/distributed programming! Internally: For the Paradigm and Scheduler Parallelize Map Transfer data from Map to Reduce Parallelize Reduce Implement storage for Map input/output, Reduce input/output

Inside MapReduce Parallelize Map Transfer data from Map to Reduce Easy! Each Map task is independent Transfer data from Map to Reduce All Map output records with same key assigned to same Reduce Partitioning function, e.g., hash(key)%number_of_reducers Parallelize Reduce Easy! Each Reduce task is independent Implement storage for Map input/output, Reduce input/output Map input: From distributed file system (GFS, HDFS, etc.) Map output: To local file system Reduce input: From remote file system Reduce output: To distributed file system

Assign Maps and Reduces to servers Inside MapReduce Parallelize Map Parallelize Reduce Scheduling: Assign Maps and Reduces to servers Transfer data from Map to Reduce

What kind of algorithms used in scheduling? The YARN Scheduler Used in Hadoop 2.x + YARN = Yet Another Resource Negotiator Treats each server as a collection of containers Container = some CPU + some memory Has 3 main components Global Resource Manager (RM) Scheduling Per-server Node Manager (NM) Daemon and server-specific functions Per-application (job) Application Master (AM) Container negotiation with RM and NMs Detecting task failures of that job RM Scheduler Server A NM AM 1 … Server B AM 2 Task (Container) 1. Need container 3. Container on Server B 2. Container completed What kind of algorithms used in scheduling? 4. Start task, please

Why Scheduling? Multiple “tasks” to schedule The processes on a single-core OS The tasks of a Hadoop job The tasks of multiple Hadoop jobs Limited resources that these tasks require Processor(s) Memory (Less contentious) disk, network Scheduling goals Good throughput or response time for tasks (or jobs) High utilization of resources

STF (Shortest Task First) is optimal! A special case of priority scheduling

Outline Paradigm Examples Scheduling Fault-tolerance

Fault-tolerance … Server failure RM failure Scheduler Server A NM AM 1 … Server B AM 2 Task (Container) Server failure NM heartbeats to RM If fails, RM notifies all affected AMs NM monitors tasks running at its server If fails, mark the task and restart it AM heartbeats to RM If fails, RM restarts AM, which then syncs up with its running tasks RM failure Use old checkpoints and bring up secondary RM Heartbeats also used to piggyback container requests Avoids extra messages

Slow Tasks (Stragglers) The slowest machine slows the entire job down Due to bad disk, network bandwidth, CPU, memory, etc. Keep track of “progress” of each task Perform backup (replicated) execution of straggler task

Locality Cloud networking has hierarchical topology Racks, pods, cluster, datacenter, … GFS/HDFS stores 3 replicas for each chunk Maybe on different racks MapReduce tries to schedule a Map task using the following priority: Same machine containing the input data, if fail Same rack containing the input data, if fail Anywhere

MapReduce Summary Paradigm Examples Scheduling Fault-tolerance Parallelization + Aggregation to deal with big data processing Examples Word count, Dsitributed Grep, URL count, … Scheduling For job: Maximize throughput, minimize latency For cloud: Maximize utilization Fault-tolerance Need to be very careful about various failures

湖南大学-信息科学与工程学院-计算机与科学系 Thanks! 陈果副教授湖南大学-信息科学与工程学院-计算机与科学系邮箱：guochen@hnu.edu.cn 个人主页：1989chenguo.github.io https://1989chenguo.github.io/Courses/CloudComputing2018Spring.html