CS-4513 Distributed Computing Systems Hugh C. Lauer

Slides:



Advertisements
Similar presentations
Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)
Advertisements

MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
Mapreduce and Hadoop Introduce Mapreduce and Hadoop
Distributed Computations
CS 345A Data Mining MapReduce. Single-node architecture Memory Disk CPU Machine Learning, Statistics “Classical” Data Mining.
Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture.
Map Reduce Allan Jefferson Armando Gonçalves Rocir Leite Filipe??
MapReduce Dean and Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM, Vol. 51, No. 1, January Shahram.
MapReduce Simplified Data Processing on Large Clusters Google, Inc. Presented by Prasad Raghavendra.
Distributed Computations MapReduce
Distributed MapReduce Team B Presented by: Christian Bryan Matthew Dailey Greg Opperman Nate Piper Brett Ponsler Samuel Song Alex Ostapenko Keilin Bickar.
L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.
MapReduce Simplified Data Processing On large Clusters Jeffery Dean and Sanjay Ghemawat.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
MapReduce : Simplified Data Processing on Large Clusters Hongwei Wang & Sihuizi Jin & Yajing Zhang
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Map Reduce Architecture
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
SIDDHARTH MEHTA PURSUING MASTERS IN COMPUTER SCIENCE (FALL 2008) INTERESTS: SYSTEMS, WEB.
MapReduce.
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
MapReduce. Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture emerging: – Cluster of.
Google MapReduce Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc. Presented by Conroy Whitney 4 th year CS – Web Development.
MapReduce: Simplified Data Processing on Large Clusters 컴퓨터학과 김정수.
Süleyman Fatih GİRİŞ CONTENT 1. Introduction 2. Programming Model 2.1 Example 2.2 More Examples 3. Implementation 3.1 ExecutionOverview 3.2.
Map Reduce and Hadoop S. Sudarshan, IIT Bombay
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
Parallel Programming Models Basic question: what is the “right” way to write parallel programs –And deal with the complexity of finding parallelism, coarsening.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
MapReduce and Hadoop 1 Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 2: MapReduce and Hadoop Mining Massive.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Map Reduce: Simplified Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat Google, Inc. OSDI ’04: 6 th Symposium on Operating Systems Design.
MAP REDUCE : SIMPLIFIED DATA PROCESSING ON LARGE CLUSTERS Presented by: Simarpreet Gill.
MapReduce How to painlessly process terabytes of data.
MapReduce M/R slides adapted from those of Jeff Dean’s.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
CS 345A Data Mining MapReduce. Single-node architecture Memory Disk CPU Machine Learning, Statistics “Classical” Data Mining.
Information Retrieval Lecture 9. Outline Map Reduce, cont. Index compression [Amazon Web Services]
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
MapReduce : Simplified Data Processing on Large Clusters P 謝光昱 P 陳志豪 Operating Systems Design and Implementation 2004 Jeffrey Dean, Sanjay.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.
Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.
Lecture 4. MapReduce Instructor: Weidong Shi (Larry), PhD
Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung
Some slides adapted from those of Yuan Yu and Michael Isard
Hadoop Aakash Kag What Why How 1.
An Open Source Project Commonly Used for Processing Big Data Sets
Large-scale file systems and Map-Reduce
MapReduce: Simplified Data Processing on Large Clusters
Map Reduce.
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn.
MapReduce Simplified Data Processing on Large Cluster
The Google File System Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung Google Presented by Jiamin Huang EECS 582 – W16.
MapReduce Simplied Data Processing on Large Clusters
COS 418: Distributed Systems Lecture 1 Mike Freedman
湖南大学-信息科学与工程学院-计算机与科学系
February 26th – Map/Reduce
Hadoop Basics.
Map reduce use case Giuseppe Andronico INFN Sez. CT & Consorzio COMETA
Cse 344 May 4th – Map/Reduce.
Distributed System Gang Wu Spring,2018.
CS 345A Data Mining MapReduce This presentation has been altered.
Cloud Computing MapReduce, Batch Processing
THE GOOGLE FILE SYSTEM.
5/7/2019 Map Reduce Map reduce.
COS 518: Distributed Systems Lecture 11 Mike Freedman
MapReduce: Simplified Data Processing on Large Clusters
Presentation transcript:

CS-4513 Distributed Computing Systems Hugh C. Lauer Map-Reduce Assumptions: Graduate level Operating Systems Making Choices about operation systems Why a micro-century? …just about enough time for one concept CS-4513 Distributed Computing Systems Hugh C. Lauer

MapReduce Programming model and implementation … … for processing very large data sets Many terabytes, on clusters of distributed computers Supports a broad variety of real-world tasks Foundation of Google’s applications CS-4513 D-term 2009 (Special Lecture) MapReduce

Why MapReduce An important new model for distributed and parallel computing Fundamentally different from traditional models of parallelism Data Parallelism Task Parallelism Pipelined Parallelism An abstraction to automate the mechanics of data handling and to let the programmer concentrate on semantics of a problem CS-4513 D-term 2009 (Special Lecture) MapReduce

Last Year in CS-4513 Divided class into four teams Each team to research and teach one aspect The abstraction itself and its algorithms Distributed MapReduce Class of problems that MapReduce can help solve Google File System to support MapReduce Today’s material is drawn from those presentations CS-4513 D-term 2009 (Special Lecture) MapReduce

Google Cluster 1000’s of PC-class systems High-speed interconnect Dual proc. x86, 4-8 GB RAM Commodity disks High-speed interconnect 100-1000 Mb/sec Distributed, replicated file system, optimized for GByte-size files Reading and appending Non-negligible failure rates CS-4513 D-term 2009 (Special Lecture) MapReduce

Typical Applications Search TBytes for words or phrases Create Page Rank among pages Conceptually simple Devilishly difficult to implement in distrib. environment CS-4513 D-term 2009 (Special Lecture) MapReduce

Basic Abstraction Partition application into two functions Map Reduce Both written by programmer Let system partition execution among distributed platforms Scheduling, communication, synchronization, fault tolerance, reliability, etc. As of January 2008 10,000 separate MapReduce programs developed within Google 100,000 MapReduce jobs per day 20 Petabytes of data processed per day CS-4513 D-term 2009 (Special Lecture) MapReduce

Map and Reduce Map – written by programmer System Takes input key-value pairs Generates set of intermediate key-value pairs System Organizes intermediate pairs by key Reduce – written by programmer Processes or merges all values for a given key Iterates through all keys CS-4513 D-term 2009 (Special Lecture) MapReduce

Example – Count Occurrences of Words in Collection of Documents Pseudo-code:– map(String key, String value): // key: document name // value: document contents for each word w in value: EmitIntermediate(w, "1"); Note: key is not used in this simple application CS-4513 D-term 2009 (Special Lecture) MapReduce

Example – Count Occurrences of Words (continued) Pseudo-code:– reduce(String key, Iterator values): // key: a word // values: a list of counts int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result)); CS-4513 D-term 2009 (Special Lecture) MapReduce

Example – Count Occurrences of Words (continued) MapReduce specification Names of input and output files Tuning parameters Expressed as C++ main() function Linked with MapReduce library CS-4513 D-term 2009 (Special Lecture) MapReduce

Full C++ Text of Word Frequency Application Approximately 70 lines of C++ code Dean, J. and Ghemawat, S. “MapReduce: Simplified data processing on large clusters,” In Proceedings of Operating Systems Design and Implementation (OSDI). San Francisco, CA, 2004. pp. 137-150. (.pdf). Note: This paper is an earlier version of the CACM paper distributed by e-mail to class. It contains some details not included in the CACM paper. CS-4513 D-term 2009 (Special Lecture) MapReduce

Other Examples Distributed grep Count of URL access frequency Key is pattern to search for; values are the lines to search Count of URL access frequency Similar to word count; from web access logs Reverse web-link graph Obtain list of sources for URL target Large-scale indexing Google production search service CS-4513 D-term 2009 (Special Lecture) MapReduce

What It Does Map: (k1, v1)  list(k2, v2) Reduce: (k2, list(v2))  list(v2) MapReduce library: Converts input arguments to many (k1, v1) pairs; calls Map for each pair Reorganizes intermediate lists from Map Calls Reduce for each intermediate key k2 CS-4513 D-term 2009 (Special Lecture) MapReduce

Brute-Force Implementation CS-4513 D-term 2009 (Special Lecture) MapReduce

Brute-Force Implementation Step 0: split input files into pieces–16-64 Mbyte CS-4513 D-term 2009 (Special Lecture) MapReduce

Brute-Force Implementation Step 1:– Fork User Program Many distributed processes Scattered across cluster One designated as Master Brute-Force Implementation CS-4513 D-term 2009 (Special Lecture) MapReduce

Brute-Force Implementation Step 2:– Master assigns worker tasks Manages results Monitors behavior & faults CS-4513 D-term 2009 (Special Lecture) MapReduce

Brute-Force Implementation Step 3:– Map workers Read input splits via GFS Parse key-value pairs Passes pair to Map function Buffers output in local mem. CS-4513 D-term 2009 (Special Lecture) MapReduce

Brute-Force Implementation Step 4:–Intermediate files To local disk (via GFS) Notify master CS-4513 D-term 2009 (Special Lecture) MapReduce

Brute-Force Implementation Step 5:–Reduce worker Reads int. data (streaming) Sorts by intermediate key CS-4513 D-term 2009 (Special Lecture) MapReduce

Brute-Force Implementation Step 6:–Call Reduce function For each key, list of values Writes output file Notifies master CS-4513 D-term 2009 (Special Lecture) MapReduce

Result One output file for each Reduce worker Combined by application program or Passed to another Map-Reduce call Or another distributed application CS-4513 D-term 2009 (Special Lecture) MapReduce

Questions? This presentation is stored at //www.cs.wpi.edu/~lauer/MapReduce--D-Term-09.ppt

Distributed System Issues Fault-tolerance Distributed file access Scalable performance CS-4513 D-term 2009 (Special Lecture) MapReduce

Managing Faults and Failures In a cluster of 1800 nodes, there will always be a handful of failures Question: with 1800 hard drives, 100,000 hours MTBF, what is MBTF of a drive in cluster? Some processors may be “slow” – called stragglers Intermittent memory or bus errors Recoverable disk or network errors Over-scheduling by system CS-4513 D-term 2009 (Special Lecture) MapReduce

Managing Faults and Failures (continued) Master task periodically pings worker tasks If no response, starts a new worker task with same responsibility New worker reads data from different replica Also, starts backup tasks for stragglers Just in case! Whichever task finishes first “wins” Other task(s) shut down Performance penalty for backup tasks A few percent loss in system resources Enormous improvement in response time CS-4513 D-term 2009 (Special Lecture) MapReduce

Questions?

Google File System Assumptions System failures are the norm System stores mostly large (multi-gigabyte) files Expected read operations Large streaming accesses (> 1MByte per access) Few random accesses (a few KB out of someplace random) Expected write operations Long appending writes Multiple clients appending concurrently Updates in place to the middle of a file are extremely rare … and expensive Bandwidth trumps latency CS-4513 D-term 2009 (Special Lecture) MapReduce

Google File System (continued) One Master server per cluster Many Chunk servers in each cluster Clients CS-4513 D-term 2009 (Special Lecture) MapReduce

Google File System (continued) Files partitioned into 64 MByte chunks Each chunk is replicated across chunk servers Chunk server stores its chunks in traditional Linux files on node of the cluster At least three replicas per chunk (different servers) No caching of file data (not useful in streaming!) Dynamic replication if chunk server fails CS-4513 D-term 2009 (Special Lecture) MapReduce

Google File System (continued) Master maintains metadata & chunk info Also garbage collection All data transactions are between clients and chunk servers Transactions between client and Master for control and info only Atomic transactions, replicated log Master can be restarted on a different node as necessary Also chunk servers CS-4513 D-term 2009 (Special Lecture) MapReduce

Reference Ghemawat, Sanjay, Gobioff, Howard, and Leung, Shun-Tak, “The Google File System,” Proceedings of the 2003 Symposium on Operating System Principles, Bolton Landing (Lake George), NY, October 2003. (.pdf) CS-4513 D-term 2009 (Special Lecture) MapReduce

Additonal Reference Dean, Jeffrey, and Ghemawat, Sanjay, “MapReduce: Simplified Data Processing on Large Clusters,” Communications of the ACM, vol 51, #1, January 2008, pp. 107-113. (.pdf) CS-4513 D-term 2009 (Special Lecture) MapReduce

Questions?