Intro to Map/Reduce Andrew Rau-Chaplin - Adapted from What is Cloud Computing? (and an intro to parallel/distributed processing), Jimmy Lin, The iSchool.

Slides:



Advertisements
Similar presentations
 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
Advertisements

Based on the text by Jimmy Lin and Chris Dryer; and on the yahoo tutorial on mapreduce at index.html
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
Cloud Computing and MapReduce Used slides from RAD Lab at UC Berkeley about the cloud ( and slides from Jimmy Lin’s.
G O O G L E F I L E S Y S T E M 陳 仕融 黃 振凱 林 佑恩 Z 1.
The Google File System Authors : Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Presentation by: Vijay Kumar Chalasani 1CS5204 – Operating Systems.
NFS, AFS, GFS Yunji Zhong. Distributed File Systems Support access to files on remote servers Must support concurrency – Make varying guarantees about.
Distributed Computations
MapReduce: Simplified Data Processing on Large Clusters Cloud Computing Seminar SEECS, NUST By Dr. Zahid Anwar.
Lecture 6 – Google File System (GFS) CSE 490h – Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the content of this presentation.
Cloud Computing Lecture #2 Introduction to MapReduce Jimmy Lin The iSchool University of Maryland Monday, September 8, 2008 This work is licensed under.
CS 345A Data Mining MapReduce. Single-node architecture Memory Disk CPU Machine Learning, Statistics “Classical” Data Mining.
MapReduce Technical Workshop This presentation includes course content © University of Washington Redistributed under the Creative Commons Attribution.
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
Cloud Computing Lecture #2 From Lisp to MapReduce and GFS
Distributed Computations MapReduce
Lecture 2 – MapReduce: Theory and Implementation CSE 490h – Introduction to Distributed Computing, Spring 2007 Except as otherwise noted, the content of.
L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
MapReduce : Simplified Data Processing on Large Clusters Hongwei Wang & Sihuizi Jin & Yajing Zhang
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google∗
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
Hadoop Ida Mele. Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken.
MapReduce.
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
1 The Google File System Reporter: You-Wei Zhang.
Map Reduce and Hadoop S. Sudarshan, IIT Bombay
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
Parallel Programming Models Basic question: what is the “right” way to write parallel programs –And deal with the complexity of finding parallelism, coarsening.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
MapReduce and Hadoop 1 Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 2: MapReduce and Hadoop Mining Massive.
1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Map Reduce: Simplified Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat Google, Inc. OSDI ’04: 6 th Symposium on Operating Systems Design.
MAP REDUCE : SIMPLIFIED DATA PROCESSING ON LARGE CLUSTERS Presented by: Simarpreet Gill.
MapReduce How to painlessly process terabytes of data.
MapReduce M/R slides adapted from those of Jeff Dean’s.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
CENG334 Introduction to Operating Systems Erol Sahin Dept of Computer Eng. Middle East Technical University Ankara, TURKEY Network File System Except as.
Presenters: Rezan Amiri Sahar Delroshan
GFS : Google File System Ömer Faruk İnce Fatih University - Computer Engineering Cloud Computing
Eduardo Gutarra Velez. Outline Distributed Filesystems Motivation Google Filesystem Architecture The Metadata Consistency Model File Mutation.
GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.
MAP REDUCE BASICS CHAPTER 2. Basics Divide and conquer – Partition large problem into smaller subproblems – Worker work on subproblems in parallel Threads.
MapReduce Algorithm Design Based on Jimmy Lin’s slides
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
MapReduce Computer Engineering Department Distributed Systems Course Assoc. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2015.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
Presenter: Seikwon KAIST The Google File System 【 Ghemawat, Gobioff, Leung 】
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Presenter: Chao-Han Tsai (Some slides adapted from the Google’s series lectures)
1 Advanced Database Systems: DBS CB, 2 nd Edition Advanced Topics of Interest: “MapReduce and SQL” & “SSD and DB”
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.
Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung
Advanced Database Systems: DBS CB, 2nd Edition
Large-scale file systems and Map-Reduce
Google File System CSE 454 From paper by Ghemawat, Gobioff & Leung.
MapReduce Simplied Data Processing on Large Clusters
湖南大学-信息科学与工程学院-计算机与科学系
MapReduce Algorithm Design Adapted from Jimmy Lin’s slides.
Distributed System Gang Wu Spring,2018.
by Mikael Bjerga & Arne Lange
5/7/2019 Map Reduce Map reduce.
COS 518: Distributed Systems Lecture 11 Mike Freedman
Presentation transcript:

Intro to Map/Reduce Andrew Rau-Chaplin - Adapted from What is Cloud Computing? (and an intro to parallel/distributed processing), Jimmy Lin, The iSchool University of Maryland - Some material adapted from slides by Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet,

MapReduce Functional programming MapReduce paradigm Distributed file system

Functional Programming MapReduce = functional programming meets distributed processing on steroids – Not a new idea… dates back to the 50’s (or even 30’s) What is functional programming? – Computation as application of functions – Theoretical foundation provided by lambda calculus How is it different? – Traditional notions of “data” and “instructions” are not applicable – Data flows are implicit in program – Different orders of execution are possible Exemplified by LISP and ML

Overview of Lisp Lisp ≠ Lost In Silly Parentheses We’ll focus on particular a dialect: “Scheme” Lists are primitive data types Functions written in prefix notation (+ 1 2)  3 (* 3 4)  12 (sqrt (+ (* 3 3) (* 4 4)))  5 (define x 3)  x (* x 5)  15 '( ) '((a 1) (b 2) (c 3))

Functions Functions = lambda expressions bound to variables Syntactic sugar for defining functions – Above expressions is equivalent to: Once defined, function can be applied: (define (foo x y) (sqrt (+ (* x x) (* y y)))) (define foo (lambda (x y) (sqrt (+ (* x x) (* y y))))) (foo 3 4)  5

Other Features In Scheme, everything is an s-expression – No distinction between “data” and “code” – Easy to write self-modifying code Higher-order functions – Functions that take other functions as arguments (define (bar f x) (f (f x))) (define (baz x) (* x x)) (bar baz 2)  16 Doesn’t matter what f is, just apply it twice.

Recursion is your friend Simple factorial example Even iteration is written with recursive calls! (define (factorial n) (if (= n 1) 1 (* n (factorial (- n 1))))) (factorial 6)  720 (define (factorial-iter n) (define (aux n top product) (if (= n top) (* n product) (aux (+ n 1) top (* n product)))) (aux 1 n 1)) (factorial-iter 6)  720

Lisp  MapReduce? What does this have to do with MapReduce? After all, Lisp is about processing lists Two important concepts in functional programming – Map: do something to everything in a list – Fold: combine results of a list in some way

Map Map is a higher-order function How map works: – Function is applied to every element in a list – Result is a new list fffff

Fold Fold is also a higher-order function How fold works: – Accumulator set to initial value – Function applied to list element and the accumulator – Result stored in the accumulator – Repeated for every item in the list – Result is the final value in the accumulator ffffffinal value Initial value

Map/Fold in Action Simple map example: Fold examples: Sum of squares: (map (lambda (x) (* x x)) '( ))  '( ) (fold + 0 '( ))  15 (fold * 1 '( ))  120 (define (sum-of-squares v) (fold + 0 (map (lambda (x) (* x x)) v))) (sum-of-squares '( ))  55

Lisp  MapReduce Let’s assume a long list of records: imagine if... – We can parallelize map operations – We have a mechanism for bringing map results back together in the fold operation That’s MapReduce! (and Hadoop) Observations: – No limit to map parallelization since maps are indepedent – We can reorder folding if the fold function is commutative and associative

Typical Problem Iterate over a large number of records Extract something of interest from each Shuffle and sort intermediate results Aggregate intermediate results Generate final output Key idea: provide an abstraction at the point of these two operations Map Reduce

MapReduce Programmers specify two functions: map (k, v) → * reduce (k’, v’) → * – All v’ with the same k’ are reduced together Usually, programmers also specify: partition (k’, number of partitions ) → partition for k’ – Often a simple hash of the key, e.g. hash(k’) mod n – Allows reduce operations for different keys in parallel Implementations: – Google has a proprietary implementation in C++ – Hadoop is an open source implementation in Java (lead by Yahoo)

It’s just divide and conquer! Data Store Initial kv pairs map Initial kv pairs map Initial kv pairs map Initial kv pairs k 1, values… k 2, values… k 3, values… k 1, values… k 2, values… k 3, values… k 1, values… k 2, values… k 3, values… k 1, values… k 2, values… k 3, values… Barrier: aggregate values by keys reduce k 1, values… final k 1 values reduce k 2, values… final k 2 values reduce k 3, values… final k 3 values

Recall these problems? How do we assign work units to workers? What if we have more work units than workers? What if workers need to share partial results? How do we aggregate partial results? How do we know all the workers have finished? What if workers die?

MapReduce Runtime Handles scheduling – Assigns workers to map and reduce tasks Handles “data distribution” – Moves the process to the data Handles synchronization – Gathers, sorts, and shuffles intermediate data Handles faults – Detects worker failures and restarts Everything happens on top of a distributed FS (later)

“Hello World”: Word Count Map(String input_key, String input_value): // input_key: document name // input_value: document contents for each word w in input_values: EmitIntermediate(w, "1"); Reduce(String key, Iterator intermediate_values): // key: a word, same for input and output // intermediate_values: a list of counts int result = 0; for each v in intermediate_values: result += ParseInt(v); Emit(AsString(result));

Source: Dean and Ghemawat (OSDI 2004)

Bandwidth Optimization Issue: large number of key-value pairs Solution: use “Combiner” functions – Executed on same machine as mapper – Results in a “mini-reduce” right after the map phase – Reduces key-value pairs to save bandwidth

Skew Problem Issue: reduce is only as fast as the slowest map Solution: redundantly execute map operations, use results of first to finish – Addresses hardware problems... – But not issues related to inherent distribution of data

How do we get data to the workers? Compute Nodes NAS SAN What’s the problem here?

Distributed File System Don’t move data to workers… Move workers to the data! – Store data on the local disks for nodes in the cluster – Start up the workers on the node that has the data local Why? – Not enough RAM to hold all the data in memory – Disk access is slow, disk throughput is good A distributed file system is the answer – GFS (Google File System) – HDFS for Hadoop

GFS: Assumptions Commodity hardware over “exotic” hardware High component failure rates – Inexpensive commodity components fail all the time “Modest” number of HUGE files Files are write-once, mostly appended to – Perhaps concurrently Large streaming reads over random access High sustained throughput over low latency GFS slides adapted from material by Dean et al.

GFS: Design Decisions Files stored as chunks – Fixed size (64MB) Reliability through replication – Each chunk replicated across 3+ chunkservers Single master to coordinate access, keep metadata – Simple centralized management No data caching – Little benefit due to large data sets, streaming reads Simplify the API – Push some of the issues onto the client

Single Master We know this is a: – Single point of failure – Scalability bottleneck GFS solutions: – Shadow masters – Minimize master involvement Never move data through it, use only for metadata (and cache metadata at clients) Large chunk size Master delegates authority to primary replicas in data mutations (chunk leases) Simple, and good enough!

Master’s Responsibilities (1/2) Metadata storage Namespace management/locking Periodic communication with chunkservers – Give instructions, collect state, track cluster health Chunk creation, re-replication, rebalancing – Balance space utilization and access speed – Spread replicas across racks to reduce correlated failures – Re-replicate data if redundancy falls below threshold – Rebalance data to smooth out storage and request load

Master’s Responsibilities (2/2) Garbage Collection – Simpler, more reliable than traditional file delete – Master logs the deletion, renames the file to a hidden name – Lazily garbage collects hidden files Stale replica deletion – Detect “stale” replicas using chunk version numbers

Metadata Global metadata is stored on the master – File and chunk namespaces – Mapping from files to chunks – Locations of each chunk’s replicas All in memory (64 bytes / chunk) – Fast – Easily accessible Master has an operation log for persistent logging of critical metadata updates – Persistent on local disk – Replicated – Checkpoints for faster recovery

Mutations Mutation = write or append – Must be done for all replicas Goal: minimize master involvement Lease mechanism: – Master picks one replica as primary; gives it a “lease” for mutations – Primary defines a serial order of mutations – All replicas follow this order – Data flow decoupled from control flow

Parallelization Problems How do we assign work units to workers? What if we have more work units than workers? What if workers need to share partial results? How do we aggregate partial results? How do we know all the workers have finished? What if workers die? How is MapReduce different?

From Theory to Practice Hadoop Cluster You 1. Scp data to cluster 2. Move data into HDFS 3. Develop code locally 4. Submit MapReduce job 4a. Go back to Step 3 5. Move data out of HDFS 6. Scp data from cluster

On Amazon: With EC2 You 1. Scp data to cluster 2. Move data into HDFS 3. Develop code locally 4. Submit MapReduce job 4a. Go back to Step 3 5. Move data out of HDFS 6. Scp data from cluster 0. Allocate Hadoop cluster EC2 Your Hadoop Cluster 7. Clean up! Uh oh. Where did the data go?

On Amazon: EC2 and S3 Your Hadoop Cluster S3 (Persistent Store) EC2 (The Cloud) Copy from S3 to HDFS Copy from HFDS to S3

Example Using Hadoop Streaming to count the number of times that words occur within a text collection. Write the word count map function Upload it to S3 using tools such as s3cmd or the Firefox plugin S3 Organizer

Running wordcount Go to AWS Management Console ( Create New Job Flow Note: – Using builtin reducer called aggregate. – This reducer adds up the counts of words being emitted by the wordSplitter map function.

HPC & the Cloud Star Cluster - HPC in the Cloud Amazon – HPC on AWS - applications/ applications/ – Cluster Compute Instances