DISTRIBUTED COMPUTING & MAP REDUCE CS16: Introduction to Data Structures & Algorithms Thursday, April 17, 2014 1.

Slides:



Advertisements
Similar presentations
Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)
Advertisements

MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
MapReduce.
Based on the text by Jimmy Lin and Chris Dryer; and on the yahoo tutorial on mapreduce at index.html
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
Problem-solving on large-scale clusters: theory and applications Lecture 3: Bringing it all together.
Data-Intensive Computing with MapReduce/Pig Pramod Bhatotia MPI-SWS Distributed Systems – Winter Semester 2014.
Distributed Computations
CS 345A Data Mining MapReduce. Single-node architecture Memory Disk CPU Machine Learning, Statistics “Classical” Data Mining.
Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture.
MapReduce Simplified Data Processing on Large Clusters Google, Inc. Presented by Prasad Raghavendra.
Distributed Computations MapReduce
L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
MapReduce : Simplified Data Processing on Large Clusters Hongwei Wang & Sihuizi Jin & Yajing Zhang
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
SIDDHARTH MEHTA PURSUING MASTERS IN COMPUTER SCIENCE (FALL 2008) INTERESTS: SYSTEMS, WEB.
Hadoop Ida Mele. Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken.
MapReduce.
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
MapReduce. Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture emerging: – Cluster of.
Google MapReduce Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc. Presented by Conroy Whitney 4 th year CS – Web Development.
Jeffrey D. Ullman Stanford University. 2 Chunking Replication Distribution on Racks.
Süleyman Fatih GİRİŞ CONTENT 1. Introduction 2. Programming Model 2.1 Example 2.2 More Examples 3. Implementation 3.1 ExecutionOverview 3.2.
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
Parallel Programming Models Basic question: what is the “right” way to write parallel programs –And deal with the complexity of finding parallelism, coarsening.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Map Reduce: Simplified Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat Google, Inc. OSDI ’04: 6 th Symposium on Operating Systems Design.
CSE 486/586 CSE 486/586 Distributed Systems Graph Processing Steve Ko Computer Sciences and Engineering University at Buffalo.
MapReduce How to painlessly process terabytes of data.
Google’s MapReduce Connor Poske Florida State University.
MapReduce M/R slides adapted from those of Jeff Dean’s.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
MapReduce Algorithm Design Based on Jimmy Lin’s slides
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
MapReduce : Simplified Data Processing on Large Clusters P 謝光昱 P 陳志豪 Operating Systems Design and Implementation 2004 Jeffrey Dean, Sanjay.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
MapReduce: simplified data processing on large clusters Jeffrey Dean and Sanjay Ghemawat.
MapReduce Joins Shalish.V.J. A Refresher on Joins A join is an operation that combines records from two or more data sets based on a field or set of fields,
Brief Overview on Bigdata, Hadoop, MapReduce Jianer Chen CSCE-629, Fall 2015.
MapReduce Basics Chapter 2 Lin and Dyer & /tutorial/
MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
Large-scale file systems and Map-Reduce
Map Reduce.
PREGEL Data Management in the Cloud
Central Florida Business Intelligence User Group
Lecture 3: Bringing it all together
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
February 26th – Map/Reduce
Cse 344 May 4th – Map/Reduce.
Chapter 2 Lin and Dyer & MapReduce Basics Chapter 2 Lin and Dyer &
Distributed System Gang Wu Spring,2018.
CS 345A Data Mining MapReduce This presentation has been altered.
Introduction to MapReduce
CS639: Data Management for Data Science
5/7/2019 Map Reduce Map reduce.
COS 518: Distributed Systems Lecture 11 Mike Freedman
MapReduce: Simplified Data Processing on Large Clusters
Presentation transcript:

DISTRIBUTED COMPUTING & MAP REDUCE CS16: Introduction to Data Structures & Algorithms Thursday, April 17,

Outline 1) Distributed Computing Overview 2) MapReduce I. Application: Word Count II. Application: Mutual Friends Thursday, April 17,

Distributed Computing: Motivation Throughout the course of the semester, we have talked about many ways to optimize algorithm speed, for example: Using efficient data structures Taking greedy “shortcuts” when we are sure they are correct Dynamic programming algorithms However, we left out a seemingly obvious one... Using more than one computer! Thursday, April 17,

Distributed Computing: Explanation Distributed computing is the field of taking a computational problem and distributing it among many worker computers (nodes) Usually there is a master computer, which coordinates the distribution to and feedback from the nodes The distributed system can be represented as a graph where the vertices are nodes and the edges are the network connecting them Thursday, April 17,

Distributed Computing: Primes Thursday, April 17,

Distributed Computing: Primes (2) Wrong! The largest known prime number was discovered last year: 2 57,885,161 – 1. This number has 17,425,170 digits! A single machine cannot verify the primality of this number! This prime (named M48) was discovered through GIMPS, the distributed program “Great Internet Mersenne Prime Search” launched in 1996 to find large prime numbers searches for primes of the form 2 m -1 where m itself is prime GIMPS allows users to donate some of their computer’s rest time toward testing the primality of numbers of this form Thursday, April 17,

Distributed Computing: Primes (3) Thursday, April 17, Is 101 prime? Node 3Node 2 Node 1 Master {5,6,7} | 101?{8,9,10} | 101?{2,3,4} | 101? No

More Applications There are many problems being solved by massive distributed systems like GIMPS: Stanford’s project uses distributed computing to solve the problem of how proteins fold, helping scientists studying Alzheimer’s, Parkinson’s, and cancers Berkeley’s project uses distributed computing to generate a highly accurate 3D model of the Milky Way galaxy using data collected over the past decade Thursday, April 17,

Typical Large-Data Problem As the Internet has grown, so has the amount of data in existence! To make sense of all this data, we generally want to: Iterate over a large number of records Extract something of interest from each Shuffle and sort intermediate results Aggregate intermediate results Generate final output Thursday, April 17,

Developing MapReduce In 2004, Google was “indexing” websites: crawling pages and keeping track of words and their locations on each page The size of the Web was huge! To handle all this information, Google researchers came up with the MapReduce Framework Thursday, April 17,

MapReduce MapReduce is a programming abstraction seeking to hide system-level details from developers while providing the advantages of distributed computation Only focus on the “what” and not the “how” The developer specifies the computation that needs to be performed, and the framework (MapReduce) handles the actual execution Serves as a black box for distributed computing Thursday, April 17,

Typical Large-Data Problem To make sense of all this data, we generally want to: Iterate over a large number of records Extract something of interest from each Shuffle and sort intermediate results Aggregate intermediate results Generate final output The MapReduce framework takes care of the rest Thursday, April 17, REDUCE MAP

MapReduce Thursday, April 17,

Counting Words Suppose you want to know the most popular word on the Internet One method you could use would be to create a hashtable with words as keys and counts as values, but looping through every word on every page and hashing it to the table takes a long time… MapReduce allows you to speed up the entire process. You need to determine the “mapping” and “reducing” portions, and the framework does the rest! Thursday, April 17,

MapReduce Example Suppose we have three documents and we would like to know the number of times each word occurs throughout all the documents Thursday, April 17, do not forget to do what you do send me a forget me not two plus two is not one doc1doc2doc3

MapReduce Example The MapReduce runtime takes care of calling our map routine three times, once for each document. What are the input arguments to map ? Thursday, April 17, do not forget to do what you do send me a forget me not two plus two is not one doc1doc2doc3

MapReduce Example: Mapping Thursday, April 17, do not forget to do what you do send me a forget me not two plus two is not one doc1doc2doc3 map (doc1, [do, not, forget, to do, what, you, do]) map (doc2, [send, me, a, forget, me, not]) map (doc3, [two, plus, two, is, not, one])

What should our map function return? Remember that the function must return a list of (key, value) tuples. The output from all of the map calls is grouped by key and then passed to the reducer. MapReduce Example: Mapping Thursday, April 17, map (doc1, [do, not, forget, to, do, what, you, do]) map (doc2, [send, me, a, forget, me, not]) map (doc3, [two, plus, two, is, not, one]) ? ? ?

MapReduce Example: Mapping Thursday, April 17, map (doc1, [do, not, forget, to, do, what, you, do])  [(do,1),(not,1),(forget,1),(to,1),(what,1),(you,1),(do,1)] map (doc2, [send, me, a, forget, me, not])  [(send,1),(me,1),(a,1),(forget,1),(me,1),(not,1)] map (doc3, [two, plus, two, is, not, one])  [(two,1),(plus,1),(two,1),(is,1),(not,1),(one,1)]

The framework takes care of grouping the returned tuples by key and passing the list of values to the reducers. All values for a certain key are sent to the same reducer MapReduce Example: Mapping Thursday, April 17, map(map_k, map_v): // In this example, map_k is a document name and // map_v is a list of strings in the document tuples = [] for v in map_v: tuples.add((v,1)) return tuples

Reduce is called for each of these (word, []) pairs, e.g. reduce(forget, [1,1]) MapReduce Example: Reducing Thursday, April 17, do [1, 1, 1] not [1, 1, 1] forget [1, 1] me [1, 1] two [1, 1] to [1] what [1] you [1] send [1] a [1] plus [1] is [1] one [1]

Once the reducers all return, we have our counts for all the words! MapReduce Example: Reducing Thursday, April 17, reduce(k, values): // In this example, k is a particular word and // values is a list of all 1’s sum = 0 for v in values: sum += 1 return (k, sum)

The MapReduce Process Thursday, April 17, framework

Note that all of the map operations can run entirely in parallel, as can all of the reduce operations (once the map operations terminate) With hundreds or thousands of computing nodes, this is a huge benefit! This example seemed trivial, but suppose we were counting billions of entries! MapReduce Benefits Thursday, April 17,

One More Example For every person on Facebook, Facebook stores their friends in the format Person  [List of friends] How can we use MapReduce precompute for any two people on Facebook all of the friends they have in common? Hint: we’ll have to use the property that identical keys from the mapper all go to the same reducer! Thursday, April 17,

MapReduce Benefits (2) The MapReduce framework Handles scheduling Assigns workers to map and reduce tasks Handles “data distribution” Handles synchronization Gathers, sorts, and shuffles intermediate data Handles errors and worker failures and restarts Everything happens on top of a distributed filesystem (for another lecture!) All you have to do (for the most part) is write two functions, map and reduce ! Thursday, April 17,

Acknowledgements d-2010-Spring/session1-slides.pdf d-2010-Spring/session1-slides.pdf mapreduce mapreduce Thursday, April 17,