MAP-REDUCE: WIN EPIC WIN MAP-REDUCE: WIN -OR- EPIC WIN CSC313: Advanced Programming Topics.

Slides:



Advertisements
Similar presentations
Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)
Advertisements

MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
Overview of MapReduce and Hadoop
DISTRIBUTED COMPUTING & MAP REDUCE CS16: Introduction to Data Structures & Algorithms Thursday, April 17,
Distributed Computations
MapReduce: Simplified Data Processing on Large Clusters Cloud Computing Seminar SEECS, NUST By Dr. Zahid Anwar.
MapReduce Dean and Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM, Vol. 51, No. 1, January Shahram.
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
MapReduce Simplified Data Processing on Large Clusters Google, Inc. Presented by Prasad Raghavendra.
Distributed Computations MapReduce
Hinrich Schütze and Christina Lioma Lecture 4: Index Construction
7/14/2015EECS 584, Fall MapReduce: Simplied Data Processing on Large Clusters Yunxing Dai, Huan Feng.
Lecture 2 – MapReduce: Theory and Implementation CSE 490h – Introduction to Distributed Computing, Spring 2007 Except as otherwise noted, the content of.
Distributed MapReduce Team B Presented by: Christian Bryan Matthew Dailey Greg Opperman Nate Piper Brett Ponsler Samuel Song Alex Ostapenko Keilin Bickar.
L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.
MapReduce Simplified Data Processing On large Clusters Jeffery Dean and Sanjay Ghemawat.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
MapReduce : Simplified Data Processing on Large Clusters Hongwei Wang & Sihuizi Jin & Yajing Zhang
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
SIDDHARTH MEHTA PURSUING MASTERS IN COMPUTER SCIENCE (FALL 2008) INTERESTS: SYSTEMS, WEB.
Hadoop Ida Mele. Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken.
MapReduce.
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
MapReduce. Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture emerging: – Cluster of.
Google MapReduce Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc. Presented by Conroy Whitney 4 th year CS – Web Development.
Map Reduce: Simplified Data Processing On Large Clusters Jeffery Dean and Sanjay Ghemawat (Google Inc.) OSDI 2004 (Operating Systems Design and Implementation)
Introduction to MapReduce Amit K Singh. “The density of transistors on a chip doubles every 18 months, for the same cost” (1965) Do you recognize this.
Süleyman Fatih GİRİŞ CONTENT 1. Introduction 2. Programming Model 2.1 Example 2.2 More Examples 3. Implementation 3.1 ExecutionOverview 3.2.
Map Reduce and Hadoop S. Sudarshan, IIT Bombay
Parallel Programming Models Basic question: what is the “right” way to write parallel programs –And deal with the complexity of finding parallelism, coarsening.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.
MapReduce – An overview Medha Atre (May 7, 2008) Dept of Computer Science Rensselaer Polytechnic Institute.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Map Reduce: Simplified Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat Google, Inc. OSDI ’04: 6 th Symposium on Operating Systems Design.
MAP REDUCE : SIMPLIFIED DATA PROCESSING ON LARGE CLUSTERS Presented by: Simarpreet Gill.
MapReduce How to painlessly process terabytes of data.
Google’s MapReduce Connor Poske Florida State University.
MapReduce M/R slides adapted from those of Jeff Dean’s.
Mass Data Processing Technology on Large Scale Clusters Summer, 2007, Tsinghua University All course material (slides, labs, etc) is licensed under the.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
1 MapReduce: Theory and Implementation CSE 490h – Intro to Distributed Computing, Modified by George Lee Except as otherwise noted, the content of this.
CPS110: Introduction to Google Landon Cox April 20, 2009.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
MapReduce: Simplified Data Processing on Large Clusters Lim JunSeok.
MapReduce Computer Engineering Department Distributed Systems Course Assoc. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2015.
MapReduce : Simplified Data Processing on Large Clusters P 謝光昱 P 陳志豪 Operating Systems Design and Implementation 2004 Jeffrey Dean, Sanjay.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 4: Index Construction Related to Chapter 4:
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
MapReduce: Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc.
Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.
Some slides adapted from those of Yuan Yu and Michael Isard
Large-scale file systems and Map-Reduce
Map Reduce.
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn.
Introduction to MapReduce and Hadoop
MapReduce Simplied Data Processing on Large Clusters
February 26th – Map/Reduce
Map reduce use case Giuseppe Andronico INFN Sez. CT & Consorzio COMETA
Cse 344 May 4th – Map/Reduce.
Distributed System Gang Wu Spring,2018.
Introduction to MapReduce
5/7/2019 Map Reduce Map reduce.
MapReduce: Simplified Data Processing on Large Clusters
Presentation transcript:

MAP-REDUCE: WIN EPIC WIN MAP-REDUCE: WIN -OR- EPIC WIN CSC313: Advanced Programming Topics

Brief History of Google BackRub: disk drives 24 GB total storage

Brief History of Google BackRub: disk drives 24 GB total storage =

Brief History of Google Google: disk drives 366 GB total storage

Brief History of Google Google: disk drives 366 GB total storage =

Traditional Design Principles  If big enough, supercomputer processes work  Use desktop CPUs, just a lot more of them  But it also provides huge bandwidth to memory  Equivalent to many machines bandwidth at once  But supercomputers are VERY, VERY expensive  Maintenance also expensive once machine bought  But do get something: high-quality == low downtime  Safe, expensive solution to very large problems

Why Trade Money for Safety?

How Was Search Performed? DNS

How Was Search Performed? DNS

How Was Search Performed? DNS

How Was Search Performed? DNS

How Was Search Performed? DNS

Google’s Big Insight  Performing search is “embarrassingly parallel”  No need for supercomputer and all that expense  Can instead do this using lots & lots of desktops  Identical effective bandwidth & performance

Google’s Big Insight  Performing search is “embarrassingly parallel”  No need for supercomputer and all that expense  Can instead do this using lots & lots of desktops  Identical effective bandwidth & performance  But problem is desktop machines unreliable  Budget for 2 replacements, since machines cheap  Just expect failure; software provides quality

Google’s Big Insight  Performing search is “embarrassingly parallel”  No need for supercomputer and all that expense  Can instead do this using lots & lots of desktops  Identical effective bandwidth & performance  But problem is desktop machines unreliable  Budget for 2 replacements, since machines cheap software provides quality  Just expect failure; software provides quality

Google’s Big Insight  Performing search is “embarrassingly parallel”  No need for supercomputer and all that expense  Can instead do this using lots & lots of desktops  Identical effective bandwidth & performance  But problem is desktop machines unreliable  Budget for 2 replacements, since machines cheap software provides quality  Just expect failure; software provides quality

A brief history of Google Google: 2012 ?0,000 total servers ??? PB total storage

How Is Search Performed Now?

How Is Search Performed Now? Spell Checker Ad Server Document Servers (TB) Index Servers (TB)

How Is Search Performed Now? Spell Checker Ad Server Document Servers (TB) Index Servers (TB)

How Is Search Performed Now? Spell Checker Ad Server Document Servers (TB) Index Servers (TB)

How Is Search Performed Now? Spell Checker Ad Server Document Servers (TB) Index Servers (TB)

Google’s Processing Model  Buy cheap machines & prepare for worst  Machines going to fail, but still cheaper approach  Important steps keep whole system reliable  Replicate data so that information losses limited  Move data freely so can always rebalance loads  These decisions lead to many other benefits  Scalability helped by focus on balancing  Search speed improved; performance much better  Utilize resources fully, since search demand varies

Heterogeneous processing  By buying cheapest computers, variances are high  Programs must handle homo- & hetero- systems  Centralized workqueue helps with different speeds

Heterogeneous processing  By buying cheapest computers, variances are high  Programs must handle homo- & hetero- systems  Centralized workqueue helps with different speeds  This process also leads to a few small downsides  Space  Power consumption  Cooling costs

Complexity at Google

Google Abstractions  Google File System  Handles replication to provide scalability & durability  BigTable  Manages large relational data sets  Chubby  Gonna skip past that joke; distributed locking service  MapReduce  If  If job fits, easy parallelism possible without much work

Google Abstractions  Google File System  Handles replication to provide scalability & durability  BigTable  Manages large relational data sets  Chubby  Gonna skip past that joke; distributed locking service  MapReduce  If  If job fits, easy parallelism possible without much work

Remember Google’s Problem

MapReduce Overview  Programming model makes details simple  Automatic parallelization & load balancing  Network and disk I/O optimization  Robust performance even if machines fail

MapReduce Overview provides good Façade  Programming model provides good Façade  Automatic parallelization & load balancing  Network and disk I/O optimization  Robust performance even if machines fail

MapReduce Overview  Programming model provides good Façade  Automatic parallelization & load balancing  Network and disk I/O optimization  Robust performance even if machines fail  Idea came from 2 Lisp (functional) primitives  Map  Reduce

MapReduce Overview  Programming model provides good Façade  Automatic parallelization & load balancing  Network and disk I/O optimization  Robust performance even if machines fail  Idea came from 2 Lisp (functional) primitives  Map  Map: process each entry in list using some function  Reduce  Reduce: recombines data using given function

Typical MapReduce problem 1. Read lots and lots of data (e.g., TBs) 2. Map  Extract important data from each entry in input 3. Combine Maps and sort entries by key 4. Reduce  Process each key’s entries to get result for that key 5. Output final result & watch money roll in

Typical MapReduce problem 1. Read lots and lots of data (e.g., TBs) 2. Map  Extract important data from each entry in input 3. Combine Maps and sort entries by key 4. Reduce  Process each key’s entries to get result for that key 5. Output final result & watch money roll in

Typical MapReduce problem 1. Read lots and lots of data (e.g., TBs) 2. Map  Extract important data from each entry in input 3. Combine Maps and sort entries by key 4. Reduce  Process each key’s entries to get result for that key 5. Output final result & watch money roll in

Typical MapReduce problem 1. Read lots and lots of data (e.g., TBs) 2. Map  Extract important data from each entry in input 3. Combine Maps and sort entries by key 4. Reduce  Process each key’s entries to get result for that key 5. Output final result & watch money roll in

Pictorial View of MapReduce

Ex: Count Word Frequencies  Processes files separately Map Key=URL Value=text on page Key=URL Value=text on page

Ex: Count Word Frequencies  Processes files separately & count word freq. in each Map Key=URL Value=text on page Key=URL Value=text on page Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count

Ex: Count Word Frequencies Reduce Key’=“to” Value’=“1” Key’=“to” Value’=“1” Key’=“be” Value’=“1” Key’=“be” Value’=“1” Key’=“or” Value’=“1” Key’=“or” Value’=“1” Key’=“not” Value’=“1” Key’=“not” Value’=“1” Key’=“to” Value’=“1” Key’=“to” Value’=“1” Key’=“be” Value’=“1” Key’=“be” Value’=“1”  In shuffle step, Maps combined & entries sorted by key  Reduce

Ex: Count Word Frequencies  In shuffle step, Maps combined & entries sorted by key  Reduce combines key’s results to compute final output Reduce Key’=“to” Value’=“1” Key’=“to” Value’=“1” Key’=“be” Value’=“1” Key’=“be” Value’=“1” Key’=“or” Value’=“1” Key’=“or” Value’=“1” Key’=“not” Value’=“1” Key’=“not” Value’=“1” Key’=“to” Value’=“1” Key’=“to” Value’=“1” Key’=“be” Value’=“1” Key’=“be” Value’=“1” Key’’=“to” Value’’=“2” Key’’=“to” Value’’=“2” Key’’=“be” Value’’=“2” Key’’=“be” Value’’=“2” Key’’=“or” Value’’=“1” Key’’=“or” Value’’=“1” Key’’=“not” Value’’=“1” Key’’=“not” Value’’=“1”

Word Frequency Pseudo-code Map(String input_key, String input_values) { String[] words = input_values.split(“ ”); foreach w in words { EmitIntermediate(w, "1"); } } Reduce(String key, Iterator intermediate_values){ int result = 0; foreach v in intermediate_values { result += ParseInt(v); } Emit(result); }

Ex: Build Search Index  Processes files separately & record words found on each Map Key=URL Value=text on page Key=URL Value=text on page Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=word Value’=URL Key’=word Value’=URL

Ex: Build Search Index  Processes files separately & record words found on each  To get search Map, combine key’s results in Reduce Map Key=URL Value=text on page Key=URL Value=text on page Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=word Value’=URL Key’=word Value’=URL Reduce Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=word Value’=URL Key’=word Value’=URL Key=word Value=URLs with word Key=word Value=URLs with word

Search Index Pseudo-code Map(String input_key, String input_values) { String[] words = input_values.split(“ ”); foreach w in words { EmitIntermediate(w, input_key); } } Reduce(String key, Iterator intermediate_values){ List result = new ArrayList(); foreach v in intermediate_values { result.addLast(v); } Emit(result); }

Ex: Page Rank Computation  Google’s algorithm ranking pages’ relevance

Ex: Page Rank Computation Map Key= Value=links on page Key= Value=links on page Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=link on page Value’= Key’=link on page Value’= Reduce Key= Value=links on page Key= Value=links on page Key= Value=links on page Key= Value=links on page Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=link to URL Value’= Key’=link to URL Value’= Key= Value=links on page Key= Value=links on page + +

Ex: Page Rank Computation Map Key= Value=links on page Key= Value=links on page Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=link on page Value’= Key’=link on page Value’= Reduce Key= Value=links on page Key= Value=links on page Key= Value=links on page Key= Value=links on page Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=link to URL Value’= Key’=link to URL Value’= Key= Value=links on page Key= Value=links on page + + Repeat entire process (e.g., input Reduce results back into Map) until page ranks stabilize (sum of changes to the ranks drops below some threshold)

Ex: Page Rank Computation  Google’s algorithm ranking pages’ relevance

Advanced MapReduce Ideas  How to implement? One master, many workers  Split input data into tasks where each task size fixed  Will also be partitioning reduce phase into tasks  Dynamically assign tasks to workers during each step  Tasks assigned as needed & placed in in-process list  Once worker completes task, save result & retire task  Assume that a worker crashed, if not complete in time  Move incomplete tasks back into pool for reassignment

Advanced MapReduce Ideas