CS639: Data Management for Data Science

Slides:



Advertisements
Similar presentations
MapReduce.
Advertisements

CS 245Notes 71 CS 245: Database System Principles Notes 7: Query Optimization Hector Garcia-Molina.
DISTRIBUTED COMPUTING & MAP REDUCE CS16: Introduction to Data Structures & Algorithms Thursday, April 17,
Distributed Computations
MapReduce: Simplified Data Processing on Large Clusters Cloud Computing Seminar SEECS, NUST By Dr. Zahid Anwar.
Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture.
Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Standard architecture emerging: – Cluster of commodity.
MapReduce Simplified Data Processing on Large Clusters Google, Inc. Presented by Prasad Raghavendra.
Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture.
Distributed Computations MapReduce
Lecture 2 – MapReduce: Theory and Implementation CSE 490H This presentation incorporates content licensed under the Creative Commons Attribution 2.5 License.
Lecture 2 – MapReduce: Theory and Implementation CSE 490h – Introduction to Distributed Computing, Spring 2007 Except as otherwise noted, the content of.
L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.
Hadoop Ida Mele. Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken.
MapReduce.
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
MapReduce. Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture emerging: – Cluster of.
Map Reduce and Hadoop S. Sudarshan, IIT Bombay
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
MapReduce – An overview Medha Atre (May 7, 2008) Dept of Computer Science Rensselaer Polytechnic Institute.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
CS212: DATA STRUCTURES Lecture 10:Hashing 1. Outline 2  Map Abstract Data type  Map Abstract Data type methods  What is hash  Hash tables  Bucket.
MAP REDUCE : SIMPLIFIED DATA PROCESSING ON LARGE CLUSTERS Presented by: Simarpreet Gill.
Google’s MapReduce Connor Poske Florida State University.
MapReduce M/R slides adapted from those of Jeff Dean’s.
4. Scalability and MapReduce Prof. Tudor Dumitraș Assistant Professor, ECE University of Maryland, College Park ENEE 759D | ENEE 459D | CMSC 858Z
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
1 MapReduce: Theory and Implementation CSE 490h – Intro to Distributed Computing, Modified by George Lee Except as otherwise noted, the content of this.
Hung-chih Yang 1, Ali Dasdan 1 Ruey-Lung Hsiao 2, D. Stott Parker 2
Database Applications (15-415) Part II- Hadoop Lecture 26, April 21, 2015 Mohammad Hammoud.
SLIDE 1IS 240 – Spring 2013 MapReduce, HBase, and Hive University of California, Berkeley School of Information IS 257: Database Management.
MapReduce Algorithm Design Based on Jimmy Lin’s slides
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
MapReduce Computer Engineering Department Distributed Systems Course Assoc. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2015.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
MapReduce: simplified data processing on large clusters Jeffrey Dean and Sanjay Ghemawat.
MapReduce Joins Shalish.V.J. A Refresher on Joins A join is an operation that combines records from two or more data sets based on a field or set of fields,
Item Based Recommender System SUPERVISED BY: DR. MANISH KUMAR BAJPAI TARUN BHATIA ( ) VAIBHAV JAISWAL( )
Lecture 4. MapReduce Instructor: Weidong Shi (Larry), PhD
”Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters” Published In SIGMOD '07 By Yahoo! Senthil Nathan N IIT Bombay.
Some slides adapted from those of Yuan Yu and Michael Isard
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
MapReduce Types, Formats and Features
Scalable systems.
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn.
Adapted from: Google & UWash’s Creative Common MR Deck
CC Procesamiento Masivo de Datos Otoño Lecture 4: MapReduce/Hadoop II
Lecture 3. MapReduce Instructor: Weidong Shi (Larry), PhD
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Database Applications (15-415) Hadoop Lecture 26, April 19, 2016
MapReduce Simplied Data Processing on Large Clusters
湖南大学-信息科学与工程学院-计算机与科学系
Cse 344 May 2nd – Map/reduce.
February 26th – Map/Reduce
MapReduce Algorithm Design
Map reduce use case Giuseppe Andronico INFN Sez. CT & Consorzio COMETA
Cse 344 May 4th – Map/Reduce.
MapReduce Algorithm Design Adapted from Jimmy Lin’s slides.
Chapter 2 Lin and Dyer & MapReduce Basics Chapter 2 Lin and Dyer &
Word Co-occurrence Chapter 3, Lin and Dyer.
Distributed System Gang Wu Spring,2018.
Cloud Computing MapReduce, Batch Processing
Introduction to MapReduce
MapReduce Algorithm Design
CS639: Data Management for Data Science
CS639: Data Management for Data Science
CS639: Data Management for Data Science
Google Map Reduce OSDI 2004 slides
CS639: Data Management for Data Science
Presentation transcript:

CS639: Data Management for Data Science Lecture 10: Algorithms in MapReduce (continued) Theodoros Rekatsinas

Logistics/Announcements PA3 out by the end of day If you have grading questions or questions for PA1 and PA2 please ask Frank and Huawei No class on Monday, we will resume on Wednesday

Today’s Lecture Recap on MapReduce data and programming model More MapReduce Examples

Recap on MapReduce data and programming model

Recall: The Map Reduce Abstraction for Distributed Algorithms Distributed Data Storage map map map map map map Map (Shuffle) Reduce reduce reduce reduce reduce

Recall: MapReduce’s Data Model Files! A File is a bag of (key, value) pairs A bag is a multiset A map-reduce program: Input: a bag of (inputkey, value) pairs Output: a bag of (outputkey, value) pairs

MapReduce Programming Model Input & Output: each a set of key/value pairs Programmer specifies two functions: map (in_key, in_value) -> list(out_key, intermediate_value) Processes input key/value pair Produces set of intermediate pairs reduce (out_key, list(intermediate_value)) -> (out_key, list(out_values)) Combines all intermediate values for a particular key Produces a set of merged output values (usually just one)

Example: Word count over a corpus of documents map(String input_key, String input_value): //input_key: document id //input_value: document bag of words for each word w in input_value: EmitIntermediate(w, 1); reduce(String intermediate_key, Iterator intermediate_values): //intermediate_key: word //intermediate_values: ???? result = 0; for each v in intermediate_values: result += v; EmitFinal(intermediate_key, result);

2. More MapReduce Examples

Relational Join Assigned Departments Employee EmpSSN DepName 9999999999 Accounts 7777777777 Sales Marketing Name SSN Sue 9999999999 Tony 7777777777 Employee ⨝SSN=EmpSSN Assigned Departments Name SSN EmpSSN DepName Sue 9999999999 Accounts Tony 7777777777 Sales Marketing

Relational Join Assigned Departments Employee EmpSSN DepName 9999999999 Accounts 7777777777 Sales Marketing Name SSN Sue 9999999999 Tony 7777777777 Remember the semantics! join_result = [] for e in Employee: for d in Assigned Departments: if e.SSN = d.EmpSSN r = <e.Name, e.SSN, d.EmpSSN, d.DepName> join_result.append(r) rerun join_result Employee ⨝SSN=EmpSSN Assigned Departments Name SSN EmpSSN DepName Sue 9999999999 Accounts Tony 7777777777 Sales Marketing

Relational Join Assigned Departments Employee EmpSSN DepName 9999999999 Accounts 7777777777 Sales Marketing Name SSN Sue 9999999999 Tony 7777777777 Remember the semantics! join_result = [] for e in Employee: for d in Assigned Departments: if e.SSN = d.EmpSSN r = <e.Name, e.SSN, d.EmpSSN, d.DepName> join_result.append(r) rerun join_result Employee ⨝SSN=EmpSSN Assigned Departments Imagine we have a huge number of records! Let’s use MapReduce! We want the map phase to process each tuple. Is there a problem? Name SSN EmpSSN DepName Sue 9999999999 Accounts Tony 7777777777 Sales Marketing

Relational Join Assigned Departments Employee EmpSSN DepName 9999999999 Accounts 7777777777 Sales Marketing Name SSN Sue 9999999999 Tony 7777777777 Remember the semantics! join_result = [] for e in Employee: for d in Assigned Departments: if e.SSN = d.EmpSSN r = <e.Name, e.SSN, d.EmpSSN, d.DepName> join_result.append(r) rerun join_result Employee ⨝SSN=EmpSSN Assigned Departments Imagine we have a huge number of records! Let’s use MapReduce! We want the map phase to process each tuple. Is there a problem? The Relational Join is a binary operation! But MapReduce is a unary operation: I operate on a single key Can we approximate the join using MapReduce? Name SSN EmpSSN DepName Sue 9999999999 Accounts Tony 7777777777 Sales Marketing

Relational Join in MapReduce: Preprocessing before the Map Phase Employee Key idea: Flatten all tables and combine tuples from different tables in a single dataset Name SSN Sue 9999999999 Tony 7777777777 Employee Sue 9999999999 Tony 7777777777 Assigned Departments Accounts Sales Marketing Assigned Departments EmpSSN DepName 9999999999 Accounts 7777777777 Sales Marketing

Relational Join in MapReduce: Preprocessing before the Map Phase Employee Key idea: Flatten all tables and combine tuples from different tables in a single dataset Name SSN Sue 9999999999 Tony 7777777777 Employee Sue 9999999999 Tony 7777777777 Assigned Departments Accounts Sales Marketing Assigned Departments EmpSSN DepName 9999999999 Accounts 7777777777 Sales Marketing In practice, it's not difficult to get this (lineage) information. You know for example if you're processing, you have distributed files from this directory representing all the employee chunks, and  you've got distributed files in this directory for representing the assigned  department's chunks We use the table name to keep track of “which table did the tuple come from” This is a label that we've attached to every tuple so that we can know where that came from. We'll use it later!

Relational Join in MapReduce: Map Phase Employee Sue 9999999999 Tony 7777777777 Assigned Departments Accounts Sales Marketing For each tuple in the flattened input we will generate a key value pair! key=9999999999, value=(Employee, Sue, 9999999999) key=7777777777, value=(Employee, Tony, 7777777777) key=9999999999, value=(Assigned Departments, 9999999999, Accounts) key=7777777777, value=(Assigned Departments, 7777777777, Sales) key=7777777777, value=(Assigned Departments, 7777777777, Marketing)

Relational Join in MapReduce: Map Phase Employee Sue 9999999999 Tony 7777777777 Assigned Departments Accounts Sales Marketing Why use this value as the key? For each tuple in the flattened input we will generate a key value pair! key=9999999999, value=(Employee, Sue, 9999999999) key=7777777777, value=(Employee, Tony, 7777777777) key=9999999999, value=(Assigned Departments, 9999999999, Accounts) key=7777777777, value=(Assigned Departments, 7777777777, Sales) key=7777777777, value=(Assigned Departments, 7777777777, Marketing)

Relational Join in MapReduce: Map Phase Employee Sue 9999999999 Tony 7777777777 Assigned Departments Accounts Sales Marketing Why use this value as the key? We are joining on SSN (for Employee) and EmpSSN (for Assigned Depts) For each tuple in the flattened input we will generate a key value pair! Typically we wouldn’t bother removing the attribute used as key key=9999999999, value=(Employee, Sue, 9999999999) key=7777777777, value=(Employee, Tony, 7777777777) key=9999999999, value=(Assigned Departments, 9999999999, Accounts) key=7777777777, value=(Assigned Departments, 7777777777, Sales) key=7777777777, value=(Assigned Departments, 7777777777, Marketing)

Relational Join in MapReduce: Map Phase (Two Tricks so far) Employee Sue 9999999999 Tony 7777777777 Assigned Departments Accounts Sales Marketing Trick 1: Flattened and combined tables in a single input file. Why use this value as the key? We are joining on SSN (for Employee) and EmpSSN (for Assigned Depts) For each tuple in the flattened input we will generate a key value pair! Trick 2: Produce a key value pair where the key is the join attribute. key=9999999999, value=(Employee, Sue, 9999999999) key=7777777777, value=(Employee, Tony, 7777777777) key=9999999999, value=(Assigned Departments, 9999999999, Accounts) key=7777777777, value=(Assigned Departments, 7777777777, Sales) key=7777777777, value=(Assigned Departments, 7777777777, Marketing)

Relational Join in MapReduce: Reduce Phase (after the magic Shuffle) Input to Reducer 1 key=9999999999, value=[(Employee, Sue, 9999999999), (Assigned Departments, 9999999999, Accounts)] Input to Reducer 2 key=7777777777, value=[(Employee, Tony, 7777777777), (Assigned Departments, 7777777777, Sales), (Assigned Departments, 7777777777, Marketing)] After the shuffle phase all inputs with the same key will end up in the same reducer! It does not matter which relation the different tuples came from!

Relational Join in MapReduce: Reduce Phase (after the magic Shuffle) Input to Reducer 1 key=9999999999, value=[(Employee, Sue, 9999999999), (Assigned Departments, 9999999999, Accounts)] Input to Reducer 2 key=7777777777, value=[(Employee, Tony, 7777777777), (Assigned Departments, 7777777777, Sales), (Assigned Departments, 7777777777, Marketing)] We have all the information we need to perform the join for a each key in a single machine. This is how we scale.

Relational Join in MapReduce: Reduce Phase (after the magic Shuffle) Desired output of reduce function Input to Reducer 1 key=9999999999, value=[(Employee, Sue, 9999999999), (Assigned Departments, 9999999999, Accounts)] Input to Reducer 2 key=7777777777, value=[(Employee, Tony, 7777777777), (Assigned Departments, 7777777777, Sales), (Assigned Departments, 7777777777, Marketing)] Output of Reduce Function (Reducer 1) Sue, 9999999999, 9999999999, Accounts Output of Reduce Function (Reducer 2) Tony, 7777777777, 7777777777, Sales Tony, 7777777777, 7777777777, Marketing

Relational Join in MapReduce: Reduce Phase (after the magic Shuffle) Desired output of reduce function Input to Reducer 1 key=9999999999, value=[(Employee, Sue, 9999999999), (Assigned Departments, 9999999999, Accounts)] Input to Reducer 2 key=7777777777, value=[(Employee, Tony, 7777777777), (Assigned Departments, 7777777777, Sales), (Assigned Departments, 7777777777, Marketing)] Output of Reduce Function (Reducer 1) Sue, 9999999999, 9999999999, Accounts Output of Reduce Function (Reducer 2) Tony, 7777777777, 7777777777, Sales Tony, 7777777777, 7777777777, Marketing This part came from the Employees table This part came from the Assigned Departments table What is the reduce function implementation?

Relational Join in MapReduce: Reduce Phase Implementation Input to Reducer 1 key=9999999999, value=[(Employee, Sue, 9999999999), (Assigned Departments, 9999999999, Accounts)] Input to Reducer 2 key=7777777777, value=[(Employee, Tony, 7777777777), (Assigned Departments, 7777777777, Sales), (Assigned Departments, 7777777777, Marketing)] Desired output of reduce function Output of Reduce Function (Reducer 1) Sue, 9999999999, 9999999999, Accounts Output of Reduce Function (Reducer 2) Tony, 7777777777, 7777777777, Sales Tony, 7777777777, 7777777777, Marketing Simple Pseudo-code for Reduce reduce(String key, Iterator tuples): //intermediate_key: join key //intermediate_values: tuples with the same join key join_result = []; for t1 in tuples: for t2 in tuples: if t1[0] <> t2[0]: output tuple = (t1[1:], t2[1:]) join_result.append(t) rerun (key, join_result) you've got a set of tuples from one relation and  you need to associate it with every possible tuple from the other relation This is a cross-product operation! Relational algebra is everywhere! Notice that we need to keep track of where each tuple came from.

Social Network Analysis: Count Friends Input Desired Output Jim, 3 Lin, 2 Sue, 1 Kai, 1 Joe, 1 Jim Sue Lin Joe Kai Symmetric friendship edges

Social Network Analysis: Count Friends Input Desired Output Jim, 3 Lin, 2 Sue, 1 Kai, 1 Joe, 1 Jim Sue Lin Joe Kai key, value Jim, 1 Sue, 1 Lin, 1 Joe, 1 Kai, 1 Jim, (1, 1, 1) Sue, 1 Lin, (1,1) Joe, 1 Kai, 1 MAP SHUFFLE REDUCE Emit one for each left-hand value Symmetric friendship edges

Matrix Multiply in MapReduce C = A x B A dimensions m x n, B dimensions n x l In the map phase: for each element (i,j) of A, emit ((i,k),A[i,j]) for k in 1..l Key = (i,k) and value = A[i,j] for each element (i,j) of B, emit ((i,k),B[i,j]) for k in 1..m Key = (i,k) and value = B[i,j] In the reduce phase, emit key = (i,k) Value = Sumj (A[i,j] * B[j,k])

Matrix Multiply in MapReduce: Illustrated = We have one reducer per output cell Each reducer computes Sumj (A[i,j] * B[j,k])