Lecture 6: Practical Computing with Large Data Sets CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier Special thanks to Haeberlen.

Slides:



Advertisements
Similar presentations
Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)
Advertisements

MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
DISTRIBUTED COMPUTING & MAP REDUCE CS16: Introduction to Data Structures & Algorithms Thursday, April 17,
Data-Intensive Computing with MapReduce/Pig Pramod Bhatotia MPI-SWS Distributed Systems – Winter Semester 2014.
Distributed Computations
MapReduce: Simplified Data Processing on Large Clusters Cloud Computing Seminar SEECS, NUST By Dr. Zahid Anwar.
CS 345A Data Mining MapReduce. Single-node architecture Memory Disk CPU Machine Learning, Statistics “Classical” Data Mining.
MapReduce Dean and Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM, Vol. 51, No. 1, January Shahram.
MapReduce Simplified Data Processing on Large Clusters Google, Inc. Presented by Prasad Raghavendra.
Distributed Computations MapReduce
7/14/2015EECS 584, Fall MapReduce: Simplied Data Processing on Large Clusters Yunxing Dai, Huan Feng.
L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.
MapReduce Simplified Data Processing On large Clusters Jeffery Dean and Sanjay Ghemawat.
© 2013 A. Haeberlen, Z. Ives NETS 212: Scalable and Cloud Computing 1 University of Pennsylvania Introduction to MapReduce September 26, 2013.
MapReduce : Simplified Data Processing on Large Clusters Hongwei Wang & Sihuizi Jin & Yajing Zhang
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
SIDDHARTH MEHTA PURSUING MASTERS IN COMPUTER SCIENCE (FALL 2008) INTERESTS: SYSTEMS, WEB.
Hadoop Ida Mele. Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken.
MapReduce.
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
MapReduce. Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture emerging: – Cluster of.
Google MapReduce Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc. Presented by Conroy Whitney 4 th year CS – Web Development.
Map Reduce: Simplified Data Processing On Large Clusters Jeffery Dean and Sanjay Ghemawat (Google Inc.) OSDI 2004 (Operating Systems Design and Implementation)
Süleyman Fatih GİRİŞ CONTENT 1. Introduction 2. Programming Model 2.1 Example 2.2 More Examples 3. Implementation 3.1 ExecutionOverview 3.2.
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
Parallel Programming Models Basic question: what is the “right” way to write parallel programs –And deal with the complexity of finding parallelism, coarsening.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.
MapReduce – An overview Medha Atre (May 7, 2008) Dept of Computer Science Rensselaer Polytechnic Institute.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
HAMS Technologies 1
Map Reduce: Simplified Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat Google, Inc. OSDI ’04: 6 th Symposium on Operating Systems Design.
MAP REDUCE : SIMPLIFIED DATA PROCESSING ON LARGE CLUSTERS Presented by: Simarpreet Gill.
MapReduce How to painlessly process terabytes of data.
Google’s MapReduce Connor Poske Florida State University.
MapReduce M/R slides adapted from those of Jeff Dean’s.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Introduction to Search Engines Technology CS Technion, Winter 2013 Amit Gross Some slides are courtesy of: Edward Bortnikov & Ronny Lempel, Yahoo!
Hung-chih Yang 1, Ali Dasdan 1 Ruey-Lung Hsiao 2, D. Stott Parker 2
Information Retrieval Lecture 9. Outline Map Reduce, cont. Index compression [Amazon Web Services]
SLIDE 1IS 240 – Spring 2013 MapReduce, HBase, and Hive University of California, Berkeley School of Information IS 257: Database Management.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
Lecture 7: Practical Computing with Large Data Sets cont. CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier Special thanks.
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
MapReduce : Simplified Data Processing on Large Clusters P 謝光昱 P 陳志豪 Operating Systems Design and Implementation 2004 Jeffrey Dean, Sanjay.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
MapReduce: simplified data processing on large clusters Jeffrey Dean and Sanjay Ghemawat.
MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
1 Source A. Haeberlen, Z. Ives University of Pennsylvania MapReduceIntro.pptx Introduction to MapReduce.
Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
Map Reduce.
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn.
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
MapReduce Simplied Data Processing on Large Clusters
湖南大学-信息科学与工程学院-计算机与科学系
February 26th – Map/Reduce
Map reduce use case Giuseppe Andronico INFN Sez. CT & Consorzio COMETA
Cse 344 May 4th – Map/Reduce.
CS110: Discussion about Spark
Distributed System Gang Wu Spring,2018.
Introduction to MapReduce
5/7/2019 Map Reduce Map reduce.
COS 518: Distributed Systems Lecture 11 Mike Freedman
MapReduce: Simplified Data Processing on Large Clusters
Lecture 29: Distributed Systems
Presentation transcript:

Lecture 6: Practical Computing with Large Data Sets CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier Special thanks to Haeberlen and Ives at UPenn

Big Data in the 1800s The US Census in 1880 surveyed a record 50,189,209 people. – In 1870 the total had been 39, Took 8 years to count and calculate!

Census in 1890 Had to find a new way to calculate the results. Hired a major company to come up with a new method

Census in 1890 New methods were effective! – 50M people to 62M people – 8 years to process to 1 year to process!

Dealing with Big Data How do we work smarter, instead of harder? Dealing with the modern census: – How could we quickly determine the population of a city?

Older Analogies Say you have 10,000 employees whose job it is to collate census forms by city. How would you organize and complete this task?

Challenges Suppose people take vacations, get sick, or work at different rates… Suppose some forms are filled out incorrectly and need to be corrected or thrown out… Suppose a supervisor gets sick… How do we monitor progress…

Define the Challenges What is the main challenge? – Are the individual task complicated? – What makes this challenging? How resilient is our solution? How well does it balance work across employees? – What factors affect this? How general is the set of techniques?

Dealing with the Details Wouldn't it be nice if there were some system that took care of all these details for you? Ideally, you'd just tell the system what needs to be done That's the MapReduce framework.

Abstracting into a digital data flow Filter+Stack Worker Filter+Stack Worker Filter+Stack Worker Filter+Stack Worker CountStack Worker CountStack Worker CountStack Worker CountStack Worker CountStack Worker Cincinnati: 4k Columbus: 4k Cleveland: 3k Toledo: 1k Indianapolis : 4k

Abstracting once more There are two kinds of workers: – Those that take input data items and produce output items for the “stacks” – Those that take the stacks and aggregate the results to produce outputs on a per-stack basis We’ll call these: – map: takes (item_key, value), produces one or more (stack_key, value’) pairs – reduce: takes (stack_key, {set of value’}), produces one or more output results – typically (stack_key, agg_value) We will refer to this key as the reduce key

Why MapReduce? Scenario: – You have a huge amount of data, e.g., all the Google searches of the last three years – You would like to perform a computation on the data, e.g., find out which search terms were the most popular How would you do it? Analogy to the census example: – The computation isn't necessarily difficult, but parallelizing and distributing it, as well as handling faults, is challenging Idea: A programming language! – Write a simple program to express the (simple) computation, and let the language runtime do all the hard work

What is MapReduce? A famous distributed programming model In many circles, considered the key building block for much of Google’s data analysis – A programming language built on it: Sawzall, – … Sawzall has become one of the most widely used programming languages at Google. … [O]n one dedicated Workqueue cluster with 1500 Xeon CPUs, there were 32,580 Sawzall jobs launched, using an average of 220 machines each. While running those jobs, 18,636 failures occurred (application failure, network outage, system crash, etc.) that triggered rerunning some portion of the job. The jobs read a total of 3.2x10 15 bytes of data (2.8PB) and wrote 9.9x10 12 bytes (9.3TB). – Other similar languages: Yahoo’s Pig Latin and Pig; Microsoft’s Dryad Cloned in open source: Hadoop,

The MapReduce programming model Simple distributed functional programming primitives Modeled after Lisp primitives: – map (apply function to all items in a collection) and – reduce (apply function to set of items with a common key) We start with: – A user-defined function to be applied to all data, map : (key,value)  (key, value) – Another user-specified operation reduce : (key, {set of values})  result – A set of n nodes, each with data All nodes run map on all of their data, producing new data with keys – This data is collected by key, then shuffled, and finally reduced – Dataflow is through temp files on GFS

Simple example: Word count Goal: Given a set of documents, count how often each word occurs – Input: Key-value pairs (document:lineNumber, text) – Output: Key-value pairs (word, #occurrences) – What should be the intermediate key-value pairs? map(String key, String value) { // key: document name, line no // value: contents of line } reduce(String key, Iterator values) { } for each word w in value: emit(w, "1") // key: a word // values: a list of counts int result = 0; for each v in values: result += ParseInt(v); emit(key, result)

Simple example: Word count Mapper (1-2) Mapper (3-4) Mapper (5-6) Mapper (7-8) Reducer (A-G) Reducer (H-N) Reducer (O-U) Reducer (V-Z) (1, the apple) (2, is an apple) (3, not an orange) (4, because the) (5, orange) (6, unlike the apple) (7, is orange) (8, not green) (the, 1) (apple, 1) (is, 1) (apple, 1) (an, 1) (not, 1) (orange, 1) (an, 1) (because, 1) (the, 1) (orange, 1) (unlike, 1) (apple, 1) (the, 1) (is, 1) (orange, 1) (not, 1) (green, 1) (apple, 3) (an, 2) (because, 1) (green, 1) (is, 2) (not, 2) (orange, 3) (the, 3) (unlike, 1) (apple, {1, 1, 1}) (an, {1, 1}) (because, {1}) (green, {1}) (is, {1, 1}) (not, {1, 1}) (orange, {1, 1, 1}) (the, {1, 1, 1}) (unlike, {1}) Each mapper receives some of the KV-pairs as input The mappers process the KV-pairs one by one Each KV-pair output by the mapper is sent to the reducer that is responsible for it The reducers sort their input by key and group it The reducers process their input one group at a time Key range the node is responsible for

MapReduce dataflow Mapper Reducer Input data Output data "The Shuffle" Intermediate (key,value) pairs What is meant by a 'dataflow'? What makes this so scalable?

More examples Distributed grep – all lines matching a pattern – Map: filter by pattern – Reduce: output set Count URL access frequency – Map: output each URL as key, with count 1 – Reduce: sum the counts Reverse web-link graph – Map: output (target,source) pairs when link to target found in souce – Reduce: concatenates values and emits (target,list(source)) Inverted index – Map: Emits (word,documentID) – Reduce: Combines these into (word,list(documentID)) 18

Common mistakes to avoid Mapper and reducer should be stateless – Don't use static variables - after map + reduce return, they should remember nothing about the processed data! – Reason: No guarantees about which key-value pairs will be processed by which workers! Don't try to do your own I/O! – Don't try to read from, or write to, files in the file system – The MapReduce framework does all the I/O for you: All the incoming data will be fed as arguments to map and reduce Any data your functions produce should be output via emit HashMap h = new HashMap(); map(key, value) { if (h.contains(key)) { h.add(key,value); emit(key, "X"); } } Wrong! map(key, value) { File foo = new File("xyz.txt"); while (true) { s = foo.readLine();... } } Wrong!

More common mistakes to avoid Mapper must not map too much data to the same key – In particular, don't map everything to the same key!! – Otherwise the reduce worker will be overwhelmed! – It's okay if some reduce workers have more work than others Example: In WordCount, the reduce worker that works on the key 'and' has a lot more work than the reduce worker that works on 'syzygy'. map(key, value) { emit("FOO", key + " " + value); } reduce(key, value[]) { /* do some computation on all the values */ } Wrong!

Designing MapReduce algorithms Key decision: What should be done by map, and what by reduce ? – map can do something to each individual key-value pair, but it can't look at other key-value pairs Example: Filtering out key-value pairs we don't need – map can emit more than one intermediate key-value pair for each incoming key-value pair Example: Incoming data is text, map produces (word,1) for each word – reduce can aggregate data; it can look at multiple values, as long as map has mapped them to the same (intermediate) key Example: Count the number of words, add up the total cost,... Need to get the intermediate format right! – If reduce needs to look at several values together, map must emit them using the same key!

More details on the MapReduce data flow 22 Data partitions by key Map computation partitions Reduce computation partitions Redistribution by output’s key ("shuffle") Coordinator (Default MapReduce uses Filesystem)

Some additional details To make this work, we need a few more parts… The file system (distributed across all nodes): – Stores the inputs, outputs, and temporary results The driver program (executes on one node): – Specifies where to find the inputs, the outputs – Specifies what mapper and reducer to use – Can customize behavior of the execution The runtime system (controls nodes): – Supervises the execution of tasks – Esp. JobTracker

Some details Fewer computation partitions than data partitions – All data is accessible via a distributed filesystem with replication – Worker nodes produce data in key order (makes it easy to merge) – The master is responsible for scheduling, keeping all nodes busy – The master knows how many data partitions there are, which have completed – atomic commits to disk Locality: Master tries to do work on nodes that have replicas of the data Master can deal with stragglers (slow machines) by re- executing their tasks somewhere else

What if a worker crashes? We rely on the file system being shared across all the nodes Two types of (crash) faults: – Node wrote its output and then crashed Here, the file system is likely to have a copy of the complete output – Node crashed before finishing its output The JobTracker sees that the job isn’t making progress, and restarts the job elsewhere on the system (Of course, we have fewer nodes to do work…) But what if the master crashes?

Other challenges Locality – Try to schedule map task on machine that already has data Task granularity – How many map tasks? How many reduce tasks? Dealing with stragglers – Schedule some backup tasks Saving bandwidth – E.g., with combiners Handling bad records – "Last gasp" packet with current sequence number

Scale and MapReduce From a particular Google paper on a language built over MapReduce: – … Sawzall has become one of the most widely used programming languages at Google. … [O]n one dedicated Workqueue cluster with 1500 Xeon CPUs, there were 32,580 Sawzall jobs launched, using an average of 220 machines each. While running those jobs, 18,636 failures occurred (application failure, network outage, system crash, etc.) that triggered rerunning some portion of the job. The jobs read a total of 3.2x10 15 bytes of data (2.8PB) and wrote 9.9x10 12 bytes (9.3TB). Source: Interpreting the Data: Parallel Analysis with Sawzall (Rob Pike, Sean Dorward, Robert Griesemer, Sean Quinlan)

For next time Midterms due Thursday.