Map reduce use case Giuseppe Andronico INFN Sez. CT & Consorzio COMETA

Slides:



Advertisements
Similar presentations
Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)
Advertisements

MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
Overview of MapReduce and Hadoop
Distributed Computations
MapReduce: Simplified Data Processing on Large Clusters Cloud Computing Seminar SEECS, NUST By Dr. Zahid Anwar.
MapReduce Technical Workshop This presentation includes course content © University of Washington Redistributed under the Creative Commons Attribution.
MapReduce Dean and Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM, Vol. 51, No. 1, January Shahram.
MapReduce Simplified Data Processing on Large Clusters Google, Inc. Presented by Prasad Raghavendra.
Distributed Computations MapReduce
Lecture 2 – MapReduce: Theory and Implementation CSE 490H This presentation incorporates content licensed under the Creative Commons Attribution 2.5 License.
Lecture 2 – MapReduce: Theory and Implementation CSE 490h – Introduction to Distributed Computing, Spring 2007 Except as otherwise noted, the content of.
L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
MapReduce : Simplified Data Processing on Large Clusters Hongwei Wang & Sihuizi Jin & Yajing Zhang
SIDDHARTH MEHTA PURSUING MASTERS IN COMPUTER SCIENCE (FALL 2008) INTERESTS: SYSTEMS, WEB.
Hadoop Ida Mele. Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken.
MapReduce.
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
MapReduce. Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture emerging: – Cluster of.
Google MapReduce Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc. Presented by Conroy Whitney 4 th year CS – Web Development.
Introduction to MapReduce Amit K Singh. “The density of transistors on a chip doubles every 18 months, for the same cost” (1965) Do you recognize this.
Map Reduce and Hadoop S. Sudarshan, IIT Bombay
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
Parallel Programming Models Basic question: what is the “right” way to write parallel programs –And deal with the complexity of finding parallelism, coarsening.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
MapReduce How to painlessly process terabytes of data.
Google’s MapReduce Connor Poske Florida State University.
MapReduce M/R slides adapted from those of Jeff Dean’s.
Mass Data Processing Technology on Large Scale Clusters Summer, 2007, Tsinghua University All course material (slides, labs, etc) is licensed under the.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
1 MapReduce: Theory and Implementation CSE 490h – Intro to Distributed Computing, Modified by George Lee Except as otherwise noted, the content of this.
PPCC Spring Map Reduce1 MapReduce Prof. Chris Carothers Computer Science Department
Information Retrieval Lecture 9. Outline Map Reduce, cont. Index compression [Amazon Web Services]
SLIDE 1IS 240 – Spring 2013 MapReduce, HBase, and Hive University of California, Berkeley School of Information IS 257: Database Management.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
MapReduce : Simplified Data Processing on Large Clusters P 謝光昱 P 陳志豪 Operating Systems Design and Implementation 2004 Jeffrey Dean, Sanjay.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
Map Reduce. Functional Programming Review r Functional operations do not modify data structures: They always create new ones r Original data still exists.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
MapReduce: simplified data processing on large clusters Jeffrey Dean and Sanjay Ghemawat.
MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.
MapReduce: Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc.
MapReduce using Hadoop Jan Krüger … in 30 minutes...
Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.
Lecture 4. MapReduce Instructor: Weidong Shi (Larry), PhD
Map reduce Cs 595 Lecture 11.
MapReduce: Simplified Data Processing on Large Clusters
Map Reduce.
Adapted from: Google & UWash’s Creative Common MR Deck
Introduction to MapReduce and Hadoop
MapReduce Simplied Data Processing on Large Clusters
湖南大学-信息科学与工程学院-计算机与科学系
February 26th – Map/Reduce
Hadoop Basics.
Cse 344 May 4th – Map/Reduce.
CS-4513 Distributed Computing Systems Hugh C. Lauer
Distributed System Gang Wu Spring,2018.
Cloud Computing MapReduce, Batch Processing
Introduction to MapReduce
CS639: Data Management for Data Science
5/7/2019 Map Reduce Map reduce.
Google Map Reduce OSDI 2004 slides
MapReduce: Simplified Data Processing on Large Clusters
Presentation transcript:

Map reduce use case Giuseppe Andronico INFN Sez. CT & Consorzio COMETA Workshops Grids vs. Clouds Beijing, 18.05.2011

Introduction Classical data intensive and data parallel applications Huge amount of data to be analyzed Analysis can be done: splitting data in a great number of small chunks applying analysis procedure to each chunk independently applying a specific procedure to the set of previous results to obtain the final goal In this case can be used the Map & Reduce method It fits nicely cloud computing

What is MapReduce? Simple data-parallel programming model designed for scalability and fault-tolerance Pioneered by Google Processes 20 petabytes of data per day Popularized by open-source Hadoop project Used at Yahoo!, Facebook, Amazon, …

What is MapReduce used for? At Google: Index construction for Google Search Article clustering for Google News Statistical machine translation At Yahoo!: “Web map” powering Yahoo! Search Spam detection for Yahoo! Mail At Facebook: Data mining Ad optimization Spam detection

What is MapReduce used for? In research: Astronomical image analysis (Washington) Bioinformatics (Maryland) Analyzing Wikipedia conflicts (PARC) Natural language processing (CMU) Particle physics (Nebraska) Ocean climate simulation (Washington)

Example Given: snapshot of the internet To find: The top N pages with the highest number of incoming links Also given: Infinite computing and storage resources Goal: Optimize it and make it simple enough so that anybody knowing a computer language can do this. CS591CLD Cloud Computing Seminar

Map Reduce Automatic parallelization & distribution Fault-tolerant Provides status and monitoring tools Clean abstraction for programmers CS591CLD Cloud Computing Seminar

Programming Model Borrows from functional programming Users implement interface of two functions: map (in_key, in_value) -> (out_key, intermediate_value) list reduce (out_key, intermediate_value list) -> out_value list CS591CLD Cloud Computing Seminar

Map Records from the data source (lines out of files, rows of a database, etc) are fed into the map function as key*value pairs: e.g., (filename, line). map() produces one or more intermediate values along with an output key from the input of the form <key, value> CS591CLD Cloud Computing Seminar

Reduce Reduce After the map phase is over, all the intermediate values for a given output key are combined together into a list reduce() combines those intermediate values into one or more final values for that same output key (in practice, usually only one final value per key) CS591CLD Cloud Computing Seminar

Map Reduce CS591CLD Cloud Computing Seminar

Parallelism map() functions run in parallel, creating different intermediate values from different input data sets reduce() functions also run in parallel, each working on a different output key All values are processed independently Bottleneck: reduce phase can’t start until map phase is completely finished. CS591CLD Cloud Computing Seminar

Example: Count word occurrences map(String input_key, String input_value): // input_key: document name // input_value: document contents for each word w in input_value: EmitIntermediate(w, "1"); reduce(String output_key, Iterator intermediate_values): // output_key: a word // output_values: a list of counts int result = 0; for each v in intermediate_values: result += ParseInt(v); Emit(AsString(result)); CS591CLD Cloud Computing Seminar

Example vs. Actual Source Code Example is written in pseudo-code Actual implementation is in Java, using Hadoop True code is somewhat more involved (defines how the input key/values are divided up and accessed, etc.) CS591CLD Cloud Computing Seminar

Locality Master program divides up tasks based on location of data: tries to have map() tasks on same machine as physical file data, or at least same rack map() task inputs are divided into 64 MB blocks: same size as Google File System chunks CS591CLD Cloud Computing Seminar

Fault Tolerance Master detects worker failures Re-executes completed & in-progress map() tasks Re-executes in-progress reduce() tasks Master notices particular input key/values cause crashes in map(), and skips those values on re-execution. Effect: Can work around bugs in third-party libraries! CS591CLD Cloud Computing Seminar

Optimizations No reduce can start until map is complete: A single slow disk controller can rate-limit the whole process Master redundantly executes “slow-moving” map tasks; uses results of first copy to finish CS591CLD Cloud Computing Seminar

Sort Backup tasks reduce job completion time a lot! Normal No backup tasks 200 processes killed M = 15000 R = 4000 Backup tasks reduce job completion time a lot! System deals well with failures CS591CLD Cloud Computing Seminar

Optimizations “Combiner” functions can run on same machine as a mapper Causes a mini-reduce phase to occur before the real reduce phase, to save bandwidth CS591CLD Cloud Computing Seminar

Some Applications Count of URL access frequency Reverse Web-Link Graph Distributed Grep: Map - Emits a line if it matches the supplied pattern Reduce - Copies the the intermediate data to output Count of URL access frequency Map – Process web log and outputs <URL, 1> Reduce - Emits <URL, total count> Reverse Web-Link Graph Map – process web log and outputs <target, source> Reduce - emits <target, list(source)> CS591CLD Cloud Computing Seminar

MapReduce Conclusions MapReduce has proven to be a useful abstraction Greatly simplifies large-scale computations at Google Functional programming paradigm can be applied to large-scale applications Fun to use: focus on problem, let library deal with messy details CS591CLD Cloud Computing Seminar

What is Hadoop? At Google MapReduce operation are run on a special file system called Google File System (GFS) that is highly optimized for this purpose. GFS is not open source. Doug Cutting and Yahoo! reverse engineered the GFS and called it Hadoop Distributed File System (HDFS). The software framework that supports HDFS, MapReduce and other related entities is called the project Hadoop or simply Hadoop. This is open source and distributed by Apache. 11/22/2018

Resources “MapReduce: Simplified Data Processing on Large Clusters”, Jeffrey Dean and Sanjay Ghemawat GFS, BigTable, Sawzall, Chubby, Protocol Buffers Google Map Reduce Lecture Series HDFS Architecture CS591CLD Cloud Computing Seminar