MapReduce Computer Engineering Department Distributed Systems Course Assoc. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2015.

Slides:



Advertisements
Similar presentations
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
Advertisements

Overview of MapReduce and Hadoop
Mapreduce and Hadoop Introduce Mapreduce and Hadoop
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
MapReduce: Simplified Data Processing on Large Clusters Cloud Computing Seminar SEECS, NUST By Dr. Zahid Anwar.
The Google File System. Why? Google has lots of data –Cannot fit in traditional file system –Spans hundreds (thousands) of servers connected to (tens.
MapReduce Technical Workshop This presentation includes course content © University of Washington Redistributed under the Creative Commons Attribution.
Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture.
MapReduce Simplified Data Processing on Large Clusters Google, Inc. Presented by Prasad Raghavendra.
Distributed Computations MapReduce
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
Case Study - GFS.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
Hadoop Ida Mele. Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken.
MapReduce.
Jeffrey D. Ullman Stanford University. 2 Chunking Replication Distribution on Racks.
SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.
Cloud Distributed Computing Environment Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)
Süleyman Fatih GİRİŞ CONTENT 1. Introduction 2. Programming Model 2.1 Example 2.2 More Examples 3. Implementation 3.1 ExecutionOverview 3.2.
Map Reduce and Hadoop S. Sudarshan, IIT Bombay
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
MapReduce and Hadoop 1 Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 2: MapReduce and Hadoop Mining Massive.
MapReduce – An overview Medha Atre (May 7, 2008) Dept of Computer Science Rensselaer Polytechnic Institute.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
Map Reduce: Simplified Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat Google, Inc. OSDI ’04: 6 th Symposium on Operating Systems Design.
MAP REDUCE : SIMPLIFIED DATA PROCESSING ON LARGE CLUSTERS Presented by: Simarpreet Gill.
MapReduce M/R slides adapted from those of Jeff Dean’s.
4. Scalability and MapReduce Prof. Tudor Dumitraș Assistant Professor, ECE University of Maryland, College Park ENEE 759D | ENEE 459D | CMSC 858Z
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
Hung-chih Yang 1, Ali Dasdan 1 Ruey-Lung Hsiao 2, D. Stott Parker 2
Database Applications (15-415) Part II- Hadoop Lecture 26, April 21, 2015 Mohammad Hammoud.
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.
SLIDE 1IS 240 – Spring 2013 MapReduce, HBase, and Hive University of California, Berkeley School of Information IS 257: Database Management.
MapReduce and the New Software Stack CHAPTER 2 1.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
MapReduce and NoSQL CMSC 461 Michael Wilson. Big data  The term big data has become fairly popular as of late  There is a need to store vast quantities.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Big Data,Map-Reduce, Hadoop. Presentation Overview What is Big Data? What is map-reduce? input/output data types why is it useful and where is it used?
CC P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2014 Aidan Hogan Lecture V: 2014/04/07.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
MapReduce: simplified data processing on large clusters Jeffrey Dean and Sanjay Ghemawat.
Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.
The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Presenter: Chao-Han Tsai (Some slides adapted from the Google’s series lectures)
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Map reduce Cs 595 Lecture 11.
Introduction to MapReduce and Hadoop
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Database Applications (15-415) Hadoop Lecture 26, April 19, 2016
February 26th – Map/Reduce
Hadoop Basics.
Cse 344 May 4th – Map/Reduce.
CS639: Data Management for Data Science
CS639: Data Management for Data Science
CS639: Data Management for Data Science
MapReduce: Simplified Data Processing on Large Clusters
Presentation transcript:

MapReduce Computer Engineering Department Distributed Systems Course Assoc. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2015

What does Scalable Mean? Operationally – In the past: works even if data does not fit in main memory – Now: can make use of 1000s of cheap computers Algorithmically – In the past: if you have N data items, you must do no more than N m operations – polynomial time algorithms – Now: if you have N data items, you should do no more than N m / k operations, for some large k Polynomial time algorithms must be parallelized

Example: Find matching DNA Sequences Given a set of sequences Find all sequences equal to “GATTACGATTATTA”

Sequential (Linear) search Time = 0 Match? NO

Sequential (Linear) search 40 Records, 40 comparison N Records, N comparison The algorithmic complexity is order N: O(N)

What if Sorted Sequences? GATATTTTAAGC < GATTACGATTATTA No Match – keep searching in other half… O(log N)

New Task: Read Trimming Given a set of DNA sequences Trim the final n bps of each sequence Generate a new dataset Can we use an index? – No we have to touch every record no matter what. – O(N) Can we do any better?

Parallelization O(?)

New Task: Convert 405K TIFF Images to PNG

Another Example: Computing Word Frequency of Every Word in a Single document

There is a pattern here … A function that maps a read to a trimmed read A function that maps a TIFF image to a PNG image A function that maps a document to its most common word A function that maps a document to a histogram of word frequencies.

Compute Word Frequency Across all Documents

(word, count)

How to split things into pieces – How to write map and reduce MAPREDUCE

Map Reduce Map-reduce: high-level programming model and implementation for large-scale data processing. Programming model is based on functional programming – Every record is assumed to be in form of pairs. Google: paper published 2004 Free variant: Hadoop – java – Apache

Example: Upper-case Mapper in ML

Example: Explode Mapper

Example: Filter Mapper

Example: Chaining Keyspaces Output key is int

Data Model Files A File = a bag of (key, value) pairs A map-reduce program: – Input: a bag of (inputkey, value) pairs – Output: a bag of (outputkey, value) pairs

Step 1: Map Phase User provides the Map function: – Input: (input key, value) – Output: bag of (intermediate key, value) System applies the map function in parallel to all (input key, value) pairs in the input file

Step 2: Reduce Phase User provides Reduce function – Input: (intermediate key, bag of values) – Output: bag of output (values) The system will group all pairs with the same intermediate key, and passes the bag of values to the reduce function

Reduce After the map phase is over, all the intermediate values for a given output key are combined together into a list Reduce() combines those intermediate values into one or more final values for that same output key In practice, usually only one final value per key

Example: Sum Reducer

In summary Input and output : Each a set of key/value pairs Programmer specifies two function Map(in_key, in_value) -> list(out_key, intermediate_value) – Process input key/value pair – Produces set of intermediate pairs Reduce (out_key, list(intermediate_value)) -> list(out_value) – Combines all intermediate values for a particular key – Produces a set of merged output values (usually just one) Inspired by primitives from functional programming languages such as Lisp, Scheme, and Haskell

Example: What does this do? Word count application of map reduce

Example: Word Length Histogram

Big = Yellow = 10+letters Medium = Red = 5..9 letters Small = Blue = 2..4 letters Tiny = Pink = 1 letter

More Examples: Building an Inverted Index Input Tweet1, (“I love pancakes for breakfast”) Tweet2, (“I dislike pancakes”) Tweet3, (“what should I eat for breakfast”) Tweet4, (“I love to eat”) Desired output “pancakes”, (tweet1, tweet2) “breakfast”, (tweet1, tweet3) “eat”, (tweet3, tweet4) “love”, (tweet1, tweet4)

More Examples: Relational Joins

Relational Join MapReduce: Before Map Phase

Relational Join MapReduce: Map Phase

Relational Join MapReduce: Reduce Phase

Relational Join in MapReduce, Another Example MAP: REDUCE:

Simple Social Network Analysis: Count Friends MAP SHUFFLE

Taxonomy of Parallel Architectures

Cluster Computing Large number of commodity servers, connected by high speed, commodity network Rack holds a small number of servers Data center: holds many racks Massive parallelism – 100s, or 1000s servers – Many hours Failure – If medium time between failure is 1 year – Then, 1000 servers have one failure / hour

Distributed File System (DFS) For very large files: TBs, PBs Each file is partitioned into chunks, typically 64MB Each chunk is replicated several times (>2), on different racks, for fault tolerance Implementations: – Google’s DFS: GFS, proprietary – Hadoop’s DFS: HDFS, open source

HDFS: Motivation Based on Google’s GFS Redundant storage of massive amounts of data on cheap and unreliable computers Why not use an existing file system? – Different workload and design priorities – Handles much bigger dataset sizes than other file systems

Assumptions High component failure rates – Inexpensive commodity components fail all the time Modest number of HUGE files – Just a few million – Each is 100MB or larger; multi-GB files typical Files are write-once, mostly appended to – Perhaps concurrently Large streaming reads High sustained throughput favored over low latency

Hdfs Design Decisions Files are stored as blocks – Much larger size than most filesystems (default is 64MB) Reliability through replication – Each block replicated across 3+ DataNodes Single master (NameNode) coordinates access, metadata – Simple centralized management No data caching – Little benefit due to large data sets, streaming reads Familiar interface, but customize API – Simplify the problem; focus on distributed apps

Based on GFS Architecture

Referanslar 9yQ4 9yQ4