NETE4631 Network Information Systems (NISs): Big Data and Scaling in the Cloud Suronapee, PhD 1.

Slides:



Advertisements
Similar presentations
Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)
Advertisements

MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
Overview of MapReduce and Hadoop
Mapreduce and Hadoop Introduce Mapreduce and Hadoop
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
Computations have to be distributed !
Distributed Computations
CS 345A Data Mining MapReduce. Single-node architecture Memory Disk CPU Machine Learning, Statistics “Classical” Data Mining.
Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture.
Map Reduce Allan Jefferson Armando Gonçalves Rocir Leite Filipe??
Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland Tuesday, June 29, 2010 This work is licensed.
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
MapReduce Simplified Data Processing on Large Clusters Google, Inc. Presented by Prasad Raghavendra.
Distributed Computations MapReduce
L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
MapReduce : Simplified Data Processing on Large Clusters Hongwei Wang & Sihuizi Jin & Yajing Zhang
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Map Reduce Architecture
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
SIDDHARTH MEHTA PURSUING MASTERS IN COMPUTER SCIENCE (FALL 2008) INTERESTS: SYSTEMS, WEB.
Ch 4. The Evolution of Analytic Scalability
MapReduce.
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
MapReduce. Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture emerging: – Cluster of.
Google MapReduce Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc. Presented by Conroy Whitney 4 th year CS – Web Development.
Jeffrey D. Ullman Stanford University. 2 Chunking Replication Distribution on Racks.
SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.
MapReduce April 2012 Extract from various presentations: Sudarshan, Chungnam, Teradata Aster, …
Süleyman Fatih GİRİŞ CONTENT 1. Introduction 2. Programming Model 2.1 Example 2.2 More Examples 3. Implementation 3.1 ExecutionOverview 3.2.
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
Parallel Programming Models Basic question: what is the “right” way to write parallel programs –And deal with the complexity of finding parallelism, coarsening.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
MapReduce and Hadoop 1 Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 2: MapReduce and Hadoop Mining Massive.
1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Introduction to Hadoop and HDFS
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
Map Reduce: Simplified Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat Google, Inc. OSDI ’04: 6 th Symposium on Operating Systems Design.
MAP REDUCE : SIMPLIFIED DATA PROCESSING ON LARGE CLUSTERS Presented by: Simarpreet Gill.
MapReduce How to painlessly process terabytes of data.
Google’s MapReduce Connor Poske Florida State University.
MapReduce M/R slides adapted from those of Jeff Dean’s.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
CS 345A Data Mining MapReduce. Single-node architecture Memory Disk CPU Machine Learning, Statistics “Classical” Data Mining.
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
SECTION 5: PERFORMANCE CHRIS ZINGRAF. OVERVIEW: This section measures the performance of MapReduce on two computations, Grep and Sort. These programs.
MapReduce and the New Software Stack CHAPTER 2 1.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
Big Data,Map-Reduce, Hadoop. Presentation Overview What is Big Data? What is map-reduce? input/output data types why is it useful and where is it used?
MapReduce: Simplified Data Processing on Large Clusters Lim JunSeok.
MapReduce : Simplified Data Processing on Large Clusters P 謝光昱 P 陳志豪 Operating Systems Design and Implementation 2004 Jeffrey Dean, Sanjay.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.
1 Student Date Time Wei Li Nov 30, 2015 Monday 9:00-9:25am Shubbhi Taneja Nov 30, 2015 Monday9:25-9:50am Rodrigo Sanandan Dec 2, 2015 Wednesday9:00-9:25am.
Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.
Lecture 4. MapReduce Instructor: Weidong Shi (Larry), PhD
Large-scale file systems and Map-Reduce
Map Reduce.
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn.
MapReduce Simplied Data Processing on Large Clusters
湖南大学-信息科学与工程学院-计算机与科学系
Lecture 16 (Intro to MapReduce and Hadoop)
CS 345A Data Mining MapReduce This presentation has been altered.
5/7/2019 Map Reduce Map reduce.
MapReduce: Simplified Data Processing on Large Clusters
Distributed Systems and Concurrency: Map Reduce
Presentation transcript:

NETE4631 Network Information Systems (NISs): Big Data and Scaling in the Cloud Suronapee, PhD 1

Huge Amount of Data 2 today statistics

Statistical facts of # devices 3

Big Data  A collection of data sets so large and complex, it’s impossible to process it on one computer with the usual databases and tools  Big Data represent the information assets characterized by “High Volume, Velocity, and Variety”  Because of its size and complexity, Big Data is hard to capture, store, copy, delete (privacy), search, share, analyze, and visualize 4

Big Data Processing  Combine it with cloud would be possible  as to require specific technology and analytical methods for its transformation into to Value 5 Derived dataInput process

MapReduce  What is MapReduce?  Programming model from LISP  Scatter and gather principals  Many problems can be phrased this way  Large input data make simple computation impossible  Advantages  Easy to process and generate large data sets  Hides difficulty of writing parallel code  System takes care of scheduling, load balancing, handling machines failures, etc. 6

MapReduce Programming Model  The computation takes a set of input key/value pairs, and produces a set of output key/value pairs.  Users expresses the computation as two functions: Map and Reduce.  Map  Takes an input pair and produces a set of intermediate key/value pairs.  Reduce  Accepts an intermediate key I and a set of values for that key and merges together these values to form a possibly smaller set of values.(typ. 1 output) 7

Word Count 8  Count number of times each distinct word appears in the file  MAP(KEY = LINE, VALUE = CONTENTS):  REDUCE(KEY, VALUES):

Word Count Illustrated 9

Observation  Conceptually the map and reduce functions supplied by the user have associated types  The input keys and values are drawn from a different domain than the output keys and values.  Furthermore, the intermediate keys and values are from the same domain as the output keys and values. 10

PageRank Algorithm  Phase 1: Propagation  Phase 2: Aggregation  Input: A pool of objects, including both vertices and edges 11

PageRank: Propagation  Map: for each object  If object is vertex, emit key=URL, value=object  If object is edge, emit key=source URL, value=object  Reduce: (input is a web page and all the outgoing links)  Find the number of edge objects -> outgoing links  Read the PageRank value from the vertex object  Assign PR(edges)=PR(vertex)/num_outgoing 12

PageRank: Aggregation  Map: for each object  If object is vertex, emit key=URL, value=object  If object is edge, emit key=destination URL, value=object  Reduce: (input is a web page and all the incoming links)  Add the PR value of all incoming links  Assign PR(vertex)= Σ PR(incoming links) 13

More Examples  Distributed Grep:  Map: emits a line if it matches a supplied pattern  Reduce: copies supplied intermediate data to the output  Count of URL Access Frequency:  Map: processes logs of web page requests, outputs (URL, 1)  Reduce: adds together all values for the same URL and emits (URL, total count) pairs  Reverse Web-Link Graph:  Map: extracts a key from each record, and emits a (key; record) pair.  Reduce: emits all pairs unchanged. 14

Implementation 15 Overall flow of a MapReduce operation

Execution Overview  When user calls MapReduce, sequence of actions are:  MapReduce library first splits input files into M pieces (=16-64MB) and starts up many copies of the program on a cluster of machines  The master, one of the program copies assigns work to the workers  The map worker who is assigned a map task do the following:  Reads the contents of the corresponding input split  Parses key/value pairs from input data and input each to the Map function.  Buffer produced intermediate key/value pairs in memory.  Buffered pairs are written to local disk, partitioned into R regions by the partitioning function (their location passed back to the master)  The master forwards these locations to the reduce workers.  The reduce worker reads intermediate data and sorts with the key  The reduce worker performs reduce function and append to output 16

Parallelism  map() functions run in parallel, creating different intermediate values from different input data sets  reduce() functions also run in parallel, each working on a different output key  All values are processed independently  Bottleneck: reduce phase can’t start until map phase is completely finished 17

Combiners  Often a map task will produce many pairs of the form (k,v1), (k,v2), … for the same key k  E.g., popular words in Word Count  Can save network time by preaggregating at mapper  Combine (k1, list(v1)) -> v2  Usually same as reduce function  Works only if reduce function is commutative and associative 18

Hadoop 19

Hadoop Execution  1. Client submits “wordcount” job, indicating code and input files  2. JobTracker breaks input file into k chunks (64 MB each). Assigns work to TaskTrackers  3. After map(), TaskTrackers exchange map-output for grouping map output by keys  4. JobTracker breaks reduce() keyspace into m chunks. Assigns work  5. reduce() output may go to HDFS 20

Map-Machine  Reads contents of assigned portion of input file  Parses and prepares data for input to map function  Passes data into map function and saves result in memory  Periodically writes completed work to local disk  Notifies Master of this partially completed work (intermediate data) 21

Reduce-Machine  Receives notification from Master of partially completed work  Retrieves intermediate data from Map-Machine via remote-read  Sorts intermediate data by key  Iterates over intermediate data  For each unique key, sends corresponding set through reduce function  Appends result of reduce function to final output file (HDFS) 22

Data Flow  Input, final output are stored on a distributed file system  Scheduler tries to schedule map tasks “close” to physical storage location of input data  Intermediate results are stored on local FS of map and reduce workers  Output is often input to another map reduce task 23

Capacity planning 24  Cloud provider have to be on-demand in scale  Capacity Planning  Match demand to available resources

Scaling in cloud 25  Scale vertically (scale up)  Add resources to a node (or a server) to make it powerful  Scale horizontally (scale out)  Add more nodes (or commodity servers)

Building blocks in cloud 26  Data center  Server: what we want to connect  Switch control: who is connected right now (enabling data flowing)  Switch  A layer 2 device that deals with local networking  Switching a connection is based on its own internal hardware

Scaling the servers 27  Add more ports to the switch  Support hundreds of thousands giga-bits each second  Hundreds of thousands servers in a data center  Each of which requires up to 1 Gbps  Infeasible  Add more switches  Imaging a tree-like structure

What happens as we keep going up the tree? 28  Technology impossible to build the enormous root switch  Increase ports (expensive)  What happens if the root fail?  Switches can’t handle that much load  Max per switch = 2 Gbps  Other 2 connects are useless

From tree to fat tree 29  4x4 switch represented as 2 set of 2x2 switches  Enforce the “criss cross” pattern

A large flat tree: the 8x8 switches (4x(2x2)) 30  A tree scalable, using only 2x2 switches (smaller switches)

The Clos Network 31  Non-blocking property  “Any unused server can connect to any other unused server at any time, no matter what the other connections are.”  Adding another set of switches in the middle

Scale out is better than scale up 32  Scale out  Having a lot of smaller switches  Scale up  Having a few big switches

Scaling comparison 33  Cost  Normally, scale up pays more than scale out.  Scale out enables you to try smaller-specialized configuration.  Maintenance  Scale out increases the number of systems you must manage.  Communication  Scale out increases the number of communication between systems.  Scale out introduces additional latency to your system.  Scale out increase the level of your availability of the system.

References  Cloud Computing Application, Campbell, R. and Farivar, R.  A Survey of Mobile Cloud Computing: Architecture, Applications, and Approaches  Brinton, Christopher; Chiang, Mung ( ). Networks Illustrated: 8 Principles Without Calculus (Kindle Locations ). Edwiser Scholastic Press. Kindle Edition. 34