Distributed Systems Lecture 3 Big Data and MapReduce 1.

Slides:



Advertisements
Similar presentations
Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)
Advertisements

MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
Overview of MapReduce and Hadoop
EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.
Parallel and Distributed Computing: MapReduce Alona Fyshe.
 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)
Intro to Map-Reduce Feb 4, map-reduce? A programming model or abstraction. A novel way of thinking about designing a solution to certain problems…
Distributed Computations
CS 345A Data Mining MapReduce. Single-node architecture Memory Disk CPU Machine Learning, Statistics “Classical” Data Mining.
An Introduction to MapReduce: Abstractions and Beyond! -by- Timothy Carlstrom Joshua Dick Gerard Dwan Eric Griffel Zachary Kleinfeld Peter Lucia Evan May.
CS 425 / ECE 428 Distributed Systems Fall 2014 Indranil Gupta (Indy) Lecture 3: Mapreduce and Hadoop All slides © IG.
Distributed Computations MapReduce
Introduction to Google MapReduce WING Group Meeting 13 Oct 2006 Hendra Setiawan.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Lecture 3-1 Computer Science 425 Distributed Systems CS 425 / CSE 424 / ECE 428 Fall 2012 Indranil Gupta (Indy) Sep 4, 2012 Lecture 3 Cloud Computing -
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
Hadoop Ida Mele. Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken.
Lecture 3-1 Computer Science 425 Distributed Systems CS 425 / CSE 424 / ECE 428 Fall 2010 Indranil Gupta (Indy) August 31, 2010 Lecture 3  2010, I. Gupta.
MapReduce.
Jeffrey D. Ullman Stanford University. 2 Chunking Replication Distribution on Racks.
Map Reduce and Hadoop S. Sudarshan, IIT Bombay
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
MapReduce and Hadoop 1 Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 2: MapReduce and Hadoop Mining Massive.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Apache Hadoop MapReduce What is it ? Why use it ? How does it work Some examples Big users.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
HAMS Technologies 1
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
MapReduce M/R slides adapted from those of Jeff Dean’s.
Big Data for Relational Practitioners Len Wyatt Program Manager Microsoft Corporation DBI225.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
CS 345A Data Mining MapReduce. Single-node architecture Memory Disk CPU Machine Learning, Statistics “Classical” Data Mining.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Writing a MapReduce Program 1. Agenda  How to use the Hadoop API to write a MapReduce program in Java  How to use the Streaming API to write Mappers.
Map-Reduce Big Data, Map-Reduce, Apache Hadoop SoftUni Team Technical Trainers Software University
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
MapReduce Computer Engineering Department Distributed Systems Course Assoc. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2015.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
Introduction to Hadoop Owen O’Malley Yahoo!, Grid Team Modified by R. Cook.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
ApproxHadoop Bringing Approximations to MapReduce Frameworks
Team3: Xiaokui Shu, Ron Cohen CS5604 at Virginia Tech December 6, 2010.
Cloud Computing Mapreduce (2) Keke Chen. Outline  Hadoop streaming example  Hadoop java API Framework important APIs  Mini-project.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
Beyond Hadoop The leading open source system for processing big data continues to evolve, but new approaches with added features are on the rise. Ibrahim.
Airlinecount CSCE 587 Spring Preliminary steps in the VM First: log in to vm Ex: ssh vm-hadoop-XX.cse.sc.edu -p222 Where: XX is the vm number assigned.
HADOOP Priyanshu Jha A.D.Dilip 6 th IT. Map Reduce patented[1] software framework introduced by Google to support distributed computing on large data.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Lecture 4: Mapreduce and Hadoop
Big Data is a Big Deal!.
Introduction to Google MapReduce
Hadoop MapReduce Framework
Map Reduce.
Trends: Technology Doubling Periods – storage: 12 mos, bandwidth: 9 mos, and (what law is this?) cpu compute capacity: 18 mos Then and Now Bandwidth 1985:
Introduction to MapReduce and Hadoop
Central Florida Business Intelligence User Group
Airlinecount CSCE 587 Fall 2017.
Database Applications (15-415) Hadoop Lecture 26, April 19, 2016
MIT 802 Introduction to Data Platforms and Sources Lecture 2
湖南大学-信息科学与工程学院-计算机与科学系
Map reduce use case Giuseppe Andronico INFN Sez. CT & Consorzio COMETA
Cse 344 May 4th – Map/Reduce.
Charles Tappert Seidenberg School of CSIS, Pace University
Chapter X: Big Data.
MIT 802 Introduction to Data Platforms and Sources Lecture 2
Presentation transcript:

Distributed Systems Lecture 3 Big Data and MapReduce 1

Previous lecture Overview of main cloud computing aspects – Definition – Models – Elasticity – Cloud stack – Virtualization AWS 2

Data intensive computing Clouds are designed for data intensive applications Approach: move application to data Computation-Intensive computing – Example areas: MPI-based, High-performance computing, Grids – Typically run on supercomputers (e.g., NCSA Blue Waters) – High CPU utilization Data-Intensive computing – Typically store data at datacenters – Use compute nodes nearby (same datacenter, rack; latency based) – Compute nodes run computation services – High I/O utilization In data-intensive computing, the focus shifts from computation to the data: CPU utilization no longer the most important resource metric 3

Big Data (1) Data mining huge amounts of data collected in a wide range of domains from astronomy to healthcare has become essential for planning and performance – Sload Digital Sky Survey (2000): 200 GB/night – Large Synoptic Survey Telescope (2016): 140 TB every five days – NASA Center for Climate Simulation: 32 PB of information – eBay: two warehouses of 7.5 respectively 40 PB – Amazon holds the world’s largest databases: 7.5 TB, 18.5 TB, and 24.7 TB – Facebook: 50 million photos Michael Dell (CEO): – “Top thinkers are no longer the people who can tell you what happened in the past, but those who can predict the future. Welcome to the Data Economy, which holds the promise for significantly advancing society and economic growth on a global scale.”

Big Data (2) Volume – Analyzing large volumes (TBs) of distributed data E.g. Derive insights into large historical data sets (e.g. Velocity – Fast processing of variable data sets (streaming data) E.g. Trend analysis, weather forecast Variety (Complexity) – Highly complex analysis at large scale E.g. audio/video analysis at web scale, speech to text etc. – Unstructured or structured data E.g. relational databases, graphs Dimensions

Big Data (3) Volume Velocity Variety MB GBTBPB Historic Batch Periodic O(days/hours) Realtime O(seconds) Relational data Audio Video Graphs Photos

Big Data (4) Data is an important asset to any organization Discovery of knowledge; Enabling discovery; annotation of data Complex computational models No single environment is good enough: need elastic, on- demand capacities → cloud computing New programming models for BigData on Cloud Supporting algorithms and data structures Data Economy

Big Data analytics (1) Cloud based Big Data programming – Large number of low-cost “commodity” hardware – Performance/$ – High failure rate increases with the number of resources – High Volume, distributed data (usually unstructured or tuple based) Embarrassingly parallel computations on big data Programming model: – Split data and process each chunk independently. Join the intermediate results and output the aggregated result – Same computation performed at different nodes on different pieces of the dataset MapReduce

HDFS MR Task Tracker M/R Data Node B B B HDFS MR Task Tracker M/R Data Node B B B HDFS MR Task Tracker M/R Data Node B B B Hadoop Slaves (workers) HDFS MR Job Tracker Name Node Hadoop Master Big Data analytics (2) Hadoop

Rack 3 Rack 2 Rack 1 Hadoop Distributed File System (HDFS) (3) Data Node x1 B B Data Node x2 B B Data Node B x1 B Data Node x1 x2 B Data Node B x2 B Data Node B B B HDFS Name Node Filename “x”, size: 1GB blockid “1”, datanodes d1, d3, d4 blockid “2”, datanodes d2, d4, d5 … Each block is replicated (3 times by default) One replica of each block on the same rack, rest spread across the cluster

Hadoop MR execution (4) 11 Terms are borrowed from Functional Languages (e.g., Lisp) Example: sum of squares (map square ‘( )) –Output: ( ) [processes each record sequentially and independently] (reduce + ‘( )) –(+ 16 (+ 9 (+ 4 1) ) ) –Output: 30 [processes set of all records in groups]

Hadoop MR execution (5) b2 b3 b4 b5 b6 b7 b1 Mapper MM MR Job Tracker Execute “wc” on file “x” with 4 mappers 3 reducers K1,k2,k3 K4, k5 Reducer RR k1=>v1,v2… k2=>v1,v2 k3=>v1,v2 k4=>v1,v2 k5=>v1,v2 o1 o2 o3 Launch Mappers (Prefer Collocation with data) Launch Reducers Shuffle Stage

Hadoop MR programming model (6) Input Map Reduce Output Map(k, v) -> (k’,v’) Reduce(k’, v’[]) -> (k’’,v’’) User defined functions Shuffle – MR System void map(String key, String value) { //do work //emit key, value pairs to reducers } void reduce(String key, Iterator values) { //for each key, iterate through all values //aggregate results //emit final result }

Hadoop - example: word count (7) void map(String key, String value) { //Key: document name (ignored) //value: document contents for each word w in value Emit(w, 1) } void reduce(String key, Iterator values) { //key: word //values: list of counts int count = 0; foreach v in values count += v Emit(key, count) } Input: Large number of documents Output: Count of each word occurring across documents

Hadoop – example: word count (8) public class Map extends Mapper { private final static IntWritable one = new IntWritable(1); private Text word = new public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } 15

Hadoop – example: word count (9) public class Reduce extends Reducer protected void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable value : values) { sum += value.get(); } context.write(key, new IntWritable(sum)); } Source: job.html 16

Hadoop – example: word count (10) public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); } 17

Hadoop - example: word count (11) Peter Piper picked a peck of pickled peppers M M M M M M A peck of pickled peppers Peter Piper picked R R R R Of, 1 Pickled, 1 Peppers 1 Peter, 1 Piper, 1 Picked, 1 A, 1 Peck, 1 A, 1 Peck, 1 Peter, 1 Piper, 1 Picked, 1 Of, 1 Peppers 1 Pickled, 1 Of, 1 Peppers 1 Peppers, 1 Pickled, 1 A, 1 Peck, 1 Peter, 1 Picked, 1 Piper, 1 Peter, 2 Piper, 2 Picked, 2 A, 2 Peck, 2 Of, 2 Pickled, 2 Peppers 2 Local Shuffle and Sort

Hadoop - more examples (12) Distributed search (grep) – Map: Emit line if it matches the patter – Reduce: Concatenate Analysis of large scale system logs More – Jerry Zhao, Jelena Pjesivac-Grbovic, MapReduce: The Programming Model and Practice. Sigmetrics tutorial, Sigmetrics research.google.com/pubs/archive/36249.pdf – mapreduce/

Hadoop - other patterns and optimizations (13) Map-Reduce-Reduce… M M R R R R R R Iterative Map-Reduce M M R R Local Combiners (Mappers) M M R R M M R R C C C C C, 1 A, 2 B, 1 A, 1 B, 1 A, 1 C, 2 A, 2 C, 2 B, 1 Custom Data Partitionerse.g., hash(line) mod R Goal: Increase local work (Mapper+Combiners), reduce data transfer over network

Hadoop - other important aspects (14) Data Locality – Map Heavy Jobs – Execute Mappers on data partitions, move data to reducers – Reduce Heavy Jobs – Do not move data to reducers, instead move reducers to intermediate data Fault Tolerance – Worker Failure Re-execute on failure (Hadoop) Regenerate current state with minimum re-execution (e.g. Resilient Distributed Datasets, SPARK) – Master Failure Primary/Secondary Master. Regular backups to secondary. On failure of primary, secondary takes over

Hadoop in real life 22 Easy to write and run highly parallel programs in new cloud programming paradigms: Google: MapReduce and Pregel (and many others) Amazon: Elastic MapReduce service (pay-as-you-go) Google (MapReduce) – Indexing: a chain of 24 MapReduce jobs – ~200K jobs processing 50PB/month (in 2006) Yahoo! (Hadoop + Pig) – WebMap: a chain of 100 MapReduce jobs – 280 TB of data, 2500 nodes, 73 hours Facebook (Hadoop + Hive) – ~300TB total, adding 2TB/day (in 2008) – 3K jobs processing 55TB/day Similar numbers from other companies, e.g., Yieldex, eharmony.com, etc. NoSQL: MySQL has been an industry standard for a while, but Cassandra is 2400 times faster!

Useful links Installing Hadoop on single node (Linux) WordCount example – client/hadoop-mapreduce-client-core/MapReduceTutorial.html client/hadoop-mapreduce-client-core/MapReduceTutorial.html Running WordCount on AWS Elastic MapReduce – amazon-elastic-mapreduce-job.html amazon-elastic-mapreduce-job.html 23

Next lecture Failure detection 24