Presentation is loading. Please wait.

Presentation is loading. Please wait.

Map Reduce & Hadoop June 3, 2015 HS Oh, HR Lee, JY Choi

Similar presentations


Presentation on theme: "Map Reduce & Hadoop June 3, 2015 HS Oh, HR Lee, JY Choi"— Presentation transcript:

1 Map Reduce & Hadoop June 3, 2015 HS Oh, HR Lee, JY Choi
YS Lee, SH Choi

2 Outline Part1 Introduction to Hadoop
MapReduce Tutorial with Simple Example Hadoop v2.0: YARN Part2 MapReduce Hive Stream Data Processing: Storm Spark Up-to-date Trends

3 MapReduce Overview Task flow Shuffle configurables Combiner
Partitioner Custom Partitioner Example Number of Maps and Reduces How to write MapReduce functions

4 MapReduce Overview A A B A A A B B B A B B

5 MapReduce Task flow

6 MapReduce Shuffle Configurables

7 Combiner Mini Reducer Functionally same as the reducer
Performs on each map task(locally), reduces communication cost Using combiner when Reduce function is both commutative and associative

8 Partitioner Divides Map’s output key, value pair by rule
Default strategy is hashing HashPartitioner public class HashPartitioner<K2, V2> implements Partitioner<K2, V2> { public void configure(JobConf job) {} public int getPartition(K2 key, V2 value, int numReduceTasks)  return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks; }

9 Custom Partitioner Example
Input with name, age, sex, and score Map outputs divide by range of age public static class AgePartitioner extends Partitioner<Text, Text> {               public int getPartition(Text key, Text value, int numReduceTasks) {             String [] nameAgeScore = value.toString().split("\t");             String age = nameAgeScore[1];             int ageInt = Integer.parseInt(age);                         //this is done to avoid performing mod with 0             if(numReduceTasks == 0)                 return 0;             //if the age is <20, assign partition 0             if(ageInt <=20){                            }             //else if the age is between 20 and 50, assign partition 1             if(ageInt >20 && ageInt <=50){                                 return 1 % numReduceTasks;             //otherwise assign partition 2             else                 return 2 % numReduceTasks;         }     }

10 Number of Maps and Reduces
The number of Maps = DFS blocks To adjust DFS block size to adjust the number of maps Right level of parallelism for maps → 10~100 maps/node mapred.map.tasks parameter is just a hint The number of Reduces Suggested values Set # of reduce tasks a little bit less than # of total slot A task time between 5 and 15 min Create the fewest files possible conf.setNumReduceTasks(int num)

11 How to write MapReduce functions [1/2]
Input part Output part Java Word Count Example public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, one); } public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); output.collect(key, new IntWritable(sum)); Input part Output part Reporter: A facility for Map-Reduce applications to report progress and update counters, status information etc. Mapper and Reducer can use the Reporter provided to report progress or just indicate that they are alive. In scenarios where the application takes significant amount of time to process individual key/value pairs, this is crucial since the framework might assume that the task has timed-out and kill that task. Applications can also update Counters via the provided Reporter .

12 How to write MapReduce functions [2/2]
Python Word Count Example Mapper.py #!/usr/bin/python import sys for line in sys.stdin:     for word in line.strip().split():         print "%s\t%d" % (word, 1) How to excute bin/Hadoop jar share/Hadoop/tools/lib/Hadoop-streaming jar -files /home/hduser/Mapper.py, /home/hduser/Reducer.py -mapper /home/hduser/Mapper.py -reducer /home/hduser/Reducer.py -input /input/count_of_monte_cristo.txt -output /output Reducer.py #!/usr/bin/python import sys current_word = None current_count = 1 for line in sys.stdin:     word, count = line.strip().split('\t')     if current_word:         if word == current_word:             current_count += int(count)         else:             print "%s\t%d" % (current_word, current_count)             current_count = 1     current_word = word if current_count > 1:     print "%s\t%d" % (current_word, current_count)

13 Hive & Stream Data Processing: Storm
Hadoop Ecosystem Hive & Stream Data Processing: Storm

14 The World of Big Data Tools
DAG Model MapReduce Model Graph Model BSP / Collective Model Hadoop MPI For Iterations / Learning HaLoop Giraph Twister Hama GraphLab Spark GraphX Harp Flink Dryad / DryadLINQ REEF For Query Pig / PigLatin Hive Tez Drill SparkSQL(Shark) MRQL For Streaming S4 Storm Samza Spark Streaming From Bingjing Zhang

15 Hive Data warehousing on top of Hadoop Designed to
enable easy data summarization ad-hoc querying analysis of large volumes of data HiveQL statements are automatically translated into MapReduce jobs

16 Advantages Higher level query language
Simplifies working with large amounts of data Lower learning curve than Pig or MapReduce HiveQL is much closer to SQL than Pig Less trial and error than Pig

17 Disadvantages Updating data is complicated
Mainly because of using HDFS Can add records Can overwrite partitions No real time access to data Use other means like HBase or Impala High latency

18 Hive Architecture Metastore 에서 어떤데이터가 어떻게 저장되고 움직이는지? Compiler는 뭘하는지?

19 Metastore Metastore contains things like IDs of Database IDs of Tables
IDs of Intex

20 Compiler Parser Semantic Analyzer Logical Plan Generator
Query Plan Generator Parser : Convert into Parse Tree Representation Semantic Analyzer : Convert into block-base internal query representation Logical Plan Gen. : Convert the internal query representation to logical plan, which consists of a tree of operators. (join,filter…) Query Plan Gen. : Convert into physical plans (M/R Jobs)

21 Hive Architecture Metastore 에서 어떤데이터가 어떻게 저장되고 움직이는지? Compiler는 뭘하는지?

22 HiveQL While based on SQL, HiveQL does not strictly follow the full SQL-92 standard. HiveQL offers extensions not in SQL, including multitable inserts and create table as select, but only offers basic support for indexes. HiveQL lacks support for transactions and materialized views, and only limited subquery support. Support for insert, update, and delete with full ACID functionality was made available with release 0.14.

23 Datatypes in Hive Primitive datatypes TINYINT SMALLINT INT BIGINT
BOOLEAN FLOAT DOUBLE STRING

24 HiveQL – Group By pv_users pageid_age_sum
1 25 2 32 3 27 21 18570 30 26 pageid age Count 1 25 32 21 2 3 27 18570 30 26 HiveQL : INSERT INTO TABLE pageid_age_sum SELECT pageid, age, count(1) FROM pv_users GROUP BY pageid, age;

25 HiveQL – Group By in MapReduce
Shuffle Reduce pageid age 1 25 2 32 key value <1,25> 1 <2,25> <1,32> key value <1,25> 1 <1,32> <1,21> pageid age Count 1 25 32 21 pageid age 2 25 3 27 1 21 key value <2,25> 1 <3,27> <1,21> key value <2,25> 1 2 25 key value <3,27> 1 3 27 1 pageid age 18570 30 26 key value <18570,30> 1 <18570,26> key value <18570,30> 1 <18570,26> 18570 30 1 26

26 Stream Data Processing

27 Distributed Stream Processing Engine
Stream data Unbounded sequence of event tuples E.g., sensor data, stock trading data, web traffic data, … Since large volume of data flows from many sources, centralized systems can no longer process in real time.

28 Distributed Stream Processing Engine
General Stream Processing Model Stream processing involves processing data before storing. c.f. Batch systems(like Hadoop) provide processing data after storing. Processing Element (PE): A processing unit in stream engine Generally stream processing engine creates a logical network of stream processing elements(PE) connected in directed acyclic graph(DAG).

29 Distributed Stream Processing Engine

30 DSPE Systems Apache Storm (Current release: 0.10) Developed by Twitter
Donated to Apache Software Foundation in 2013 Pull based messaging Apache S4 (Current release: 0.6) Developed by Yahoo Donated to Apache Software Foundation in 2011 S4 stands for Simple Scalable Streaming Systems Push based messaging Apache Samza (Current release: 0.9) Developed by LinkedIn Messaging using message broker(Kafka)

31 Apache Storm System Architecture

32 Apache Storm Topology A PE DAG on Storm
Spout: Starting point of data stream can be listening to HTTP port or pulling from queue Bolt: Process incoming stream tuple Bolt pulls message from upstream PE. Bolts don’t take excessive amount of messages. Stream grouping Shuffle grouping, fields grouping, partial key grouping, all grouping, global grouping, … Message Processing Guarantee Each PE keeps the output message until downstream PE processes the message and sends acknowledgement message.

33 Apache Storm: Spouts Tuple Tuple Source of streams

34 Apache Storm: Bolts Processes input streams and produces new streams
Tuple Tuple Tuple Tuple Tuple Tuple Processes input streams and produces new streams

35 Apache Storm: Topology
Network of spouts and bolts

36 Apache Storm: Task Spouts and bolts execute as many tasks across the cluster

37 Apache Storm: Stream grouping
Shuffle grouping: pick a random task Fields grouping: consistent hashing on a subset of tuple fields All grouping: send to all tasks Global grouping: pick task with lowest id

38 Apache Storm Supported language Tutorial Python, Java, Clojure
Bolt ‘exclaim1’ appends the string “!!” to its input. Bolt ‘exclaim2’ appends the string “**” to its input.

39 Apache Storm Rice Bob John John Bob Rice Bob!! word Rice!! exclaim1

40 References Apache Hive, https://hive.apache.org/
Design - Apache Hive, Apache Storm,

41 Fast, Interactive, Language-Integrated Cluster Computing
Spark

42 Motivation Most current cluster programming models are based on acyclic data flow from stable storage to stable storage Benefits of data flow: runtime can decide where to run tasks and can automatically recover from failures Map Reduce Input Output

43 Motivation Acyclic data flow is inefficient for applications that repeatedly reuse a working set of data: Iterative algorithms (machine learning, graphs) Interactive data mining tools (R, Excel, Python) With such frameworks, apps reload data from stable storage on each query

44 Solution: Resilient Distributed Datasets(RDDs)
Allow apps to keep working sets in memory for efficient reuse Retain the attractive properties of MapReduce Fault tolerance, data locality, scalability Support a wide range of applications Batch, Query processing, Stream processing, Graph processing, Machine learning

45 (return a result to driver program)
RDD Operations Transformations (define a new RDD) map filter sample groupByKey reduceByKey sortByKey flatMap union join cogroup cross mapValues Actions (return a result to driver program) collect reduce count save lookupKey

46 Result: scaled to 1 TB data in 5-7 sec (vs 170 sec for on-disk data)
Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Base RDD Transformed RDD lines = sc.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache() Worker Driver results tasks Block 1 Action cachedMsgs.filter(_.contains(“foo”)).count cachedMsgs.filter(_.contains(“bar”)).count Cache 2 . . . Cache 3 Block 2 Result: scaled to 1 TB data in 5-7 sec (vs 170 sec for on-disk data) Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data) Block 3

47 filter (func = _.contains(...))
RDD Fault Tolerance RDDs maintain lineage information that can be used to reconstruct lost partitions messages = textFile(...).filter(_.startsWith(“ERROR”)) .map(_.split(‘\t’)(2)) HDFS File Filtered RDD Mapped RDD filter (func = _.contains(...)) map (func = _.split(...))

48 Performance Logistic Regression Refer to the RDD paper.
Elaborate the setup of experiments, and which values are compared

49 Fault Recovery Run K-means on 75-node cluster
Each iteration consists of 400 tasks working on 100GB data RDD is reconstructed by using lineage Recovery overhead: 24s (≈ 30%) Lineage graph: ≤10KB Matei et al, Resilient Distributed Datasets, NSDI `12

50 Generality Various type of applications can be built atop RDD
Can be combined in single application and run on Spark Runtime

51 Interactive Analytics
Interactive shell is provided Program returns the result directly Run ad-hoc queries

52 Demo WordCount in Scala API Show the result on the shell
counts.saveAsTextFile() → counts.collect()

53 Conclusion Performance Fast due to caching data in memory
Fault-tolerance Fast recovery by using lineage history Programmability Multiple languages support Simple & Integrated programming model

54 Up-to-date Trends As we see, the area Hadoop ecosystem covers are broadening widely.

55 Up-to-date Trends Batch + Real-time Analytics Big-Data-as-a-Service

56 Trend1: Batch + Real-time Analytics
Lambda Architecture Data Dispatched to both the batch layer and the speed layer Batch layer Manage the master dataset (an immutable, append-only set of raw data) Pre-compute the batch views.

57 Trend1: Batch + Real-time Analytics
Lambda Architecture Serving layer Index the batch views Can be queried in low-latency, ad-hoc way. Speed layer Deals with recent data only (serving layer’s update cost is high) Merge results from batch views and real-time views when answering queries.

58 Trend2: Big-Data-as-a-Service
Big data analytics systems are provided as Cloud service Programming API & Monitoring interface Infrastructure can also be provided as a service No worry for distributing data, resource optimization, resource provision, etc. Users can focus on the data itself

59 Trend2: Big-Data-as-a-service
Google Cloud Dataflow <Programming API> <Monitoring UI>

60 References Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica, Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing, NSDI`12 Apache Spark, Databricks, Lambda Architecture, Google Cloud Dataflow

61 Questions? Thank you


Download ppt "Map Reduce & Hadoop June 3, 2015 HS Oh, HR Lee, JY Choi"

Similar presentations


Ads by Google