Download presentation
Presentation is loading. Please wait.
Published byAylin Pendergast Modified over 9 years ago
1
YSmart: Yet Another SQL-to-MapReduce Translator
Rubao Lee1, Tian Luo1, Yin Huai1, Fusheng Wang2, Yongqiang He3, Xiaodong Zhang1 Thanks for the introduction. Good morning everyone. My name is Yin Huai. Today, I will present on our work on YSmart, Yet another SQL-to-MapReduce translator. This is a joint work with others in The Ohio State University, Emory University, and Facebook Data Infrastructure Team. 1Dept. of Computer Sci. & Eng., The Ohio State University 2Center for Comprehensive Informatics, Emory University 3 Data Infrastructure Team, Facebook
2
Data Explosion in Human Society
The global storage capacity The amount of digital information created and replicated in a year 2007 Analog 18.86 billion GB Analog Storage 2000 We are entering an era of data explosion. Since the year of 2000, the total amount of digital storage capacity has been rapidly increasing. In the year of 2007, the total amount of digital storage capacity has reached around 280 billion GB, 45 percent of which is from PC hard disks. According to another report, to the year of 2020, the total amount of digital data created and replicated in that year will reach 35 trillion GB, which is 44 times as much as that in the year of With the data explosion, companies, organizations and research projects needs to deal with big data more often. The big data is not just about the huge volume of data, and also about the complex of the data and the complex analysis methods on processing the data. Digital billion GB Digital Storage Source: Exabytes: Documenting the 'digital age' and huge growth in computing capacity, The Washington Post
3
Challenges of Big Data Management and Analytics
Existing DB technology is not prepared for such huge volumes Storage and system scale (e.g. 70TB/day and 3000 nodes in Facebook) Many applications in science, medical and business follow the same trend Big data demands more complex analytics than data transactions Data mining, time series analysis and etc. gain deep insights The conventional DB business model is not affordable Software license, maintain fees, $10,000/TB* The conventional DB processing model is to “scale up” By updating CPU/memory/storage/networks (HPC model) The big data processing model is to “scale-out”: continuously adding low cost computing and storage nodes in a distributed way Hadoop is the most widely used implementation of MapReduce in hundreds of society-dependent corporations/organizations for big data analytics: AOL, Baidu, EBay, Facebook, IBM, NY Times, Yahoo! …. The MapReduce (MR) programming model becomes an effective data processing engine for big data analytics *:
4
MapReduce Overview A simple but effective programming model designed to process huge volumes of data concurrently on large-scale clusters Key/value pair transitions Map: (k1, v1) (k2, v2) Reduce: (k2, a list of v2) (k3, v3) Shuffle: Partition Key (It could be the same as k2, or not) Partition Key: to determine how a key/value pair in the map output would be transferred to a reduce task MapReduce programming model and its implementation has been proven to be a very important framework to deal with Big Data. MapReduce is a simple but effective programming model designed to process huge volumes of data concurrently on a cluster. A data processing procedure is composed by one or multiple MR jobs. In every MR job, data is organized as key/value pairs. For example, we can have a kind of key/value pair with name of a person as key and the age of him or her as value. Typically, a mr job has two parts, a map function and a reduce function. These two functions will transform a key/value pair to another kind of key/value pair. These two functions are invoked by map and reduce tasks, respectively. Right now, Hadoop is the most widely used implementation of MapReduce. Hundreds of society-dependent corporations/organizations have chosen hadoop and its ecosystem as their big data analytics platform, such as Facebook, IBM, NY Times, Yahoo! ….
5
MR(Hadoop) Job Execution Overview
MR program (job) The execution of a MR job involves 6 steps Map Tasks Reduce Tasks Control level work, e.g. job scheduling and task assignment Data is stored in a distributed file system (e.g. Hadoop Distributed File System) 1: Job submission Master node Worker nodes Worker nodes Let’s quickly review the execution of a MapReduce job, or be specific Hadoop Job. In a Hadoop cluster, there is master node, which will take care all of control level work, such as job scheduling and task assignment. There are a bunch of worker nodes, which will do data processing work specified by Map or Reduce function. In the cluster, data is stored in a distributed file system, in hadoop, it is hadoop distributed file system. The execution of a MR job has 6 steps. Firstly, a user will submit his or her MR program, or called MR job. Then, when this job is ready to be executed, the master will assign map and reduce tasks to worker nodes. 2: Assign Tasks Do data processing work specified by Map or Reduce Function
6
MR(Hadoop) Job Execution Overview
MR program The execution of a MR job involves 6 steps Map Tasks Reduce Tasks 1: Job submission Map output Master node Worker nodes Worker nodes Map output will be shuffled to different reduce tasks based on Partition Keys (PKs) (usually Map output keys) 3: Map phase Concurrent tasks 4: Shuffle phase
7
MR(Hadoop) Job Execution Overview
MR program The execution of a MR job involves 6 steps Map Tasks Reduce Tasks 1: Job submission 6: Output will be stored back to the distributed file system Master node Worker nodes Worker nodes There are three phases in the job execution. First one is map phase, in this phase, the worker nodes associated with map tasks will concurrently process the input data and generate intermediate key/value pairs. Then, in the shuffle phase, the intermediate results will be distributed to worker nodes associated with reduce tasks. All key/value pairs with same key will be distributed to the same reduce tasks. After that, the worker nodes associated with reduce tasks will concurrently process the intermediate key/value pairs and generate the results of this job. Finally, the output will be stored back to HDFS. Reduce output 3: Map phase Concurrent tasks 4: Shuffle phase 5: Reduce phase Concurrent tasks
8
MR(Hadoop) Job Execution Overview
MR program The execution of a MR job involves 6 steps Map Tasks Reduce Tasks 1: Job submission 6: Output will be stored back to Distributed File System Master node Worker nodes Worker nodes A MapReduce (MR) job is resource-consuming: 1: Input data scan in the Map phase => local or remote I/Os 1: Store intermediate results of Map output => local I/Os 3: Transfer data across in the Shuffle phase => network costs 4: Store final results of this MR job => local I/Os + network costs (replicate data) Although mapreduce programming model hide the complex parallel programming work, programmers should know a MR job is resources consuming. There are three kinds of nontrivial costs. First is the job-scheduling and task-startup costs. Second is the network transfer in the shuffle phase. Finally, the third one is the potential costs on remotely access map input data once a map task is not co-located with its input data. Reduce output 3: Map phase Concurrent tasks 4: Shuffle phase 5: Reduce phase Concurrent tasks
9
This complex code is for a simple MR job
MR programming is not that “simple”! public static class Reduce extends Reducer<IntWritable,Text,IntWritable,Text> { private Text result = new Text(); public void reduce(IntWritable key, Iterable<Text> values, Context context ) throws IOException, InterruptedException { double sumQuantity = 0.0; IntWritable newKey = new IntWritable(); boolean isDiscard = true; String thisValue = new String(); int thisKey = 0; for (Text val : values) { String[] tokens = val.toString().split("\\|"); if (tokens[tokens.length - 1].compareTo("l") == 0){ sumQuantity += Double.parseDouble(tokens[0]); } else if (tokens[tokens.length - 1].compareTo("o") == 0){ thisKey = Integer.valueOf(tokens[0]); thisValue = key.toString() + "|" + tokens[1]+"|"+tokens[2]; else continue; if (sumQuantity > 314){ isDiscard = false; if (!isDiscard){ thisValue = thisValue + "|" + sumQuantity; newKey.set(thisKey); result.set(thisValue); context.write(newKey, result); public int run(String[] args) throws Exception { Configuration conf = new Configuration(); String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs(); if (otherArgs.length != 3) { System.err.println("Usage: Q18Job1 <orders> <lineitem> <out>"); System.exit(2); Job job = new Job(conf, "TPC-H Q18 Job1"); job.setJarByClass(Q18Job1.class); job.setMapperClass(Map.class); job.setMapOutputKeyClass(IntWritable.class); job.setMapOutputValueClass(Text.class); job.setReducerClass(Reduce.class); job.setOutputKeyClass(IntWritable.class); job.setOutputValueClass(Text.class); FileInputFormat.addInputPath(job, new Path(otherArgs[0])); FileInputFormat.addInputPath(job, new Path(otherArgs[1])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[2])); return (job.waitForCompletion(true) ? 0 : 1); public static void main(String[] args) throws Exception { int res = ToolRunner.run(new Configuration(), new Q18Job1(), args); System.exit(res); package tpch; import java.io.IOException; import java.util.ArrayList; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.conf.Configured; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.DoubleWritable; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.Mapper.Context; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.FileSplit; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.util.GenericOptionsParser; import org.apache.hadoop.util.Tool; import org.apache.hadoop.util.ToolRunner; public class Q18Job1 extends Configured implements Tool{ public static class Map extends Mapper<Object, Text, IntWritable, Text>{ private final static Text value = new Text(); private IntWritable word = new IntWritable(); private String inputFile; private boolean isLineitem = false; @Override protected void setup(Context context ) throws IOException, InterruptedException { inputFile = ((FileSplit)context.getInputSplit()).getPath().getName(); if (inputFile.compareTo("lineitem.tbl") == 0){ isLineitem = true; } System.out.println("isLineitem:" + isLineitem + " inputFile:" + inputFile); public void map(Object key, Text line, Context context String[] tokens = (line.toString()).split("\\|"); if (isLineitem){ word.set(Integer.valueOf(tokens[0])); value.set(tokens[4] + "|l"); context.write(word, value); else{ value.set(tokens[1] + "|" + tokens[4]+"|"+tokens[3]+"|o"); This complex code is for a simple MR job Low Productivity! Do you miss some thing like … “SELECT * FROM Book WHERE price > ”? Although MR model provide a much easier environment for parallel programming. But the programming work is not that easy. You can easily write a lots of codes even for a simple MR job. Thus, the productivity of MR programming is actually quite low. To enhance the productivity, High-level languages, such as SQL-like declarative languages, on top of MapReduce is desired.
10
SQL-to-MapReduce Translator Hadoop Distributed File System (HDFS)
High-Level Programming and Data Processing Environment on top of Hadoop A job description in SQL-like declarative language An interface between users and MR programs (jobs) SQL-to-MapReduce Translator Write MR programs (jobs) MR programs (jobs) With SQL-like declarative languages, users do not need to write MR jobs or even know how to write MR jobs. They can quickly use SQL-like languages to describe their data processing procedure in a few lines of query. Then, a SQL-to-MapReduce translator will translate these SQL-like queries to one or multiple MR jobs. Then, the MR framework will execute these jobs. Right now, Hive and Pig are two representative translators on top of hadoop. Workers Hadoop Distributed File System (HDFS)
11
SQL-to-MapReduce Translator Hadoop Distributed File System (HDFS)
High-Level Programming and Data Processing Environment on top of Hadoop A job description in SQL-like declarative language A interface between users and MR programs (jobs) SQL-to-MapReduce Translator Write MR programs (jobs) Improve productivity from hand-coding MapReduce programs 95%+ Hadoop jobs in Facebook are generated by Hive 75%+ Hadoop jobs in Yahoo! are invoked by Pig* A MR program (job) A data warehousing system (Facebook) A high-level programming environment (Yahoo!) Hive is a data warehousing system and pig is a high-level dataflow system. Although the purposes of these two systems are different. They share one core-component, which is the SQL-to-MapRedceu translator. These two systems can tremendously improve the productivity from hand-coding MR programs. Let me give you two examples in two largest big data analytics systems. In the Facebook, more than 95 percent of Hadoop jobs are generated by Hive. In the Yahoo!, more 75 percent of Hadoop jobs are invoked by Pig. Workers Hadoop Distributed File System (HDFS) *
12
Translating SQL-like Queries to MapReduce Jobs: Existing Approach
“Sentence by sentence” translation [C. Olston et al. SIGMOD 2008], [A. Gates et al., VLDB 2009] and [A. Thusoo et al., ICDE2010] Implementation: Hive and Pig Three steps Identify major sentences with operations that shuffle the data Such as: Join, Group by and Order by For every operation in the major sentence, a corresponding MR job is generated e.g. a join op. => a join MR job Add other operations, such as selection and projection, into corresponding MR jobs Let’s first look at the existing approach on Translating SQL-like Queries to MapReduce Jobs and find out if this approach can efficiently translate queries to MR jobs. The existing approach can be generalized as Sentence by sentence translation, involving three steps. A query is composed by multiple sentences. Firstly, major sentences with operations that shuffle the data are identified. For example, join, group by, order by need shuffle the data. Then, for every major sentence, a correponding MR job is generated. For example, a join MR job will be generated for a join operation. Finally, other operations such as selection and projection will be added into corresponding MR jobs. A basic question we want to ask is that Can existing SQL-to-MapReduce translators also provide comparable performance with optimized MR job(s)? Can existing SQL-to-MapReduce translators also provide comparable performance with optimized MR job(s)?
13
An Example: TPC-H Q21 What’s wrong?
One of the most complex and time-consuming queries in the TPC-H benchmark for data warehousing performance Optimized MR Jobs vs. Hive in a Facebook production cluster 3.7x What’s wrong?
14
The Execution Plan of TPC-H Q21
Hive handle this sub-tree in a different way from our optimized MR jobs SORT AGG3 It’s the dominated part on time (~90% of execution time) Join4 Left-outer-Join Join3 Join2 supplier nation AGG1 AGG2 Join1 lineitem orders lineitem lineitem
15
What’s wrong with existing SQL-to-MR translators?
A JOIN MR Job However, inter-job correlations exist. Let’s look at the Partition Key An AGG MR Job Key: l_orderkey A Table J5 A Composite MR Job Key: l_orderkey J3 Key: l_orderkey Key: l_orderkey Key: l_orderkey J4 J2 J1 lineitem orders lineitem lineitem lineitem orders Corollary: Generate more unnecessary MR jobs, which significantly hurt the performance What’s wrong with existing SQL-to-MR translators? Existing translators are correlation-unaware Ignore common data input Ignore common data transition Add unnecessary data re-partition J1, J2 and J4 all need the input table ‘lineitem’ J1 to J5 all use the same partition key ‘l_orderkey’
16
Approaches of Big Data Analytics in MR: The landscape
Correlation-aware SQL-to-MR translator Hand-coding MR jobs Pro: Easy programming, high productivity Con: Poor performance on complex queries (complex queries are usual in daily operations) Performance Pro: high performance MR programs Con: 1: lots of coding even for a simple job 2: Redundant coding is inevitable 3: Hard to debug [J. Tan et al., ICDCS 2010] The effort needed to implement a data processing procedure in MapReduce environment The execution time of the whole data processing procedure in MapReduce environment So, let’s review the landscape of existing approaches of big data analytics in MR environment in two dimensions. The first dimension is productivity, which means The effort needed to implement a data processing procedure in MapReduce environment. The second dimension is performance, which mean The execution time of the whole data processing procedure in MapReduce environment. Firstly, we can write MR jobs by ourselves. The advantage of this approach is we can write high efficient jobs with high performance. But the disadvantages are we have to write lots of code, involving lots of redundant code. Moreover, we may make more mistakes when we write more lines of code. Thus, the productivity is low. However, you can choose high productivity through using existing sql-to-mr translators. But when you put your system in production, say farewell to your toy testing cases, and face more complex queries, you will realize the poor performance of MR jobs translated from these existing translators. Can we combine the high productivity of sql-to-mapreduce translator and the high performance of hand-coding mr jobs? A Correlation-aware SQL-to-MR translator is the solution Existing SQL-to-MR Translators Productivity
17
Outline Background and Motivation
Our Solution: YSmart, a correlation-aware SQL-to- MapReduce translator Performance Evaluation YSmart in Hive Conclusion
18
Identify Correlations Merge Correlated MR jobs
Our Approaches and Critical Challenges Correlation-aware SQL-to-MR translator MR Jobs for best performance SQL-like queries Primitive MR Jobs Identify Correlations Merge Correlated MR jobs 1: Correlation possibilities and detection 3: Implement high-performance and low-overhead MR jobs 2: Rules for automatically exploiting correlations
19
Primitive MR Jobs Selection-Projection (SP) MR Job:
Simple query with only selection and/or projection Aggregation (AGG) MR Job: Group input relation and apply aggregation functions e.g. SELECT count(*) FROM faculty GROUP BY age. Partition key can be any column(s) in the GROUP BY clause Join (JOIN) MR Job: For an equi-join on two relations (inner or left/right/full outer join) e.g. SELECT * FROM faculty LEFT OUTER JOIN awards ON … Partition key is in the JOIN predicate Sort (SORT) MR Job: Sort input relation Usually the last step in a complex query
20
Input Correlation (IC)
Multiple MR jobs have input correlation (IC) if their input relation sets are not disjoint J2 J1 lineitem orders lineitem A shared input relation set Map Func. of MR Job 1 Map Func. of MR Job 2
21
Transit Correlation (TC)
Multiple MR jobs have transit correlation (TC) if they have input correlation (IC), and they have the same Partition Key Key: l_orderkey Key: l_orderkey J2 J1 lineitem orders lineitem Two MR jobs should first have IC Partition Key A shared input relation set Other Data Map Func. of MR Job 1 Map Func. of MR Job 2
22
Job Flow Correlation (JFC)
A MR job has Job Flow Correlation (JFC) with one of its child MR jobs if it has the same partition key as that MR job J1 J2 J2 Partition Key Output of MR Job 2 Other Data J1 Map Func. of MR Job 1 Reduce Func. of MR Job 1 Map Func. of MR Job 2 lineitem orders Reduce Func. of MR Job 2
23
Query Optimization Rules for Automatically Exploiting Correlations
JOB-Merging: Exploiting both Input Correlation and Transit Correlation AGG-Pushdown: Exploiting the Job Flow Correlation associated with Aggregation jobs JOIN-Merging-Full: Exploiting the Job Flow Correlation associated with JOIN jobs and their Transit Correlated parents jobs JOIN-Merging-Half: Exploiting the Job Flow Correlation associated with JOIN jobs
24
Rule 1: JOB MERGING J1 lineitem orders lineitem lineitem orders
Condition: Job 1 and Job2 have both IC and TC Action: Merge Map/Reduce Func. of Job 1 and Job 2 into a Single Map/Reduce Func. J2 J1 Do all the work of Job 1 Reduce Func., Job 2 Reduce Func.. Generate two outputs lineitem orders lineitem Job 1 Reduce Func. Reduce Func. of Merged Job Job 2 Reduce Func. Job 1 Map Func. Map Func. of Merged Job Job 2 Map Func. lineitem orders
25
Rule 2: AGG-Pushdown J1 lineitem orders lineitem orders
Condition: AGG Job and its parent Job 1 have JFC J2 Action: Merge Job 1 Reduce Func., Job 2 Reduce Func. and JOIN Job into a single Reduce Func. J1 Reduce Func. of Merged Job AGG Reduce Func. Do all the work of Job 1 Reduce Func., AGG Map Func. and AGG Reduce Func. . lineitem orders AGG Map Func. Job 1 Reduce Func. Job 1 Map Func. lineitem orders
26
Reduce Func. of Merged Job
Rule 3:Join-Merging-Full Condition 1: Job 1 and Job 2 have Transit Correlation (TC) Condition 2: JOIN Job have JFC with Job 1 and Job 2 J3 Action: Merge Job 1 Reduce Func., Job 2 Reduce Func. and JOIN Job into a single Reduce Func. J2 J1 JOIN Reduce Func. Do all the work of Job 1 Reduce Func., Job 2 Reduce Func., JOIN Map Func. and JOIN Reduce Func. lineitem orders lineitem JOIN Map Func. Reduce Func. of Merged Job Merge Map Func. Job 1 Reduce Func. Job 2 Reduce Func. Job 1 Map Func. Map Func. of Merged Job Job 2 Map Func. lineitem 26 orders
27
Reduce Func. of Merged Job
Rule 4:Join-Merging-Half Condition: Job 2 and JOIN Job have JFC Action: Merge the Reduce Func. of Job and that of JOIN into a single Reduce Func. Data from Job 1: do the work of Reduce Func. of JOIN JOIN Reduce Task Reduce Func. of Merged Job Data from Job 2: Do all the work of Job 2 Reduce Func., JOIN Map Func. and JOIN Reduce Func. JOIN Map Func. JOIN Map Func. JOIN Map Func. Must consider data dependency Job 1 Reduce Func. Job 2 Reduce Func. Job 1 must be executed before the merged Job Job 1 Map Func. Job 2 Map Func.
28
The Common MapReduce Framework
Aim at providing a framework for executing merged tasks in low overhead To minimize size of intermediate data To minmize data access in reduce phase Common Map Func. and Common Reduce Func. Common Map Func.: Tag a record with Job-IDs it does not belong to Less intermediate results Common Reduce Func.: Dispatch key/value pairs => post-job computations Refer our paper for details
29
Outline Background and Motivation
Our Solution: YSmart, a correlation-aware SQL-to- MapReduce translator Performance Evaluation YSmart in Hive Conclusion
30
Exp1: Four Cases of TPC-H Q21
1: Sentence-to-Sentence Translation 5 MR jobs 2: InputCorrelation+TransitCorrelation 3 MR jobs lineitem orders Join2 Left-outer-Join lineitem orders Join2 AGG1 AGG2 Left-outer-Join Join1 3: InputCorrelation+TransitCorrelation+ JobFlowCorrelation 1 MR job 4: Hand-coding (similar with Case 3) In reduce function, we optimize code according to the query semantic Case1: Do not exploit correlations Case2: Join1, AGG1, AGG2 are combined Case3: All five operations are combined Case 4: All five operations are combined lineitem orders lineitem orders
31
Breakdowns of Execution Time (sec)
From totally 888sec to 510sec From totally 768sec to 567sec Only 17% difference Hand-coding does not need to follow the structure of query plan Show the case “no correlation” We first show the execution time breakdowns of Case 1 that does not exploit any correlation. We show the execution times of the map phase and the reduce phase of each job. For example. Red bar shows the execution time of the map phase of Job 1, and the green bar shows the execution time of the reduce phase of Job1. In this case that does not consider correlations, five jobs are generated. We can see that, the dominant parts of the whole query are the map phases of Job1, Job2, and job4. (which bars) In fact, they all execute a full table scan of linitem that is the largest table in the database. (2) Show case 2: If we use Input correlation and transit correlation, we need three jobs. In this case, the first job is acctually executing three jobs in the above case. (…) (3) Show case 3: If we further consider Job Flow correlation, one job is enough to execute the three jobs in case 2. The job’s map phase is almost identical to the map phase of Job 1 in Case 2. However, its reduce phase executes the code of all other operations. (4) We further show the performance of hand-optimized code. Its map phase is almost identical to the map phase of case 3. But, its reduce phase is optimized according to semantics…. HuaiYin: You can complete this slide by your own will. I here only provide an outline. No Correlation Input Correlation Transit Correlation Input Correlation Transit Correlation JobFlow Correlation Hand-Coding
32
Exp2: Clickstream Analysis
A typical query in production clickstream analysis: “what is the average number of pages a user visits between a page in category ‘X’ and a page in category ‘Y’?” In YSmart JOIN1, AGG1, AGG2, JOIN2 and AGG3 are executed in a single MR job 8.4x 4.8x
33
YSmart in the Hadoop Ecosystem
See patch HIVE-2206 at apache.org Hive + YSmart YSmart With SQL-like declarative languages, users do not need to write MR jobs or even know how to write MR jobs. They can quickly use SQL-like languages to describe their data processing procedure in a few lines of query. Then, a SQL-to-MapReduce translator will translate these SQL-like queries to one or multiple MR jobs. Then, the MR framework will execute these jobs. Right now, Hive and Pig are two representative translators on top of hadoop. Hadoop Distributed File System (HDFS)
34
Conclusion YSmart is a correlation-aware SQL-to-MapReduce translator Ysmart can outperform Hive by 4.8x, and Pig by 8.4x YSmart is being integrated into Hive The individual version of YSmart will be released soon Thank You!
36
Backup Slides
37
Performance Evaluation
Exp 1. How can correlations affect query execution performance? In a local small cluster, the worker node is a quad-core machine Workload: TPC-H Q21, 10GB dataset Exp 2. How YSmart can outperformce Hive and Pig? Amazon EC2 cluster Workload: Click-stream, 20GB Compare the performance on execution time among YSmart, Hive and Pig
38
Why MapReduce is effective data processing engine for big data analytics?
Two unique properties Minimum dependency among tasks (almost sharing nothing) Simple task operations in each node (low cost machines are sufficient) Two strong merits for big data analytics Scalability (Amadal’s Law): increase throughput by increasing # of nodes Fault-tolerance (quick and low cost recovery of the failures of tasks) Hadoop is the most widely used implementation of MapReduce in hundreds of society-dependent corporations/organizations for big data analytics: AOL, Baidu, EBay, Facebook, IBM, NY Times, Yahoo! ….
39
An Example A typical question in click stream analysis: “what is the average number of pages a user visits between a page in category ‘Travel’ and a page in category ‘Shopping’?” User Category Timestamp (TS) Alice Travel :05:06 :05:07 Sports :06:01 Shopping :07:06 :07:20 Clicks(user id int, page id int, category id int, ts timestamp)
40
An Example A typical question in click stream analysis: “what is the average number of pages a user visits between a page in category ‘Travel’ and a page in category ‘Shopping’?” 3 steps: 1: Identify the time window for every user-visit path from a page in in category ‘Travel’ and a page in category ‘Shopping’; 2: Count the number of pages in each time window (exclude start and end pages); 3: Calculate the average number of counts generated by step 2. Clicks(user id int, page id int, category id int, ts timestamp)
41
Step 1: Identify the time window for every user-visit path from a page in in category ‘Travel’ and a page in category ‘Shopping’; Step 1-1: For each user, identify pairs of clicks from category ‘Travel’ to ‘Shopping’ Join1 C1.user=C2.user Clicks (C1) Clicks (C2)
42
User Category Timestamp (TS) Alice Travel 2011-01-01 04:05:06
:05:07 Sports :06:01 Shopping :07:06 :07:20 :05:06 :05:06 :05:07 :05:07 :07:06 :07:06 :07:20 :07:20 User TS_Travel TS_Shopping Alice :05:06 :07:06 :07:20 :05:07 Join1 C1.user=C2.user Clicks (C1) Clicks (C2)
43
Step 1-2: Find the end TS for every time window
Step 1-3: Find the start TS for every time window User TS_Start TS_End Alice :05:07 :07:06 User TS_Travel TS_Shopping Alice :05:06 :07:06 :07:20 :05:07 User TS_Travel TS_End Alice :05:06 :07:06 :05:07 AGG2 Group by user, TS_End AGG1 Group by user, TS_Travel Join1 C1.user=C2.user Clicks (C1) Clicks (C2)
44
Step 2-1: Identify the clicks in each time window
Step 2: Count the number of pages in each time window (exclude start and end pages); Step 2-1: Identify the clicks in each time window User TS_Start TS_End Alice :05:07 :07:06 Join2 user=C3.user Group by user, TS_End User Category Timestamp (TS) Alice Travel :05:06 :05:07 Sports :06:01 Shopping :07:06 :07:20 User TS_Start Alice :05:07 AGG2 Clicks (C3) AGG1 Group by user, TS_Travel Join1 C1.user=C2.user Clicks (C1) Clicks (C2)
45
Step 2-2: Count the number of pages in each time window (exclude start and end pages);
Group by user, TS_Start AGG3 User TS_Start Alice :05:07 Join2 user=C3.user Group by user, TS_End AGG2 Clicks (C3) User Count Alice 1 AGG1 Group by user, TS_Travel Join1 C1.user=C2.user Clicks (C1) Clicks (C2)
46
Step 3: Calculate the average number of counts generated by step 2.
AGG4 Group by user, TS_Start AGG3 User Count Alice 1 Join2 user=C3.user Group by user, TS_End AGG2 Clicks (C3) Average Count 1 AGG1 Group by user, TS_Travel the average number of pages a user visits between a page in category ‘Travel’ and a page in category ‘Shopping’ Join1 C1.user=C2.user Clicks (C1) Clicks (C2)
47
AGG4 2 MR jobs 6 MR jobs Group by user, TS_Start Join1 AGG1 AGG2 Join2 AGG3 AGG3 Join2 These 5 MR jobs are correlated (sharing the same Partition Key), thus can be merged into 1 MR job user=C3.user Group by user, TS_End ~3x AGG2 Clicks (C3) AGG1 Group by user, TS_Travel Join1 C1.user=C2.user Clicks (C1) Clicks (C2)
48
MR Job Execution Overview
MR program The execution of a MR job involves 6 steps Map Tasks Reduce Tasks Hadoop is the most widely used implementation of MapReduce in hundreds of society-dependent corporations/organizations for big data analytics: AOL, Baidu, EBay, Facebook, IBM, NY Times, Yahoo! …. 1: Job submission 6: Output will be stored back to Distributed File System Master node Worker nodes Worker nodes A MapReduce (MR) job is resources consuming: 1: Job-scheduling and task-startup costs are nontrivial; 2: Intermediate results of Map output will traverse the “slow” network; 3: Potential remote data access of Map tasks. Reduce output 3: Map phase Concurrent tasks 4: Shuffle phase 5: Reduce phase Concurrent tasks
49
Rule 1: Exploiting the Transit Correlation (TC)
Condition: Job 1 and Job2 have TC Partition Key Other Data Job 1 output Job 2 output Action: Merge Reduce Func. of Job 1 and Job 2 into a Single Reduce Func. Job 1 Job 2 Reduce Func. of Merged Job Map Func. of MR Job 1 Map Func. of MR Job 2 Reduce Func. of MR Job 1 Reduce Func. of MR Job 2
50
Condition: AGG Job and its parent Job 1 have JFC
Rule 2: Exploiting the Job Flow Correlation (JFC) associated with Aggregation (AGG) jobs Condition: AGG Job and its parent Job 1 have JFC MR Job 1 AGG Job Partition Key Output of AGG Job Other Data Map Func. of MR Job 1 Reduce Func. of MR Job 1 Map Func. of AGG Job Reduce Func. of AGG Job
51
Condition: AGG Job and its parent Job1 have JFC
Rule 2: Exploiting the Job Flow Correlation (JFC) associated with Aggregation (AGG) jobs Condition: AGG Job and its parent Job1 have JFC Action: Merge Map and Reduce Func. of AGG into the Reduce Func. of Job 1 Merged Job Partition Key Output of AGG Job Other Data Map Func. of Merged Job Reduce Func. of Merged Job
52
Exp2: Evaluation Environment
Amazon EC2 cluster 11 nodes 20GB click-stream dataset Compare the performance on execution time among YSmart, Hive and Pig Facebook cluster 747 nodes 8 cores per node 1 TB data set of TPC-H benchmark Compare the performance on execution time between YSmart and Hive Every query were executed three times for both YSmart and Hive May need more contents
53
Amazon EC2 11-node cluster
Query: “what is the average number of pages a user visits between a page in category ‘X’ and a page in category ‘Y’?” In YSmart JOIN1, AGG1, AGG2, JOIN2 and AGG3 are executed in a single MR job 8.4x 4.8x
54
Facebook Cluster Query: TPC-H Suppliers Who Kept Orders Waiting Query (Q21) 3.3x 3.1x 3.7x
55
Data Explosion in Human Society
The global storage capacity The amount of digital information created and replicated in a year 2007 Analog 18.86 billion GB Analog Storage 1986 Analog 2.62 billion GB PC hard disks 123 billion GB 44.5% Digital 0.02 billion GB We are entering an era of data explosion. Since the year of 1986, the total amount of global storage capacity has tremendously increased. In 1986, the total storage capacity of digital medium is only 0.02 billion GB, comparing with the 2.62 billion GB analog storage capacity. However, in the year of 2007, the total amount of digital storage medium has reached around 280 billion GB, of which the PC hard disks storage has 123 billion GB. The analog storage capacity is just less than 20 billion GB in that year. According to another report, to the year of 2020, the total amount of digital data created and replicated in that year will reach 35 trillion GB, which is 44 times as much as that in the year of With the data explosion, companies, organizations and research projects needs to deal with big data more often. The big data is not just about the huge volume of data, and also about the complex of the data and the complex analysis methods on processing the data. The most important thing is that big data has the potential to make big impact. Digital billion GB Digital Storage Source: Exabytes: Documenting the 'digital age' and huge growth in computing capacity, The Washington Post
56
Challenges of Big Data Management and Analytics
Existing DB technology is not prepared for such huge volumes Storage and system scale (e.g. 70TB/day and 3000 nodes in Facebook) Many applications in science, medical and business follow the same trend Big data demands more complex analytics than data transactions Data mining, time series analysis and etc. gain deep insights The conventional DB business model is not affordable Software license, maintain fees, $10,000/TB* The conventional DB processing model is to “scale up” By updating CPU/memory/storage/networks (HPC model) The big data processing model is to “scale-out”: continuously adding low cost computing and storage nodes in a distributed way The MapReduce (MR) programming model becomes an effective data processing engine for big data analytics *:
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.