Download presentation
Presentation is loading. Please wait.
Published byRosamond Patrick Modified over 9 years ago
1
Clydesdale: Structured Data Processing on MapReduce Jackie
2
Unmodified Hadoop Aim at workload where the data fit s a star schema Draw on existing techniques: columnar storage, tailored join plans, block iteration Introduction
3
Background Clydesdale architecture challenges experiment Outline
4
InputFormats and OutputFormats InputFormat implements two methods: getSplits(), getRecordReader() MapRunners Schedules JVM reuse Background
5
Clydesdale Architecture
6
Avoid I/O for columns that are not used Store each column in separate HDFS file ColumnInputFormat ensures that different columns in a row are co-located at datanode Columnar Storage
7
Sql-like structured data processing Map phase is responsible for joining the fact table with the dimension tables Reduce phase is responsible for the grouping and aggregation Join Strategy
8
Flow of Clydesdale’s Join Job
9
Consider the following query Examples
10
Map phase Build hashtable for each dimension table using predicates Maptask checks whether the input was in the hashtables Output the record that satisfies the join conditions Key from the subset of columns needed for grouping. Reduce phase Aggregate the values of the same key Sort at client Execution process
11
Pseudocode for the Query
12
Exploit multi-core parallelism Single map task per node Uses a custom MapRunner class to run a multi-threaded map task Using MultiCIF packs multiple input splits into a single multi-split Shared across consecutive map tasks that run on the same node Task scheduling Block iteration Optimizing for the Native Implementation
13
Pseudocode for MapRunner
14
Schedule only one map task from the join job on a given node Schedule subsequent map tasks on the node where the dimension hash table has already been built Communicate to the map task the number of slots, or processor cores it can use on the node Task Scheduling
15
High per-row overheads B-CIF: return an array of rows over the same input Block Iteration
16
Support two join plans: repartition join, mapjoin Reparttion join is a robust technique that works with any combination of sizes of tables Mapjoin is designed for one table that is significantly samller than the other Hive Background
17
Hive’s Mapjoin plan
18
SQL-Logical Plan-Physical Plan-MR Workflow Workflow with Six Jobs Hive:SQL-Like Language
19
Cluster Cluster A : 9 nodes, two quad-core processors,16G memory, 8*250G disk, 1G ethernet switch Cluster B: 42nodes, two quad-core processors, 32G memory 5*500G disk 1G ethernet switch Clydesdale on hadoop 0.21 and Hive on hadoop0.20.2 Workload Storage Format: Clydesdale fact tables were stored in Multi-CIF, Hive is RCFile Experimental Setup
20
Comparison with Hive
21
Hive joins one dimension table at a time with the fact table Hive maintains many copies of the hash table Hive creates the hash tables on a single node and pays the cost of disseminating it to the entire cluster Each task in Hive has to load and deserialize the hash table when it starts. Result Analysis
22
Analysis of Clydesdale
23
Limitations
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.