Clydesdale: Structured Data Processing on MapReduce Jackie
Unmodified Hadoop Aim at workload where the data fit s a star schema Draw on existing techniques: columnar storage, tailored join plans, block iteration Introduction
Background Clydesdale architecture challenges experiment Outline
InputFormats and OutputFormats InputFormat implements two methods: getSplits(), getRecordReader() MapRunners Schedules JVM reuse Background
Clydesdale Architecture
Avoid I/O for columns that are not used Store each column in separate HDFS file ColumnInputFormat ensures that different columns in a row are co-located at datanode Columnar Storage
Sql-like structured data processing Map phase is responsible for joining the fact table with the dimension tables Reduce phase is responsible for the grouping and aggregation Join Strategy
Flow of Clydesdale’s Join Job
Consider the following query Examples
Map phase Build hashtable for each dimension table using predicates Maptask checks whether the input was in the hashtables Output the record that satisfies the join conditions Key from the subset of columns needed for grouping. Reduce phase Aggregate the values of the same key Sort at client Execution process
Pseudocode for the Query
Exploit multi-core parallelism Single map task per node Uses a custom MapRunner class to run a multi-threaded map task Using MultiCIF packs multiple input splits into a single multi-split Shared across consecutive map tasks that run on the same node Task scheduling Block iteration Optimizing for the Native Implementation
Pseudocode for MapRunner
Schedule only one map task from the join job on a given node Schedule subsequent map tasks on the node where the dimension hash table has already been built Communicate to the map task the number of slots, or processor cores it can use on the node Task Scheduling
High per-row overheads B-CIF: return an array of rows over the same input Block Iteration
Support two join plans: repartition join, mapjoin Reparttion join is a robust technique that works with any combination of sizes of tables Mapjoin is designed for one table that is significantly samller than the other Hive Background
Hive’s Mapjoin plan
SQL-Logical Plan-Physical Plan-MR Workflow Workflow with Six Jobs Hive:SQL-Like Language
Cluster Cluster A : 9 nodes, two quad-core processors,16G memory, 8*250G disk, 1G ethernet switch Cluster B: 42nodes, two quad-core processors, 32G memory 5*500G disk 1G ethernet switch Clydesdale on hadoop 0.21 and Hive on hadoop Workload Storage Format: Clydesdale fact tables were stored in Multi-CIF, Hive is RCFile Experimental Setup
Comparison with Hive
Hive joins one dimension table at a time with the fact table Hive maintains many copies of the hash table Hive creates the hash tables on a single node and pays the cost of disseminating it to the entire cluster Each task in Hive has to load and deserialize the hash table when it starts. Result Analysis
Analysis of Clydesdale
Limitations