Clydesdale: Structured Data Processing on MapReduce Jackie.

Clydesdale: Structured Data Processing on MapReduce Jackie

 Unmodified Hadoop  Aim at workload where the data fit s a star schema  Draw on existing techniques: columnar storage, tailored join plans, block iteration Introduction

 Background  Clydesdale architecture  challenges  experiment Outline

 InputFormats and OutputFormats  InputFormat implements two methods: getSplits(), getRecordReader()  MapRunners  Schedules  JVM reuse Background

Clydesdale Architecture

 Avoid I/O for columns that are not used  Store each column in separate HDFS file  ColumnInputFormat ensures that different columns in a row are co-located at datanode Columnar Storage

 Sql-like structured data processing  Map phase is responsible for joining the fact table with the dimension tables  Reduce phase is responsible for the grouping and aggregation Join Strategy

Flow of Clydesdale’s Join Job

 Consider the following query Examples

 Map phase  Build hashtable for each dimension table using predicates  Maptask checks whether the input was in the hashtables  Output the record that satisfies the join conditions  Key from the subset of columns needed for grouping.  Reduce phase  Aggregate the values of the same key  Sort at client Execution process

Pseudocode for the Query

 Exploit multi-core parallelism  Single map task per node  Uses a custom MapRunner class to run a multi-threaded map task  Using MultiCIF packs multiple input splits into a single multi-split  Shared across consecutive map tasks that run on the same node  Task scheduling  Block iteration Optimizing for the Native Implementation

Pseudocode for MapRunner

 Schedule only one map task from the join job on a given node  Schedule subsequent map tasks on the node where the dimension hash table has already been built  Communicate to the map task the number of slots, or processor cores it can use on the node Task Scheduling

 High per-row overheads  B-CIF: return an array of rows over the same input Block Iteration

 Support two join plans: repartition join, mapjoin  Reparttion join is a robust technique that works with any combination of sizes of tables  Mapjoin is designed for one table that is significantly samller than the other Hive Background

Hive’s Mapjoin plan

SQL-Logical Plan-Physical Plan-MR Workflow Workflow with Six Jobs Hive:SQL-Like Language

 Cluster  Cluster A ： 9 nodes, two quad-core processors,16G memory, 8*250G disk, 1G ethernet switch  Cluster B: 42nodes, two quad-core processors, 32G memory 5*500G disk 1G ethernet switch  Clydesdale on hadoop 0.21 and Hive on hadoop0.20.2  Workload  Storage Format: Clydesdale fact tables were stored in Multi-CIF, Hive is RCFile Experimental Setup

Comparison with Hive

 Hive joins one dimension table at a time with the fact table  Hive maintains many copies of the hash table  Hive creates the hash tables on a single node and pays the cost of disseminating it to the entire cluster  Each task in Hive has to load and deserialize the hash table when it starts. Result Analysis

Analysis of Clydesdale

Limitations

Clydesdale: Structured Data Processing on MapReduce Jackie.

Similar presentations

Presentation on theme: "Clydesdale: Structured Data Processing on MapReduce Jackie."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Clydesdale: Structured Data Processing on MapReduce Jackie.

Similar presentations

Presentation on theme: "Clydesdale: Structured Data Processing on MapReduce Jackie."— Presentation transcript:

Similar presentations

About project

Feedback