1 Tree and Graph Processing On Hadoop Ted Malaska
2 Schedule Intro Overview of Hadoop and Eco-System Summarize Tree Rooting MR Overview/Implementation Options Hbase Overview/Implementation Options Giraph Overview/Implementation Options Spark Overview/Implementation Options Summery Quesitons
3 Intro Hi there
4 Overview of Hadoop and Eco-System Searc h NoSql Machine Learning LFP RTQ Streaming Ingestion Batch HDFS Security and Access Controls Auditing and Monitoring Map Reduce Pig Crunch Hive Giraph Sqoop Flume Kafka Storm Spark Streaming Spark Impala Mahout Oryx RR Python Streaming SAS HBase Accumulo NFS Search SolR
5 In Scope for Tonight Searc h NoSql Machine Learning LFP RTQ Streaming Ingestion Batch HDFS Security and Access Controls Auditing and Monitoring Map Reduce Pig Crunch Hive Giraph Sqoop Flume Kafka Storm Spark Streaming Spark Impala Mahout Oryx RR Python Streaming SAS HBase Accumulo NFS Search SolR
6 Summarize Tree Rooting Basic Tree True Root Leafs Branches Vertex Edge Depth
7 Summarize Tree Rooting More Complex Tree Circular Link Multiple Parents
8 Summarize Tree Rooting Merging Trees Borderline True Graph Problem Multi Rooted Vertex Multi Rooted Vertex True Root
9 Summarize Tree Rooting Know your data
10 Basic Storage Format | Example | | |
11 Preprocessing Terming Data Nodes and edges have data Data has weight Normally linkage information is under 10% of true data size Organize Data by Partitioning
12 Basic Solution Step 1: Identify Roots Echo to all edges Vertexes with that receive no echoes are roots Root the root Step 2: Walk the tree Echo from last newly rooted Vertex to all edges If vertex is not already rooted then root it | | | |R: |201|R: |202|R: |R:Null 202|301|R:Null 301|R:Null 101|R: |201|R: |202|R: |R: |301|R: |R:Null 101|R: |201|R: |202|R: |R: |301|R: |R:101
13 Map Reduce Massive parallel processing on Hadoop Based on the Google 2004 MapReduce white paper Able to process PBs of data
14 Map Reduce Data Blocks Mapper Sort & Shuffle Mapper Data Blocks
15 Map Reduce Self Joins Always dumping two output: Newly Rooted Still Un-Rooted All Data Un-Rooted Newly Rooted Un-Rooted Newly Rooted Old Rooted 0 MR - Stage0 Root Identifying MR - Stage0 Root Identifying MR – Stage1 Rooting MR – Stage1 Rooting Un-Rooted Newly Rooted Old Rooted 0 MR – Stage2 Rooting MR – Stage2 Rooting Old Rooted 1
16 Map Reduce Great for large batch operations No memory limit Not good at iterations
17 HBase Largest and Most used NoSql Implementation in the World Based on the Google 2006 BigTable white paper Imagine it like a giant HashMap with keys and values Handles 100k of operations a second on even a small 10 node cluster
18 HBase Getting Client HBase Master HBase Region Server Block Cache
19 HBase Putting Client HBase Master HBase Region Server WAL MemStore HFile WAL MemStore WAL MemStore
20 HBase Good for graph traversing Bad for large batch processing Scan rate about 8x slower then HDFS Good for end of a long tail
21 Giraph System built for Large Batch Graph Processing Based on Pregel 2009 white paper Hardened by LinkedIn and FaceBook Recorded to handle up to a Trillion edges
22 Giraph Loading Data Blocks Worker Master
Communication 23 Giraph (Bulk Synchronous Parallel) Worker Local vertex computing Barrier synchronization Local vertex computing
24 Giraph Most mature bulk graph processing out there Of all the solutions, most graph focused
25 Spark At Berkeley around 2011 some asked is we could do better then MR Take advantage of lower cost memory Building on everything before
26 Spark Worker Dag Scheduler (Like a queue planner Dag Scheduler (Like a queue planner Spark Worker RDD Objects Task Threads Block Manager Rdd1.join(rdd2). groupBy(…).filter(…) Task Scheduler Threads Block Manager Cluster Manager Cluster Manager
27 Spark Implementations Onion MR approach with Basic Spark Pregel approach with Bagel or GraphX Bagel is a Façade over Generic Spark Functionality GraphX is an effort extend to Spark Less code Learning curve Its Raw will be changing a lot in the next year