1 Tree and Graph Processing On Hadoop Ted Malaska.

1 Tree and Graph Processing On Hadoop Ted Malaska

2 Schedule Intro Overview of Hadoop and Eco-System Summarize Tree Rooting MR Overview/Implementation Options Hbase Overview/Implementation Options Giraph Overview/Implementation Options Spark Overview/Implementation Options Summery Quesitons

3 Intro Hi there

4 Overview of Hadoop and Eco-System Searc h NoSql Machine Learning LFP RTQ Streaming Ingestion Batch HDFS Security and Access Controls Auditing and Monitoring Map Reduce Pig Crunch Hive Giraph Sqoop Flume Kafka Storm Spark Streaming Spark Impala Mahout Oryx RR Python Streaming SAS HBase Accumulo NFS Search SolR

5 In Scope for Tonight Searc h NoSql Machine Learning LFP RTQ Streaming Ingestion Batch HDFS Security and Access Controls Auditing and Monitoring Map Reduce Pig Crunch Hive Giraph Sqoop Flume Kafka Storm Spark Streaming Spark Impala Mahout Oryx RR Python Streaming SAS HBase Accumulo NFS Search SolR

6 Summarize Tree Rooting Basic Tree 0 0 1 1 1 1 2 2 2 2 2 2 2 2 3 3 3 3 3 3 True Root Leafs Branches Vertex Edge Depth

7 Summarize Tree Rooting More Complex Tree 0 0 1 1 1 1 2 2 2 2 2 2 2 2 3 3 3 3 2 2 Circular Link Multiple Parents

8 Summarize Tree Rooting Merging Trees Borderline True Graph Problem 0 0 1 1 1 1 2 2 2 2 2 2 2 2 3 3 3 3 2 2 0 0 0 0 Multi Rooted Vertex Multi Rooted Vertex True Root

9 Summarize Tree Rooting Know your data

10 Basic Storage Format | Example 101 101|201 101|202 201 202|301 301

11 Preprocessing Terming Data Nodes and edges have data Data has weight Normally linkage information is under 10% of true data size Organize Data by Partitioning

12 Basic Solution Step 1: Identify Roots Echo to all edges Vertexes with that receive no echoes are roots Root the root Step 2: Walk the tree Echo from last newly rooted Vertex to all edges If vertex is not already rooted then root it. 101 101|201 101|202 201 202|301 301 101|R:101 101|201|R:101 101|202|R:101 201|R:Null 202|301|R:Null 301|R:Null 101|R:101 101|201|R:101 101|202|R:101 201|R:101 202|301|R:101 301|R:Null 101|R:101 101|201|R:101 101|202|R:101 201|R:101 202|301|R:101 301|R:101

13 Map Reduce Massive parallel processing on Hadoop Based on the Google 2004 MapReduce white paper Able to process PBs of data

14 Map Reduce Data Blocks Mapper Sort & Shuffle Mapper Data Blocks

15 Map Reduce Self Joins Always dumping two output: Newly Rooted Still Un-Rooted All Data Un-Rooted Newly Rooted Un-Rooted Newly Rooted Old Rooted 0 MR - Stage0 Root Identifying MR - Stage0 Root Identifying MR – Stage1 Rooting MR – Stage1 Rooting Un-Rooted Newly Rooted Old Rooted 0 MR – Stage2 Rooting MR – Stage2 Rooting Old Rooted 1

16 Map Reduce Great for large batch operations No memory limit Not good at iterations

17 HBase Largest and Most used NoSql Implementation in the World Based on the Google 2006 BigTable white paper Imagine it like a giant HashMap with keys and values Handles 100k of operations a second on even a small 10 node cluster

18 HBase Getting Client HBase Master HBase Region Server Block Cache

19 HBase Putting Client HBase Master HBase Region Server WAL MemStore HFile WAL MemStore WAL MemStore

20 HBase Good for graph traversing Bad for large batch processing Scan rate about 8x slower then HDFS Good for end of a long tail

21 Giraph System built for Large Batch Graph Processing Based on Pregel 2009 white paper Hardened by LinkedIn and FaceBook Recorded to handle up to a Trillion edges

22 Giraph Loading Data Blocks Worker Master

Communication 23 Giraph (Bulk Synchronous Parallel) Worker Local vertex computing Barrier synchronization Local vertex computing

24 Giraph Most mature bulk graph processing out there Of all the solutions, most graph focused

25 Spark At Berkeley around 2011 some asked is we could do better then MR Take advantage of lower cost memory Building on everything before

26 Spark Worker Dag Scheduler (Like a queue planner Dag Scheduler (Like a queue planner Spark Worker RDD Objects Task Threads Block Manager Rdd1.join(rdd2). groupBy(…).filter(…) Task Scheduler Threads Block Manager Cluster Manager Cluster Manager

27 Spark Implementations Onion MR approach with Basic Spark Pregel approach with Bagel or GraphX Bagel is a Façade over Generic Spark Functionality GraphX is an effort extend to Spark Less code Learning curve Its Raw will be changing a lot in the next year

1 Tree and Graph Processing On Hadoop Ted Malaska.

Similar presentations

Presentation on theme: "1 Tree and Graph Processing On Hadoop Ted Malaska."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Tree and Graph Processing On Hadoop Ted Malaska.

Similar presentations

Presentation on theme: "1 Tree and Graph Processing On Hadoop Ted Malaska."— Presentation transcript:

Similar presentations

About project

Feedback