A Peta-Scale Graph Mining System

Slides:

Advertisements

Similar presentations

CMU SCS I2.2 Large Scale Information Network Processing INARC 1 Overview Goal: scalable algorithms to find patterns and anomalies on graphs 1. Mining Large.

Advertisements

Overview of this week Debugging tips for ML algorithms

Piccolo: Building fast distributed programs with partitioned tables Russell Power Jinyang Li New York University.

Transportation Problem (TP) and Assignment Problem (AP)

Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.

Seunghwa Kang David A. Bader Large Scale Complex Network Analysis using the Hybrid Combination of a MapReduce Cluster and a Highly Multithreaded System.

CMU SCS : Multimedia Databases and Data Mining Extra: intro to hadoop C. Faloutsos.

Distributed Graph Analytics Imranul Hoque CS525 Spring 2013.

APACHE GIRAPH ON YARN Chuan Lei and Mohammad Islam.

CMU SCS C. Faloutsos (CMU)#1 Large Graph Algorithms Christos Faloutsos CMU McGlohon, Mary Prakash, Aditya Tong, Hanghang Tsourakakis, Babis Akoglu, Leman.

Algorithmic and Economic Aspects of Networks Nicole Immorlica.

Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland Tuesday, June 29, 2010 This work is licensed.

Scaling Personalized Web Search Glen Jeh, Jennfier Widom Stanford University Presented by Li-Tal Mashiach Search Engine Technology course (236620) Technion.

Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.

Arithmetic Operations on Matrices. 1. Definition of Matrix 2. Column, Row and Square Matrix 3. Addition and Subtraction of Matrices 4. Multiplying Row.

Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland MLG, January, 2014 Jaehwan Lee.

Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.

Map Reduce: Simplified Data Processing On Large Clusters Jeffery Dean and Sanjay Ghemawat (Google Inc.) OSDI 2004 (Operating Systems Design and Implementation)

Ex-MATE: Data-Intensive Computing with Large Reduction Objects and Its Application to Graph Mining Wei Jiang and Gagan Agrawal.

Spiros Papadimitriou Jimeng Sun IBM T.J. Watson Research Center Hawthorne, NY, USA Reporter: Nai-Hui, Ku.

Distributed Computing Rik Sarkar. Distributed Computing Old style: Use a computer for computation.

MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.

Distributed shared memory. What we’ve learnt so far  MapReduce/Dryad as a distributed programming model  Data-flow (computation as vertex, data flow.

The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.

Bi-Hadoop: Extending Hadoop To Improve Support For Binary-Input Applications Xiao Yu and Bo Hong School of Electrical and Computer Engineering Georgia.

CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original.

Data Structures and Algorithms in Parallel Computing

Single-Pass Belief Propagation

Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.

Jimmy Lin and Michael Schatz Design Patterns for Efficient Graph Algorithms in MapReduce Michele Iovino Facoltà di Ingegneria dell’Informazione, Informatica.

Implementation of Classifier Tool in Twister Magesh khanna Vadivelu Shivaraman Janakiraman.

Cohesive Subgraph Computation over Large Graphs

Big Data is a Big Deal!.

Large Graph Mining: Power Tools and a Practitioner’s guide

Linear Algebra review (optional)

Distributed Programming in “Big Data” Systems Pramod Bhatotia wp

Pagerank and Betweenness centrality on Big Taxi Trajectory Graph

Tutorial: Big Data Algorithms and Applications Under Hadoop

Database Management System

Fast Kernel-Density-Based Classification and Clustering Using P-Trees

Spark Presentation.

15-826: Multimedia Databases and Data Mining

Warm-Up - 8/30/2010 Simplify. 1.) 2.) 3.) 4.) 5.)

Bisection and Twisted SVD on GPU

PEGASUS: A PETA-SCALE GRAPH MINING SYSTEM

Hadoop Clusters Tess Fulkerson.

Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.

NetMine: Mining Tools for Large Graphs

Kijung Shin1 Mohammad Hammoud1

Introduction to Spark.

Chapter 15 QUERY EXECUTION.

Applying Twister to Scientific Applications

MapReduce Simplied Data Processing on Large Clusters

CMPT 733, SPRING 2016 Jiannan Wang

Apache Spark & Complex Network

Sequence Alignment 11/24/2018.

MapReduce Algorithm Design Adapted from Jimmy Lin’s slides.

KMeans Clustering on Hadoop Fall 2013 Elke A. Rundensteiner

Word Co-occurrence Chapter 3, Lin and Dyer.

Parallel Applications And Tools For Cloud Computing Environments

Overview of big data tools

Graph and Tensor Mining for fun and profit

Spark and Scala.

COMP60621 Fundamentals of Parallel and Distributed Systems

Query Execution Presented by Jiten Oswal CS 257 Chapter 15

MapReduce Algorithm Design

Parallel Graph Algorithms

CMPT 733, SPRING 2017 Jiannan Wang

COMP60611 Fundamentals of Parallel and Distributed Systems

Map Reduce, Types, Formats and Features

Presentation transcript:

A Peta-Scale Graph Mining System PEGASUS: A Peta-Scale Graph Mining System – Implementation and Observations Presenter: Vivi Ma

CONTENT Background PEGUSUS PageRank Example Performance and Scalability Real World Applications

BACKGROUND What is Graph Mining? … is an area of data mining to find patterns, rules, and anomalies of graphs …graph mining tasks such as PageRank, diameter estimation, connected components etc.

BACKGROUND What is the problem? Why is it important? Large volume of available data Limited scalability Rely on a shared memory model - limits their ability to handle large disk-resident graph Graphs or networks are everywhere We must find graph mining algorithms that are faster and can scale up to billions of nodes and edges to tackle real world applications

First open source Peta Graph Mining library PEGASUS First open source Peta Graph Mining library

PEGASUS Based on Hadoop Handling graphs with billions of nodes and edges Unification of seemingly different graph mining tasks Generalized Iterated Matrix-Vector multiplication(GIM-V)

PEGASUS Linear runtime on the numbers of edges Scales up well with the number of available machines Combination of optimizations can speed up 5 times Analyzed Yahoo’s web graph (around 6,7 billion edges)

GIM-V Generalized Iterated Matrix-Vector multiplication Main idea Three operations n by n matrix M, vector v of size n, mi,j denote the (i,j) element of M Combine2: multiply mi,j, and vj CombineAll: sum n multiplication results for node i Assign: overwrite previous value of vi with new result to make vi’

GIM-V Main idea Operator XG, where the three functions can be defined arbitrarily where vi’=assign(vi, combineAlli ({xj|j=1…n, and xj=combine2(mi,j, vj)})) Three functions Strong connection of GIM-V with SQL Combine2(mi,j,vj), CombineAll(x1,…,xn), Assign(vi,vnew)

Eample1. PageRank PageRank: calculate relative importance of web pages Main idea Formula

Eample1. PageRank Three operations Combine2(mi,j,vj) = c x mi,j x vj CombineAll(x1,…,xn) = Assign(vi,vnew) = vnew

GIM-V Base: Naïve Multiplication How can we implement a matrix by vector multiplication in MapReduce? Stage1: performs combine2 operation by combing columns of matrix with rows of vector. Stage2: combines all partial results from stage1 and assigns the new vector to the old vector.

Stage 1 Stage 2 Distribution of work among machines during GIM-V execution

Example 2. Connected Components Main idea Formula Three operations Combine2(mi,j,vj) = mi,j x vj CombineAll(x1,…,xn) = MIN{Xj | j=1…n} Assign(vi,vnew) = MIN(vi, vnew)

Fast Algorithm for GIM-V GIM-V BL: Block Multiplication Main idea: Group elements of the input matrix into blocks of size b by b. Elements of the input vectors are also divided into blocks of length b

Only blocks with at least one non-zero element are saved to disk More than 5 times faster than GIM-V Base since The bottleneck of naïve implementation is the grouping stage which is implemented by sorting. GIM-BL reduced the number of elements sorted. [shuffling stage of HADOOP] E.g. 36 elements are sorted before, 9 elements sorted now Compression – the size of the data decreases significantly by converting edges and vectors to block format, which speeds up as fewer I/O operations are needed

Fast Algorithm for GIM-V GIM-V CL: Clustered Edges Main idea: Clustering is a pre-processing step which can be ran only once on the data file and reused in the future. This can be used to reduce the number of used blocks. Only useful when combined with block encoding.

Fast Algorithm for GIM-V GIM-V DI: Diagonal Block Iteration Main idea: Reduce the number of iterations required to converge. In HCC, (algorithm for finding connected component),the idea is to multiply diagonal matrix blocks and corresponding vector blocks until the contents of the vector don’t change in one iteration.

Performance and Scalability How the performance of the methods changes as we add more machines?

Performance and Scalability GIM-V DI vs. GIM-V BL-CL for HCC

Real World Applications PEGASUS can be useful for finding patterns, outliers, and interesting observations. Connected Components of Real Networks PageRanks of Real Networks Diameter of Real Networks

Interesting discussion question… Standalone alternatives like SNAP, NetworkX, iGraph Distributed alternatives like Pregel, PowerGraph, GraphX Interesting to compare Pegasus against all these alternatives not only performance-wise but also types of problem

Reference [1] U Kang, Charalampos E. Tsourakakis, and Christos Faloutsos, PEGASUS: A Peta Mining System - Implementation and Observations. Proc. Intl. Conf. Data Mining, 2009, 229-238 [2] U Kang. "Mining Tera-Scale Graphs: Theory, Engineering and Discoveries." Diss. Carnegie Mellon U, 2012. Print. [3]Kenneth Shum. “Notes on PageRank Algorithm.” Http://home.ie.cuhk.edu.hk/~wkshum/papers/pagerank.pdf .N.p., n.d.Web.20 Sept.2016.

Thanks! Any questions?