Graph Processing Recap: data-intensive cloud computing – Just database management on the cloud – But scaling it to thousands of nodes – Handling partial.

Slides:



Advertisements
Similar presentations
Piccolo: Building fast distributed programs with partitioned tables Russell Power Jinyang Li New York University.
Advertisements

epiC: an Extensible and Scalable System for Processing Big Data
1 TDD: Topics in Distributed Databases Distributed Query Processing MapReduce Vertex-centric models for querying graphs Distributed query evaluation by.
Distributed System Structures Network Operating Systems –provide an environment where users can access remote resources through remote login or file transfer.
Distributed Graph Analytics Imranul Hoque CS525 Spring 2013.
Distributed Graph Processing Abhishek Verma CS425.
Matei Zaharia Large-Scale Matrix Operations Using a Data Flow Engine.
APACHE GIRAPH ON YARN Chuan Lei and Mohammad Islam.
Yucheng Low Aapo Kyrola Danny Bickson A Framework for Machine Learning and Data Mining in the Cloud Joseph Gonzalez Carlos Guestrin Joe Hellerstein.
LFGRAPH: SIMPLE AND FAST DISTRIBUTED GRAPH ANALYTICS Hoque, Imranul, Vmware Inc. and Gupta, Indranil, University of Illinois at Urbana-Champaign – TRIOS.
Piccolo – Paper Discussion Big Data Reading Group 9/20/2010.
IMapReduce: A Distributed Computing Framework for Iterative Computation Yanfeng Zhang, Northeastern University, China Qixin Gao, Northeastern University,
Scaling Distributed Machine Learning with the BASED ON THE PAPER AND PRESENTATION: SCALING DISTRIBUTED MACHINE LEARNING WITH THE PARAMETER SERVER – GOOGLE,
Distributed Computations
1 Multiprocessors. 2 Idea: create powerful computers by connecting many smaller ones good news: works for timesharing (better than supercomputer) bad.
GraphLab A New Parallel Framework for Machine Learning Carnegie Mellon Based on Slides by Joseph Gonzalez Mosharaf Chowdhury.
MapReduce : Simplified Data Processing on Large Clusters Hongwei Wang & Sihuizi Jin & Yajing Zhang
MapReduce.
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
CSCI-2950u :: Data-Intensive Scalable Computing Rodrigo Fonseca (rfonseca)
MapReduce: Simplified Data Processing on Large Clusters 컴퓨터학과 김정수.
CS Storage Systems Lecture 14 Consistency and Availability Tradeoffs.
Presented By HaeJoon Lee Yanyan Shen, Beng Chin Ooi, Bogdan Marius Tudor National University of Singapore Wei Lu Renmin University Cang Chen Zhejiang University.
GRAPH PROCESSING Hi, I am Mayank and the second presenter for today is Shadi. We will be talking about Graph Processing.
Distributed shared memory. What we’ve learnt so far  MapReduce/Dryad as a distributed programming model  Data-flow (computation as vertex, data flow.
MAP REDUCE : SIMPLIFIED DATA PROCESSING ON LARGE CLUSTERS Presented by: Simarpreet Gill.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
CSE 486/586 CSE 486/586 Distributed Systems Graph Processing Steve Ko Computer Sciences and Engineering University at Buffalo.
Pregel: A System for Large-Scale Graph Processing Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and.
MARISSA: MApReduce Implementation for Streaming Science Applications 作者 : Fadika, Z. ; Hartog, J. ; Govindaraju, M. ; Ramakrishnan, L. ; Gunter, D. ; Canon,
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
WG5: Applications & Performance Evaluation Pascal Felber
Distributed Galois Andrew Lenharth 2/27/2015. Goals An implementation of the operator formulation for distributed memory – Ideally forward-compatible.
1 Distributed Databases BUAD/American University Distributed Databases.
Carnegie Mellon Yucheng Low Aapo Kyrola Danny Bickson A Framework for Machine Learning and Data Mining in the Cloud Joseph Gonzalez Carlos Guestrin Joe.
Distributed Graph Simulation: Impossibility and Possibility 1 Yinghui Wu Washington State University Wenfei Fan University of Edinburgh Southwest Jiaotong.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
Data Structures and Algorithms in Parallel Computing Lecture 4.
DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.
© Hortonworks Inc Hadoop: Beyond MapReduce Steve Loughran, Big Data workshop, June 2013.
Data Structures and Algorithms in Parallel Computing
Parallel Applications And Tools For Cloud Computing Environments CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010.
Distributed databases A brief introduction with emphasis on NoSQL databases Distributed databases1.
Department of Computer Science, Johns Hopkins University Pregel: BSP and Message Passing for Graph Computations EN Randal Burns 14 November 2013.
MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.
EpiC: an Extensible and Scalable System for Processing Big Data Dawei Jiang, Gang Chen, Beng Chin Ooi, Kian Lee Tan, Sai Wu School of Computing, National.
Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
TensorFlow– A system for large-scale machine learning
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Miraj Kheni Authors: Toyotaro Suzumura, Koji Ueno
Majd F. Sakr, Suhail Rehman and
Distributed Computation Framework for Machine Learning
Parallel Programming By J. H. Wang May 2, 2017.
PREGEL Data Management in the Cloud
MapReduce Simplified Data Processing on Large Cluster
L21: Putting it together: Tree Search (Ch. 6)
Consistency in Distributed Systems
Data Structures and Algorithms in Parallel Computing
Review of Bulk-Synchronous Communication Costs Problem of Semijoin
COS 418: Distributed Systems Lecture 1 Mike Freedman
February 26th – Map/Reduce
Distributed Systems CS
Cse 344 May 4th – Map/Reduce.
CMSC 635 Ray Tracing.
Replication-based Fault-tolerance for Large-scale Graph Processing
Parallel Applications And Tools For Cloud Computing Environments
Distributed Systems CS
Review of Bulk-Synchronous Communication Costs Problem of Semijoin
Parallel Programming in C with MPI and OpenMP
MapReduce: Simplified Data Processing on Large Clusters
Presentation transcript:

Graph Processing Recap: data-intensive cloud computing – Just database management on the cloud – But scaling it to thousands of nodes – Handling partial failures gracefully – Sacrificing strong ACID Graph processing frameworks (Pregel, …) – Graph database on the cloud – But scaling it to thousands of nodes – Handling partial failures gracefully – Sacrificing strong ACID

MapReduce vs Graph Processing When to use MapReduce vs Graph Processing Frameworks? – Two models can be reduced to each other – MR on Pregel: Every vertex to every other vertex – Pregel on MR: Similar to PageRank on MR Use graph fw for sparse graphs and matrixes – Comm. limited to small neighborhood (partitioning) – Faster than the shuffle/sort/reduce of MR – Local state kept between iterations (side-effects)

MapReduce vs Graph Processing (2) Use MR for all-to-all problems – Distributed joins – When little or no state between tasks (otherwise have to written to disk) Graph processing is middleground between message passing (MPI) and data-intensive computing

Pregel vs GraphLab Distributed system models – Asynchronous model and Synchronous model – Well known tradeoff between two models Synchronous: concurrency-control/failures easy, poor perf. Asynchronous: concurrency-control/failures hard, good perf. Pregel is a synchronous system – No concurrency control, no worry of consistency – Fault-tolerance, check point at each barrier GraphLab is asynchronous system – Consistency of updates harder (sequential, vertex) – Fault-tolerance harder (need a snapshot with consistency)

Pregel vs GraphLab (2) Synchronous model requires waiting for every node – Good for problems that require it – Bad when waiting for stragglers or load-imbalance Asynchronous model can make faster progress – Some problems work in presence of drift – Fast, but harder to reason about computation – Can load balance in scheduling to deal with load skew