Graph Analysis with High Performance Computing by Bruce Hendrickson and Jonathan W. Berry Sandria National Laboratories Published in the March/April 2008.

Slides:



Advertisements
Similar presentations
Multiple Processor Systems
Advertisements

A Workflow Engine with Multi-Level Parallelism Supports Qifeng Huang and Yan Huang School of Computer Science Cardiff University
Social network partition Presenter: Xiaofei Cao Partick Berg.
Piccolo: Building fast distributed programs with partitioned tables Russell Power Jinyang Li New York University.
Distributed Systems CS
Today’s topics Single processors and the Memory Hierarchy
Running Large Graph Algorithms – Evaluation of Current State-of-the-Art Andy Yoo Lawrence Livermore National Laboratory – Google Tech Talk Feb Summarized.
1 Parallel Scientific Computing: Algorithms and Tools Lecture #3 APMA 2821A, Spring 2008 Instructors: George Em Karniadakis Leopold Grinberg.
Distributed Breadth-First Search with 2-D Partitioning Edmond Chow, Keith Henderson, Andy Yoo Lawrence Livermore National Laboratory LLNL Technical report.
Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.
Advanced Topics in Algorithms and Data Structures An overview of the lecture 2 Models of parallel computation Characteristics of SIMD models Design issue.
Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.
Models of Parallel Computation Advanced Algorithms & Data Structures Lecture Theme 12 Prof. Dr. Th. Ottmann Summer Semester 2006.
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
Hossein Bastan Isfahan University of Technology 1/23.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
1 Presenter: Ming-Shiun Yang Sah, A., Balakrishnan, M., Panda, P.R. Design, Automation & Test in Europe Conference & Exhibition, DATE ‘09. A Generic.
Juan Mendivelso.  Serial Algorithms: Suitable for running on an uniprocessor computer in which only one instruction executes at a time.  Parallel Algorithms:
Pregel: A System for Large-Scale Graph Processing
XMT BOF SC09 XMT Status And Roadmap Shoaib Mufti Director Knowledge Management.
Computer System Architectures Computer System Software
CC02 – Parallel Programming Using OpenMP 1 of 25 PhUSE 2011 Aniruddha Deshmukh Cytel Inc.
Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.
11 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray) Abdullah Gharaibeh, Lauro Costa, Elizeu.
LOGO OPERATING SYSTEM Dalia AL-Dabbagh
Operating System Review September 10, 2012Introduction to Computer Security ©2004 Matt Bishop Slide #1-1.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
CS668- Lecture 2 - Sept. 30 Today’s topics Parallel Architectures (Chapter 2) Memory Hierarchy Busses and Switched Networks Interconnection Network Topologies.
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
Threads, Thread management & Resource Management.
Pregel: A System for Large-Scale Graph Processing Presented by Dylan Davis Authors: Grzegorz Malewicz, Matthew H. Austern, Aart J.C. Bik, James C. Dehnert,
Graph Algorithms for Irregular, Unstructured Data John Feo Center for Adaptive Supercomputing Software Pacific Northwest National Laboratory July, 2010.
Bulk Synchronous Parallel Processing Model Jamie Perkins.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.
Alternative ProcessorsHPC User Forum Panel1 HPC User Forum Alternative Processor Panel Results 2008.
Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.
1 Coscheduling in Clusters: Is it a Viable Alternative? Gyu Sang Choi, Jin-Ha Kim, Deniz Ersoz, Andy B. Yoo, Chita R. Das Presented by: Richard Huang.
InterConnection Network Topologies to Minimize graph diameter: Low Diameter Regular graphs and Physical Wire Length Constrained networks Nilesh Choudhury.
GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.
Motivation: Sorting is among the fundamental problems of computer science. Sorting of different datasets is present in most applications, ranging from.
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
2016/1/5Part I1 Models of Parallel Processing. 2016/1/5Part I2 Parallel processors come in many different varieties. Thus, we often deal with abstract.
Data Structures and Algorithms in Parallel Computing Lecture 7.
Presented by PLASMA (Parallel Linear Algebra for Scalable Multicore Architectures) ‏ The Innovative Computing Laboratory University of Tennessee Knoxville.
Presented by PLASMA (Parallel Linear Algebra for Scalable Multicore Architectures) ‏ The Innovative Computing Laboratory University of Tennessee Knoxville.
Extreme Computing’05 Parallel Graph Algorithms: Architectural Demands of Pathological Applications Bruce Hendrickson Jonathan Berry Keith Underwood Sandia.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.
Background Computer System Architectures Computer System Software.
Introduction Goal: connecting multiple computers to get higher performance – Multiprocessors – Scalability, availability, power efficiency Job-level (process-level)
Next Generation of Apache Hadoop MapReduce Owen
Department of Computer Science, Johns Hopkins University Pregel: BSP and Message Passing for Graph Computations EN Randal Burns 14 November 2013.
Hybrid Parallel Implementation of The DG Method Advanced Computing Department/ CAAM 03/03/2016 N. Chaabane, B. Riviere, H. Calandra, M. Sekachev, S. Hamlaoui.
INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.
Lecture 5. Example for periority The average waiting time : = 41/5= 8.2.
Group Members Hamza Zahid (131391) Fahad Nadeem khan Abdual Hannan AIR UNIVERSITY MULTAN CAMPUS.
These slides are based on the book:
Overview Parallel Processing Pipelining
Parallel Programming By J. H. Wang May 2, 2017.
Store Recycling Function Experimental Results
CMSC 611: Advanced Computer Architecture
Department of Computer Science University of California, Santa Barbara
Distributed Systems CS
Multithreaded Programming
Chapter 4: Threads & Concurrency
Department of Computer Science University of California, Santa Barbara
Presentation transcript:

Graph Analysis with High Performance Computing by Bruce Hendrickson and Jonathan W. Berry Sandria National Laboratories Published in the March/April 2008 issue of Computing in Science and Engineering 2/9/11 Presented by Darlene Barker

Overview Explored the use of high-performance computing to study large, complex graphing algorithms Presented the challenges running graphing algorithms using explicit message passing using a MPI in distributed-memory computers Proposed solution—developing graph algorithms on a nontraditional, multithreaded supercomputers

Distributed-memory computers Most popular class of parallel machines which uses programming with explicit message passing (MPI) –The user divides the data among processors and determines which processor performs which task. The processors exchange data via user-controlled messages.

Alternatives to using explicit message passing (MPI) to program distributed-memory parallel computers: Partitioned global address-space computing - Unified Parallel C (UPC) Shared-Memory computing: Cache-coherent parallel computers Massively multithreaded architectures Software—OpenMP – restricted to shared-memory machines but has some portability

UPC The number of control threads is constant in a UPC and is generally equal to the number of processors or cores.

Cache-coherent parallel computers -Global memory is universally accessible to each computer and presents some challenges, such as latency: -while using faster hardware to access memory but still with limitations in that it adds overhead degrading performance. -Requires a protocol for thread synchronization and scheduling

Massively multithreaded architecture Examples: Cray MTA-2, XMT Addresses latency challenge seen with ensuring that the processor has other work to do while waiting for a memory request to be satisfied. When a memory request is issued, the processor immediately switches its attention to another thread that’s ready to execute.

Drawbacks Custom vs. commodity processors which are expensive and have a much slower clock rate than mainstream processors. MTA-2’s programming model although simple and elegant it is not portable to other architectures.

To fix the cross-architectural problem with the MTA-2 programming model Use generic programming libraries that hide machine E.g. Generic programming underlies: »C++ Standard Template Library »Boost C++ Libraries »Boost Graph Library (BGL) Use the massively multithreaded architecture with an extended subset of the Boost Graph Library

Studied two fundamental graph algorithms on different platforms s-t connectivity To find a path from vertex s to vertex t that traverses the fewest possible edges. Single-Source Shortest Paths (SSSPs) Find the shortest-length path from a specific vertex to all other vertices in the graph.

Focused on two different classes of graphs Erdos-Renyi random graphs – constructed by assigning a uniform edge probability to each possible edge and then using a random number generator to determine which edges exists. Inverse power law graphs (RMAT) – constructed by recursively adding adjacencies to a matrix in an intentionally uneven way.

Example of a Erdos-Renyi random graph – not relating to the paper

Results Only the MTA-2 has a programming model and architecture sufficiently robust to easily test instances of inverse power law graphs with close to a billion edges.

Challenges for distributed-memory machines High degree nodes using standard scientific computing practice—storing ghost nodes. High-degree vertices requiring very large message buffers. Ghost nodes limit memory scalability and help runtime scalability.

R-t connectivity Type of graphNo. of nodesResults on # of processors Note Erdos-Renyi3.2 billion32768Using world’s largest machine, BlueGene/L. Erdos-RenyiFixed sizeSpeedup of roughly 36 on 450 processors Erdos-Renyi32768# of vertices visited should be roughly 177 times larger for a graph of nodes than for a graph on a single node.

Conclusion - 1 Unlike most scientific computing kernels, graph algorithms exhibit complex memory access patterns and limited amounts of actual processing. Performance is determined by the computer’s ability to access memory, not by actual processor speed. They believe a broad trend exists in the scientific computing community towards increasingly complex and memory limited simulations.

Conclusion - 2 With current microprocessor sizes going from silicon to spare, the author’s believe that this space should be used to support massive multithreading, resulting in processors and parallel machines that can apply to a broader range than current offerings.