Global Trees: A Framework for Linked Data Structures on Distributed Memory Parallel Systems D. Brian Larkins, James Dinan, Sriram Krishnamoorthy, Srinivasan.

Slides:



Advertisements
Similar presentations
Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories Muthu Baskaran 1 Uday Bondhugula.
Advertisements

MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.
Piccolo: Building fast distributed programs with partitioned tables Russell Power Jinyang Li New York University.
SkewReduce YongChul Kwon Magdalena Balazinska, Bill Howe, Jerome Rolia* University of Washington, *HP Labs Skew-Resistant Parallel Processing of Feature-Extracting.
U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Potential Languages of the Future Chapel,
Introduction to the Partitioned Global Address Space (PGAS) Programming Model David E. Hudak, Ph.D. Program Director for HPC Engineering
1 A GPU Accelerated Storage System NetSysLab The University of British Columbia Abdullah Gharaibeh with: Samer Al-Kiswany Sathish Gopalakrishnan Matei.
Engineering Distributed Graph Algorithms in PGAS languages Guojing Cong, IBM research Joint work with George Almasi and Vijay Saraswat.
Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500 Cluster.
Guoquing Xu, Atanas Rountev Ohio State University Oct 9 th, 2008 Presented by Eun Jung Park.
SKELETON BASED PERFORMANCE PREDICTION ON SHARED NETWORKS Sukhdeep Sodhi Microsoft Corp Jaspal Subhlok University of Houston.
Communication Pattern Based Node Selection for Shared Networks
Trishul Chilimbi, Yutaka Suzue, Johnson Apacible,
An Evaluation of a Framework for the Dynamic Load Balancing of Highly Adaptive and Irregular Parallel Applications Kevin J. Barker, Nikos P. Chrisochoides.
Haoyuan Li CS 6410 Fall /15/2009.  U-Net: A User-Level Network Interface for Parallel and Distributed Computing ◦ Thorsten von Eicken, Anindya.
Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.
Tile Reduction: the first step towards tile aware parallelization in OpenMP Ge Gan Department of Electrical and Computer Engineering Univ. of Delaware.
Big Kernel: High Performance CPU-GPU Communication Pipelining for Big Data style Applications Sajitha Naduvil-Vadukootu CSC 8530 (Parallel Algorithms)
A Load Balancing Framework for Adaptive and Asynchronous Applications Kevin Barker, Andrey Chernikov, Nikos Chrisochoides,Keshav Pingali ; IEEE TRANSACTIONS.
Sensor Networks Storage Sanket Totala Sudarshan Jagannathan.
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Background: MapReduce and FREERIDE Co-clustering on FREERIDE Experimental.
ET E.T. International, Inc. X-Stack: Programming Challenges, Runtime Systems, and Tools Brandywine Team May2013.
Copyright 2005, Data Mining Research Lab, The Ohio State University Cache-conscious Frequent Pattern Mining on a Modern Processor Amol Ghoting, Gregory.
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Summary of Contributions Background: MapReduce and FREERIDE Wavelet.
Adaptive MPI Milind A. Bhandarkar
7 th Annual Workshop on Charm++ and its Applications ParTopS: Compact Topological Framework for Parallel Fragmentation Simulations Rodrigo Espinha 1 Waldemar.
Selective Recovery From Failures In A Task Parallel Programming Model James Dinan*, Sriram Krishnamoorthy #, Arjun Singri*, P. Sadayappan* *The Ohio State.
Face Detection And Recognition For Distributed Systems Meng Lin and Ermin Hodžić 1.
Center for Programming Models for Scalable Parallel Computing: Project Meeting Report Libraries, Languages, and Execution Models for Terascale Applications.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Tekin Bicer Gagan Agrawal 1.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
Co-Array Fortran Open-source compilers and tools for scalable global address space computing John Mellor-Crummey Rice University.
Hybrid Parallel Programming with MPI and Unified Parallel C
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.
MRPGA : An Extension of MapReduce for Parallelizing Genetic Algorithm Reporter :古乃卉.
Argonne National Laboratory is a U.S. Department of Energy laboratory managed by U Chicago Argonne, LLC. Xin Zhao *, Pavan Balaji † (Co-advisor) and William.
4.2.1 Programming Models Technology drivers – Node count, scale of parallelism within the node – Heterogeneity – Complex memory hierarchies – Failure rates.
HPC User Forum Back End Compiler Panel SiCortex Perspective Kevin Harris Compiler Manager April 2009.
Computer Science and Engineering Predicting Performance for Grid-Based P. 1 IPDPS’07 A Performance Prediction Framework.
Towards Exascale File I/O Yutaka Ishikawa University of Tokyo, Japan 2009/05/21.
Overcoming Scaling Challenges in Bio-molecular Simulations Abhinav Bhatelé Sameer Kumar Chao Mei James C. Phillips Gengbin Zheng Laxmikant V. Kalé.
Computer Science and Engineering Parallelizing Defect Detection and Categorization Using FREERIDE Leonid Glimcher P. 1 ipdps’05 Scaling and Parallelizing.
Leibniz Supercomputing Centre Garching/Munich Matthias Brehm HPC Group June 16.
High-Level, One-Sided Models on MPI: A Case Study with Global Arrays and NWChem James Dinan, Pavan Balaji, Jeff R. Hammond (ANL); Sriram Krishnamoorthy.
© 2009 IBM Corporation Parallel Programming with X10/APGAS IBM UPC and X10 teams  Through languages –Asynchronous Co-Array Fortran –extension of CAF with.
A Multi-platform Co-array Fortran Compiler for High-Performance Computing John Mellor-Crummey, Yuri Dotsenko, Cristian Coarfa {johnmc, dotsenko,
1 Qualifying ExamWei Chen Unified Parallel C (UPC) and the Berkeley UPC Compiler Wei Chen the Berkeley UPC Group 3/11/07.
Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.
Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.
Globus and PlanetLab Resource Management Solutions Compared M. Ripeanu, M. Bowman, J. Chase, I. Foster, M. Milenkovic Presented by Dionysis Logothetis.
Implementing Babel RMI with ARMCI Jian Yin Khushbu Agarwal Daniel Chavarría Manoj Krishnan Ian Gorton Vidhya Gurumoorthi Patrick Nichols.
Experiences with Co-array Fortran on Hardware Shared Memory Platforms Yuri DotsenkoCristian Coarfa John Mellor-CrummeyDaniel Chavarria-Miranda Rice University,
OpenMP for Networks of SMPs Y. Charlie Hu, Honghui Lu, Alan L. Cox, Willy Zwaenepoel ECE1747 – Parallel Programming Vicky Tsang.
1 Titanium Review: GASNet Trace Wei Tu GASNet Trace Wei Tu U.C. Berkeley September 9, 2004.
Parallel Performance Wizard: A Generalized Performance Analysis Tool Hung-Hsun Su, Max Billingsley III, Seth Koehler, John Curreri, Alan D. George PPW.
Sushant Ahuja, Cassio Cristovao, Sameep Mohta
Spark Presentation.
Parallel Programming By J. H. Wang May 2, 2017.
Yu Su, Yi Wang, Gagan Agrawal The Ohio State University
Verilog to Routing CAD Tool Optimization
Scalable Parallel Interoperable Data Analytics Library
Bin Ren, Gagan Agrawal, Brad Chamberlain, Steve Deitz
HPC User Forum: Back-End Compiler Technology Panel
Support for Adaptivity in ARMCI Using Migratable Objects
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Presentation transcript:

Global Trees: A Framework for Linked Data Structures on Distributed Memory Parallel Systems D. Brian Larkins, James Dinan, Sriram Krishnamoorthy, Srinivasan Parthasarthy, Atanas Rountev, P. Sadayappan

Background Trees and graphs can concisely represent relationships between data Data sets are becoming increasingly large and can require compute-intensive processing Developing efficient, memory hierarchy- aware applications is hard

Sample Applications n-body simulation Fast Multipole Methods (FMM) multiresolution analysis clustering and classification frequent pattern mining

Key Contributions Efficient fine-grained data access with a global view of data Exploit linked structure to provide fast global pointer dereferencing High-level, locality-aware, parallel operations on linked data structures Application-driven customization Empirical validation of the approach

Framework Design

Global Chunk Layer (GCL) API and run-time library for managing chunks - built on ARMCI Abstracts common functionality for handling irregular, linked data Provides a global namespace with access and modification operations Extensible and highly customizable to maximize functionality and performance

Chunks A chunk is: Contiguous memory segment Globally accessible Physically local to only one process Collection of user-defined elements Unit of data transfer

Programming Model SPMD with MIMD-style parallelism Global pointers permit fine-grained access Chunks allow coarse-grained data movement Uses get/compute/put model for globally shared data access Provides both uniform global view and chunked global view of data

Global Pointers c = &p.child[i] + p.child[i].ci + p.child[i].no }} c p

Global Trees (GT) Run-time library and API for global view programming trees on DM clusters Built on GCL chunk communication framework High-level tree operations which work in parallel and are locality aware Each process can asynchronously access any portion of the shared tree structure

GT Concepts Tree Groups set of global trees allocations are made from the same chunk pool Global Node Pointers Tree Nodes link structure managed by GT body is user-defined structure

Example: Copying a Tree

Tree Traversals GT provides optimized, parallel traversals for common traversal orders Visitor callbacks are application-defined computations on a single node GT currently provides top-down, bottom- up, and level-wise traversals

Sample Traversal Usage

Node Mapping

Custom Allocation No single mapping of data elements to chunks will be optimal GT/GCL supports custom allocators to improve spatial locality Allocators can use a hint from call-site and can keep state between calls Default allocation is local-open

Experimental Results Evaluate using: Barnes-Hut from SPLASH-2 Compression operation from MADNESS GT compared with: Intel’s Cluster OpenMP and TreadMarks runtime UPC

Global Pointer Overhead Barnes-Hut compress()

Chunk Size and Bandwidth Experiments run on the department WCI Cluster GHz Intel Xeon, 6GB RAM, Infiniband

Impact of Chunk Utilization Barnes-Hut Experiments run on the department WCI Cluster GHz Intel Xeon, 6GB RAM, Infiniband

Barnes-Hut Chunk Size Selection Barnes-Hut application from SPLASH-2 suite

Barnes-Hut Scaling chunk size = 256, bodies = 512k

Local vs. Remote Access MADNESS compress()

Related Approaches Distributed Shared Memory (DSM) Cluster OpenMP, TreadMarks, Cashmere, Midway Distributed Shared Objects (DSO) Charm++, Linda, Orca Partitioned Global Address Space (PGAS) Languages and Systems UPC, Titanium, CAF, ARMCI, SHMEM, GASNET Shared pointer-based data structure support on distributed memory clusters Parallel Irregular Trees, Olden HPCS Programming Languages Chapel, X10

Future Work Global Graphs GT data reference locality tools More applications

Questions