Galois Performance Mario Mendez-Lojo Donald Nguyen.

Slides:

Advertisements

Similar presentations

Instructor Notes Lecture discusses parallel implementation of a simple embarrassingly parallel nbody algorithm We aim to provide some correspondence between.

Advertisements

1 Parallelizing Irregular Applications through the Exploitation of Amorphous Data-parallelism Keshav Pingali (UT, Austin) Mario Mendez-Lojo (UT, Austin)

Enabling Speculative Parallelization via Merge Semantics in STMs Kaushik Ravichandran Santosh Pande College.

Scalable and Precise Dynamic Datarace Detection for Structured Parallelism Raghavan RamanJisheng ZhaoVivek Sarkar Rice University June 13, 2012 Martin.

Dynamic Load Balancing for VORPAL Viktor Przebinda Center for Integrated Plasma Studies.

ParaMeter: A profiling tool for amorphous data-parallel programs Donald Nguyen University of Texas at Austin.

Advanced Data Structures

Parallel Inclusion-based Points-to Analysis Mario Méndez-Lojo Augustine Mathew Keshav Pingali The University of Texas at Austin (USA) 1.

Dynamic Feedback: An Effective Technique for Adaptive Computing Pedro Diniz and Martin Rinard Department of Computer Science University of California,

Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500 Cluster.

OpenFOAM on a GPU-based Heterogeneous Cluster

CISC October Goals for today: Foster’s parallel algorithm design –Partitioning –Task dependency graph Granularity Concurrency Collective communication.

Galois System Tutorial Donald Nguyen Mario Méndez-Lojo.

Structure-driven Optimizations for Amorphous Data-parallel Programs 1 Mario Méndez-Lojo 1 Donald Nguyen 1 Dimitrios Prountzos 1 Xin Sui 1 M. Amber Hassaan.

Galois System Tutorial Mario Méndez-Lojo Donald Nguyen.

The of Parallelism in Algorithms Keshav Pingali The University of Texas at Austin Joint work with D.Nguyen, M.Kulkarni, M.Burtscher, A.Hassaan, R.Kaleem,

1 Johannes Schneider Transactional Memory: How to Perform Load Adaption in a Simple And Distributed Manner Johannes Schneider David Hasenfratz Roger Wattenhofer.

Daniel Blackburn Load Balancing in Distributed N-Body Simulations.

1 Advanced Data Structures. 2 Topics Data structures (old) stack, list, array, BST (new) Trees, heaps, union-find, hash tables, spatial, string Algorithm.

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.

CSE 160/Berman Mapping and Scheduling W+A: Chapter 4.

High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.

Task Based Execution of GPU Applications with Dynamic Data Dependencies Mehmet E Belviranli Chih H Chou Laxmi N Bhuyan Rajiv Gupta.

7. Preflow-Push Demo. 2 Preflow-Push Algorithm s 2 1 t 10 2 G: 5 3 s 2 1 t G f :

A Lightweight Infrastructure for Graph Analytics Donald Nguyen Andrew Lenharth and Keshav Pingali The University of Texas at Austin.

MATE-EC2: A Middleware for Processing Data with Amazon Web Services Tekin Bicer David Chiu* and Gagan Agrawal Department of Compute Science and Engineering.

Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.

Surface Simplification Using Quadric Error Metrics Michael Garland Paul S. Heckbert.

Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500, clusters,

Power Characteristics of Irregular GPGPU Programs Jared Coplin and Martin Burtscher Department of Computer Science 1.

A Shape Analysis for Optimizing Parallel Graph Programs Dimitrios Prountzos 1 Keshav Pingali 1,2 Roman Manevich 2 Kathryn S. McKinley 1 1: Department of.

Advisor: Dr. Aamir Shafi Co-Advisor: Mr. Ali Sajjad Member: Dr. Hafiz Farooq Member: Mr. Tahir Azim Optimizing N-body Simulations for Multi-core Compute.

Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.

Agenda Project discussion Modeling Critical Sections in Amdahl's Law and its Implications for Multicore Design, S. Eyerman, L. Eeckhout, ISCA'10 [pdf]pdf.

Mark Marron, Mario Mendez-Lojo Manuel Hermenegildo, Darko Stefanovic, Deepak Kapur 1.

A GPU Implementation of Inclusion-based Points-to Analysis Mario Méndez-Lojo (AMD) Martin Burtscher (Texas State University, USA) Keshav Pingali (U.T.

NERSC NUG Meeting 5/29/03 Seaborg Code Scalability Project Richard Gerber NERSC User Services.

Implementing Parallel Graph Algorithms: Graph coloring Created by: Avdeev Alex, Blakey Paul.

1 Keshav Pingali University of Texas, Austin Introduction to parallelism in irregular algorithms.

1 Keshav Pingali University of Texas, Austin Operator Formulation of Irregular Algorithms.

Synchronization Transformations for Parallel Computing Pedro Diniz and Martin Rinard Department of Computer Science University of California, Santa Barbara.

Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.

Mark Marron 1, Deepak Kapur 2, Manuel Hermenegildo 1 1 Imdea-Software (Spain) 2 University of New Mexico 1.

Overcoming Scaling Challenges in Bio-molecular Simulations Abhinav Bhatelé Sameer Kumar Chao Mei James C. Phillips Gengbin Zheng Laxmikant V. Kalé.

ABCD: Eliminating Array-Bounds Checks on Demand Rastislav Bodík Rajiv Gupta Vivek Sarkar U of Wisconsin U of Arizona IBM TJ Watson recent experiments.

Mark Marron IMDEA-Software (Madrid, Spain) 1.

Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.

Standalone FLES Package for Event Reconstruction and Selection in CBM DPG Mainz, 21 March 2012 I. Kisel 1,2, I. Kulakov 1, M. Zyzak 1 (for the CBM.

Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.

1 Adaptive Parallelism for Web Search Myeongjae Jeon Rice University In collaboration with Yuxiong He (MSR), Sameh Elnikety (MSR), Alan L. Cox (Rice),

ILPc: A novel approach for scalable timing analysis of synchronous programs Hugh Wang Partha S. Roop Sidharta Andalam.

Mark Marron 1, Deepak Kapur 2, Manuel Hermenegildo 1 1 Imdea-Software (Spain) 2 University of New Mexico 1.

On Concurrency Idioms and their Effect on Program Analysis Weizmann Institute of Science Guy Katz and David Harel.

Barnes Hut N-body Simulation Martin Burtscher Fall 2009.

Mesh Generation, Refinement and Partitioning Algorithms Xin Sui The University of Texas at Austin.

Massively Parallel Cosmological Simulations with ChaNGa Pritish Jetley, Filippo Gioachin, Celso Mendes, Laxmikant V. Kale and Thomas Quinn.

1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,

Effective Data-Race Detection for the Kernel

Store Recycling Function Experimental Results

Artificial Intelligence

Linchuan Chen, Peng Jiang and Gagan Agrawal

Degree-aware Hybrid Graph Traversal on FPGA-HMC Platform

Course Outline Introduction in algorithms and applications

Peng Jiang, Linchuan Chen, and Gagan Agrawal

Gary M. Zoppetti Gagan Agrawal

Transactions with Nested Parallelism

PERFORMANCE MEASURES. COMPUTATIONAL MODELS Equal Duration Model:  It is assumed that a given task can be divided into n equal subtasks, each of which.

MCMC Inference over Latent Diffeomorphisms

Parallel Programming in C with MPI and OpenMP

Tim Harris (MSR Cambridge)

Presentation transcript:

Galois Performance Mario Mendez-Lojo Donald Nguyen

Overview Galois system is a test bed to explore opts – Safe but not fast out of the box Important optimizations – Select least transactional overhead – Select right scheduling – Select appropriate data structure Quantify optimizations on applications 2

Algorithms 3 irregular algorithms topology operator ordering morph local computation reader general graph grid tree unordered ordered 1. Barnes-Hut 2. Delaunay Mesh Refinement 3. Preflow-push

Methodology Threads IdleSerialGC 4 Time Compute Abort Ratio: Aborted It/Total it GC options UseParallelGC UseParallelOldGC NewRatio=1

Terms Base – Default scheduling, Default graph Serial – Galois classes => No concurrency control classes Speedup – Best mean performance of a serial variant Throughput – # Serial Iterations / time 5

Numbers Runtime – Last of 5 runs in same VM – Ignore time to read and construct initial graph Other statistics – Last of 5 runs 6

Test Environment 2 x Xeon X5570 (4 core, 2.93 GHz) Java 1.6.0_0-b11 Linux x86_64 20GB heap size 7

BARNES-HUT 8 Most Distant Galaxy Candidates in the Hubble Ultra Deep Field

Barnes-Hut N-body algorithm – Oct-tree acceleration structure – Serial Tree build, center of mass, particle update – Parallel Force computation Structure – Reader on tree Variants – Splash2, Reader Galois 9

Reader Optimization child = octree.getNeighbor(nn, 1); child = octree.getNeighbor(nn, 1, MethodFlag.NONE); 10

ParaMeter Profile 11

Barnes-Hut Results 100,000 points, 1 time step 12 Best serial: base Serial time: ms Best // time: 1553 ms Best speedup: 6.6X

Barnes-Hut Results 100,000 points, 1 time step 13 Best serial: base Serial time: ms Best // time: 1553 ms Best speedup: 6.6X

Barnes-Hut Scalability 14

15

DELAUNAY MESH REFINEMENT 16

Delaunay Mesh Refinement Refine “bad” triangles – Maintained in worklist Structure – Cautious operator on graph Variants – Flag optimized, locallifo 17 base: Priority.defaultOrder() local lifo: Priority.first(ChunkedFIFO.class). thenLocally(LIFO.class)

Cautious Optimization mesh.contains(item);... mesh.remove(preNodes.get(i));... mesh.add(node); mesh.contains(item, MethodFlag.CHECK_CONFLICT);... mesh.remove(preNodes.get(i), MethodFlag.NONE);... mesh.add(node, MethodFlag.NONE); No need to save undo info Only check conflicts up to first write

LIFO Optimization GaloisRuntime.foreach(..., Priority.defaultOrder()); GaloisRuntime.foreach(..., Priority.first(ChunkedFIFO.class).thenLocally(LIFO.class)); 19

ParaMeter Profile 20

DMR Results 0.5M triangles, 0.25M bad triangles Best serial: locallifo.flagopt Serial time: ms Best // time: 3745 ms Best speedup: 4.5X 21

22

PREFLOW-PUSH 23

Preflow-push Max-flow algorithm – Nodes push flow downhill Structure – Cautious, local computation Variants – Flag optimized, local computation graph base (discharge): Priority.first(Bucketed.class, numHeight+1, false, indexer). then(FIFO.class) base (relabel): Priority.first(ChunkedFIFO.class, 8)

Local Computation Optimization graph =... b = new LocalComputationGraph.ObjectGraphBuilder(); graph = b.from(graph).create() 25

ParaMeter Profile 26

Preflow-push Results From challenge problem (genmf-wide) 14 linearly connected grids(194x194), 526,904 nodes, 2,586,020 edges C: ms Java: ms Best serial: lc.flagopt Serial time: ms Best // time: ms Best speedup: 3.1X 27

Preflow-push Scalability 28

29

What performance did we expect? 30 Threads Time IdleSerialGC//ComputeMiss-Speculation Measured Indirectly Synchronization, … Error

What performance did we expect? Naïve: 31 r(x) = t 1 / x Amdahl: r(x) = t p / x + t s t 1 = t p + t s t s = t idle + t gc + t serial Simple: r(x) = (t p (i x / i 1 )) / x + t s

Barnes-Hut 32

Delaunay Mesh Refinement 33

Preflow-push 34

Summary Many profitable optimizations – Selecting among method flags, worklists, graph variants Open topics – Automation – Static, dynamic and performance analysis – Efficient ordered algorithms 35

36