Force Directed Placement: GPU Implementation

Slides:

Advertisements

Similar presentations

Kinetic and Potential Energy

Advertisements

Implementation of Voxel Volume Projection Operators Using CUDA

Instructor Notes This lecture begins with an example of how a wide- memory bus is utilized in GPUs The impact of memory coalescing and memory bank conflicts.

List Ranking and Parallel Prefix

List Ranking on GPUs Sathish Vadhiyar. List Ranking on GPUs Linked list prefix computations – computations of prefix sum on the elements contained in.

Christopher McCabe, Derek Causon and Clive Mingham Centre for Mathematical Modelling & Flow Analysis Manchester Metropolitan University MANCHESTER M1 5GD.

A Parallel GPU Version of the Traveling Salesman Problem Molly A. O’Neil, Dan Tamir, and Martin Burtscher* Department of Computer Science.

GPGPU Introduction Alan Gray EPCC The University of Edinburgh.

An Effective GPU Implementation of Breadth-First Search Lijuan Luo, Martin Wong and Wen-mei Hwu Department of Electrical and Computer Engineering, UIUC.

Nequalities Takagi Factorization on a GPU using CUDA Gagandeep S. Sachdev, Vishay Vanjani & Mary W. Hall School of Computing, University of Utah What is.

L13: Review for Midterm. Administrative Project proposals due Friday at 5PM (hard deadline) No makeup class Friday! March 23, Guest Lecture Austin Robison,

Force Directed Algorithm Adel Alshayji 4/28/2005.

Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.

University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.

Extending TreeJuxtaposer Nicholas Chen Maryam Farboodi May 9, 2006.

Computing Platform Benchmark By Boonyarit Changaival King Mongkut’s University of Technology Thonburi (KMUTT)

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

CUDA Programming David Monismith CS599 Based on notes from the Udacity Parallel Programming (cs344) Course.

New method to optimize Force-directed layouts of large graphs Meva DODO, Fenohery ANDRIAMANAMPISOA, Patrice TORGUET, Jean Pierre JESSEL IRIT University.

Implementing a Speech Recognition System on a GPU using CUDA

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 12 Parallel.

1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 22, 2013 MemCoalescing.ppt Memory Coalescing These notes will demonstrate the effects.

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 25, 2011 Synchronization.ppt Synchronization These notes will introduce: Ways to achieve.

Energy Kinetic energy is the energy of motion. Potential pnergy is stored energy.

Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology.

QCAdesigner – CUDA HPPS project

CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

An Efficient CUDA Implementation of the Tree-Based Barnes Hut n-body Algorithm By Martin Burtscher and Keshav Pingali Jason Wengert.

Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009.

© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.

CS/EE 217 GPU Architecture and Parallel Programming Midterm Review

Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009.

Synchronization These notes introduce:

Smoothed Particle Hydrodynamics Matthew Zhu CSCI 5551 — Fall 2015.

©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture.

Accelerating an N-Body Simulation Anuj Kalia Maxeler Technologies.

Parallel Programming in Chess Simulations Part 2 Tyler Patton.

CUDA Simulation Benjy Kessler.  Given a brittle substance with a crack in it.  The goal is to study how the crack propagates in the substance as a function.

Heterogeneous Computing using openCL lecture 4 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.

Performed by:Liran Sperling Gal Braun Instructor: Evgeny Fiksman המעבדה למערכות ספרתיות מהירות High speed digital systems laboratory.

Institute of Software,Chinese Academy of Sciences An Insightful and Quantitative Performance Optimization Chain for GPUs Jia Haipeng.

Potential and Kinetic Energy Understanding the cyclic nature of potential and kinetic energy.

CS 179: GPU Computing Recitation 2: Synchronization, Shared memory, Matrix Transpose.

1 ITCS 4/5145 Parallel Programming, B. Wilkinson, Nov 12, CUDASynchronization.ppt Synchronization These notes introduce: Ways to achieve thread synchronization.

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs Mai Zheng, Vignesh T. Ravi, Wenjing Ma, Feng Qin, and Gagan Agrawal Dept. of Computer.

Review IMGD Engine Architecture Types Broadly, what are the two architecture types discussed for game engines? What are the differences?

Blocked 2D Convolution Ravi Sankar P Nair

CS/EE 217 – GPU Architecture and Parallel Programming

CS427 Multicore Architecture and Parallel Computing

EECE571R -- Harnessing Massively Parallel Processors ece

Image Convolution with CUDA

ECE408 Fall 2015 Applied Parallel Programming Lecture 21: Application Case Study – Molecular Dynamics.

Lecture 5: GPU Compute Architecture

Recitation 2: Synchronization, Shared memory, Matrix Transpose

Spare Register Aware Prefetching for Graph Algorithms on GPUs

Lecture 5: GPU Compute Architecture for the last time

CS/EE 217 – GPU Architecture and Parallel Programming

All-Pairs Shortest Paths

Memory Coalescing These notes will demonstrate the effects of memory coalescing Use of matrix transpose to improve matrix multiplication performance B.

Using Shared memory These notes will demonstrate the improvements achieved by using shared memory, with code and results running on coit-grid06.uncc.edu.

A Comparison-FREE SORTING ALGORITHM ON CPUs

Unit 3 - Energy Learning Target 3.2 – Be able to use the equations for Potential Energy & Kinetic Energy.

CS179: GPU PROGRAMMING Recitation 2 GPU Memory Synchronization

Gary M. Zoppetti Gagan Agrawal Rishi Kumar

Synchronization These notes introduce:

Parallelized Analytic Placer

Presentation transcript:

Force Directed Placement: GPU Implementation Bernice Chan Ivan So Mark Teper

Force Directed Placement Nodes represented by masses Edges represented by springs Motions of elements are simulated iteratively until steady state is reached

Basic Implementation The Force Directed Placement algorithm is implemented in both CPU and GPU CPU psuedo-code: do { calculate_forces() calculate_velocity() update_positions() compute_kinetic_energy() }while(kinetic_energy > threshold) Initial GPU algorithm achieved results similar to CPU performance Parallelized each iteration at a node level Used two kernels; one for computing new velocities and positions, second for computing kinetic energy.

Additional Optimizations Increasing parallelism Compute force between each pair of nodes in parallel (NxN threads) New kernel to update velocities and positions per thread Reducing Functional Units Reordered floating point operations to reduce total number required Reducing Sync Overhead Compute the kinetic energy while updating velocity and positions to reduce increase work in each thread Improve Memory Coalescing Combine float x and y positions into float2 data Change graph edge weights fro char to ints

Additional Optimizations (2) Using local memory Cached all memory values locally before performing operations Reducing bank conflicts Transpose the force calculate kernel so that data can be coalesced in the position update kernel Reducing memory accesses Perform the force calculations on a block of the NxN Using float4 instead of float Similar to using float2 data structures use float4 to store both attractive and repulsive forces (in x and y direction)